X-VC

Official code release for X-VC: Zero-shot Streaming Voice Conversion in Codec Space.

Online Demo

You may visit our real-time voice conversion system on the online demo page. First, choose a preset target voice or upload/record your own reference audio, then use the preview player to check the target voice. Click Start Stream to start real-time conversion. After speaking, click Stop to end the session; the saved input and converted output audio will appear in the Saved Audio section for playback. Please use headphones for a better experience, and report any issues you encounter.

Environment Setup

1. Clone

git clone https://github.com/Jerrister/X-VC.git
cd X-VC

2. Create conda environment and install dependencies

conda create -n xvc python=3.10 -y
conda activate xvc
pip install -U pip
pip install -r requirements.txt

3. Prepare pretrained models

Prepare:

GLM-4-Voice-Tokenizer (for semantic tokenization)
ERes2Net speaker encoder (for speaker feature extraction)

Then set paths in configs/xvc.yaml, especially:

model.generator.semantic_encoder.encoder.from_pretrained
model.generator.semantic_encoder.cfg
model.generator.speaker_encoder.pretrained_dir

4. Prepare checkpoints

Put X-VC checkpoints under ckpts/, for example:

ckpts/
  xvc.pt

Inference

Single-pair Inference

Use scripts/infer_single.sh.

bash scripts/infer_single.sh

Key arguments in this script:

current=0 for offline inference.
current>0 for streaming inference.
chunk/current/future/smooth control streaming behavior.

Outputs are saved under save_dir (default: outputs/xvc_single).

Batch Offline Inference (SeedTTS-eval as example)

Use scripts/batch_infer_seedtts_offline.sh.

bash scripts/batch_infer_seedtts_offline.sh

This script reports:

saved_dir
total_rtf

Batch Streaming Inference (SeedTTS-eval as example)

Use scripts/batch_infer_seedtts_stream.sh.

bash scripts/batch_infer_seedtts_stream.sh

This script reports:

saved_dir
avg_latency_ms

Training

Step 1: Prepare pretrained dependencies

Before training, prepare the required pretrained dependencies:

SAC pretrained checkpoint(s) (for model initialization)

Then set corresponding paths in configs/xvc.yaml, especially:

model.generator.checkpoint
model.discriminator.checkpoint

Step 2: Prepare training data

Organize your training/validation data in JSONL format and set:

datasets.train
datasets.val

in configs/xvc.yaml.

Step 3: Modify training configs

You can adjust training behavior in:

configs/xvc.yaml (main training config)
configs/ds_stage2.json (DeepSpeed config)

Step 4: Start training

Use scripts/train.sh.

bash scripts/train.sh

Notes:

Default training engine is DeepSpeed (configs/ds_stage2.json).
Main experiment config is configs/xvc.yaml.
Set your WANDB_API_KEY in scripts/train.sh before running if you use wandb logging.

Data Format

Training config points to JSONL files in configs/xvc.yaml:

datasets.train
datasets.val

Each JSONL line should be a JSON object.

Required fields:

target_utt
source_wav_path
target_wav_path

Optional field:

source_utt

Minimal example:

{"source_utt":"utt_0001","source_wav_path":"<path_to_source>","target_utt":"utt_0002","target_wav_path":"<path_to_target>"}

Acknowledgements

This codebase builds upon open-source components from SAC and the broader audio generation ecosystem.

Citation

If you find our work useful in your research, please consider citing:

@misc{zheng2026xvczeroshotstreamingvoice,
      title={X-VC: Zero-shot Streaming Voice Conversion in Codec Space}, 
      author={Qixi Zheng and Yuxiang Zhao and Tianrui Wang and Wenxi Chen and Kele Xu and Yikang Li and Qinyuan Chen and Xipeng Qiu and Kai Yu and Xie Chen},
      year={2026},
      eprint={2604.12456},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2604.12456},
}

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
bins		bins
configs		configs
examples		examples
figures		figures
models		models
scripts		scripts
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

X-VC

Online Demo

Environment Setup

1. Clone

2. Create conda environment and install dependencies

3. Prepare pretrained models

4. Prepare checkpoints

Inference

Single-pair Inference

Batch Offline Inference (SeedTTS-eval as example)

Batch Streaming Inference (SeedTTS-eval as example)

Training

Step 1: Prepare pretrained dependencies

Step 2: Prepare training data

Step 3: Modify training configs

Step 4: Start training

Data Format

Acknowledgements

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

X-VC

Online Demo

Environment Setup

1. Clone

2. Create conda environment and install dependencies

3. Prepare pretrained models

4. Prepare checkpoints

Inference

Single-pair Inference

Batch Offline Inference (SeedTTS-eval as example)

Batch Streaming Inference (SeedTTS-eval as example)

Training

Step 1: Prepare pretrained dependencies

Step 2: Prepare training data

Step 3: Modify training configs

Step 4: Start training

Data Format

Acknowledgements

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages