Official code release for X-VC: Zero-shot Streaming Voice Conversion in Codec Space.
You may visit our real-time voice conversion system on the online demo page. First, choose a preset target voice or upload/record your own reference audio, then use the preview player to check the target voice. Click Start Stream to start real-time conversion. After speaking, click Stop to end the session; the saved input and converted output audio will appear in the Saved Audio section for playback. Please use headphones for a better experience, and report any issues you encounter.
git clone https://github.com/Jerrister/X-VC.git
cd X-VCconda create -n xvc python=3.10 -y
conda activate xvc
pip install -U pip
pip install -r requirements.txtPrepare:
- GLM-4-Voice-Tokenizer (for semantic tokenization)
- ERes2Net speaker encoder (for speaker feature extraction)
Then set paths in configs/xvc.yaml, especially:
model.generator.semantic_encoder.encoder.from_pretrainedmodel.generator.semantic_encoder.cfgmodel.generator.speaker_encoder.pretrained_dir
Put X-VC checkpoints under ckpts/, for example:
ckpts/
xvc.pt
bash scripts/infer_single.shKey arguments in this script:
current=0for offline inference.current>0for streaming inference.chunk/current/future/smoothcontrol streaming behavior.
Outputs are saved under save_dir (default: outputs/xvc_single).
Use scripts/batch_infer_seedtts_offline.sh.
bash scripts/batch_infer_seedtts_offline.shThis script reports:
saved_dirtotal_rtf
Use scripts/batch_infer_seedtts_stream.sh.
bash scripts/batch_infer_seedtts_stream.shThis script reports:
saved_diravg_latency_ms
Before training, prepare the required pretrained dependencies:
- SAC pretrained checkpoint(s) (for model initialization)
Then set corresponding paths in configs/xvc.yaml, especially:
model.generator.checkpointmodel.discriminator.checkpoint
Organize your training/validation data in JSONL format and set:
datasets.traindatasets.val
in configs/xvc.yaml.
You can adjust training behavior in:
configs/xvc.yaml(main training config)configs/ds_stage2.json(DeepSpeed config)
Use scripts/train.sh.
bash scripts/train.shNotes:
- Default training engine is DeepSpeed (
configs/ds_stage2.json). - Main experiment config is
configs/xvc.yaml. - Set your
WANDB_API_KEYinscripts/train.shbefore running if you use wandb logging.
Training config points to JSONL files in configs/xvc.yaml:
datasets.traindatasets.val
Each JSONL line should be a JSON object.
Required fields:
target_uttsource_wav_pathtarget_wav_path
Optional field:
source_utt
Minimal example:
{"source_utt":"utt_0001","source_wav_path":"<path_to_source>","target_utt":"utt_0002","target_wav_path":"<path_to_target>"}This codebase builds upon open-source components from SAC and the broader audio generation ecosystem.
If you find our work useful in your research, please consider citing:
@misc{zheng2026xvczeroshotstreamingvoice,
title={X-VC: Zero-shot Streaming Voice Conversion in Codec Space},
author={Qixi Zheng and Yuxiang Zhao and Tianrui Wang and Wenxi Chen and Kele Xu and Yikang Li and Qinyuan Chen and Xipeng Qiu and Kai Yu and Xie Chen},
year={2026},
eprint={2604.12456},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2604.12456},
}This project is licensed under the MIT License.

