Cong Phuoc Huynh
2025
VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing
Zhisheng Zheng
|
Puyuan Peng
|
Anuj Diwan
|
Cong Phuoc Huynh
|
Xiaohang Sun
|
Zhu Liu
|
Vimal Bhat
|
David Harwath
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
We introduce VoiceCraft-X, an autoregressive neural codec language model which unifies multilingual speech editing and zero-shot text-to-speech (TTS) synthesis across 11 languages: English, Mandarin, Korean, Japanese, Spanish, French, German, Dutch, Italian, Portuguese, and Polish. VoiceCraft-X utilizes the Qwen3 large language model for phoneme-free cross-lingual text processing and a novel token reordering mechanism with time-aligned text and speech tokens to handle both tasks as a single sequence generation problem. The model generates high-quality, natural-sounding speech, seamlessly creating new audio or editing existing recordings within one framework. VoiceCraft-X shows robust performance in diverse linguistic settings, even with limited per-language data, underscoring the power of unified autoregressive approaches for advancing complex, real-world multilingual speech applications. Audio samples are available at https://zhishengzheng.com/voicecraft-x/.
Search
Fix author
Co-authors
- Vimal Bhat 1
- Anuj Diwan 1
- David Harwath 1
- Zhu Liu 1
- Puyuan Peng 1
- show all...