UniVocal: Unified Speech-Singing Code-Switching Synthesis
YuFei Shi, Qian Chen, Wen Wang, Xiangang Li, Zhen-Hua Ling, Yang Ai
Abstract
We propose UniVocal, a unified framework that implicitly infers vocal modes from text context to pioneer Speech-Singing Code-Switching (SCS) Synthesis—a task where transitions are autonomously driven by textual semantics, akin to seamless human language blending. Unlike single-mode generation or systems relying on switching-control tags, our proposed UniVocal implicitly infers vocal modes solely from text context. To achieve this, we employ a data-efficient two-stage curriculum learning strategy that progressively trains a competitive TTS system to acquire the desired SCS capability. Addressing data scarcity, we introduce a scalable pipeline to synthesize diverse code-switching data that is both semantically and acoustically natural, alongside a new multi-scenario benchmark, SCSBench. To address limitations of semantic tokenizers in capturing acoustic details, we also introduce refined cent token and Chain-of-Thought (CoT) generation for planning prosody before content generation, effectively enhancing empathetic speech generation and singing melody. Experimental results demonstrate that UniVocal achieves state-of-the-art performance on SCSBench while maintaining competitive performance on regular speech and singing tasks. Audio samples are available at https://project-univocal-demo.github.io/demo/. The code and dataset are released at https://github.com/FunAudioLLM/FunResearch/tree/main/UniVocal.- Anthology ID:
- 2026.acl-long.1452
- Volume:
- Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 31479–31496
- Language:
- URL:
- https://preview.aclanthology.org/check-for-anonymous-pdfs/2026.acl-long.1452/
- DOI:
- Cite (ACL):
- YuFei Shi, Qian Chen, Wen Wang, Xiangang Li, Zhen-Hua Ling, and Yang Ai. 2026. UniVocal: Unified Speech-Singing Code-Switching Synthesis. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 31479–31496, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- UniVocal: Unified Speech-Singing Code-Switching Synthesis (Shi et al., ACL 2026)
- PDF:
- https://preview.aclanthology.org/check-for-anonymous-pdfs/2026.acl-long.1452.pdf