Lao-English Code-Switched Speech Synthesis Via Neural Codec Language Modeling

Yaping Liu, Linqin Wang, Shengxiang Gao, Zhengtao Yu, Ling Dong, Tian Tian


Abstract
"This paper addresses the challenges of data scarcity and limited speaker resources in Lao-English code-switched speech synthesis. We propose a neural encoder-decoder-based method for mixed-lingual speech synthesis. The method first extracts phoneme-level speech representations and employs a dot-product attention mechanism to map Lao and English phonemes into a shared la-tent space, thereby enhancing the model’s capability to represent cross-lingual phonetic information. In addition, language ID embedding module is extended to explicitly indicate the language of each input token, helping the model distinguish and adapt to language-specific pronunciation characteristics. Experiments are conducted on the open-source English dataset LibriTTS anda proprietary Lao speech corpus. Both subjective evaluations (MOS, AB preference tests) and objective metrics (RMSE) demonstrate that the proposed approach significantly outperforms the baseline VALL-E X model in terms of naturalness and language-switching fluency. Furthermore, ablation studies confirm that both the shared phoneme latent space and the language ID mod-ule play critical roles in improving synthesis quality. This approach offers a novel solution for integrating low-resource languages into mixed-lingual speech synthesis."
Anthology ID:
2025.ccl-1.80
Volume:
Proceedings of the 24th China National Conference on Computational Linguistics (CCL 2025)
Month:
August
Year:
2025
Address:
Jinan, China
Editors:
Maosong Sun, Peiyong Duan, Zhiyuan Liu, Ruifeng Xu, Weiwei Sun
Venue:
CCL
SIG:
Publisher:
Chinese Information Processing Society of China
Note:
Pages:
1067–1077
Language:
URL:
https://preview.aclanthology.org/ingest-ccl/2025.ccl-1.80/
DOI:
Bibkey:
Cite (ACL):
Yaping Liu, Linqin Wang, Shengxiang Gao, Zhengtao Yu, Ling Dong, and Tian Tian. 2025. Lao-English Code-Switched Speech Synthesis Via Neural Codec Language Modeling. In Proceedings of the 24th China National Conference on Computational Linguistics (CCL 2025), pages 1067–1077, Jinan, China. Chinese Information Processing Society of China.
Cite (Informal):
Lao-English Code-Switched Speech Synthesis Via Neural Codec Language Modeling (Liu et al., CCL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-ccl/2025.ccl-1.80.pdf