A Fast and High-quality Text-to-Speech Method with Compressed Auxiliary Corpus and Limited Target Speaker Corpus

Ye Tao, Chaofeng Lu, Meng Liu, Kai Xu, Tianyu Liu, Yunlong Tian, Yongjie Du


Abstract
With an auxiliary corpus (non-target speaker corpus) for model pre-training, Text-to-Speech (TTS) methods can generate high-quality speech with a limited target speaker corpus. However, this approach comes with expensive training costs. To overcome the challenge, a high-quality TTS method is proposed, significantly reducing training costs while maintaining the naturalness of synthesized speech. In this paper, we propose an auxiliary corpus compression algorithm that reduces the training cost while the naturalness of the synthesized speech is not significantly degraded. We then use the compressed corpus to pre-train the proposed TTS model CMDTTS, which fuses phoneme and word multi-level prosody modeling components and denoises the generated mel-spectrograms using denoising diffusion probabilistic models (DDPMs). In addition, a fine-tuning step that the conditional generative adversarial network (cGAN) is introduced to embed the target speaker feature and improve speech quality using the target speaker corpus. Experiments are conducted on Chinese and English single speaker’s corpora, and the results show that the method effectively balances the model training speed and the synthesized speech quality and outperforms the current models.
Anthology ID:
2024.lrec-main.46
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
525–535
Language:
URL:
https://aclanthology.org/2024.lrec-main.46
DOI:
Bibkey:
Cite (ACL):
Ye Tao, Chaofeng Lu, Meng Liu, Kai Xu, Tianyu Liu, Yunlong Tian, and Yongjie Du. 2024. A Fast and High-quality Text-to-Speech Method with Compressed Auxiliary Corpus and Limited Target Speaker Corpus. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 525–535, Torino, Italia. ELRA and ICCL.
Cite (Informal):
A Fast and High-quality Text-to-Speech Method with Compressed Auxiliary Corpus and Limited Target Speaker Corpus (Tao et al., LREC-COLING 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2024.lrec-main.46.pdf