CLAPSpeech: Learning Prosody from Text Context with Contrastive Language-Audio Pre-Training

Zhenhui Ye, Rongjie Huang, Yi Ren, Ziyue Jiang, Jinglin Liu, Jinzheng He, Xiang Yin, Zhou Zhao


Abstract
Improving text representation has attracted much attention to achieve expressive text-to-speech (TTS). However, existing works only implicitly learn the prosody with masked token reconstruction tasks, which leads to low training efficiency and difficulty in prosody modeling. We propose CLAPSpeech, a cross-modal contrastive pre-training framework that learns from the prosody variance of the same text token under different contexts. Specifically, 1) with the design of a text encoder and a prosody encoder, we encourage the model to connect the text context with its corresponding prosody pattern in the joint multi-modal space; 2) we introduce a multi-scale pre-training pipeline to capture prosody patterns in multiple levels. 3) we show how to incorporate CLAPSpeech into existing TTS models for better prosody. Experiments on three datasets not only show that CLAPSpeech could improve the prosody prediction for existing TTS methods, but also demonstrate its generalization ability to adapt to multiple languages and multi-speaker text-to-speech. We also deeply analyze the principle behind the performance of CLAPSpeech. Ablation studies demonstrate the necessity of each component in CLAPSpeech. Source code and audio samples are available at https://clapspeech.github.io.
Anthology ID:
2023.acl-long.518
Volume:
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2023
Address:
Toronto, Canada
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
9317–9331
Language:
URL:
https://aclanthology.org/2023.acl-long.518
DOI:
10.18653/v1/2023.acl-long.518
Bibkey:
Cite (ACL):
Zhenhui Ye, Rongjie Huang, Yi Ren, Ziyue Jiang, Jinglin Liu, Jinzheng He, Xiang Yin, and Zhou Zhao. 2023. CLAPSpeech: Learning Prosody from Text Context with Contrastive Language-Audio Pre-Training. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9317–9331, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
CLAPSpeech: Learning Prosody from Text Context with Contrastive Language-Audio Pre-Training (Ye et al., ACL 2023)
Copy Citation:
PDF:
https://preview.aclanthology.org/remove-xml-comments/2023.acl-long.518.pdf