Scaling Rich Style-Prompted Text-to-Speech Datasets

Anuj Diwan; Zhisheng Zheng; David Harwath; Eunsol Choi

Scaling Rich Style-Prompted Text-to-Speech Datasets

Anuj Diwan, Zhisheng Zheng, David Harwath, Eunsol Choi

Abstract

We introduce Paralinguistic Speech Captions (ParaSpeechCaps), a large-scale dataset that annotates speech utterances with rich style captions. While rich abstract tags (e.g. guttural, nasal, pained) have been explored in small-scale human-annotated datasets, existing large-scale datasets only cover basic tags (e.g. low-pitched, slow, loud). We combine off-the-shelf text and speech embedders, classifiers and an audio language model to automatically scale rich tag annotations for the first time. ParaSpeechCaps covers a total of 59 style tags, including both speaker-level intrinsic tags and utterance-level situational tags. It consists of 282 hours of human-labelled data (PSC-Base) and 2427 hours of automatically annotated data (PSC-Scaled). We finetune Parler-TTS, an open-source style-prompted TTS model, on ParaSpeechCaps, and achieve improved style consistency (+7.9% Consistency MOS) and speech quality (+15.5% Naturalness MOS) over the best performing baseline that combines existing rich style tag datasets. We ablate several of our dataset design choices to lay the foundation for future work in this space. Our dataset, models and code are released at https://github.com/ajd12342/paraspeechcaps .

Anthology ID:: 2025.emnlp-main.180
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 3639–3659
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.180/
DOI:
Bibkey:
Cite (ACL):: Anuj Diwan, Zhisheng Zheng, David Harwath, and Eunsol Choi. 2025. Scaling Rich Style-Prompted Text-to-Speech Datasets. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 3639–3659, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Scaling Rich Style-Prompted Text-to-Speech Datasets (Diwan et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.180.pdf
Checklist:: 2025.emnlp-main.180.checklist.pdf

PDF Cite Search Checklist Fix data