Prompt-Guided Selective Masking Loss for Context-Aware Emotive Text-to-Speech

Yejin Jeon, Youngjae Kim, Jihyun Lee, Gary Lee


Abstract
Emotional dialogue speech synthesis (EDSS) aims to generate expressive speech by leveraging the dialogue context between interlocutors. This is typically done by concatenating global representations of previous utterances as conditions for text-to-speech (TTS) systems. However, such approaches overlook the importance of integrating localized acoustic cues that convey emotion. To address this, we introduce a novel approach that utilizes a large language model (LLM) to generate holistic emotion tags based on prior dialogue context, while also pinpointing key words in the target utterance that align with the predicted emotional state. Furthermore, we enhance the emotional richness of synthesized speech by incorporating concentrated acoustic features of these key words through a novel selective audio masking loss function. This methodology not only improves emotional expressiveness, but also facilitates automatic emotion speech generation during inference by eliminating the need for manual emotion tag selection. Comprehensive subjective and objective evaluations and analyses demonstrate the effectiveness of the proposed approach.
Anthology ID:
2025.findings-naacl.38
Volume:
Findings of the Association for Computational Linguistics: NAACL 2025
Month:
April
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
638–650
Language:
URL:
https://preview.aclanthology.org/fix-sig-urls/2025.findings-naacl.38/
DOI:
Bibkey:
Cite (ACL):
Yejin Jeon, Youngjae Kim, Jihyun Lee, and Gary Lee. 2025. Prompt-Guided Selective Masking Loss for Context-Aware Emotive Text-to-Speech. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 638–650, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
Prompt-Guided Selective Masking Loss for Context-Aware Emotive Text-to-Speech (Jeon et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-sig-urls/2025.findings-naacl.38.pdf