Chenda Li


2026

Emotional Text-to-Speech aims to synthesize speech with human-like naturalness and expressiveness. However, existing systems rely on sentence-level labels, which fails to capture the subtle nuances of human affect. Based on cognitive appraisal theories, we argue that emotional expression is not generated in isolation but is deeply influenced by speaker’s Personal Experience and the conversational Context.To overcome the information bottleneck inherent in traditional annotations, we present Emotional-Context-Speech, a large-scale, context-aware speech corpus derived from multi-speaker audiobooks. This dataset provides not only transcriptions but also dialogue context, personal experience, open-vocabulary emotion labels, and paralinguistic descriptions.Experimental results demonstrate that TTS model trained using additional context and experience descriptions as inputs, called Emotional-Context-TTS, significantly outperforms existing methods in terms of emotional expression accuracy and naturalness.