Dongjie Fu

2026

While diffusion and flow-matching models have advanced TTS, generating high-arousal emotions remains a persistent challenge due to the trade-off between stability and expressiveness. Existing systems often suffer from linguistic collapse when pursuing high intensity or fail to meet target emotional levels under stable settings. In this work, we identify that standard Gaussian initialization inevitably introduces a neutral prosody bias, while uniform Classifier-Free Guidance often distorts the acoustic manifold, leading to artifacts. To address this, we propose an inference framework that rectifies the emotional trajectory. An Emotion-Rectified Noise Prior injects a semantic gradient at initialization to align sampling with the target emotional manifold, and Likelihood-Inverse Guidance adaptively schedules guidance via a conditional/unconditional likelihood ratio, strengthening guidance only when the trajectory drifts toward a neutral fallback. Extensive experiments demonstrate that our method effectively resolves the stability bottleneck in high-intensity scenarios, achieving superior linguistic accuracy and emotional fidelity without model retraining. Audio samples are available at https://showtts.github.io/emotionTTS/.

2025

pdf bib abs

Extensive research on LLM-based spoken dialogue systems has significantly advanced the development of intelligent voice assistants. However, the integration of role information within speech remains an underexplored area, limiting its application in real-world scenarios, particularly in multi-party dialogue settings. With the growing demand for personalization, voice assistants that can recognize and remember users establish a deeper connection with them. We focus on enabling LLMs with speaker-awareness capabilities and enhancing their understanding of character settings through synthetic data to generate contextually appropriate responses. We introduce Persona-Dialogue, the first large-scale multi-party spoken dialogue dataset that incorporates speaker profiles. Based on this dataset, we propose PAChat, an architecture that simultaneously models both linguistic content and speaker features, allowing LLMs to map character settings to speaker identities in speech. Through extensive experiments, we demonstrate that PAChat successfully achieves speaker-specific responses, character understanding, and the generation of targeted replies in multi-party dialogue scenarios, surpassing existing spoken dialogue systems.

Co-authors

Venues

ACL1
EMNLP1

Fix author