Chain-Talker: Chain Understanding and Rendering for Empathetic Conversational Speech Synthesis

Yifan Hu, Rui Liu, Yi Ren, Xiang Yin, Haizhou Li


Abstract
Conversational Speech Synthesis (CSS) aims to align synthesized speech with the emotional and stylistic context of user-agent interactions to achieve empathy. Current generative CSS models face interpretability limitations due to insufficient emotional perception and redundant discrete speech coding. To address the above issues, we present Chain-Talker, a three-stage framework mimicking human cognition: Emotion Understanding derives context-aware emotion descriptors from dialogue history; Semantic Understanding generates compact semantic codes via serialized prediction; and Empathetic Rendering synthesizes expressive speech by integrating both components. To support emotion modeling, we develop CSS-EmCap, an LLM-driven automated pipeline for generating precise conversational speech emotion captions. Experiments on three benchmark datasets demonstrate that Chain-Talker produces more expressive and empathetic speech than existing methods, with CSS-EmCap contributing to reliable emotion modeling. The code and demos are available at: https://github.com/AI-S2-Lab/Chain-Talker.
Anthology ID:
2025.findings-acl.101
Volume:
Findings of the Association for Computational Linguistics: ACL 2025
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venues:
Findings | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1988–2003
Language:
URL:
https://preview.aclanthology.org/ingestion-acl-25/2025.findings-acl.101/
DOI:
Bibkey:
Cite (ACL):
Yifan Hu, Rui Liu, Yi Ren, Xiang Yin, and Haizhou Li. 2025. Chain-Talker: Chain Understanding and Rendering for Empathetic Conversational Speech Synthesis. In Findings of the Association for Computational Linguistics: ACL 2025, pages 1988–2003, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Chain-Talker: Chain Understanding and Rendering for Empathetic Conversational Speech Synthesis (Hu et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-acl-25/2025.findings-acl.101.pdf