Berrak Sisman
2025
Multimodal Fine-grained Context Interaction Graph Modeling for Conversational Speech Synthesis
Zhenqi Jia
|
Rui Liu
|
Berrak Sisman
|
Haizhou Li
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Conversational Speech Synthesis (CSS) aims to generate speech with natural prosody by understanding the multimodal dialogue history (MDH). The latest work predicts the accurate prosody expression of the target utterance by modeling the utterance-level interaction characteristics of MDH and the target utterance. However, MDH contains fine-grained semantic and prosody knowledge at the word level. Existing methods overlook the fine-grained semantic and prosodic interaction modeling. To address this gap, we propose MFCIG-CSS, a novel Multimodal Fine-grained Context Interaction Graph-based CSS system. Our approach constructs two specialized multimodal fine-grained dialogue interaction graphs: a semantic interaction graph and a prosody interaction graph. These two interaction graphs effectively encode interactions between word-level semantics, prosody, and their influence on subsequent utterances in MDH. The encoded interaction features are then leveraged to enhance synthesized speech with natural conversational prosody. Experiments on the DailyTalk dataset demonstrate that MFCIG-CSS outperforms all baseline models in terms of prosodic expressiveness. Code and speech samples are available at https://github.com/AI-S2-Lab/MFCIG-CSS.
2021
Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue
Haizhou Li
|
Gina-Anne Levow
|
Zhou Yu
|
Chitralekha Gupta
|
Berrak Sisman
|
Siqi Cai
|
David Vandyke
|
Nina Dethlefs
|
Yan Wu
|
Junyi Jessy Li
Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue
Search
Fix author
Co-authors
- Haizhou Li 2
- Siqi Cai 1
- Nina Dethlefs 1
- Chitralekha Gupta 1
- Zhenqi Jia 1
- show all...