Multi-Condition Guided Diffusion Network for Multimodal Emotion Recognition in Conversation

Wenjin Tian, Xianying Huang, Shihao Zou


Abstract
Emotion recognition in conversation (ERC) involves identifying emotional labels associated with utterances within a conversation, a task that is essential for developing empathetic robots. Current research emphasizes contextual factors, the speaker’s influence, and extracting complementary information across different modalities. However, it often overlooks the cross-modal noise at the semantic level and the redundant information brought by the features themselves. This study introduces a diffusion-based approach designed to effectively address the challenges posed by redundant information and unexpected noise while robustly capturing shared semantics, thus facilitating the learning of compact and representative features from multimodal data. Specifically, we present the Multi-Condition Guided Diffusion Network (McDiff). McDiff employs a modal prior knowledge extraction strategy to derive the prior distribution for each modality, thereby enhancing the regional attention of each modality and applying the generated prior distribution at each diffusion step. Furthermore, we propose a method to learn the mutual information of each modality through a specific objective constraints approach prior to the forward process, which aims to improve inter-modal interaction and mitigate the effects of noise and redundancy. Comprehensive experiments conducted on two multimodal datasets, IEMOCAP and MELD, demonstrate that McDiff significantly surpasses existing state-of-the-art methodologies, thereby affirming the generalizability and efficacy of the proposed model.
Anthology ID:
2025.findings-naacl.177
Volume:
Findings of the Association for Computational Linguistics: NAACL 2025
Month:
April
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3215–3227
Language:
URL:
https://preview.aclanthology.org/fix-sig-urls/2025.findings-naacl.177/
DOI:
Bibkey:
Cite (ACL):
Wenjin Tian, Xianying Huang, and Shihao Zou. 2025. Multi-Condition Guided Diffusion Network for Multimodal Emotion Recognition in Conversation. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 3215–3227, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
Multi-Condition Guided Diffusion Network for Multimodal Emotion Recognition in Conversation (Tian et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-sig-urls/2025.findings-naacl.177.pdf