Emotion-Wheel-Guided Audio-Referred Text Representation for Multimodal Emotion Recognition in Conversation
Eunseon Seong, Harim Lee, Dahye Kim, Changhyun Kim, Dong-Kyu Chae
Abstract
Multimodal Emotion Recognition in Conversation aims to identify emotions within a dialogue with multimodal data, including audio, visual, and textual features. While existing methods have made significant improvements, there are two fundamental limitations to be addressed. From the modality fusion perspective, current approaches treat all modalities as functionally equivalent during fusion, overlooking their distinct communicative roles and information capacities, in which text conveys explicit semantic meaning while audio provides paralinguistic cues. From the emotion label perspective, many works ignore the continuous structure of emotion characterized by psychological theory and apply uniform penalties regardless of affective proximity. To address these limitations, we propose EMART, EMotion-Wheel-Guided Audio-Referred Text Representation for ERC, specifically focusing on audio and text modalities. First, we propose a modality-aware fusion strategy capturing linguistic features from text as the primary source and audio as a complementary component. Secondly, we propose an emotion-wheel-guided supervised contrastive loss to encode emotional proximity based on Russell’s circumplex model. Experimental results on IEMOCAP and MELD demonstrate outstanding performance. The code is available at: https://github.com/DILAB-HYU/EMART.git.- Anthology ID:
- 2026.acl-long.1875
- Volume:
- Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 40392–40403
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.acl-long.1875/
- DOI:
- Cite (ACL):
- Eunseon Seong, Harim Lee, Dahye Kim, Changhyun Kim, and Dong-Kyu Chae. 2026. Emotion-Wheel-Guided Audio-Referred Text Representation for Multimodal Emotion Recognition in Conversation. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 40392–40403, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- Emotion-Wheel-Guided Audio-Referred Text Representation for Multimodal Emotion Recognition in Conversation (Seong et al., ACL 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.acl-long.1875.pdf