Emotion-Wheel-Guided Audio-Referred Text Representation for Multimodal Emotion Recognition in Conversation

Eunseon Seong, Harim Lee, Dahye Kim, Changhyun Kim, Dong-Kyu Chae


Abstract
Multimodal Emotion Recognition in Conversation aims to identify emotions within a dialogue with multimodal data, including audio, visual, and textual features. While existing methods have made significant improvements, there are two fundamental limitations to be addressed. From the modality fusion perspective, current approaches treat all modalities as functionally equivalent during fusion, overlooking their distinct communicative roles and information capacities, in which text conveys explicit semantic meaning while audio provides paralinguistic cues. From the emotion label perspective, many works ignore the continuous structure of emotion characterized by psychological theory and apply uniform penalties regardless of affective proximity. To address these limitations, we propose EMART, EMotion-Wheel-Guided Audio-Referred Text Representation for ERC, specifically focusing on audio and text modalities. First, we propose a modality-aware fusion strategy capturing linguistic features from text as the primary source and audio as a complementary component. Secondly, we propose an emotion-wheel-guided supervised contrastive loss to encode emotional proximity based on Russell’s circumplex model. Experimental results on IEMOCAP and MELD demonstrate outstanding performance. The code is available at: https://github.com/DILAB-HYU/EMART.git.
Anthology ID:
2026.acl-long.1875
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
40392–40403
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1875/
DOI:
Bibkey:
Cite (ACL):
Eunseon Seong, Harim Lee, Dahye Kim, Changhyun Kim, and Dong-Kyu Chae. 2026. Emotion-Wheel-Guided Audio-Referred Text Representation for Multimodal Emotion Recognition in Conversation. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 40392–40403, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Emotion-Wheel-Guided Audio-Referred Text Representation for Multimodal Emotion Recognition in Conversation (Seong et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1875.pdf
Checklist:
 2026.acl-long.1875.checklist.pdf