Locate and Explain: Joint Multimodal Emotion Cause Extraction and Summarization in Conversation

Jikun Wan, Chen Gong, Guohong Fu


Abstract
Multimodal emotion cause analysis in conversation aims to identify the causes of emotions by leveraging multimodal information. Existing studies mainly formulate this problem as either utterance-level emotion cause extraction, which provides clear cause localization but limited explanation, or multimodal emotion cause generation, which offers fine-grained explanations but lacks explicit traceability to source utterances. Moreover, existing datasets rely heavily on human judgment and lack well-defined structured theoretical guidance, leading to subjective and inconsistent annotations. To address these issues, we introduce joint Multimodal Emotion Cause Extraction and Summarization in conversation (MECES), a new task that simultaneously extracts emotion cause utterances and generates cause summaries, enabling both precise localization and interpretable explanations of emotion cause. We further construct a MECES dataset guided by the Activating Events–Beliefs–Consequences theory from psychology. This dataset consists of 5,787 emotion utterances annotated with causes, comprising 12,231 emotion-cause pairs and 6,040 cause summaries. We also propose an effective end-to-end joint learning approach for MECES task, establishing strong benchmark results for this newly introduced task and dataset.
Anthology ID:
2026.acl-long.2012
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
43472–43489
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.2012/
DOI:
Bibkey:
Cite (ACL):
Jikun Wan, Chen Gong, and Guohong Fu. 2026. Locate and Explain: Joint Multimodal Emotion Cause Extraction and Summarization in Conversation. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 43472–43489, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Locate and Explain: Joint Multimodal Emotion Cause Extraction and Summarization in Conversation (Wan et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.2012.pdf
Checklist:
 2026.acl-long.2012.checklist.pdf