JX4MEI: Multimodal Semantically-Enhanced LLM for Joint Multimodal Emotion-Intent Explanation and Classification
YiJie Huang, Xiaocui Yang, Shi Feng, Daling Wang, Yifei Zhang, Ning Yuan, Zhuoyue Jia, Wen Zhang
Abstract
Existing multimodal emotion and intent recognition tasks predominantly focus on classification, overlooking the underlying rationale and intrinsic connections between these states. Bridging this gap, we propose **Joint Multimodal Emotion-Intent Explanation and Classification, JX4MEI**, a novel task requiring the model to jointly predict emotion and intent, while generating natural language explanations for why they co-occur. To support this, we present **XMEI-dataset**, a large-scale benchmark of 15,461 multimodal samples spanning 7 emotion and 9 intent categories across text, audio, and visual modalities. Unlike prior works, our dataset provides fine-grained rationales for emotion, intent, and their causal interplay, curated via a rigorous pipeline involving Chain-of-Thought generation and strict human refinement to eliminate model artifacts. Furthermore, we propose **XMEI-Qwen**, a model equipped with a novel **Language-Query Former (LQ-Former)**. By leveraging modality-specific captions as semantic queries, LQ-Former injects explicit semantic guidance into feature alignment, significantly enhancing reasoning capabilities. Empirical experiments demonstrate that XMEI-Qwen sets a new state-of-the-art on the JX4MEI task, outperforming competitive baselines in both prediction and explanation generation. Code: https://github.com/OrangeYeah1027/JX4MEI.- Anthology ID:
- 2026.findings-acl.1012
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 20242–20261
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1012/
- DOI:
- Cite (ACL):
- YiJie Huang, Xiaocui Yang, Shi Feng, Daling Wang, Yifei Zhang, Ning Yuan, Zhuoyue Jia, and Wen Zhang. 2026. JX4MEI: Multimodal Semantically-Enhanced LLM for Joint Multimodal Emotion-Intent Explanation and Classification. In Findings of the Association for Computational Linguistics: ACL 2026, pages 20242–20261, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- JX4MEI: Multimodal Semantically-Enhanced LLM for Joint Multimodal Emotion-Intent Explanation and Classification (Huang et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1012.pdf