Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning

Jiazheng Liu, Sipeng Zheng, Börje F. Karlsson, Zongqing Lu


Abstract
Multimodal large language models (MLLMs), built on large-scale pre-trained vision towers and language models, have shown great capabilities in multimodal understanding. However, most existing MLLMs are trained on single-turn vision question-answering tasks, which do not accurately reflect real-world human conversations. In this paper, we introduce MMDiag, a new large-scale multi-turn multimodal dialogue dataset. This dataset is collaboratively generated through deliberately designed rules and GPT assistance, featuring complex dialogues with contextual dependencies that force models to track, ground, and recall information across multiple turns and disparate visual regions. MMDiag serves as a strong benchmark for multi-turn multimodal dialogue learning and brings more challenges to the grounding and reasoning capabilities of MLLMs. Further, inspired by human vision processing we present DiagNote, equipped with multimodal grounding and reasoning capabilities. DiagNote adopts a novel dual-module architecture that explicitly separates reasoning from grounding: a reasoning module (Deliberate) performs step-by-step Chain-of-Thought, while a grounding module (Gaze) provides precise visual focus by predicting bounding box annotations. These modules interact iteratively, enabling DiagNote to dynamically refine its understanding. We empirically demonstrate the advantages of DiagNote in both grounding and jointly processing and reasoning with vision and language information over existing MLLMs.
Anthology ID:
2025.emnlp-main.1690
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
33291–33312
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1690/
DOI:
Bibkey:
Cite (ACL):
Jiazheng Liu, Sipeng Zheng, Börje F. Karlsson, and Zongqing Lu. 2025. Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 33291–33312, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning (Liu et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1690.pdf
Checklist:
 2025.emnlp-main.1690.checklist.pdf