Benchmarking Egocentric Clinical Intent Understanding Capability for Medical Multimodal Large Language Models

Shaonan Liu, Guo Yu, Xiaoling Luo, Shiyi Zheng, Jie Liu, Wenting Chen, Linlin Shen


Abstract
Medical Multimodal Large Language Models (Med-MLLMs) require egocentric clinical intent understanding for real-world deployment, yet existing benchmarks fail to evaluate this critical capability. We introduce MedGaze-Bench, the first benchmark leveraging clinician gaze as a Cognitive Cursor to assess intent understanding across surgery, emergency simulation, and diagnostic interpretation. Our benchmark addresses three fundamental challenges: visual homogeneity of anatomical structures, strict temporal-causal dependencies in clinical workflows, and implicit adherence to safety protocols. We propose a Three-Dimensional Clinical Intent Framework evaluating: (1) Spatial Intent—discriminating precise targets amid visual noise, (2) Temporal Intent—inferring causal rationale through retrospective and prospective reasoning, and (3) Standard Intent—verifying protocol compliance through safety checks. Beyond accuracy metrics, we introduce Trap QA mechanisms to stress-test clinical reliability by penalizing hallucinations and cognitive sycophancy. Experiments reveal current MLLMs struggle with egocentric intent due to over-reliance on global features, leading to fabricated observations and uncritical acceptance of invalid instructions.
Anthology ID:
2026.acl-long.1228
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
26682–26697
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1228/
DOI:
Bibkey:
Cite (ACL):
Shaonan Liu, Guo Yu, Xiaoling Luo, Shiyi Zheng, Jie Liu, Wenting Chen, and Linlin Shen. 2026. Benchmarking Egocentric Clinical Intent Understanding Capability for Medical Multimodal Large Language Models. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 26682–26697, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Benchmarking Egocentric Clinical Intent Understanding Capability for Medical Multimodal Large Language Models (Liu et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1228.pdf
Checklist:
 2026.acl-long.1228.checklist.pdf