EchoMLLM: Incentivizing Echocardiographic Video Understanding with Keyframe Grounding and Report Generation
Heyu Huang, Wanran Sun, Chi Chen, Bo Chen, Zonghao Guo, Yuhua Li, Ruixuan Li, Kunlun He, Maosong Sun
Abstract
Echocardiography analysis demands a dual capability: rigorous quantitative keyframe localization for evidence verification and comprehensive qualitative synthesis for diagnostic reporting. However, current Multi-Modal Large Language Models (MLLMs) struggle to meet these clinical requirements due to a misalignment with diagnostic workflows, a scarcity of video instruction data, and the critical challenge of cyclic temporal ambiguity—where the repetitive nature of cardiac cycles renders standard single-frame supervision ill-posed. To bridge this gap, we introduce EchoMLLM, a unified framework designed for real-world echocardiography video understanding. First, we align model capabilities with clinical needs by defining two fine-grained tasks: cycle- and pathology-conditioned keyframe grounding and video report generation. To facilitate this, we curate EchoMM-120k, a large-scale instruction dataset specifically constructed to support temporal localization and professional reporting. Furthermore, to resolve the cyclic ambiguity, we propose a multi-stage training paradigm incorporating a novel cycle-aware Reinforcement Learning (RL) strategy. By prioritizing logical consistency over rigid index matching, our approach moves beyond rote memorization to elicit invariant reasoning. Extensive experiments demonstrate that EchoMLLM reduces temporal grounding errors by up to 76% and improves report generation quality by 65% over its backbone, achieving state-of-the-art performance against both generalist and medical baselines.- Anthology ID:
- 2026.findings-acl.1001
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 20053–20071
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1001/
- DOI:
- Cite (ACL):
- Heyu Huang, Wanran Sun, Chi Chen, Bo Chen, Zonghao Guo, Yuhua Li, Ruixuan Li, Kunlun He, and Maosong Sun. 2026. EchoMLLM: Incentivizing Echocardiographic Video Understanding with Keyframe Grounding and Report Generation. In Findings of the Association for Computational Linguistics: ACL 2026, pages 20053–20071, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- EchoMLLM: Incentivizing Echocardiographic Video Understanding with Keyframe Grounding and Report Generation (Huang et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1001.pdf