EchoMLLM: Incentivizing Echocardiographic Video Understanding with Keyframe Grounding and Report Generation

Heyu Huang; Wanran Sun; Chi Chen; Bo Chen; Zonghao Guo; Yuhua Li; Ruixuan Li; Kunlun He; Maosong Sun (孙茂松)

EchoMLLM: Incentivizing Echocardiographic Video Understanding with Keyframe Grounding and Report Generation

Heyu Huang, Wanran Sun, Chi Chen, Bo Chen, Zonghao Guo, Yuhua Li, Ruixuan Li, Kunlun He, Maosong Sun

Abstract

Echocardiography analysis demands a dual capability: rigorous quantitative keyframe localization for evidence verification and comprehensive qualitative synthesis for diagnostic reporting. However, current Multi-Modal Large Language Models (MLLMs) struggle to meet these clinical requirements due to a misalignment with diagnostic workflows, a scarcity of video instruction data, and the critical challenge of cyclic temporal ambiguity—where the repetitive nature of cardiac cycles renders standard single-frame supervision ill-posed. To bridge this gap, we introduce EchoMLLM, a unified framework designed for real-world echocardiography video understanding. First, we align model capabilities with clinical needs by defining two fine-grained tasks: cycle- and pathology-conditioned keyframe grounding and video report generation. To facilitate this, we curate EchoMM-120k, a large-scale instruction dataset specifically constructed to support temporal localization and professional reporting. Furthermore, to resolve the cyclic ambiguity, we propose a multi-stage training paradigm incorporating a novel cycle-aware Reinforcement Learning (RL) strategy. By prioritizing logical consistency over rigid index matching, our approach moves beyond rote memorization to elicit invariant reasoning. Extensive experiments demonstrate that EchoMLLM reduces temporal grounding errors by up to 76% and improves report generation quality by 65% over its backbone, achieving state-of-the-art performance against both generalist and medical baselines.

Anthology ID:: 2026.findings-acl.1001
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 20053–20071
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1001/
DOI:
Bibkey:
Cite (ACL):: Heyu Huang, Wanran Sun, Chi Chen, Bo Chen, Zonghao Guo, Yuhua Li, Ruixuan Li, Kunlun He, and Maosong Sun. 2026. EchoMLLM: Incentivizing Echocardiographic Video Understanding with Keyframe Grounding and Report Generation. In Findings of the Association for Computational Linguistics: ACL 2026, pages 20053–20071, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: EchoMLLM: Incentivizing Echocardiographic Video Understanding with Keyframe Grounding and Report Generation (Huang et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1001.pdf
Checklist:: 2026.findings-acl.1001.checklist.pdf

PDF Cite Search Checklist Fix data