MI-CXR: A Benchmark for Longitudinal Reasoning over Multi-Interval Chest X-rays

Sunghwan Steve Cho, Yunseok Han, Jaeyoung Do


Abstract
Longitudinal chest X-ray (CXR) interpretation requires reasoning over disease evolution across multiple patient visits, yet most existing medical VQA benchmarks focus on single images or short-horizon image pairs. We introduce **MI-CXR**, a benchmark for standardized evaluation of **M**ulti-**I**nterval longitudinal reasoning over multi-visit **CXR** sequences, without requiring free-form report generation or additional clinical context. MI-CXR comprises five-way multiple-choice questions over five-visit patient timelines and instantiates three complementary task families: Temporal Event Localization, Interval-wise Change Reasoning, and Global Trajectory Summarization, which assess clinically grounded visual reasoning over time. Evaluating 14 state-of-the-art vision–language models (VLMs) shows low overall performance (29.3% accuracy), only modestly above random guessing. Using stage-wise diagnostic probing, we find that models often produce locally plausible interval descriptions but fail to enforce temporal constraints or compose evidence into globally consistent decisions over the full timeline. These findings reveal key limitations of current VLMs and establish MI-CXR as a principled benchmark for longitudinal medical reasoning. The benchmark is available at: https://github.com/AIDASLab/MI-CXR
Anthology ID:
2026.findings-acl.1512
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
30241–30273
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1512/
DOI:
Bibkey:
Cite (ACL):
Sunghwan Steve Cho, Yunseok Han, and Jaeyoung Do. 2026. MI-CXR: A Benchmark for Longitudinal Reasoning over Multi-Interval Chest X-rays. In Findings of the Association for Computational Linguistics: ACL 2026, pages 30241–30273, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
MI-CXR: A Benchmark for Longitudinal Reasoning over Multi-Interval Chest X-rays (Cho et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1512.pdf
Checklist:
 2026.findings-acl.1512.checklist.pdf