Burn After Reading: Do Multimodal Large Language Models Truly Capture Order of Events in Image Sequences?

Yingjin Song, Yupei Du, Denis Paperno, Albert Gatt


Abstract
This paper introduces the TempVS benchmark, which focuses on temporal grounding and reasoning capabilities of Multimodal Large Language Models (MLLMs) in image sequences. TempVS consists of three main tests (i.e., event relation inference, sentence ordering and image ordering), each accompanied with a basic grounding test. TempVS requires MLLMs to rely on both visual and linguistic modalities to understand the temporal order of events. We evaluate 38 state-of-the-art MLLMs, demonstrating that models struggle to solve TempVS, with a substantial performance gap compared to human capabilities. We also provide fine-grained insights that suggest promising directions for future research. Our TempVS benchmark data and code are available at https://github.com/yjsong22/TempVS.
Anthology ID:
2025.findings-acl.1248
Volume:
Findings of the Association for Computational Linguistics: ACL 2025
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
24316–24342
Language:
URL:
https://preview.aclanthology.org/mtsummit-25-ingestion/2025.findings-acl.1248/
DOI:
10.18653/v1/2025.findings-acl.1248
Bibkey:
Cite (ACL):
Yingjin Song, Yupei Du, Denis Paperno, and Albert Gatt. 2025. Burn After Reading: Do Multimodal Large Language Models Truly Capture Order of Events in Image Sequences?. In Findings of the Association for Computational Linguistics: ACL 2025, pages 24316–24342, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Burn After Reading: Do Multimodal Large Language Models Truly Capture Order of Events in Image Sequences? (Song et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/mtsummit-25-ingestion/2025.findings-acl.1248.pdf