Are Multimodal LLMs Movie Buffs?

Carlo Bretti, Pascal Mettes, Nanne Van Noord


Abstract
No. While Multimodal Large Language Models (MLLMs) have been shown to perform very well on general video data, we systematically show that their performance on movies lags behind. This is surprising as MLLMs are increasingly used for movie understanding. To measure the performance of MLLMs on movies, we explore three pillars of movie mastery: movie knowledge, cinematographic knowledge, and critical analysis. Through a combination of quantitative and in-depth qualitative evaluations, we identify where MLLMs show promise and, in particular, where they fail. Our findings show that in small-scale settings involving factual knowledge, MLLMs are able to outperform existing methods. However, once cinematographic and critical analysis is required, MLLMs are insufficiently able to extract meaningful information from the visual modality to be able to provide useful insights. The data and project page are available at https://carlobretti.github.io/moviebuff.
Anthology ID:
2026.findings-eacl.139
Volume:
Findings of the Association for Computational Linguistics: EACL 2026
Month:
March
Year:
2026
Address:
Rabat, Morocco
Editors:
Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2661–2677
Language:
URL:
https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.139/
DOI:
Bibkey:
Cite (ACL):
Carlo Bretti, Pascal Mettes, and Nanne Van Noord. 2026. Are Multimodal LLMs Movie Buffs?. In Findings of the Association for Computational Linguistics: EACL 2026, pages 2661–2677, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
Are Multimodal LLMs Movie Buffs? (Bretti et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.139.pdf
Checklist:
 2026.findings-eacl.139.checklist.pdf