InfiniBench: A Benchmark for Large Multi-Modal Models in Long-Form Movies and TV Shows
Kirolos Ataallah, Eslam Mohamed Bakr, Mahmoud Ahmed, Chenhui Gou, Khushbu Pahwa, Jian Ding, Mohamed Elhoseiny
Abstract
Understanding long-form videos, such as movies and TV episodes ranging from tens of minutes to two hours, remains a significant challenge for multi-modal models. Existing benchmarks often fail to test the full range of cognitive skills needed to process these temporally rich and narratively complex inputs. Therefore, we introduce InfiniBench, a comprehensive benchmark designed to evaluate the capabilities of models in long video understanding rigorously.InfiniBench offers:(1) Over 1,000 hours of video content, with an average video length of 53 minutes.(2) The largest set of question-answer pairs for long video comprehension, totaling around 87.7 K.(3) Eight diverse skills that span both grounding-based (e.g., scene transitions, character actions) and reasoning-based (e.g., deep context understanding, multi-event linking).(4) Rich annotation formats, including both multiple-choice and open-ended questions.We conducted an in-depth evaluation across both commercial (GPT-4o, Gemini 2.0 Flash) and most recent open-source vision-language models, such as Qwen2.5-VL, InternVL3.0). Results reveal that:(1) Models struggle across the board: Even the best model, GPT-4o, achieves only 47.1% on grounding-based skills, with most models performing near or just above random chance.(2) Strong reliance on world knowledge: Models achieve surprisingly high scores using only metadata (e.g., video titles), highlighting a tendency to rely on pre-trained knowledge rather than actual visual or temporal understanding.(3) Multi-Modal Importance: When provided with full video and subtitle context, however, models show substantial improvements, confirming the critical role of multimodal input in video understanding.Our findings underscore the inherent challenges in long-video comprehension and point to the need for substantial advancements in both grounding and reasoning capabilities in MLLMs.- Anthology ID:
- 2025.emnlp-main.984
- Volume:
- Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
- Month:
- November
- Year:
- 2025
- Address:
- Suzhou, China
- Editors:
- Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 19496–19523
- Language:
- URL:
- https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.984/
- DOI:
- Cite (ACL):
- Kirolos Ataallah, Eslam Mohamed Bakr, Mahmoud Ahmed, Chenhui Gou, Khushbu Pahwa, Jian Ding, and Mohamed Elhoseiny. 2025. InfiniBench: A Benchmark for Large Multi-Modal Models in Long-Form Movies and TV Shows. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 19496–19523, Suzhou, China. Association for Computational Linguistics.
- Cite (Informal):
- InfiniBench: A Benchmark for Large Multi-Modal Models in Long-Form Movies and TV Shows (Ataallah et al., EMNLP 2025)
- PDF:
- https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.984.pdf