Wang Xiao
2026
Video-MMMU: Evaluating Knowledge Acquisition from Multidisciplinary Professional Videos
Kairui Hu | Penghao Wu | Fanyi Pu | Wang Xiao | Xiang Yue | Bo Li | Yuanhan Zhang | Ziwei Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Kairui Hu | Penghao Wu | Fanyi Pu | Wang Xiao | Xiang Yue | Bo Li | Yuanhan Zhang | Ziwei Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Humans acquire knowledge through three cognitive stages: perceiving information, comprehending knowledge, and adapting knowledge to solve novel problems. Videos serve as an effective medium for knowledge acquisition, facilitating a natural progression through these learning stages. However, existing video benchmarks fail to evaluate the knowledge acquisition capabilities of Large Multimodal Models (LMMs). To address this gap, we introduce Video-MMMU, a multi-modal, multi-discipline, multi-track benchmark that evaluates LMMs’ ability to acquire knowledge from college-level, educational videos. Video-MMMU features a collection of 300 videos and 900 human-annotated questions across six disciplines, evaluating knowledge acquisition through stage-aligned question-answer pairs: Perception, Comprehension, and Adaptation. Beyond measuring final accuracy, Video-MMMU proposes the performance gain metric that quantifies an LMM’s learning gain from video, shifting the focus of evaluation from absolute performance to learning efficiency. Our evaluation reveals a substantial gap between human learners and current LMMs, highlighting the need to improve models’ ability to learn and adapt knowledge from video content.