Fanyi Pu
2026
Video-MMMU: Evaluating Knowledge Acquisition from Multidisciplinary Professional Videos
Kairui Hu | Penghao Wu | Fanyi Pu | Wang Xiao | Xiang Yue | Bo Li | Yuanhan Zhang | Ziwei Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Kairui Hu | Penghao Wu | Fanyi Pu | Wang Xiao | Xiang Yue | Bo Li | Yuanhan Zhang | Ziwei Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Humans acquire knowledge through three cognitive stages: perceiving information, comprehending knowledge, and adapting knowledge to solve novel problems. Videos serve as an effective medium for knowledge acquisition, facilitating a natural progression through these learning stages. However, existing video benchmarks fail to evaluate the knowledge acquisition capabilities of Large Multimodal Models (LMMs). To address this gap, we introduce Video-MMMU, a multi-modal, multi-discipline, multi-track benchmark that evaluates LMMs’ ability to acquire knowledge from college-level, educational videos. Video-MMMU features a collection of 300 videos and 900 human-annotated questions across six disciplines, evaluating knowledge acquisition through stage-aligned question-answer pairs: Perception, Comprehension, and Adaptation. Beyond measuring final accuracy, Video-MMMU proposes the performance gain metric that quantifies an LMM’s learning gain from video, shifting the focus of evaluation from absolute performance to learning efficiency. Our evaluation reveals a substantial gap between human learners and current LMMs, highlighting the need to improve models’ ability to learn and adapt knowledge from video content.
2025
LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models
Kaichen Zhang | Bo Li | Peiyuan Zhang | Fanyi Pu | Joshua Adrian Cahyono | Kairui Hu | Shuai Liu | Yuanhan Zhang | Jingkang Yang | Chunyuan Li | Ziwei Liu
Findings of the Association for Computational Linguistics: NAACL 2025
Kaichen Zhang | Bo Li | Peiyuan Zhang | Fanyi Pu | Joshua Adrian Cahyono | Kairui Hu | Shuai Liu | Yuanhan Zhang | Jingkang Yang | Chunyuan Li | Ziwei Liu
Findings of the Association for Computational Linguistics: NAACL 2025
The advances of large foundation models necessitate wide-coverage, low-cost, and zero-contamination benchmarks. Despite continuous exploration of language model evaluations, comprehensive studies on the evaluation of Large Multi-modal Models (LMMs) remain limited. In this work, we introduce LMMS-EVAL, a unified and standardized multimodal benchmark framework with over 50 tasks and more than 10 models to promote transparent and reproducible evaluations. Although LMMS-EVAL offers comprehensive coverage, we find it still falls short in achieving low cost and zero contamination. To approach this evaluation trilemma, we further introduce LMMS-EVAL LITE, a pruned evaluation toolkit that emphasizes both coverage and efficiency. Additionally, we present Multimodal LIVEBENCH that utilizes continuously updating news and online forums to assess models’ generalization abilities in the wild, featuring a low-cost and zero-contamination evaluation approach. In summary, our work highlights the importance of considering the evaluation trilemma and provides practical solutions to navigate the trade-offs in evaluating large multi-modal models, paving the way for more effective and reliable benchmarking of LMMs.