PedagogyBench: A Cognitive-Driven Benchmark for Multimodal Instructional Video Understanding
Xiaokang Jin, Jia Zhu, Jingjiang Liu, Yabing Shi, Jueqi Guan, Hao Chen, Pasquale De Meo
Abstract
Existing video understanding benchmarks mainly emphasize general visual recognition and reasoning, but do not adequately capture the pedagogical logic embedded in instructional videos. To address this gap, we present PedagogyBench, a multimodal benchmark for instructional video understanding grounded in pedagogical cognition. We introduce a pedagogy-driven segmentation strategy and a dual-stream semantic injection pipeline that combines machine pre-annotation with expert refinement, enabling the construction of a dataset organized around a cognitive pyramid with four levels and 20 fine-grained tasks. We further propose the Cognitive Fidelity Score (CFS) to measure the balance of model performance across pedagogical cognitive dimensions. Experiments on 12 multimodal large language models reveal a clear generative gap, where models perform relatively well on discriminative tasks but degrade on higher-order pedagogical diagnosis, often relying on parametric memory rather than grounded visual perception. Project resources are available at https://github.com/Shallcom/PedagogyBench.- Anthology ID:
- 2026.findings-acl.614
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 12621–12647
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.614/
- DOI:
- Cite (ACL):
- Xiaokang Jin, Jia Zhu, Jingjiang Liu, Yabing Shi, Jueqi Guan, Hao Chen, and Pasquale De Meo. 2026. PedagogyBench: A Cognitive-Driven Benchmark for Multimodal Instructional Video Understanding. In Findings of the Association for Computational Linguistics: ACL 2026, pages 12621–12647, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- PedagogyBench: A Cognitive-Driven Benchmark for Multimodal Instructional Video Understanding (Jin et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.614.pdf