PedagogyBench: A Cognitive-Driven Benchmark for Multimodal Instructional Video Understanding

Xiaokang Jin, Jia Zhu, Jingjiang Liu, Yabing Shi, Jueqi Guan, Hao Chen, Pasquale De Meo


Abstract
Existing video understanding benchmarks mainly emphasize general visual recognition and reasoning, but do not adequately capture the pedagogical logic embedded in instructional videos. To address this gap, we present PedagogyBench, a multimodal benchmark for instructional video understanding grounded in pedagogical cognition. We introduce a pedagogy-driven segmentation strategy and a dual-stream semantic injection pipeline that combines machine pre-annotation with expert refinement, enabling the construction of a dataset organized around a cognitive pyramid with four levels and 20 fine-grained tasks. We further propose the Cognitive Fidelity Score (CFS) to measure the balance of model performance across pedagogical cognitive dimensions. Experiments on 12 multimodal large language models reveal a clear generative gap, where models perform relatively well on discriminative tasks but degrade on higher-order pedagogical diagnosis, often relying on parametric memory rather than grounded visual perception. Project resources are available at https://github.com/Shallcom/PedagogyBench.
Anthology ID:
2026.findings-acl.614
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
12621–12647
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.614/
DOI:
Bibkey:
Cite (ACL):
Xiaokang Jin, Jia Zhu, Jingjiang Liu, Yabing Shi, Jueqi Guan, Hao Chen, and Pasquale De Meo. 2026. PedagogyBench: A Cognitive-Driven Benchmark for Multimodal Instructional Video Understanding. In Findings of the Association for Computational Linguistics: ACL 2026, pages 12621–12647, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
PedagogyBench: A Cognitive-Driven Benchmark for Multimodal Instructional Video Understanding (Jin et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.614.pdf
Checklist:
 2026.findings-acl.614.checklist.pdf