Xiaokang Jin
2026
KCVR: Knowledge-Centric Video Reconstruction for Structured Pedagogical Summarization via Dynamic Graph Planning
Jingjiang Liu | Jia Zhu | Hanghui Guo | Weijie Shi | Yue Cui | Xiaokang Jin | Yilin Wang | Qingyu Niu | Jiawei Shen | Guoqing Ma | Yidan Liang | Shimin Di | Jiajie Xu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jingjiang Liu | Jia Zhu | Hanghui Guo | Weijie Shi | Yue Cui | Xiaokang Jin | Yilin Wang | Qingyu Niu | Jiawei Shen | Guoqing Ma | Yidan Liang | Shimin Di | Jiajie Xu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Existing video summarization methods mainly compress content for gist browsing, but they often break the prerequisite logic in instructional videos and induce logical inversions (e.g., conclusions before premises). We formalize this problem as Structure-Pedagogical Reconstruction (SPR). SPR raises two challenges: (1) Structure Hallucination, where retrieved knowledge is topologically valid but not evidence-grounded by the blackboard; and (2) Logical Inversion, where soft prompt-level graph injection fails to enforce prerequisite order during decoding. To address these challenges, we propose Knowledge-Centric Video Reconstruction (KCVR), a Plan-then-Generate neuro-symbolic framework that decouples epistemic planning from content generation. KCVR prunes a Dual-Layer Epistemic Graph into a minimal video-supported plan, then realizes the plan with visually anchored attention and topology-constrained decoding. We additionally release EduStruct, a 10-discipline benchmark for SPR and structure-centric evaluation. Experiments show that KCVR outperforms strong end-to-end baselines on Knowledge Progression Consistency and Learning Objective Coverage. Our code and data are available at https://github.com/mark1001-ljj/video_sum.
PedagogyBench: A Cognitive-Driven Benchmark for Multimodal Instructional Video Understanding
Xiaokang Jin | Jia Zhu | Jingjiang Liu | Yabing Shi | Jueqi Guan | Hao Chen | Pasquale De Meo
Findings of the Association for Computational Linguistics: ACL 2026
Xiaokang Jin | Jia Zhu | Jingjiang Liu | Yabing Shi | Jueqi Guan | Hao Chen | Pasquale De Meo
Findings of the Association for Computational Linguistics: ACL 2026
Existing video understanding benchmarks mainly emphasize general visual recognition and reasoning, but do not adequately capture the pedagogical logic embedded in instructional videos. To address this gap, we present PedagogyBench, a multimodal benchmark for instructional video understanding grounded in pedagogical cognition. We introduce a pedagogy-driven segmentation strategy and a dual-stream semantic injection pipeline that combines machine pre-annotation with expert refinement, enabling the construction of a dataset organized around a cognitive pyramid with four levels and 20 fine-grained tasks. We further propose the Cognitive Fidelity Score (CFS) to measure the balance of model performance across pedagogical cognitive dimensions. Experiments on 12 multimodal large language models reveal a clear generative gap, where models perform relatively well on discriminative tasks but degrade on higher-order pedagogical diagnosis, often relying on parametric memory rather than grounded visual perception. Project resources are available at https://github.com/Shallcom/PedagogyBench.