MSVBench: Towards Human-Level Evaluation of Multi-Shot Video Generation

Haoyuan Shi, Yunxin li, Nanhao Deng, Zhenran Xu, Xinyu Chen, Longyue Wang, Baotian Hu, Min Zhang


Abstract
The evolution of video generation toward complex, multi-shot narratives has exposed a critical deficit in current evaluation methods. Existing benchmarks remain anchored to single-shot paradigms, lacking the comprehensive story assets and cross-shot metrics required to assess long-form coherence and appeal. To bridge this gap, we introduce MSVBench, the first comprehensive benchmark featuring hierarchical scripts and reference images tailored for Multi-Shot Video generation. We propose a hybrid evaluation framework that synergizes the high-level semantic reasoning of Large Multimodal Models (LMMs) with the fine-grained perceptual rigor of domain-specific expert models. Evaluating 20 video generation methods across diverse paradigms, we find that current models—despite strong visual fidelity—primarily behave as visual interpolators rather than true world models. We further validate the reliability of our benchmark by demonstrating a state-of-the-art Spearman’s rank correlation of 0.944 with human judgments. Finally, MSVBench extends beyond evaluation by providing a scalable supervisory signal. Fine-tuning a lightweight model on its pipeline-refined reasoning traces yields human-aligned performance comparable to commercial models like Gemini-2.5-Flash.
Anthology ID:
2026.findings-acl.1203
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
24034–24058
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1203/
DOI:
Bibkey:
Cite (ACL):
Haoyuan Shi, Yunxin li, Nanhao Deng, Zhenran Xu, Xinyu Chen, Longyue Wang, Baotian Hu, and Min Zhang. 2026. MSVBench: Towards Human-Level Evaluation of Multi-Shot Video Generation. In Findings of the Association for Computational Linguistics: ACL 2026, pages 24034–24058, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
MSVBench: Towards Human-Level Evaluation of Multi-Shot Video Generation (Shi et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1203.pdf
Checklist:
 2026.findings-acl.1203.checklist.pdf