QEVA: A Reference-Free Evaluation Metric for Narrative Video Summarization with Multimodal Question Answering

Woojun Jung, Junyeong Kim


Abstract
Video-to-text summarization remains underexplored in terms of comprehensive evaluation methods. Traditional n-gram overlap-based metrics and recent large language model (LLM)-based approaches depend heavily on human-written reference summaries, limiting their practicality and sensitivity to nuanced semantic aspects. In this paper, we propose QEVA, a reference-free metric evaluating candidate summaries directly against source videos through multimodal question answering. QEVA assesses summaries along three clear dimensions: Coverage, Factuality, and Temporal Coherence. We also introduce MLVU(VS)-Eval, a new annotated benchmark derived from the MLVU dataset, comprising 800 summaries generated from 200 videos using state-of-the-art video-language multimodal models. This dataset establishes a transparent and consistent framework for evaluation. Experimental results demonstrate that QEVA shows higher correlation with human judgments compared to existing approaches, as measured by Kendall’s 𝜏b, 𝜏c, and Spearman’s 𝜌. We hope that our benchmark and metric will facilitate meaningful progress in video-to-text summarization research and provide valuable insights for the development of future evaluation methods.
Anthology ID:
2025.findings-emnlp.1340
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
24632–24642
Language:
URL:
https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.1340/
DOI:
10.18653/v1/2025.findings-emnlp.1340
Bibkey:
Cite (ACL):
Woojun Jung and Junyeong Kim. 2025. QEVA: A Reference-Free Evaluation Metric for Narrative Video Summarization with Multimodal Question Answering. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 24632–24642, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
QEVA: A Reference-Free Evaluation Metric for Narrative Video Summarization with Multimodal Question Answering (Jung & Kim, Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.1340.pdf
Checklist:
 2025.findings-emnlp.1340.checklist.pdf