QEVA: A Reference-Free Evaluation Metric for Narrative Video Summarization with Multimodal Question Answering

Woojun Jung; Junyeong Kim

doi:10.18653/v1/2025.findings-emnlp.1340

QEVA: A Reference-Free Evaluation Metric for Narrative Video Summarization with Multimodal Question Answering

Abstract

Video-to-text summarization remains underexplored in terms of comprehensive evaluation methods. Traditional n-gram overlap-based metrics and recent large language model (LLM)-based approaches depend heavily on human-written reference summaries, limiting their practicality and sensitivity to nuanced semantic aspects. In this paper, we propose QEVA, a reference-free metric evaluating candidate summaries directly against source videos through multimodal question answering. QEVA assesses summaries along three clear dimensions: Coverage, Factuality, and Temporal Coherence. We also introduce MLVU(VS)-Eval, a new annotated benchmark derived from the MLVU dataset, comprising 800 summaries generated from 200 videos using state-of-the-art video-language multimodal models. This dataset establishes a transparent and consistent framework for evaluation. Experimental results demonstrate that QEVA shows higher correlation with human judgments compared to existing approaches, as measured by Kendall’s 𝜏_b, 𝜏_c, and Spearman’s 𝜌. We hope that our benchmark and metric will facilitate meaningful progress in video-to-text summarization research and provide valuable insights for the development of future evaluation methods.

Anthology ID:: 2025.findings-emnlp.1340
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 24632–24642
Language:
URL:: https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.1340/
DOI:: 10.18653/v1/2025.findings-emnlp.1340
Bibkey:
Cite (ACL):: Woojun Jung and Junyeong Kim. 2025. QEVA: A Reference-Free Evaluation Metric for Narrative Video Summarization with Multimodal Question Answering. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 24632–24642, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: QEVA: A Reference-Free Evaluation Metric for Narrative Video Summarization with Multimodal Question Answering (Jung & Kim, Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.1340.pdf
Checklist:: 2025.findings-emnlp.1340.checklist.pdf

PDF Cite Search Checklist Fix data