Bridging Multimodal and Video Summarization: A Unified Survey

Haopeng Zhang


Abstract
Multimodal summarization (MMS) and video summarization (VS) have traditionally evolved in separate communities—natural language processing (NLP) and computer vision (CV), respectively. MMS focuses on generating textual summaries from inputs such as text, images, or audio, while VS emphasizes selecting key visual content. With the recent rise of vision-language models (VLMs), these once-disparate tasks are converging under a unified framework that integrates visual and linguistic understanding.In this survey, we provide a unified perspective that bridges MMS and VS. We formalize the task landscape, review key datasets and evaluation metrics, and categorize major modeling approaches into new taxonomy. In addition, we highlight core challenges and outline future directions toward building general-purpose multimodal summarization systems. By synthesizing insights from both NLP and CV communities, this survey aims to establish a coherent foundation for advancing this rapidly evolving field.
Anthology ID:
2025.newsum-main.11
Volume:
Proceedings of The 5th New Frontiers in Summarization Workshop
Month:
November
Year:
2025
Address:
Hybrid
Editors:
Yue Dong, Wen Xiao, Haopeng Zhang, Rui Zhang, Ori Ernst, Lu Wang, Fei Liu
Venues:
NewSum | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
157–171
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.newsum-main.11/
DOI:
Bibkey:
Cite (ACL):
Haopeng Zhang. 2025. Bridging Multimodal and Video Summarization: A Unified Survey. In Proceedings of The 5th New Frontiers in Summarization Workshop, pages 157–171, Hybrid. Association for Computational Linguistics.
Cite (Informal):
Bridging Multimodal and Video Summarization: A Unified Survey (Zhang, NewSum 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.newsum-main.11.pdf