Integrating Video and Text: A Balanced Approach to Multimodal Summary Generation and Evaluation
Galann Pennec, Zhengyuan Liu, Nicholas Asher, Philippe Muller, Nancy F. Chen
Abstract
Vision-Language Models (VLMs) often struggle to balance visual and textual information when summarizing complex multimodal inputs, such as entire TV show episodes. In this paper, we propose a zero-shot video-to-text summarization approach that builds its own screenplay-like representation of an episode, effectively integrating key video moments, dialogue, and character information into a unified document. Unlike previous approaches, we simultaneously generate screenplays and name the characters in zero-shot, using only the audio, video, and transcripts as input. Additionally, we highlight that existing summarization metrics can fail to assess the multimodal content in summaries. To address this, we introduce MFactSum, a multimodal metric that evaluates summaries with respect to both vision and text modalities. Using MFactSum, we evaluate our screenplay summaries on the SummScreen3D dataset, demonstrating superiority against state-of-the-art VLMs such as Gemini 1.5 by generating summaries containing 20% more relevant visual information while requiring 75% less of the video as input.- Anthology ID:
- 2025.ijcnlp-long.129
- Volume:
- Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
- Month:
- December
- Year:
- 2025
- Address:
- Mumbai, India
- Editors:
- Kentaro Inui, Sakriani Sakti, Haofen Wang, Derek F. Wong, Pushpak Bhattacharyya, Biplab Banerjee, Asif Ekbal, Tanmoy Chakraborty, Dhirendra Pratap Singh
- Venues:
- IJCNLP | AACL
- SIG:
- Publisher:
- The Asian Federation of Natural Language Processing and The Association for Computational Linguistics
- Note:
- Pages:
- 2403–2426
- Language:
- URL:
- https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.ijcnlp-long.129/
- DOI:
- Cite (ACL):
- Galann Pennec, Zhengyuan Liu, Nicholas Asher, Philippe Muller, and Nancy F. Chen. 2025. Integrating Video and Text: A Balanced Approach to Multimodal Summary Generation and Evaluation. In Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, pages 2403–2426, Mumbai, India. The Asian Federation of Natural Language Processing and The Association for Computational Linguistics.
- Cite (Informal):
- Integrating Video and Text: A Balanced Approach to Multimodal Summary Generation and Evaluation (Pennec et al., IJCNLP-AACL 2025)
- PDF:
- https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.ijcnlp-long.129.pdf