What Is That Talk About? A Video-to-Text Summarization Dataset for Scientific Presentations

Dongqi Liu; Chenxi Whitehouse; Xi Yu; Louis Mahon; Rohit Saxena; Zheng Zhao; Yifu Qiu; Mirella Lapata; Vera Demberg

What Is That Talk About? A Video-to-Text Summarization Dataset for Scientific Presentations

Dongqi Liu, Chenxi Whitehouse, Xi Yu, Louis Mahon, Rohit Saxena, Zheng Zhao, Yifu Qiu, Mirella Lapata, Vera Demberg

Abstract

Transforming recorded videos into concise and accurate textual summaries is a growing challenge in multimodal learning. This paper introduces VISTA, a dataset specifically designed for video-to-text summarization in scientific domains. VISTA contains 18,599 recorded AI conference presentations paired with their corresponding paper abstracts. We benchmark the performance of state-of-the-art large models and apply a plan-based framework to better capture the structured nature of abstracts. Both human and automated evaluations confirm that explicit planning enhances summary quality and factual consistency. However, a considerable gap remains between models and human performance, highlighting the challenges of our dataset. This study aims to pave the way for future research on scientific video-to-text summarization.

Anthology ID:: 2025.acl-long.310
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 6187–6210
Language:
URL:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.310/
DOI:
Bibkey:
Cite (ACL):: Dongqi Liu, Chenxi Whitehouse, Xi Yu, Louis Mahon, Rohit Saxena, Zheng Zhao, Yifu Qiu, Mirella Lapata, and Vera Demberg. 2025. What Is That Talk About? A Video-to-Text Summarization Dataset for Scientific Presentations. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6187–6210, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: What Is That Talk About? A Video-to-Text Summarization Dataset for Scientific Presentations (Liu et al., ACL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.310.pdf

PDF Cite Search Fix data