Xi Yu
2025
What Is That Talk About? A Video-to-Text Summarization Dataset for Scientific Presentations
Dongqi Liu
|
Chenxi Whitehouse
|
Xi Yu
|
Louis Mahon
|
Rohit Saxena
|
Zheng Zhao
|
Yifu Qiu
|
Mirella Lapata
|
Vera Demberg
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Transforming recorded videos into concise and accurate textual summaries is a growing challenge in multimodal learning. This paper introduces VISTA, a dataset specifically designed for video-to-text summarization in scientific domains. VISTA contains 18,599 recorded AI conference presentations paired with their corresponding paper abstracts. We benchmark the performance of state-of-the-art large models and apply a plan-based framework to better capture the structured nature of abstracts. Both human and automated evaluations confirm that explicit planning enhances summary quality and factual consistency. However, a considerable gap remains between models and human performance, highlighting the challenges of our dataset. This study aims to pave the way for future research on scientific video-to-text summarization.
Search
Fix author
Co-authors
- Vera Demberg 1
- Mirella Lapata 1
- Dongqi Liu 1
- Louis Mahon 1
- Yifu Qiu 1
- show all...
Venues
- acl1