Let’s Think Frame by Frame with VIP: A Video Infilling and Prediction Dataset for Evaluating Video Chain-of-Thought

Vaishnavi Himakunthala; Andy Ouyang; Daniel Rose; Ryan He; Alex Mei; Yujie Lu; Chinmay Sonar; Michael Saxon; William Wang

doi:10.18653/v1/2023.emnlp-main.15

Let’s Think Frame by Frame with VIP: A Video Infilling and Prediction Dataset for Evaluating Video Chain-of-Thought

Vaishnavi Himakunthala, Andy Ouyang, Daniel Rose, Ryan He, Alex Mei, Yujie Lu, Chinmay Sonar, Michael Saxon, William Wang

Abstract

Despite exciting recent results showing vision-language systems’ capacity to reason about images using natural language, their capacity for video reasoning remains underexplored. We motivate framing video reasoning as the sequential understanding of a small number of keyframes, thereby leveraging the power and robustness of vision-language while alleviating the computational complexities of processing videos. To evaluate this novel application, we introduce VIP, an inference-time challenge dataset designed to explore models’ reasoning capabilities through video chain-of-thought. Inspired by visually descriptive scene plays, we propose two formats for keyframe description: unstructured dense captions and structured scene descriptions that identify the focus, action, mood, objects, and setting (FAMOuS) of the keyframe. To evaluate video reasoning, we propose two tasks: Video Infilling and Video Prediction, which test abilities to generate multiple intermediate keyframes and predict future keyframes, respectively. We benchmark GPT-4, GPT-3, and VICUNA on VIP, demonstrate the performance gap in these complex video reasoning tasks, and encourage future work to prioritize language models for efficient and generalized video reasoning.

Anthology ID:: 2023.emnlp-main.15
Volume:: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:: December
Year:: 2023
Address:: Singapore
Editors:: Houda Bouamor, Juan Pino, Kalika Bali
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 204–219
Language:
URL:: https://aclanthology.org/2023.emnlp-main.15
DOI:: 10.18653/v1/2023.emnlp-main.15
Bibkey:
Cite (ACL):: Vaishnavi Himakunthala, Andy Ouyang, Daniel Rose, Ryan He, Alex Mei, Yujie Lu, Chinmay Sonar, Michael Saxon, and William Wang. 2023. Let’s Think Frame by Frame with VIP: A Video Infilling and Prediction Dataset for Evaluating Video Chain-of-Thought. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 204–219, Singapore. Association for Computational Linguistics.
Cite (Informal):: Let’s Think Frame by Frame with VIP: A Video Infilling and Prediction Dataset for Evaluating Video Chain-of-Thought (Himakunthala et al., EMNLP 2023)
Copy Citation:
PDF:: https://preview.aclanthology.org/ml4al-ingestion/2023.emnlp-main.15.pdf
Video:: https://preview.aclanthology.org/ml4al-ingestion/2023.emnlp-main.15.mp4

PDF Search Video