Generating Questions, Answers, and Distractors for Videos: Exploring Semantic Uncertainty of Object Motions

Wenjian Ding, Yao Zhang, Jun Wang, Adam Jatowt, Zhenglu Yang


Abstract
Video Question-Answer-Distractors (QADs) show promising values for assessing the performance of systems in perceiving and comprehending multimedia content. Given the significant cost and labor demands of manual annotation, existing large-scale Video QADs benchmarks are typically generated automatically using video captions. Since video captions are incomplete representations of visual content and susceptible to error propagation, direct generation of QADs from video is crucial. This work first leverages a large vision-language model for video QADs generation. To enhance the consistency and diversity of the generated QADs, we propose utilizing temporal motion to describe the video objects. In addition, We design a selection mechanism that chooses diverse temporal object motions to generate diverse QADs focusing on different objects and interactions, maximizing overall semantic uncertainty for a given video. Evaluation on the NExT-QA and Perception Test benchmarks demonstrates that the proposed approach significantly improves both the consistency and diversity of QADs generated by a range of large vision-language models, thus highlighting its effectiveness and generalizability.
Anthology ID:
2025.findings-acl.376
Volume:
Findings of the Association for Computational Linguistics: ACL 2025
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
7207–7220
Language:
URL:
https://preview.aclanthology.org/display_plenaries/2025.findings-acl.376/
DOI:
Bibkey:
Cite (ACL):
Wenjian Ding, Yao Zhang, Jun Wang, Adam Jatowt, and Zhenglu Yang. 2025. Generating Questions, Answers, and Distractors for Videos: Exploring Semantic Uncertainty of Object Motions. In Findings of the Association for Computational Linguistics: ACL 2025, pages 7207–7220, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Generating Questions, Answers, and Distractors for Videos: Exploring Semantic Uncertainty of Object Motions (Ding et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/display_plenaries/2025.findings-acl.376.pdf