Dynamic Multistep Reasoning based on Video Scene Graph for Video Question Answering
Jianguo Mao, Wenbin Jiang, Xiangdong Wang, Zhifan Feng, Yajuan Lyu, Hong Liu, Yong Zhu
Abstract
Existing video question answering (video QA) models lack the capacity for deep video understanding and flexible multistep reasoning. We propose for video QA a novel model which performs dynamic multistep reasoning between questions and videos. It creates video semantic representation based on the video scene graph composed of semantic elements of the video and semantic relations among these elements. Then, it performs multistep reasoning for better answer decision between the representations of the question and the video, and dynamically integrate the reasoning results. Experiments show the significant advantage of the proposed model against previous methods in accuracy and interpretability. Against the existing state-of-the-art model, the proposed model dramatically improves more than 4%/3.1%/2% on the three widely used video QA datasets, MSRVTT-QA, MSRVTT multi-choice, and TGIF-QA, and displays better interpretability by backtracing along with the attention mechanisms to the video scene graphs.- Anthology ID:
- 2022.naacl-main.286
- Volume:
- Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
- Month:
- July
- Year:
- 2022
- Address:
- Seattle, United States
- Venue:
- NAACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 3894–3904
- Language:
- URL:
- https://aclanthology.org/2022.naacl-main.286
- DOI:
- 10.18653/v1/2022.naacl-main.286
- Cite (ACL):
- Jianguo Mao, Wenbin Jiang, Xiangdong Wang, Zhifan Feng, Yajuan Lyu, Hong Liu, and Yong Zhu. 2022. Dynamic Multistep Reasoning based on Video Scene Graph for Video Question Answering. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3894–3904, Seattle, United States. Association for Computational Linguistics.
- Cite (Informal):
- Dynamic Multistep Reasoning based on Video Scene Graph for Video Question Answering (Mao et al., NAACL 2022)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/2022.naacl-main.286.pdf