VGNMN: Video-grounded Neural Module Networks for Video-Grounded Dialogue Systems

Hung Le; Nancy Chen; Steven Hoi

doi:10.18653/v1/2022.naacl-main.247

VGNMN: Video-grounded Neural Module Networks for Video-Grounded Dialogue Systems

Abstract

Neural module networks (NMN) have achieved success in image-grounded tasks such as Visual Question Answering (VQA) on synthetic images. However, very limited work on NMN has been studied in the video-grounded dialogue tasks. These tasks extend the complexity of traditional visual tasks with the additional visual temporal variance and language cross-turn dependencies. Motivated by recent NMN approaches on image-grounded tasks, we introduce Video-grounded Neural Module Network (VGNMN) to model the information retrieval process in video-grounded language tasks as a pipeline of neural modules. VGNMN first decomposes all language components in dialogues to explicitly resolve any entity references and detect corresponding action-based inputs from the question. The detected entities and actions are used as parameters to instantiate neural module networks and extract visual cues from the video. Our experiments show that VGNMN can achieve promising performance on a challenging video-grounded dialogue benchmark as well as a video QA benchmark.

Anthology ID:: 2022.naacl-main.247
Volume:: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Month:: July
Year:: 2022
Address:: Seattle, United States
Editors:: Marine Carpuat, Marie-Catherine de Marneffe, Ivan Vladimir Meza Ruiz
Venue:: NAACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 3377–3393
Language:
URL:: https://preview.aclanthology.org/icon-24-ingestion/2022.naacl-main.247/
DOI:: 10.18653/v1/2022.naacl-main.247
Bibkey:
Cite (ACL):: Hung Le, Nancy Chen, and Steven Hoi. 2022. VGNMN: Video-grounded Neural Module Networks for Video-Grounded Dialogue Systems. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3377–3393, Seattle, United States. Association for Computational Linguistics.
Cite (Informal):: VGNMN: Video-grounded Neural Module Networks for Video-Grounded Dialogue Systems (Le et al., NAACL 2022)
Copy Citation:
PDF:: https://preview.aclanthology.org/icon-24-ingestion/2022.naacl-main.247.pdf
Video:: https://preview.aclanthology.org/icon-24-ingestion/2022.naacl-main.247.mp4
Data: Visual Question Answering

PDF Search Video Fix data