Dual Attention Networks for Visual Reference Resolution in Visual Dialog

Gi-Cheon Kang, Jaeseo Lim, Byoung-Tak Zhang


Abstract
Visual dialog (VisDial) is a task which requires a dialog agent to answer a series of questions grounded in an image. Unlike in visual question answering (VQA), the series of questions should be able to capture a temporal context from a dialog history and utilizes visually-grounded information. Visual reference resolution is a problem that addresses these challenges, requiring the agent to resolve ambiguous references in a given question and to find the references in a given image. In this paper, we propose Dual Attention Networks (DAN) for visual reference resolution in VisDial. DAN consists of two kinds of attention modules, REFER and FIND. Specifically, REFER module learns latent relationships between a given question and a dialog history by employing a multi-head attention mechanism. FIND module takes image features and reference-aware representations (i.e., the output of REFER module) as input, and performs visual grounding via bottom-up attention mechanism. We qualitatively and quantitatively evaluate our model on VisDial v1.0 and v0.9 datasets, showing that DAN outperforms the previous state-of-the-art model by a significant margin.
Anthology ID:
D19-1209
Volume:
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
Month:
November
Year:
2019
Address:
Hong Kong, China
Editors:
Kentaro Inui, Jing Jiang, Vincent Ng, Xiaojun Wan
Venues:
EMNLP | IJCNLP
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
2024–2033
Language:
URL:
https://aclanthology.org/D19-1209
DOI:
10.18653/v1/D19-1209
Bibkey:
Cite (ACL):
Gi-Cheon Kang, Jaeseo Lim, and Byoung-Tak Zhang. 2019. Dual Attention Networks for Visual Reference Resolution in Visual Dialog. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2024–2033, Hong Kong, China. Association for Computational Linguistics.
Cite (Informal):
Dual Attention Networks for Visual Reference Resolution in Visual Dialog (Kang et al., EMNLP-IJCNLP 2019)
Copy Citation:
PDF:
https://preview.aclanthology.org/naacl24-info/D19-1209.pdf
Code
 gicheonkang/DAN-VisDial +  additional community code
Data
MS COCOVisDialVisual Question Answering