Multi-grained Attention with Object-level Grounding for Visual Question Answering

Pingping Huang, Jianhui Huang, Yuqing Guo, Min Qiao, Yong Zhu

[How to correct problems with metadata yourself]


Abstract
Attention mechanisms are widely used in Visual Question Answering (VQA) to search for visual clues related to the question. Most approaches train attention models from a coarse-grained association between sentences and images, which tends to fail on small objects or uncommon concepts. To address this problem, this paper proposes a multi-grained attention method. It learns explicit word-object correspondence by two types of word-level attention complementary to the sentence-image association. Evaluated on the VQA benchmark, the multi-grained attention model achieves competitive performance with state-of-the-art models. And the visualized attention maps demonstrate that addition of object-level groundings leads to a better understanding of the images and locates the attended objects more precisely.
Anthology ID:
P19-1349
Volume:
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Month:
July
Year:
2019
Address:
Florence, Italy
Editors:
Anna Korhonen, David Traum, Lluís Màrquez
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3595–3600
Language:
URL:
https://aclanthology.org/P19-1349
DOI:
10.18653/v1/P19-1349
Bibkey:
Cite (ACL):
Pingping Huang, Jianhui Huang, Yuqing Guo, Min Qiao, and Yong Zhu. 2019. Multi-grained Attention with Object-level Grounding for Visual Question Answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3595–3600, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):
Multi-grained Attention with Object-level Grounding for Visual Question Answering (Huang et al., ACL 2019)
Copy Citation:
PDF:
https://preview.aclanthology.org/teach-a-man-to-fish/P19-1349.pdf
Video:
 https://preview.aclanthology.org/teach-a-man-to-fish/P19-1349.mp4
Data
Visual Question AnsweringVisual Question Answering v2.0