Multi-grained Attention with Object-level Grounding for Visual Question Answering
Pingping Huang, Jianhui Huang, Yuqing Guo, Min Qiao, Yong Zhu
Abstract
Attention mechanisms are widely used in Visual Question Answering (VQA) to search for visual clues related to the question. Most approaches train attention models from a coarse-grained association between sentences and images, which tends to fail on small objects or uncommon concepts. To address this problem, this paper proposes a multi-grained attention method. It learns explicit word-object correspondence by two types of word-level attention complementary to the sentence-image association. Evaluated on the VQA benchmark, the multi-grained attention model achieves competitive performance with state-of-the-art models. And the visualized attention maps demonstrate that addition of object-level groundings leads to a better understanding of the images and locates the attended objects more precisely.- Anthology ID:
- P19-1349
- Volume:
- Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
- Month:
- July
- Year:
- 2019
- Address:
- Florence, Italy
- Editors:
- Anna Korhonen, David Traum, Lluís Màrquez
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 3595–3600
- Language:
- URL:
- https://aclanthology.org/P19-1349
- DOI:
- 10.18653/v1/P19-1349
- Cite (ACL):
- Pingping Huang, Jianhui Huang, Yuqing Guo, Min Qiao, and Yong Zhu. 2019. Multi-grained Attention with Object-level Grounding for Visual Question Answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3595–3600, Florence, Italy. Association for Computational Linguistics.
- Cite (Informal):
- Multi-grained Attention with Object-level Grounding for Visual Question Answering (Huang et al., ACL 2019)
- PDF:
- https://preview.aclanthology.org/ml4al-ingestion/P19-1349.pdf
- Data
- Visual Question Answering, Visual Question Answering v2.0