Jianhui Huang


Multi-grained Attention with Object-level Grounding for Visual Question Answering
Pingping Huang | Jianhui Huang | Yuqing Guo | Min Qiao | Yong Zhu
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Attention mechanisms are widely used in Visual Question Answering (VQA) to search for visual clues related to the question. Most approaches train attention models from a coarse-grained association between sentences and images, which tends to fail on small objects or uncommon concepts. To address this problem, this paper proposes a multi-grained attention method. It learns explicit word-object correspondence by two types of word-level attention complementary to the sentence-image association. Evaluated on the VQA benchmark, the multi-grained attention model achieves competitive performance with state-of-the-art models. And the visualized attention maps demonstrate that addition of object-level groundings leads to a better understanding of the images and locates the attended objects more precisely.