Dual Capsule Attention Mask Network with Mutual Learning for Visual Question Answering

Weidong Tian, Haodong Li, Zhong-Qiu Zhao


Abstract
A Visual Question Answering (VQA) model processes images and questions simultaneously with rich semantic information. The attention mechanism can highlight fine-grained features with critical information, thus ensuring that feature extraction emphasizes the objects related to the questions. However, unattended coarse-grained information is also essential for questions involving global elements. We believe that global coarse-grained information and local fine-grained information can complement each other to provide richer comprehensive information. In this paper, we propose a dual capsule attention mask network with mutual learning for VQA. Specifically, it contains two branches processing coarse-grained features and fine-grained features, respectively. We also design a novel stackable dual capsule attention module to fuse features and locate evidence. The two branches are combined to make final predictions for VQA. Experimental results show that our method outperforms the baselines in terms of VQA performance and interpretability and achieves new SOTA performance on the VQA-v2 dataset.
Anthology ID:
2022.coling-1.500
Volume:
Proceedings of the 29th International Conference on Computational Linguistics
Month:
October
Year:
2022
Address:
Gyeongju, Republic of Korea
Editors:
Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen Xue, Seokhwan Kim, Younggyun Hahm, Zhong He, Tony Kyungil Lee, Enrico Santus, Francis Bond, Seung-Hoon Na
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
5678–5688
Language:
URL:
https://aclanthology.org/2022.coling-1.500
DOI:
Bibkey:
Cite (ACL):
Weidong Tian, Haodong Li, and Zhong-Qiu Zhao. 2022. Dual Capsule Attention Mask Network with Mutual Learning for Visual Question Answering. In Proceedings of the 29th International Conference on Computational Linguistics, pages 5678–5688, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
Cite (Informal):
Dual Capsule Attention Mask Network with Mutual Learning for Visual Question Answering (Tian et al., COLING 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-2023-videos/2022.coling-1.500.pdf
Data
Visual Question Answering