Abstract
Previous existing visual question answering (VQA) systems commonly use graph neural networks(GNNs) to extract visual relationships such as semantic relations or spatial relations. However, studies that use GNNs typically ignore the importance of each relation and simply concatenate outputs from multiple relation encoders. In this paper, we propose a novel layer architecture that fuses multiple visual relations through an attention mechanism to address this issue. Specifically, we develop a model that uses question embedding and joint embedding of the encoders to obtain dynamic attention weights with regard to the type of questions. Using the learnable attention weights, the proposed model can efficiently use the necessary visual relation features for a given question. Experimental results on the VQA 2.0 dataset demonstrate that the proposed model outperforms existing graph attention network-based architectures. Additionally, we visualize the attention weight and show that the proposed model assigns a higher weight to relations that are more relevant to the question.- Anthology ID:
- 2021.maiworkshop-1.13
- Volume:
- Proceedings of the Third Workshop on Multimodal Artificial Intelligence
- Month:
- June
- Year:
- 2021
- Address:
- Mexico City, Mexico
- Editors:
- Amir Zadeh, Louis-Philippe Morency, Paul Pu Liang, Candace Ross, Ruslan Salakhutdinov, Soujanya Poria, Erik Cambria, Kelly Shi
- Venue:
- maiworkshop
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 87–96
- Language:
- URL:
- https://aclanthology.org/2021.maiworkshop-1.13
- DOI:
- 10.18653/v1/2021.maiworkshop-1.13
- Cite (ACL):
- Jaewoong Lee, Heejoon Lee, Hwanhee Lee, and Kyomin Jung. 2021. Learning to Select Question-Relevant Relations for Visual Question Answering. In Proceedings of the Third Workshop on Multimodal Artificial Intelligence, pages 87–96, Mexico City, Mexico. Association for Computational Linguistics.
- Cite (Informal):
- Learning to Select Question-Relevant Relations for Visual Question Answering (Lee et al., maiworkshop 2021)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-2/2021.maiworkshop-1.13.pdf
- Data
- Visual Question Answering, Visual Question Answering v2.0