Towards Knowledge-Augmented Visual Question Answering

Maryam Ziaeefard, Freddy Lecue


Abstract
Visual Question Answering (VQA) remains algorithmically challenging while it is effortless for humans. Humans combine visual observations with general and commonsense knowledge to answer questions about a given image. In this paper, we address the problem of incorporating general knowledge into VQA models while leveraging the visual information. We propose a model that captures the interactions between objects in a visual scene and entities in an external knowledge source. Our model is a graph-based approach that combines scene graphs with concept graphs, which learns a question-adaptive graph representation of related knowledge instances. We use Graph Attention Networks to set higher importance to key knowledge instances that are mostly relevant to each question. We exploit ConceptNet as the source of general knowledge and evaluate the performance of our model on the challenging OK-VQA dataset.
Anthology ID:
2020.coling-main.169
Volume:
Proceedings of the 28th International Conference on Computational Linguistics
Month:
December
Year:
2020
Address:
Barcelona, Spain (Online)
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
1863–1873
Language:
URL:
https://aclanthology.org/2020.coling-main.169
DOI:
10.18653/v1/2020.coling-main.169
Bibkey:
Cite (ACL):
Maryam Ziaeefard and Freddy Lecue. 2020. Towards Knowledge-Augmented Visual Question Answering. In Proceedings of the 28th International Conference on Computational Linguistics, pages 1863–1873, Barcelona, Spain (Online). International Committee on Computational Linguistics.
Cite (Informal):
Towards Knowledge-Augmented Visual Question Answering (Ziaeefard & Lecue, COLING 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/auto-file-uploads/2020.coling-main.169.pdf
Code
 ziamaryam/kvqa
Data
OK-VQAVisual GenomeVisual Question Answering