Abstract
Visual Question Answering (VQA) remains algorithmically challenging while it is effortless for humans. Humans combine visual observations with general and commonsense knowledge to answer questions about a given image. In this paper, we address the problem of incorporating general knowledge into VQA models while leveraging the visual information. We propose a model that captures the interactions between objects in a visual scene and entities in an external knowledge source. Our model is a graph-based approach that combines scene graphs with concept graphs, which learns a question-adaptive graph representation of related knowledge instances. We use Graph Attention Networks to set higher importance to key knowledge instances that are mostly relevant to each question. We exploit ConceptNet as the source of general knowledge and evaluate the performance of our model on the challenging OK-VQA dataset.- Anthology ID:
- 2020.coling-main.169
- Volume:
- Proceedings of the 28th International Conference on Computational Linguistics
- Month:
- December
- Year:
- 2020
- Address:
- Barcelona, Spain (Online)
- Venue:
- COLING
- SIG:
- Publisher:
- International Committee on Computational Linguistics
- Note:
- Pages:
- 1863–1873
- Language:
- URL:
- https://aclanthology.org/2020.coling-main.169
- DOI:
- 10.18653/v1/2020.coling-main.169
- Cite (ACL):
- Maryam Ziaeefard and Freddy Lecue. 2020. Towards Knowledge-Augmented Visual Question Answering. In Proceedings of the 28th International Conference on Computational Linguistics, pages 1863–1873, Barcelona, Spain (Online). International Committee on Computational Linguistics.
- Cite (Informal):
- Towards Knowledge-Augmented Visual Question Answering (Ziaeefard & Lecue, COLING 2020)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/2020.coling-main.169.pdf
- Code
- ziamaryam/kvqa
- Data
- OK-VQA, Visual Genome, Visual Question Answering