Abstract
Visual question answering (VQA) and image captioning require a shared body of general knowledge connecting language and vision. We present a novel approach to better VQA performance that exploits this connection by jointly generating captions that are targeted to help answer a specific visual question. The model is trained using an existing caption dataset by automatically determining question-relevant captions using an online gradient-based method. Experimental results on the VQA v2 challenge demonstrates that our approach obtains state-of-the-art VQA performance (e.g. 68.4% in the Test-standard set using a single model) by simultaneously generating question-relevant captions.- Anthology ID:
- P19-1348
- Volume:
- Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
- Month:
- July
- Year:
- 2019
- Address:
- Florence, Italy
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 3585–3594
- Language:
- URL:
- https://aclanthology.org/P19-1348
- DOI:
- 10.18653/v1/P19-1348
- Cite (ACL):
- Jialin Wu, Zeyuan Hu, and Raymond Mooney. 2019. Generating Question Relevant Captions to Aid Visual Question Answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3585–3594, Florence, Italy. Association for Computational Linguistics.
- Cite (Informal):
- Generating Question Relevant Captions to Aid Visual Question Answering (Wu et al., ACL 2019)
- PDF:
- https://preview.aclanthology.org/nodalida-main-page/P19-1348.pdf
- Data
- Visual Genome, Visual Question Answering, Visual Question Answering v2.0