All You May Need for VQA are Image Captions
Soravit Changpinyo, Doron Kukliansy, Idan Szpektor, Xi Chen, Nan Ding, Radu Soricut
Abstract
Visual Question Answering (VQA) has benefited from increasingly sophisticated models, but has not enjoyed the same level of engagement in terms of data creation. In this paper, we propose a method that automatically derives VQA examples at volume, by leveraging the abundance of existing image-caption annotations combined with neural models for textual question generation. We show that the resulting data is of high-quality. VQA models trained on our data improve state-of-the-art zero-shot accuracy by double digits and achieve a level of robustness that lacks in the same model trained on human-annotated VQA data.- Anthology ID:
- 2022.naacl-main.142
- Volume:
- Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
- Month:
- July
- Year:
- 2022
- Address:
- Seattle, United States
- Editors:
- Marine Carpuat, Marie-Catherine de Marneffe, Ivan Vladimir Meza Ruiz
- Venue:
- NAACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1947–1963
- Language:
- URL:
- https://preview.aclanthology.org/add_missing_videos/2022.naacl-main.142/
- DOI:
- 10.18653/v1/2022.naacl-main.142
- Cite (ACL):
- Soravit Changpinyo, Doron Kukliansy, Idan Szpektor, Xi Chen, Nan Ding, and Radu Soricut. 2022. All You May Need for VQA are Image Captions. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1947–1963, Seattle, United States. Association for Computational Linguistics.
- Cite (Informal):
- All You May Need for VQA are Image Captions (Changpinyo et al., NAACL 2022)
- PDF:
- https://preview.aclanthology.org/add_missing_videos/2022.naacl-main.142.pdf
- Code
- google-research-datasets/maverics + additional community code
- Data
- MAVERICS, COCO-QA, Conceptual Captions, GQA, MS COCO, OK-VQA, SQuAD, VQG, Visual Question Answering, Visual Question Answering v2.0