Abstract
Understanding images and text together is an important aspect of cognition and building advanced Artificial Intelligence (AI) systems. As a community, we have achieved good benchmarks over language and vision domains separately, however joint reasoning is still a challenge for state-of-the-art computer vision and natural language processing (NLP) systems. We propose a novel task to derive joint inference about a given image-text modality and compile the Visuo-Linguistic Question Answering (VLQA) challenge corpus in a question answering setting. Each dataset item consists of an image and a reading passage, where questions are designed to combine both visual and textual information i.e., ignoring either modality would make the question unanswerable. We first explore the best existing vision-language architectures to solve VLQA subsets and show that they are unable to reason well. We then develop a modular method with slightly better baseline performance, but it is still far behind human performance. We believe that VLQA will be a good benchmark for reasoning over a visuo-linguistic context. The dataset, code and leaderboard is available at https://shailaja183.github.io/vlqa/.- Anthology ID:
- 2020.findings-emnlp.413
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2020
- Month:
- November
- Year:
- 2020
- Address:
- Online
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 4606–4616
- Language:
- URL:
- https://aclanthology.org/2020.findings-emnlp.413
- DOI:
- 10.18653/v1/2020.findings-emnlp.413
- Cite (ACL):
- Shailaja Keyur Sampat, Yezhou Yang, and Chitta Baral. 2020. Visuo-Linguistic Question Answering (VLQA) Challenge. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4606–4616, Online. Association for Computational Linguistics.
- Cite (Informal):
- Visuo-Linguistic Question Answering (VLQA) Challenge (Sampat et al., Findings 2020)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/2020.findings-emnlp.413.pdf
- Data
- AI2D, ARC, TQA, VCR, Visual Question Answering