What’s “up” with vision-language models? Investigating their struggle with spatial reasoning

Amita Kamath; Jack Hessel; Kai-Wei Chang

doi:10.18653/v1/2023.emnlp-main.568

What’s “up” with vision-language models? Investigating their struggle with spatial reasoning

Amita Kamath, Jack Hessel, Kai-Wei Chang

Abstract

Recent vision-language (VL) models are powerful, but can they reliably distinguish “right” from “left”? We curate three new corpora to quantify model comprehension of such basic spatial relations. These tests isolate spatial reasoning more precisely than existing datasets like VQAv2, e.g., our What’sUp benchmark contains sets of photographs varying only the spatial relations of objects, keeping their identity fixed (see Figure 1: models must comprehend not only the usual case of a dog under a table, but also, the same dog on top of the same table). We evaluate 18 VL models, finding that all perform poorly, e.g., BLIP finetuned on VQAv2, which nears human parity on VQAv2, achieves 56% accuracy on our benchmarks vs. humans at 99%. We conclude by studying causes of this surprising behavior, finding: 1) that popular vision-language pretraining corpora like LAION-2B contain little reliable data for learning spatial relationships; and 2) that basic modeling interventions like up-weighting preposition-containing instances or fine-tuning on our corpora are not sufficient to address the challenges our benchmarks pose. We are hopeful that these corpora will facilitate further research, and we release our data and code at https://github.com/amitakamath/whatsup_vlms.

Anthology ID:: 2023.emnlp-main.568
Volume:: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:: December
Year:: 2023
Address:: Singapore
Editors:: Houda Bouamor, Juan Pino, Kalika Bali
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 9161–9175
Language:
URL:: https://preview.aclanthology.org/add_missing_videos/2023.emnlp-main.568/
DOI:: 10.18653/v1/2023.emnlp-main.568
Bibkey:
Cite (ACL):: Amita Kamath, Jack Hessel, and Kai-Wei Chang. 2023. What’s “up” with vision-language models? Investigating their struggle with spatial reasoning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9161–9175, Singapore. Association for Computational Linguistics.
Cite (Informal):: What’s “up” with vision-language models? Investigating their struggle with spatial reasoning (Kamath et al., EMNLP 2023)
Copy Citation:
PDF:: https://preview.aclanthology.org/add_missing_videos/2023.emnlp-main.568.pdf
Video:: https://preview.aclanthology.org/add_missing_videos/2023.emnlp-main.568.mp4

PDF Search Video Fix data