Visual Spatial Reasoning

Fangyu Liu; Guy Emerson; Nigel Collier

doi:10.1162/tacl_a_00566

Visual Spatial Reasoning

Abstract

Spatial relations are a basic part of human cognition. However, they are expressed in natural language in a variety of ways, and previous work has suggested that current vision-and-language models (VLMs) struggle to capture relational information. In this paper, we present Visual Spatial Reasoning (VSR), a dataset containing more than 10k natural text-image pairs with 66 types of spatial relations in English (e.g., under, in front of, facing). While using a seemingly simple annotation format, we show how the dataset includes challenging linguistic phenomena, such as varying reference frames. We demonstrate a large gap between human and model performance: The human ceiling is above 95%, while state-of-the-art models only achieve around 70%. We observe that VLMs’ by-relation performances have little correlation with the number of training examples and the tested models are in general incapable of recognising relations concerning the orientations of objects.1

Anthology ID:: 2023.tacl-1.37
Volume:: Transactions of the Association for Computational Linguistics, Volume 11
Month:
Year:: 2023
Address:: Cambridge, MA
Venue:: TACL
SIG:
Publisher:: MIT Press
Note:
Pages:: 635–651
Language:
URL:: https://preview.aclanthology.org/fix-sig-urls/2023.tacl-1.37/
DOI:: 10.1162/tacl_a_00566
Bibkey:
Cite (ACL):: Fangyu Liu, Guy Emerson, and Nigel Collier. 2023. Visual Spatial Reasoning. Transactions of the Association for Computational Linguistics, 11:635–651.
Cite (Informal):: Visual Spatial Reasoning (Liu et al., TACL 2023)
Copy Citation:
PDF:: https://preview.aclanthology.org/fix-sig-urls/2023.tacl-1.37.pdf

PDF Cite Search Fix data