Evaluation of Multilingual Ability to Use Spatial Deictic Expressions in Vision-Language Models

Kaito Watanabe, Taisei Yamamoto, Tomoki Doi, Hitomi Yanaka


Abstract
One of the expected abilities of vision-language models (VLMs) is spatial reasoning ability based on a given text and image.To evaluate the spatial reasoning abilities of VLMs, we focus on the use of spatial deictic expressions, which are defined as spatial expressions whose referent is determined by their situational context, such as this and that.To handle spatial deictic expressions, VLMs must jointly reason over language and visual space, grounding context-dependent references in the image’s spatial structure.In addition, selecting appropriate spatial deictic expressions across languages requires VLMs to understand the language-specific spatial distinctions encoded by these expressions.In this paper, we develop a benchmark to evaluate the multilingual ability of VLMs to use spatial deictic expressions in four languages.Our experiments using this benchmark reveal that the tested models use demonstratives in a manner different from that of humans, particularly in selecting the appropriate demonstratives based on the distance from the object.
Anthology ID:
2026.acl-srw.106
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Santosh T.Y.S.S., Juan Diego Rodriguez, Ona de Gibert
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1203–1211
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-srw.106/
DOI:
Bibkey:
Cite (ACL):
Kaito Watanabe, Taisei Yamamoto, Tomoki Doi, and Hitomi Yanaka. 2026. Evaluation of Multilingual Ability to Use Spatial Deictic Expressions in Vision-Language Models. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026), pages 1203–1211, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Evaluation of Multilingual Ability to Use Spatial Deictic Expressions in Vision-Language Models (Watanabe et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-srw.106.pdf