Kaito Watanabe
2026
Evaluation of Multilingual Ability to Use Spatial Deictic Expressions in Vision-Language Models
Kaito Watanabe | Taisei Yamamoto | Tomoki Doi | Hitomi Yanaka
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Kaito Watanabe | Taisei Yamamoto | Tomoki Doi | Hitomi Yanaka
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
One of the expected abilities of vision-language models (VLMs) is spatial reasoning ability based on a given text and image.To evaluate the spatial reasoning abilities of VLMs, we focus on the use of spatial deictic expressions, which are defined as spatial expressions whose referent is determined by their situational context, such as this and that.To handle spatial deictic expressions, VLMs must jointly reason over language and visual space, grounding context-dependent references in the image’s spatial structure.In addition, selecting appropriate spatial deictic expressions across languages requires VLMs to understand the language-specific spatial distinctions encoded by these expressions.In this paper, we develop a benchmark to evaluate the multilingual ability of VLMs to use spatial deictic expressions in four languages.Our experiments using this benchmark reveal that the tested models use demonstratives in a manner different from that of humans, particularly in selecting the appropriate demonstratives based on the distance from the object.