Diagnosing Spatial Consistency across Perspectives and Viewpoints in Large Vision-Language Models

Yoonji Kim; Jieun Kim; Yujin Jeong; Sung-Bae Cho

Diagnosing Spatial Consistency across Perspectives and Viewpoints in Large Vision-Language Models

Yoonji Kim, Jieun Kim, Yujin Jeong, Sung-Bae Cho

Abstract

Consistent reasoning about 3D spatial relations across changing viewpoints is fundamental for Embodied AI agents operating in dynamic environments. While Large Vision-Language Models (LVLMs) have advanced multimodal perception, their ability to maintain spatial consistency across diverse perspectives remains underexplored. Existing benchmarks primarily assess spatial capabilities from a static, single-view, and egocentric perspective, failing to capture the dynamic nature of real-world spatial cognition.To address this gap, we introduce SCOPE (Spatial COnsistency across PErspectives and Viewpoints), a comprehensive benchmark designed to rigorously diagnose spatial reasoning capabilities. Grounded in human cognitive theories of dual spatial representations, SCOPE discretizes the 360∘ field into multiview scenarios to systematically evaluate both allocentric and egocentric reasoning capabilities. Our dataset comprises 20.1K spatial VQA pairs derived from high-quality 3D environments. Through an extensive evaluation of 26 state-of-the-art LVLMs, we identify two fundamental limitations that prevent consistent spatial understanding across viewpoints.We hope SCOPE facilitates the diagnosis of spatial reasoning, serving as a stepping stone toward reliable embodied action.

Anthology ID:: 2026.acl-long.1514
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 32803–32827
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1514/
DOI:
Bibkey:
Cite (ACL):: Yoonji Kim, Jieun Kim, Yujin Jeong, and Sung-Bae Cho. 2026. Diagnosing Spatial Consistency across Perspectives and Viewpoints in Large Vision-Language Models. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 32803–32827, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Diagnosing Spatial Consistency across Perspectives and Viewpoints in Large Vision-Language Models (Kim et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1514.pdf
Checklist:: 2026.acl-long.1514.checklist.pdf

PDF Cite Search Checklist Fix data