Diagnosing Spatial Consistency across Perspectives and Viewpoints in Large Vision-Language Models

Yoonji Kim, Jieun Kim, Yujin Jeong, Sung-Bae Cho


Abstract
Consistent reasoning about 3D spatial relations across changing viewpoints is fundamental for Embodied AI agents operating in dynamic environments. While Large Vision-Language Models (LVLMs) have advanced multimodal perception, their ability to maintain spatial consistency across diverse perspectives remains underexplored. Existing benchmarks primarily assess spatial capabilities from a static, single-view, and egocentric perspective, failing to capture the dynamic nature of real-world spatial cognition.To address this gap, we introduce SCOPE (Spatial COnsistency across PErspectives and Viewpoints), a comprehensive benchmark designed to rigorously diagnose spatial reasoning capabilities. Grounded in human cognitive theories of dual spatial representations, SCOPE discretizes the 360∘ field into multiview scenarios to systematically evaluate both allocentric and egocentric reasoning capabilities. Our dataset comprises 20.1K spatial VQA pairs derived from high-quality 3D environments. Through an extensive evaluation of 26 state-of-the-art LVLMs, we identify two fundamental limitations that prevent consistent spatial understanding across viewpoints.We hope SCOPE facilitates the diagnosis of spatial reasoning, serving as a stepping stone toward reliable embodied action.
Anthology ID:
2026.acl-long.1514
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
32803–32827
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1514/
DOI:
Bibkey:
Cite (ACL):
Yoonji Kim, Jieun Kim, Yujin Jeong, and Sung-Bae Cho. 2026. Diagnosing Spatial Consistency across Perspectives and Viewpoints in Large Vision-Language Models. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 32803–32827, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Diagnosing Spatial Consistency across Perspectives and Viewpoints in Large Vision-Language Models (Kim et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1514.pdf
Checklist:
 2026.acl-long.1514.checklist.pdf