Can LLMs See Without Pixels? Benchmarking Spatial Intelligence from Textual Descriptions

Zhongbin Guo; Zhen Yang; Yushan Li; Xinyue Zhang; Wenyu Gao; Jiacheng Wang; Chengzhi Li; Xiangrui Liu; Ping Jian (鉴萍)

Can LLMs See Without Pixels? Benchmarking Spatial Intelligence from Textual Descriptions

Zhongbin Guo, Zhen Yang, Yushan Li, Xinyue Zhang, Wenyu Gao, Jiacheng Wang, Chengzhi Li, Xiangrui Liu, Ping Jian

Abstract

Recent advancements in Spatial Intelligence (SI) have predominantly relied on Vision-Language Models (VLMs), yet a critical question remains: does spatial understanding originate from visual encoders or the fundamental reasoning backbone? Inspired by this question, we introduce **SiT-Bench**, a novel benchmark designed to evaluate the SI performance of Large Language Models (LLMs) without pixel-level input, comprises over 3,800 expert-annotated items across five primary categories and 17 subtasks, ranging from egocentric navigation and perspective transformation to fine-grained robotic manipulation. By converting single/multi-view scenes into high-fidelity, coordinate-aware textual descriptions, we challenge LLMs to perform symbolic textual reasoning rather than visual pattern matching. Evaluation results of state-of-the-art (SOTA) LLMs reveals that while models achieve proficiency in localized semantic tasks, a significant "spatial gap" remains in global consistency. Notably, we find that explicit spatial reasoning significantly boosts performance, suggesting that LLMs possess latent world-modeling potential. Our proposed dataset SiT-Bench serves as a foundational resource to foster the development of spatially-grounded LLM backbones for future VLMs and embodied agents.

Anthology ID:: 2026.findings-acl.90
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1852–1897
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.90/
DOI:
Bibkey:
Cite (ACL):: Zhongbin Guo, Zhen Yang, Yushan Li, Xinyue Zhang, Wenyu Gao, Jiacheng Wang, Chengzhi Li, Xiangrui Liu, and Ping Jian. 2026. Can LLMs See Without Pixels? Benchmarking Spatial Intelligence from Textual Descriptions. In Findings of the Association for Computational Linguistics: ACL 2026, pages 1852–1897, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Can LLMs See Without Pixels? Benchmarking Spatial Intelligence from Textual Descriptions (Guo et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.90.pdf
Checklist:: 2026.findings-acl.90.checklist.pdf

PDF Cite Search Checklist Fix data