From Pixels to BFS: High Maze Accuracy Does Not Imply Visual Planning

Alberto Gonzalo Rodriguez Salgado


Abstract
How do multimodal models solve visual spatial tasks—through genuine planning, or by brute-forcing solutions in token space? We introduce MazeBench, a benchmark of 110 procedurally generated maze images organized into nine controlled groups (diagnostic, grid scale, wall density, trap ablation, unreachable detection, and more), and evaluate 16 model configurations across four providers (OpenAI, Anthropic, Google, Alibaba) at multiple reasoning effort levels. GPT-5.4 solves 91% and Gemini 3.1 Pro 79%, but our analysis reveals these scores are misleading: models translate images into text grids and brute-force paths via serial enumeration, consuming 1,710–22,818 tokens per solve for a task humans do in seconds. Without added reasoning budgets, all configurations score only 2–12%; on 20x20 ultra-hard mazes, they hit token limits and give up. Qualitative analysis of model outputs confirms a universal two-stage strategy: image-to-grid translation followed by step-by-step path search in natural language—essentially BFS implemented in prose. A text-grid ablation shows Claude’s poor image performance (6%) jumps to 80% when given the correct grid directly, confirming vision quality, not reasoning ability, as the bottleneck for weaker models. Perhaps most striking, when we explicitly instruct models not to build a text grid and not to perform graph search—asking them to "reason visually, like a human"—they silently ignore the instruction and immediately fall back to the same grid-enumeration strategy. This suggests that brute-force token-level search is the dominant mechanism these models rely on for spatial planning in our setting.
Anthology ID:
2026.alvr-main.13
Volume:
Proceedings of the 4th Workshop on Advances in Language and Vision Research (ALVR)
Month:
July
Year:
2026
Address:
San Diego, California, USA
Editors:
Qianqi Yan, Syrielle Montariol, Yue Fan, Jing Gu, Jiayi Pan, Manling Li, Parisa Kordjamshidi, Alane Suhr, Xin Eric Wang
Venues:
ALVR | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
164–179
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.alvr-main.13/
DOI:
Bibkey:
Cite (ACL):
Alberto Gonzalo Rodriguez Salgado. 2026. From Pixels to BFS: High Maze Accuracy Does Not Imply Visual Planning. In Proceedings of the 4th Workshop on Advances in Language and Vision Research (ALVR), pages 164–179, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):
From Pixels to BFS: High Maze Accuracy Does Not Imply Visual Planning (Rodriguez Salgado, ALVR 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.alvr-main.13.pdf