Thinking in Pictures: A Diagnostic Study of Visual vs. Textual Chain-of-Thought Reasoning in Vision-Language Models

Ben Jenkins


Abstract
Chain-of-thought (CoT) reasoning has become a standard technique for eliciting complex reasoning in large language models, and recent work has extended it to vision-language models (VLMs). However, virtually all multimodal CoT methods generate intermediate reasoning steps in natural language, even for inherently visual problems such as spatial reasoning, geometric manipulation, and object tracking. We ask a fundamental question: when should a VLM reason in words, and when should it reason in pictures? We present VisCoT-Diag, a diagnostic benchmark of 1,200 instances across five visual reasoning categories, and compare four CoT paradigms across four VLMs. Our results reveal a striking modality gap: textual CoT degrades performance by up to 17.5% on spatial transformation and 13.2% on multi-object tracking, while visual CoT yields gains of up to 23.1%. We identify three failure modes (spatial state collapse, transformation hallucination, tracking loss) and show that adaptive modality routing achieves 73.1% accuracy versus 68.9% for V-CoT-everywhere. We recommend practitioners use visual CoT for spatial tasks and textual CoT for compositional counting.
Anthology ID:
2026.alvr-main.1
Volume:
Proceedings of the 4th Workshop on Advances in Language and Vision Research (ALVR)
Month:
July
Year:
2026
Address:
San Diego, California, USA
Editors:
Qianqi Yan, Syrielle Montariol, Yue Fan, Jing Gu, Jiayi Pan, Manling Li, Parisa Kordjamshidi, Alane Suhr, Xin Eric Wang
Venues:
ALVR | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1–12
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.alvr-main.1/
DOI:
Bibkey:
Cite (ACL):
Ben Jenkins. 2026. Thinking in Pictures: A Diagnostic Study of Visual vs. Textual Chain-of-Thought Reasoning in Vision-Language Models. In Proceedings of the 4th Workshop on Advances in Language and Vision Research (ALVR), pages 1–12, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):
Thinking in Pictures: A Diagnostic Study of Visual vs. Textual Chain-of-Thought Reasoning in Vision-Language Models (Jenkins, ALVR 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.alvr-main.1.pdf