Sai Srinivas Kancheti


2026

Multimodal Reasoning Models (MRMs) leveraging Chain-of-Though (CoT) based thinking have revolutionized mathematical and logical problem-solving. However, we show that this paradigm struggles with generalized spatial intelligence. We perform a comprehensive evaluation of sixteen models across thirteen spatial benchmarks and identify a critical gap: CoT prompting consistently degrades performance in visual spatial reasoning. Furthermore, through a novel No-Image++ ablation, we demonstrate that MRMs and CoT prompted MLMs suffer from severe shortcut learning, and hallucinate visual details from textual priors even when the image is absent. These findings challenge the efficacy of text-only CoT for spatial tasks and underscore the need for vision-centric reasoning paradigms.
Multimodal large language models (MLLMs) have achieved impressive progress on vision language benchmarks, yet their capacity for cognitive and psychological reasoning remains largely unexplored. We introduce Mind’s Eye, a multiple-choice benchmark of eight visuo-cognitive tasks inspired by classic human intelligence tests and organized under a novel A–R–T taxonomy: Abstraction, Relation, and Transformation. The tasks probe core processes of fluid intelligence such as pattern induction, analogical Relation mapping, and mental Transformation. We evaluate a diverse suite of closed-source and open-source MLLMs and compare their performance with human participants. Humans achieve 80% accuracy, while top performing MLLMs remain below 50%. Error analysis reveals failures in (i) visual attention allocation, (ii) internal perceptual manipulation, (iii) over reliance on domain priors, and (iv) weak abstraction of underlying visual concepts. Our findings suggest that current MLLMs exhibit limited fluid reasoning and visuo-cognitive integration compared with human participants, highlighting the need for cognitively grounded evaluation frameworks like Mind’s Eye.