Mind’s Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs

Rohit Sinha, Aditya Sanjiv Kanade, Sai Srinivas Kancheti, Vineeth N. Balasubramanian, Tanuja Ganu


Abstract
Multimodal large language models (MLLMs) have achieved impressive progress on vision language benchmarks, yet their capacity for cognitive and psychological reasoning remains largely unexplored. We introduce Mind’s Eye, a multiple-choice benchmark of eight visuo-cognitive tasks inspired by classic human intelligence tests and organized under a novel A–R–T taxonomy: Abstraction, Relation, and Transformation. The tasks probe core processes of fluid intelligence such as pattern induction, analogical Relation mapping, and mental Transformation. We evaluate a diverse suite of closed-source and open-source MLLMs and compare their performance with human participants. Humans achieve 80% accuracy, while top performing MLLMs remain below 50%. Error analysis reveals failures in (i) visual attention allocation, (ii) internal perceptual manipulation, (iii) over reliance on domain priors, and (iv) weak abstraction of underlying visual concepts. Our findings suggest that current MLLMs exhibit limited fluid reasoning and visuo-cognitive integration compared with human participants, highlighting the need for cognitively grounded evaluation frameworks like Mind’s Eye.
Anthology ID:
2026.acl-long.2124
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
45794–45835
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.2124/
DOI:
Bibkey:
Cite (ACL):
Rohit Sinha, Aditya Sanjiv Kanade, Sai Srinivas Kancheti, Vineeth N. Balasubramanian, and Tanuja Ganu. 2026. Mind’s Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 45794–45835, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Mind’s Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs (Sinha et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.2124.pdf
Checklist:
 2026.acl-long.2124.checklist.pdf