Generative Giants, Retrieval Weaklings: Why do Multimodal Large Language Models Fail at Multimodal Retrieval?

Hengyi Feng; Zeang Sheng; Meiyi Qiang; Yang Li; Wentao Zhang

Generative Giants, Retrieval Weaklings: Why do Multimodal Large Language Models Fail at Multimodal Retrieval?

Hengyi Feng, Zeang Sheng, Meiyi Qiang, Yang Li, Wentao Zhang

Abstract

Despite the remarkable success of multimodal large language models (MLLMs) in generative tasks, we observe that they exhibit a counterintuitive deficiency in the zero-shot multimodal retrieval task. In this work, we investigate the underlying mechanisms that hinder MLLMs from being effective retrievers. With the help of sparse autoencoders (SAEs), we decompose MLLM output representations into interpretable semantic concepts to probe their intrinsic behavior. Our analysis reveals that the representation space of MLLMs is overwhelmingly dominated by textual semantics; and the visual semantics essential for multimodal retrieval only constitute a small portion. We find that this imbalance is compounded by the heavy focus of MLLMs on bridging image-text modalities, which facilitates generation but homogenizes embeddings and finally diminishes the discriminative power required for multimodal retrieval. We further discover that the specific feature components that contribute most to the similarity computations of MLLMs are actually distractors that greatly reduce retrieval performance. Building on these insights, we propose , a test-time adaptation approach that applies a whitening transformation to adjust the geometry of MLLM representation spaces. Empirical results show that this simple intervention consistently improves zero-shot multimodal retrieval performance across diverse MLLMs without fine-tuning efforts.

Anthology ID:: 2026.findings-acl.781
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 15917–15933
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.781/
DOI:
Bibkey:
Cite (ACL):: Hengyi Feng, Zeang Sheng, Meiyi Qiang, Yang Li, and Wentao Zhang. 2026. Generative Giants, Retrieval Weaklings: Why do Multimodal Large Language Models Fail at Multimodal Retrieval?. In Findings of the Association for Computational Linguistics: ACL 2026, pages 15917–15933, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Generative Giants, Retrieval Weaklings: Why do Multimodal Large Language Models Fail at Multimodal Retrieval? (Feng et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.781.pdf
Checklist:: 2026.findings-acl.781.checklist.pdf

PDF Cite Search Checklist Fix data