Vision-Free Retrieval: Rethinking Multimodal Search with Textual Scene Descriptions

Ioanna Ntinou; Alexandros Xenos; Yassine Ouali; Adrian Bulat; Georgios Tzimiropoulos

Vision-Free Retrieval: Rethinking Multimodal Search with Textual Scene Descriptions

Ioanna Ntinou, Alexandros Xenos, Yassine Ouali, Adrian Bulat, Georgios Tzimiropoulos

Abstract

Contrastively-trained Vision-Language Models (VLMs), such as CLIP, have become the standard approach for learning discriminative vision-language representations. However, these models often exhibit shallow language understanding, manifesting bag-of-words behaviour. These limitations are reinforced by their dual-encoder design, which induces a modality gap. Additionally, the reliance on vast web-collected data corpora for training makes the process computationally expensive and introduces significant privacy concerns. To address these limitations, in this work, we challenge the necessity of vision encoders for retrieval tasks by introducing a vision-free, single-encoder retrieval pipeline. Departing from the traditional text-to-image retrieval paradigm, we migrate to a text-to-text paradigm with the assistance of VLLM-generated structured image descriptions. We demonstrate that this paradigm shift has significant advantages, including a substantial reduction of the modality gap, improved compositionality, and better performance on short and long caption queries, all attainable with only two hours of calibration on two GPUs. Additionally, substituting raw images with textual descriptions introduces a more privacy-friendly alternative for retrieval. To further assess generalisation and address some of the shortcomings of prior compositionality benchmarks, we release two benchmarks derived from Flickr30k and COCO, containing diverse compositional queries made of short captions, which we coin subFlickr and subCOCO. Our vision-free retriever matches and often surpasses traditional multimodal models. Importantly, our approach achieves state-of-the-art zero-shot performance on multiple retrieval and compositionality benchmarks, with models as small as 0.3B parameters.

Anthology ID:: 2025.emnlp-main.709
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 14057–14073
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.709/
DOI:
Bibkey:
Cite (ACL):: Ioanna Ntinou, Alexandros Xenos, Yassine Ouali, Adrian Bulat, and Georgios Tzimiropoulos. 2025. Vision-Free Retrieval: Rethinking Multimodal Search with Textual Scene Descriptions. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 14057–14073, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Vision-Free Retrieval: Rethinking Multimodal Search with Textual Scene Descriptions (Ntinou et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.709.pdf
Checklist:: 2025.emnlp-main.709.checklist.pdf

PDF Cite Search Checklist Fix data