Zero-Shot Multimodal Retrieval with Multi-Scale Contextual Representations

Sourajit Saha, Tejas Gokhale


Abstract
In multimodal information retrieval (MMIR), candidates relevant to an input query need to be retrieved from a database, where the query and database items span different modalities. As real-world databases evolve, repeatedly annotating and indexing data and re-optimizing domain-specific models across modalities is impractical. We present MULTI-SCORE, a fine-tuning-free, two-stage MMIR approach that couples efficient candidate filtering with fine-grained multimodal re-ranking. Stage-1 adopts Matryoshka representations to efficiently filter out low-relevance candidates without expensive similarity computations on full-scale representations for the entire database. Stage-2 re-ranks the filtered candidates by computing their fine-grained multimodal contextual representations with two scoring functions for semantic alignment using chain-of-thought prompting and question-answering. Experiments demonstrate state-of-the-art zero-shot retrieval on 12 MMIR tasks across 32 datasets while outperforming supervised methods on 23 datasets.
Anthology ID:
2026.acl-long.930
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
20304–20324
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.930/
DOI:
Bibkey:
Cite (ACL):
Sourajit Saha and Tejas Gokhale. 2026. Zero-Shot Multimodal Retrieval with Multi-Scale Contextual Representations. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 20304–20324, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Zero-Shot Multimodal Retrieval with Multi-Scale Contextual Representations (Saha & Gokhale, ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.930.pdf
Checklist:
 2026.acl-long.930.checklist.pdf