SERVAL: Surprisingly Effective Zero-Shot Visual Document Retrieval Powered by Large Vision and Language Models

Thong Nguyen; Yibin Lei; Jia-Huei Ju; Andrew Yates

SERVAL: Surprisingly Effective Zero-Shot Visual Document Retrieval Powered by Large Vision and Language Models

Thong Nguyen, Yibin Lei, Jia-Huei Ju, Andrew Yates

Abstract

Visual Document Retrieval (VDR) typically operates as text-to-image retrieval using specialized bi-encoders trained to directly embed document images. We revisit a zero-shot generate-and-encode pipeline: a vision–language model first produces a detailed textual description of each document image, which is then embedded by a standard text encoder. On the ViDoRe-v2 benchmark, the method reaches 63.4% nDCG@5, surpassing the strongest specialised multi-vector visual document encoder, and it scales similarly on MIRACL-VISION with broader multilingual coverage. Analysis shows that modern vision–language models capture complex textual and visual cues with sufficient granularity to act as a reusable semantic proxy. By off-loading modality alignment to pretrained vision–language models, our approach removes the need for computationally intensive text-image contrastive training and establishes a strong zero-shot baseline for future VDR systems.

Anthology ID:: 2025.emnlp-main.1568
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 30795–30810
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1568/
DOI:
Bibkey:
Cite (ACL):: Thong Nguyen, Yibin Lei, Jia-Huei Ju, and Andrew Yates. 2025. SERVAL: Surprisingly Effective Zero-Shot Visual Document Retrieval Powered by Large Vision and Language Models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 30795–30810, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: SERVAL: Surprisingly Effective Zero-Shot Visual Document Retrieval Powered by Large Vision and Language Models (Nguyen et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1568.pdf
Checklist:: 2025.emnlp-main.1568.checklist.pdf

PDF Cite Search Checklist Fix data