Attention as Selector: Unlocking VLM Attention for Long Document Page Retrieval

Minfeng Zhu; Linxin Bao; Wei Chen; Linchao Zhu

Attention as Selector: Unlocking VLM Attention for Long Document Page Retrieval

Minfeng Zhu, Linxin Bao, Wei Chen, Linchao Zhu

Abstract

Visual Language Models (VLMs) have become a robust foundation for document question answering. Processing long documents remains challenging due to limited context windows and computational budgets. Existing page-level retrieval methods offer a practical solution, typically encoding pages and queries into vectors and ranking them via cosine similarity. However, such embedding-based methods (i) lack query–page interaction before similarity scoring and (ii) usually require large-scale datasets to align visual and textual embeddings. In this paper, we observe that the cross-modal attention maps of well-trained VLMs are able to highlight semantically relevant regions. Building on this insight, we present CAPS (Cross-modal Attention as Page Selector), a retrieval framework that utilizes attention mechanisms inside VLMs for page selection. Specifically, CAPS first enhances attention-based retrieval capability with a small amount of contrastive data, then identifies the most effective attention head through expert head selection, and finally employs an adaptive filtering mechanism to obtain an appropriate number of relevant page candidates. Extensive experiments on four long-document benchmarks demonstrate that CAPS outperforms state-of-the-art embedding-based methods in both retrieval precision and downstream DocQA accuracy. Notably, CAPS achieves these gains using less than 10% of the training data required by competing baselines, highlighting the data efficiency of attention-based page retrieval.

Anthology ID:: 2026.acl-long.1117
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 24344–24365
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1117/
DOI:
Bibkey:
Cite (ACL):: Minfeng Zhu, Linxin Bao, Wei Chen, and Linchao Zhu. 2026. Attention as Selector: Unlocking VLM Attention for Long Document Page Retrieval. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 24344–24365, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Attention as Selector: Unlocking VLM Attention for Long Document Page Retrieval (Zhu et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1117.pdf
Checklist:: 2026.acl-long.1117.checklist.pdf

PDF Cite Search Checklist Fix data