Utility-Oriented Visual Evidence Selection for Multimodal Retrieval-Augmented Generation

Weiqing Luo; Zongye Hu; Xiao Wang; Zhiyuan Yu; Haofeng Zhang; Ziyi Huang (黄子怡)

Utility-Oriented Visual Evidence Selection for Multimodal Retrieval-Augmented Generation

Weiqing Luo, Zongye Hu, Xiao Wang, Zhiyuan Yu, Haofeng Zhang, Ziyi Huang

Abstract

Visual evidence selection is a critical component of multimodal retrieval-augmented generation (RAG), yet existing methods typically rely on semantic relevance or surface-level similarity, which are often misaligned with the actual utility of visual evidence for downstream reasoning. We reformulate multimodal evidence selection from an information-theoretic perspective by defining evidence utility as the information gain induced on a model’s output distribution. To overcome the intractability of answer-space optimization, we introduce a latent notion of evidence helpfulness and theoretically show that, under mild assumptions, ranking evidence by information gain on this latent variable is equivalent to answer-space utility. We further propose a training-free, surrogate-accelerated framework that efficiently estimates evidence utility using lightweight multimodal models. Experiments on MRAG-Bench and Visual-RAG across multiple model families demonstrate that our method consistently outperforms state-of-the-art RAG baselines while achieving substantial reductions in computational cost. We release our code at https://github.com/Hcnaeg/utility-mrag.

Anthology ID:: 2026.acl-long.1620
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 35091–35124
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1620/
DOI:
Bibkey:
Cite (ACL):: Weiqing Luo, Zongye Hu, Xiao Wang, Zhiyuan Yu, Haofeng Zhang, and Ziyi Huang. 2026. Utility-Oriented Visual Evidence Selection for Multimodal Retrieval-Augmented Generation. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 35091–35124, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Utility-Oriented Visual Evidence Selection for Multimodal Retrieval-Augmented Generation (Luo et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1620.pdf
Checklist:: 2026.acl-long.1620.checklist.pdf

PDF Cite Search Checklist Fix data