Ground Then Rank: Revisiting Knowledge-Based VQA with Training-Free Entity Identification

Qian Ma, Qiong Wu, Zhengyi Zhou, Yao Ma


Abstract
Knowledge-Based Visual Question Answering (KB-VQA) requires grounding visual queries to external knowledge beyond directly observable content in images.While recent multi modal large language models (MLLMs) show strong perceptual abilities, they struggle on KB-VQA tasks requiring groundings from both fine-grained entity and evidence levels.Most existing multi-modal retrieval augmented generation (MM-RAG) methods tightly couple entity discrimination and section-level evidence ranking into a single re-ranking stage, leading to high cost and limited generalization.In this work, we revisit existing MM-RAG solutions from a workflow perspective and argue both entity-level and fact-level groundings are key bottlenecks.We observe that although MLLMs often fail under open-ended entity naming, they can better identify the correct entity when selecting from a small set of candidate names.Based on this insight, we propose a simple and training-free identify-before-answer IBA framework that decouples entity identification from section-level re-ranking.Our approach prompts an MLLM to select high-confidence entities using only candidate names, followed by an off-the-shelf textual re-ranker for evidence selection.Experiments on Encyclopedic-VQA and InfoSeek show that our method consistently outperforms fine-tuned multi-modal re-ranking baselines while reducing training and inference complexity.Additional analyses reveal that the improvements arise not only from better entity identification, but also from selecting more informative evidence once correct entity is fixed.Our implementation is made public to ease reproducibility
Anthology ID:
2026.findings-acl.563
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
11617–11635
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.563/
DOI:
Bibkey:
Cite (ACL):
Qian Ma, Qiong Wu, Zhengyi Zhou, and Yao Ma. 2026. Ground Then Rank: Revisiting Knowledge-Based VQA with Training-Free Entity Identification. In Findings of the Association for Computational Linguistics: ACL 2026, pages 11617–11635, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Ground Then Rank: Revisiting Knowledge-Based VQA with Training-Free Entity Identification (Ma et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.563.pdf
Checklist:
 2026.findings-acl.563.checklist.pdf