RG-VQA: Leveraging Retriever-Generator Pipelines for Knowledge Intensive Visual Question Answering

Settaluri Lakshmi Sravanthi; Pulkit Agarwal; Debjyoti Mondal; Rituraj Singh; Subhadarshi Panda; Ankit Mishra; Kiran Pradeep; Srihari K B; Godawari Sudhakar Rao; Pushpak Bhattacharyya

doi:10.18653/v1/2025.findings-emnlp.1306

RG-VQA: Leveraging Retriever-Generator Pipelines for Knowledge Intensive Visual Question Answering

Settaluri Lakshmi Sravanthi, Pulkit Agarwal, Debjyoti Mondal, Rituraj Singh, Subhadarshi Panda, Ankit Mishra, Kiran Pradeep, Srihari K B, Godawari Sudhakar Rao, Pushpak Bhattacharyya

Abstract

In this paper, we propose a method to improve the reasoning capabilities of Visual Question Answering (VQA) systems by integrating Dense Passage Retrievers (DPRs) with Vision Language Models (VLMs). While recent works focus on the application of knowledge graphs and chain-of-thought reasoning, we recognize that the complexity of graph neural networks and end-to-end training remain significant challenges. To address these issues, we introduce **R**elevance **G**uided **VQA** (**RG-VQA**), a retriever-generator pipeline that uses DPRs to efficiently extract relevant information from structured knowledge bases. Our approach ensures scalability to large graphs without significant computational overhead. Experiments on the ScienceQA dataset show that RG-VQA achieves state-of-the-art performance, surpassing human accuracy and outperforming GPT-4 by more than . This demonstrates the effectiveness of RG-VQA in boosting the reasoning capabilities of VQA systems and its potential for practical applications.

Anthology ID:: 2025.findings-emnlp.1306
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 24048–24060
Language:
URL:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.1306/
DOI:: 10.18653/v1/2025.findings-emnlp.1306
Bibkey:
Cite (ACL):: Settaluri Lakshmi Sravanthi, Pulkit Agarwal, Debjyoti Mondal, Rituraj Singh, Subhadarshi Panda, Ankit Mishra, Kiran Pradeep, Srihari K B, Godawari Sudhakar Rao, and Pushpak Bhattacharyya. 2025. RG-VQA: Leveraging Retriever-Generator Pipelines for Knowledge Intensive Visual Question Answering. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 24048–24060, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: RG-VQA: Leveraging Retriever-Generator Pipelines for Knowledge Intensive Visual Question Answering (Sravanthi et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.1306.pdf
Checklist:: 2025.findings-emnlp.1306.checklist.pdf

PDF Cite Search Checklist Fix data