Srihari K B
2025
RG-VQA: Leveraging Retriever-Generator Pipelines for Knowledge Intensive Visual Question Answering
Settaluri Lakshmi Sravanthi
|
Pulkit Agarwal
|
Debjyoti Mondal
|
Rituraj Singh
|
Subhadarshi Panda
|
Ankit Mishra
|
Kiran Pradeep
|
Srihari K B
|
Godawari Sudhakar Rao
|
Pushpak Bhattacharyya
Findings of the Association for Computational Linguistics: EMNLP 2025
In this paper, we propose a method to improve the reasoning capabilities of Visual Question Answering (VQA) systems by integrating Dense Passage Retrievers (DPRs) with Vision Language Models (VLMs). While recent works focus on the application of knowledge graphs and chain-of-thought reasoning, we recognize that the complexity of graph neural networks and end-to-end training remain significant challenges. To address these issues, we introduce **R**elevance **G**uided **VQA** (**RG-VQA**), a retriever-generator pipeline that uses DPRs to efficiently extract relevant information from structured knowledge bases. Our approach ensures scalability to large graphs without significant computational overhead. Experiments on the ScienceQA dataset show that RG-VQA achieves state-of-the-art performance, surpassing human accuracy and outperforming GPT-4 by more than . This demonstrates the effectiveness of RG-VQA in boosting the reasoning capabilities of VQA systems and its potential for practical applications.
Search
Fix author
Co-authors
- Pulkit Agarwal 1
- Pushpak Bhattacharyya 1
- Ankit Mishra 1
- Debjyoti Mondal 1
- Subhadarshi Panda 1
- show all...