ExpertNeurons at SciVQA-2025: Retrieval Augmented VQA with Vision Language Model (RAVQA-VLM)

Nagaraj N Bhat, Joydeb Mondal, Srijon Sarkar


Abstract
We introduce RAVQA-VLM, a novel Retrieval-Augmented Generation (RAG) architecture with Vision Language Model for the SciVQA challenge, which targets closed-ended visual and nonvisual questions over scientific figures drawn from ACL Anthology and arXiv papers (Borisova and Rehm, 2025). Our system first encodes each input figure and its accompanying metadata (caption, figure ID, type) into dense embed- dings, then retrieves context passages from the full PDF of the source paper via a Dense Passage Retriever (Karpukhin et al., 2020). The extracted contexts are concatenated with the question and passed to a vision-capable generative backbone (e.g., Phi-3.5, Pixtral-12B, Mixtral-24B-small, InterVL-3-14B) fine-tuned on the 15.1K SciVQA training examples (Yang et al., 2023; Pramanick et al., 2024). We jointly optimize retrieval and generation end-to-end to minimize answer loss and mitigate hallucinations (Lewis et al., 2020; Rujun Han and Castelli, 2024). On the SciVQA test set, RAVQA-VLM achieves significant improvements over parametric only baselines, with relative gains of +5% ROUGE1 and +5% ROUGE-L, demonstrating the efficacy of RAG for multimodal scientific QA.
Anthology ID:
2025.sdp-1.20
Volume:
Proceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025)
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Tirthankar Ghosal, Philipp Mayr, Amanpreet Singh, Aakanksha Naik, Georg Rehm, Dayne Freitag, Dan Li, Sonja Schimmler, Anita De Waard
Venues:
sdp | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
221–229
Language:
URL:
https://preview.aclanthology.org/mtsummit-25-ingestion/2025.sdp-1.20/
DOI:
10.18653/v1/2025.sdp-1.20
Bibkey:
Cite (ACL):
Nagaraj N Bhat, Joydeb Mondal, and Srijon Sarkar. 2025. ExpertNeurons at SciVQA-2025: Retrieval Augmented VQA with Vision Language Model (RAVQA-VLM). In Proceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025), pages 221–229, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
ExpertNeurons at SciVQA-2025: Retrieval Augmented VQA with Vision Language Model (RAVQA-VLM) (Bhat et al., sdp 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/mtsummit-25-ingestion/2025.sdp-1.20.pdf