Srijon Sarkar


2025

pdf bib
ExpertNeurons at SciVQA-2025: Retrieval Augmented VQA with Vision Language Model (RAVQA-VLM)
Nagaraj N Bhat | Joydeb Mondal | Srijon Sarkar
Proceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025)

We introduce RAVQA-VLM, a novel Retrieval-Augmented Generation (RAG) architecture with Vision Language Model for the SciVQA challenge, which targets closed-ended visual and nonvisual questions over scientific figures drawn from ACL Anthology and arXiv papers (Borisova and Rehm, 2025). Our system first encodes each input figure and its accompanying metadata (caption, figure ID, type) into dense embed- dings, then retrieves context passages from the full PDF of the source paper via a Dense Passage Retriever (Karpukhin et al., 2020). The extracted contexts are concatenated with the question and passed to a vision-capable generative backbone (e.g., Phi-3.5, Pixtral-12B, Mixtral-24B-small, InterVL-3-14B) fine-tuned on the 15.1K SciVQA training examples (Yang et al., 2023; Pramanick et al., 2024). We jointly optimize retrieval and generation end-to-end to minimize answer loss and mitigate hallucinations (Lewis et al., 2020; Rujun Han and Castelli, 2024). On the SciVQA test set, RAVQA-VLM achieves significant improvements over parametric only baselines, with relative gains of +5% ROUGE1 and +5% ROUGE-L, demonstrating the efficacy of RAG for multimodal scientific QA.