Nagaraj N Bhat


Fixing paper assignments

  1. Please select all papers that belong to the same person.
  2. Indicate below which author they should be assigned to.
Provide a valid ORCID iD here. This will be used to match future papers to this author.
Provide the name of the school or the university where the author has received or will receive their highest degree (e.g., Ph.D. institution for researchers, or current affiliation for students). This will be used to form the new author page ID, if needed.

TODO: "submit" and "cancel" buttons here


2025

pdf bib
ExpertNeurons at SciVQA-2025: Retrieval Augmented VQA with Vision Language Model (RAVQA-VLM)
Nagaraj N Bhat | Joydeb Mondal | Srijon Sarkar
Proceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025)

We introduce RAVQA-VLM, a novel Retrieval-Augmented Generation (RAG) architecture with Vision Language Model for the SciVQA challenge, which targets closed-ended visual and nonvisual questions over scientific figures drawn from ACL Anthology and arXiv papers (Borisova and Rehm, 2025). Our system first encodes each input figure and its accompanying metadata (caption, figure ID, type) into dense embed- dings, then retrieves context passages from the full PDF of the source paper via a Dense Passage Retriever (Karpukhin et al., 2020). The extracted contexts are concatenated with the question and passed to a vision-capable generative backbone (e.g., Phi-3.5, Pixtral-12B, Mixtral-24B-small, InterVL-3-14B) fine-tuned on the 15.1K SciVQA training examples (Yang et al., 2023; Pramanick et al., 2024). We jointly optimize retrieval and generation end-to-end to minimize answer loss and mitigate hallucinations (Lewis et al., 2020; Rujun Han and Castelli, 2024). On the SciVQA test set, RAVQA-VLM achieves significant improvements over parametric only baselines, with relative gains of +5% ROUGE1 and +5% ROUGE-L, demonstrating the efficacy of RAG for multimodal scientific QA.