Naga Harshita Marupaka


2025

pdf bib
Enhancing Scientific Visual Question Answering through Multimodal Reasoning and Ensemble Modeling
Prahitha Movva | Naga Harshita Marupaka
Proceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025)

Scholarly articles convey valuable information not only through unstructured text but also via (semi-)structured figures such as charts and diagrams. Automatically interpreting the semantics of knowledge encoded in these figures can be beneficial for downstream tasks such as question answering (QA). Current approaches to visual question answering often struggle with the precision required for scientific data interpretation, particularly in handling numerical values, multi-step reasoning over visual elements, and maintaining consistency between visual observation and textual reasoning. We present our approach to the SciVQA 2025 shared task, focusing on answering visual and non-visual questions grounded in scientific figures from scholarly articles.Our strongest individual model, InternVL3, achieved ROUGE-1 and ROUGE-L F1 scores of 0.740 and a BERTScore of 0.983 on the SciVQA test split. We also developed an ensemble model with multiple multimodal small language models (MSLMs). Through error analysis on the validation split, our ensemble approach achieves significant improvements over individual models and achieved ROUGE-1 and ROUGE-L F1 scores of 0.735 and 0.734, respectively, and a BERTScore of 0.979 on the SciVQA test split. Our findings underscore the effectiveness of prompt optimization, chain-of-thought reasoning and ensemble modeling in improving the model’s ability in visual question answering.