Enhancing Scientific Visual Question Answering through Multimodal Reasoning and Ensemble Modeling

Prahitha Movva; Naga Harshita Marupaka

Enhancing Scientific Visual Question Answering through Multimodal Reasoning and Ensemble Modeling

Abstract

Scholarly articles convey valuable information not only through unstructured text but also via (semi-)structured figures such as charts and diagrams. Automatically interpreting the semantics of knowledge encoded in these figures can be beneficial for downstream tasks such as question answering (QA). Current approaches to visual question answering often struggle with the precision required for scientific data interpretation, particularly in handling numerical values, multi-step reasoning over visual elements, and maintaining consistency between visual observation and textual reasoning. We present our approach to the SciVQA 2025 shared task, focusing on answering visual and non-visual questions grounded in scientific figures from scholarly articles.Our strongest individual model, InternVL3, achieved ROUGE-1 and ROUGE-L F1 scores of 0.740 and a BERTScore of 0.983 on the SciVQA test split. We also developed an ensemble model with multiple multimodal small language models (MSLMs). Through error analysis on the validation split, our ensemble approach achieves significant improvements over individual models and achieved ROUGE-1 and ROUGE-L F1 scores of 0.735 and 0.734, respectively, and a BERTScore of 0.979 on the SciVQA test split. Our findings underscore the effectiveness of prompt optimization, chain-of-thought reasoning and ensemble modeling in improving the model’s ability in visual question answering.

Anthology ID:: 2025.sdp-1.23
Volume:: Proceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Tirthankar Ghosal, Philipp Mayr, Amanpreet Singh, Aakanksha Naik, Georg Rehm, Dayne Freitag, Dan Li, Sonja Schimmler, Anita De Waard
Venues:: sdp | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 252–262
Language:
URL:: https://preview.aclanthology.org/display_plenaries/2025.sdp-1.23/
DOI:
Bibkey:
Cite (ACL):: Prahitha Movva and Naga Harshita Marupaka. 2025. Enhancing Scientific Visual Question Answering through Multimodal Reasoning and Ensemble Modeling. In Proceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025), pages 252–262, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Enhancing Scientific Visual Question Answering through Multimodal Reasoning and Ensemble Modeling (Movva & Marupaka, sdp 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/display_plenaries/2025.sdp-1.23.pdf

PDF Cite Search Fix data