Visual Question Answering on Scientific Charts Using Fine-Tuned Vision-Language Models

Florian Schleid; Jan Strich; Chris Biemann

Visual Question Answering on Scientific Charts Using Fine-Tuned Vision-Language Models

Florian Schleid, Jan Strich, Chris Biemann

Abstract

Scientific charts often encapsulate the core findings of research papers, making the ability to answer questions about these charts highly valuable. This paper explores recent advancements in scientific chart visual question answering (VQA) enabled by large Vision Language Models (VLMs) and newly curated datasets. As part of the SciVQA shared task from the 5th Workshop on Scholarly Document Processing, we develop and evaluate multimodal Systems capable of answering diverse question types - including multiple-choice, yes/no, unanswerable, and infinite answer set questions - based on chart images extracted from scientific literature. We investigate the effects of zero-shot and one-shot prompting, as well as supervised fine-tuning (SFT), on the performance of Qwen2.5-VL models (7B and 32B variants). We also tried to include more training data from domain-specific datasets (SpiQA and ArXivQA). Our fine-tuned Qwen2.5-VL 32B model achieves a substantial improvement over the GPT-4o-mini baseline and reaches the 4th place in the shared task, highlighting the effectiveness of domain-specific fine-tuning. We published the code for the experiments.

Anthology ID:: 2025.sdp-1.19
Volume:: Proceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Tirthankar Ghosal, Philipp Mayr, Amanpreet Singh, Aakanksha Naik, Georg Rehm, Dayne Freitag, Dan Li, Sonja Schimmler, Anita De Waard
Venues:: sdp | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 211–220
Language:
URL:: https://preview.aclanthology.org/landing_page/2025.sdp-1.19/
DOI:
Bibkey:
Cite (ACL):: Florian Schleid, Jan Strich, and Chris Biemann. 2025. Visual Question Answering on Scientific Charts Using Fine-Tuned Vision-Language Models. In Proceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025), pages 211–220, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Visual Question Answering on Scientific Charts Using Fine-Tuned Vision-Language Models (Schleid et al., sdp 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/landing_page/2025.sdp-1.19.pdf

PDF Cite Search Fix data