SciVQA 2025: Overview of the First Scientific Visual Question Answering Shared Task

Ekaterina Borisova; Nikolas Rauscher; Georg Rehm

doi:10.18653/v1/2025.sdp-1.18

SciVQA 2025: Overview of the First Scientific Visual Question Answering Shared Task

Ekaterina Borisova, Nikolas Rauscher, Georg Rehm

Abstract

This paper provides an overview of the First Scientific Visual Question Answering (SciVQA) shared task conducted as part of the Fifth Scholarly Document Processing workshop (SDP 2025). SciVQA aims to explore the capabilities of current multimodal large language models (MLLMs) in reasoning over figures from scholarly publications for question answering (QA). The main focus of the challenge is on closed-ended visual and non-visual QA pairs. We developed the novel SciVQA benchmark comprising 3,000 images of figures and a total of 21,000 QA pairs. The shared task received seven submissions, with the best performing system achieving an average F1 score of approx. 0.86 across ROUGE-1, ROUGE-L, and BertScore metrics. Participating teams explored various fine-tuning and prompting strategies, as well as augmenting the SciVQA dataset with out-of-domain data and incorporating relevant context from source publications. The findings indicate that while MLLMs demonstrate strong performance on SciVQA, they face challenges in visual reasoning and still fall behind human judgments.

Anthology ID:: 2025.sdp-1.18
Volume:: Proceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Tirthankar Ghosal, Philipp Mayr, Amanpreet Singh, Aakanksha Naik, Georg Rehm, Dayne Freitag, Dan Li, Sonja Schimmler, Anita De Waard
Venues:: sdp | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 182–210
Language:
URL:: https://preview.aclanthology.org/landing_page/2025.sdp-1.18/
DOI:: 10.18653/v1/2025.sdp-1.18
Bibkey:
Cite (ACL):: Ekaterina Borisova, Nikolas Rauscher, and Georg Rehm. 2025. SciVQA 2025: Overview of the First Scientific Visual Question Answering Shared Task. In Proceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025), pages 182–210, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: SciVQA 2025: Overview of the First Scientific Visual Question Answering Shared Task (Borisova et al., sdp 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/landing_page/2025.sdp-1.18.pdf

PDF Cite Search Fix data