Datasets for Scientific Literature Understanding: A Survey

Yuanzhe Zhang, Xun Zhao, Maodi Hu, Xi Sun, Donghuan Song, Zhixiong Zhang


Abstract
Empowering machines to understand scientific literature is crucial for accelerating scientific discovery and advancing the AI for Science (AI4S) paradigm. In this paper, we present a comprehensive survey of datasets serving this domain. We propose a systematic taxonomy that organizes resources spanning structural understanding, text understanding, multimodal understanding and pre-training/instruction fine-tuning. Beyond a structured overview, we discuss the evolution of the field, elucidating how the emergence of Large Language Models (LLMs) has reshaped research priorities of dataset construction. By synthesizing existing datasets and identifying critical future directions, this work provides a roadmap for advancing intelligent scientific research systems.
Anthology ID:
2026.findings-acl.1414
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
28369–28389
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1414/
DOI:
Bibkey:
Cite (ACL):
Yuanzhe Zhang, Xun Zhao, Maodi Hu, Xi Sun, Donghuan Song, and Zhixiong Zhang. 2026. Datasets for Scientific Literature Understanding: A Survey. In Findings of the Association for Computational Linguistics: ACL 2026, pages 28369–28389, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Datasets for Scientific Literature Understanding: A Survey (Zhang et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1414.pdf
Checklist:
 2026.findings-acl.1414.checklist.pdf