Datasets for Scientific Literature Understanding: A Survey
Yuanzhe Zhang, Xun Zhao, Maodi Hu, Xi Sun, Donghuan Song, Zhixiong Zhang
Abstract
Empowering machines to understand scientific literature is crucial for accelerating scientific discovery and advancing the AI for Science (AI4S) paradigm. In this paper, we present a comprehensive survey of datasets serving this domain. We propose a systematic taxonomy that organizes resources spanning structural understanding, text understanding, multimodal understanding and pre-training/instruction fine-tuning. Beyond a structured overview, we discuss the evolution of the field, elucidating how the emergence of Large Language Models (LLMs) has reshaped research priorities of dataset construction. By synthesizing existing datasets and identifying critical future directions, this work provides a roadmap for advancing intelligent scientific research systems.- Anthology ID:
- 2026.findings-acl.1414
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 28369–28389
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1414/
- DOI:
- Cite (ACL):
- Yuanzhe Zhang, Xun Zhao, Maodi Hu, Xi Sun, Donghuan Song, and Zhixiong Zhang. 2026. Datasets for Scientific Literature Understanding: A Survey. In Findings of the Association for Computational Linguistics: ACL 2026, pages 28369–28389, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- Datasets for Scientific Literature Understanding: A Survey (Zhang et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1414.pdf