Donghuan Song
2026
Datasets for Scientific Literature Understanding: A Survey
Yuanzhe Zhang | Xun Zhao | Maodi Hu | Xi Sun | Donghuan Song | Zhixiong Zhang
Findings of the Association for Computational Linguistics: ACL 2026
Yuanzhe Zhang | Xun Zhao | Maodi Hu | Xi Sun | Donghuan Song | Zhixiong Zhang
Findings of the Association for Computational Linguistics: ACL 2026
Empowering machines to understand scientific literature is crucial for accelerating scientific discovery and advancing the AI for Science (AI4S) paradigm. In this paper, we present a comprehensive survey of datasets serving this domain. We propose a systematic taxonomy that organizes resources spanning structural understanding, text understanding, multimodal understanding and pre-training/instruction fine-tuning. Beyond a structured overview, we discuss the evolution of the field, elucidating how the emergence of Large Language Models (LLMs) has reshaped research priorities of dataset construction. By synthesizing existing datasets and identifying critical future directions, this work provides a roadmap for advancing intelligent scientific research systems.