Xi Sun
2026
Datasets for Scientific Literature Understanding: A Survey
Yuanzhe Zhang | Xun Zhao | Maodi Hu | Xi Sun | Donghuan Song | Zhixiong Zhang
Findings of the Association for Computational Linguistics: ACL 2026
Yuanzhe Zhang | Xun Zhao | Maodi Hu | Xi Sun | Donghuan Song | Zhixiong Zhang
Findings of the Association for Computational Linguistics: ACL 2026
Empowering machines to understand scientific literature is crucial for accelerating scientific discovery and advancing the AI for Science (AI4S) paradigm. In this paper, we present a comprehensive survey of datasets serving this domain. We propose a systematic taxonomy that organizes resources spanning structural understanding, text understanding, multimodal understanding and pre-training/instruction fine-tuning. Beyond a structured overview, we discuss the evolution of the field, elucidating how the emergence of Large Language Models (LLMs) has reshaped research priorities of dataset construction. By synthesizing existing datasets and identifying critical future directions, this work provides a roadmap for advancing intelligent scientific research systems.
SudokuFill: A Multi-Agent Progressive Filling Framework for Document-Level Scientific Information Extraction
Yang Li | Yajiao Wang | Yu Zhang | Yuanzhe Zhang | Maodi Hu | Mengting Zhang | Xi Sun | Hua Yue | Zhixiong Zhang
Findings of the Association for Computational Linguistics: ACL 2026
Yang Li | Yajiao Wang | Yu Zhang | Yuanzhe Zhang | Maodi Hu | Mengting Zhang | Xi Sun | Hua Yue | Zhixiong Zhang
Findings of the Association for Computational Linguistics: ACL 2026
Scientific information extraction (SciIE) is a key bottleneck for turning unstructured papers into computable knowledge bases, yet most existing systems still follow a “local extraction then global assembly” paradigm. This workflow is inherently lossy: by extracting fields in isolation, it breaks global correlations and discards high-confidence signals that could otherwise be reused as internal supervision, forcing systems to repeatedly restart from scratch, especially in long, multimodal scientific documents. In this paper, We propose a different view: SciIE should be solved as a progressive filling problem, similar to solving a Sudoku,once a field is filled with high confidence, it should act as a constraint that guides the remaining uncertain fields. Based on this idea, we introduce SudokuFill, a multi-agent framework that maintains a Global Filling State and performs priority scheduling to establish reliable anchors first, then reuses them as internal supervision for iterative deliberation over harder fields. Evaluated on a specialized document-level adjuvant dataset, our framework achieves a SOTA score of 51.83% on our benchmark. Crucially, SudokuFill enables a 7B model to outperform the vanilla GPT-4o, proving that structured architectural reasoning can effectively compensate for parameter scale.