READoc: A Unified Benchmark for Realistic Document Structured Extraction

Zichao Li; Aizier Abulaiti; Yaojie Lu; Xuanang Chen; Jia Zheng; Hongyu Lin; Xianpei Han; Shanshan Jiang; Bin Dong; Le Sun

doi:10.18653/v1/2025.findings-acl.1128

READoc: A Unified Benchmark for Realistic Document Structured Extraction

Zichao Li, Aizier Abulaiti, Yaojie Lu, Xuanang Chen, Jia Zheng, Hongyu Lin, Xianpei Han, Shanshan Jiang, Bin Dong, Le Sun

Abstract

Document Structured Extraction (DSE) aims to extract structured content from raw documents. Despite the emergence of numerous DSE systems, their unified evaluation remains inadequate, significantly hindering the field’s advancement. This problem is largely attributed to existing benchmark paradigms, which exhibit fragmented and localized characteristics. To offer a thorough evaluation of DSE systems, we introduce a novel benchmark named READoc, which defines DSE as a realistic task of converting unstructured PDFs into semantically rich Markdown. The READoc dataset is derived from 3,576 diverse and real-world documents from arXiv, GitHub, and Zenodo. In addition, we develop a DSE Evaluation S³uite comprising Standardization, Segmentation and Scoring modules, to conduct a unified evaluation of state-of-the-art DSE approaches. By evaluating a range of pipeline tools, expert visual models, and general Vision-Language Models, we identify the gap between current work and the unified, realistic DSE objective for the first time. We aspire that READoc will catalyze future research in DSE, fostering more comprehensive and practical solutions.

Anthology ID:: 2025.findings-acl.1128
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 21889–21905
Language:
URL:: https://preview.aclanthology.org/mtsummit-25-ingestion/2025.findings-acl.1128/
DOI:: 10.18653/v1/2025.findings-acl.1128
Bibkey:
Cite (ACL):: Zichao Li, Aizier Abulaiti, Yaojie Lu, Xuanang Chen, Jia Zheng, Hongyu Lin, Xianpei Han, Shanshan Jiang, Bin Dong, and Le Sun. 2025. READoc: A Unified Benchmark for Realistic Document Structured Extraction. In Findings of the Association for Computational Linguistics: ACL 2025, pages 21889–21905, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: READoc: A Unified Benchmark for Realistic Document Structured Extraction (Li et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/mtsummit-25-ingestion/2025.findings-acl.1128.pdf

PDF Cite Search Fix data