eSciBench: An Extensible Scientific PDF Extraction Benchmark

Noah Tremblay Taillon, Phillippe Langlais


Abstract
Automatically extracting information from PDF documents (such as authors, affiliations, references, tables, equations) may be transformative in Digital Humanities where meta-data accompanying a document is typically manually collected, a cumbersome process. In this work, we conduct a systematic benchmarking of PDF extractors on a set of 100 scientific articles (1949 pages) of the STEM domain that have been processed automatically, then carefully curated. Our benchmark, named eSciBench is openly accessible. Putting to the test 13 extractors on it reveals that although some extractors perform well overall, extracting information from scientific articles is far from a solved problem.
Anthology ID:
2026.lrec-main.600
Volume:
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:
May
Year:
2026
Address:
Palma de Mallorca, Spain
Editors:
Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:
LREC
SIG:
Publisher:
ELRA Language Resource Association
Note:
Pages:
7568–7580
Language:
URL:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.600/
DOI:
Bibkey:
Cite (ACL):
Noah Tremblay Taillon and Phillippe Langlais. 2026. eSciBench: An Extensible Scientific PDF Extraction Benchmark. International Conference on Language Resources and Evaluation, main:7568–7580.
Cite (Informal):
eSciBench: An Extensible Scientific PDF Extraction Benchmark (Tremblay Taillon & Langlais, LREC 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.600.pdf