unarXive 2024: A Large-Scale Scientific Corpus for Citation-Aware Retrieval and Generation

Ines Besrour, Michael Färber


Abstract
Full-text collections of scientific papers are essential for NLP research and the training of language models. However, existing resources remain incomplete: they often lag behind the fast-paced growth of scientific publishing, lack comprehensive citation networks, and discard essential structural elements. In this work, we introduce unarXive 2024, a large-scale, richly structured corpus containing every arXiv submission from January 1991 to December 2024 – over 2.28 million documents across physics, mathematics, computer science, and other fields. Our release enhances each paper with detailed metadata, reconstructs a substantially more complete citation network than existing datasets, and preserves fine-grained structural information, including section boundaries, mathematical notation, and non-textual elements. Beyond the corpus itself, we provide dense and sparse indexes optimized for retrieval-augmented generation (RAG) over the full arXiv archive. All resources, including code and data, are publicly available: https://github.com/faerber-lab/unarXive-2024
Anthology ID:
2026.lrec-main.556
Volume:
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:
May
Year:
2026
Address:
Palma de Mallorca, Spain
Editors:
Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:
LREC
SIG:
Publisher:
ELRA Language Resource Association
Note:
Pages:
6990–6997
Language:
URL:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.556/
DOI:
Bibkey:
Cite (ACL):
Ines Besrour and Michael Färber. 2026. unarXive 2024: A Large-Scale Scientific Corpus for Citation-Aware Retrieval and Generation. International Conference on Language Resources and Evaluation, main:6990–6997.
Cite (Informal):
unarXive 2024: A Large-Scale Scientific Corpus for Citation-Aware Retrieval and Generation (Besrour & Färber, LREC 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.556.pdf