Filling the Temporal Void: Recovering Missing Publication Years in the Project Gutenberg Corpus Using LLMs

Omar Momen, Manuel Schaaf, Alexander Mehler


Abstract
Analysing texts spanning long periods of time is critical for researchers in historical linguistics and related disciplines. However, publicly available corpora suitable for such analyses are scarce. The Project Gutenberg (PG) corpus presents a significant yet underutilized opportunity in this context, due to the absence of accurate temporal metadata. We take advantage of language models and information retrieval to explore four sources of information – Open Web, Wikipedia, Open Library API, and PG books texts – to add missing temporal metadata to the PG corpus. Through 20 experiments employing state-of-the-art Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) methods, we estimate the production years of all PG books. We curate an enriched metadata repository for the PG corpus and propose a refined version for it, which includes 53,774 books with a total of 3.8 billion tokens in 11 languages, produced between 1600 and 2000. This work provides a new resource for computational linguistics and humanities studies focusing on diachronic analyses. The final dataset and all experiments data are publicly available (https://github.com/OmarMomen14/pg-dates).
Anthology ID:
2025.findings-acl.890
Volume:
Findings of the Association for Computational Linguistics: ACL 2025
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
17318–17334
Language:
URL:
https://preview.aclanthology.org/display_plenaries/2025.findings-acl.890/
DOI:
Bibkey:
Cite (ACL):
Omar Momen, Manuel Schaaf, and Alexander Mehler. 2025. Filling the Temporal Void: Recovering Missing Publication Years in the Project Gutenberg Corpus Using LLMs. In Findings of the Association for Computational Linguistics: ACL 2025, pages 17318–17334, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Filling the Temporal Void: Recovering Missing Publication Years in the Project Gutenberg Corpus Using LLMs (Momen et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/display_plenaries/2025.findings-acl.890.pdf