Philippe Genêt

2026

National Library as Corpus: DeLiKo-2025@DNB – a Very Large Corpus of German-language Contemporary Literature
Marc Kupietz | Nils Diewald | Philippe Genêt | Andreas Witt
Proceedings of the Fifteenth Language Resources and Evaluation Conference

This paper introduces DeLiKo-2025@DNB, a very large, linguistically annotated corpus of German-language contemporary literature, freely accessible via https://korap.dnb.de/. The corpus currently comprises 21 billion words from over 287,000 books published between 2005 and the present, spanning pulp and genre fiction as well as literary award-winning works. It covers the entire holdings of EPUB-format fiction ebooks deposited with the German National Library (DNB). We provide a detailed account of the corpus composition, metadata, and key features. Additionally, we explain our strategy for enabling lawful and effective access through the deployment of the open-source corpus analysis platform KorAP at the DNB, and we discuss both the transferability of our approach and work to other national libraries and our ongoing and planned extensions and enhancements.

pdf bib abs

Text+ is the German distributed research data infrastructure for literary studies, linguistics, and spoken and written language. Its resources consist of contemporary and historical literary and media texts, deeply annotated material, transcripts of spoken and sign language, and original recordings. Text+ provides access to its resources according to the FAIR guidelines: Findable due to standard-conformant metadata, Accessible with single sign-on authentication, Interoperable via open data formats, and Reproducible through web services and extensive documentation. The 30+ partners of Text+ are archives, libraries, universities, and other research institutions. The partners are autonomous, and they differ in the amount of data and processing capabilities they provide. In this paper, we describe the hub architecture of Text+, which gives users a central and FAIR point of access to research data that continues to be distributed across the Text+ partner institutions. The architecture serves as a blueprint to evolving research infrastructures that aim at maintaining (and empowering) their research data contributors.

Co-authors

Venues

LREC2

Fix author