Evaluating Embedding Models on Danish Historical Newspapers: A Corpus and Benchmark Resource
Alie Lassche, Pascale Feldkamp, Yuri Bizzoni, Katrine Baunvig, Kristoffer Nielbo, Johan Heinsen
Abstract
We present an enriched dataset of almost five million Danish historical newspaper articles from the late seventeenth to nineteenth century, augmented with semantic embeddings and an annotated subset, to enable semi-automated classification as well as thematic and linguistic exploration. Through three historical benchmark tasks that evaluate the performance of Danish and multilingual embedding models on this historical Danish corpus, we discuss how the choice for an embedding model depends on the type of task, and enrich our corpus with embeddings from the overall best performing model. As a showcase experiment, we look at the distribution of article categories in the three subgenres that can be observed in the corpus. This experiment highlights the corpus and article-level embeddings’ potential for further exploration and analysis of the Danish historical mediascape. The resource is freely available for research use and aims to foster reproducible, data-driven studies of language and culture in the Danish nineteenth century.- Anthology ID:
- 2026.lrec-main.287
- Volume:
- Proceedings of the Fifteenth Language Resources and Evaluation Conference
- Month:
- May
- Year:
- 2026
- Address:
- Palma de Mallorca, Spain
- Editors:
- Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
- Venue:
- LREC
- SIG:
- Publisher:
- ELRA Language Resource Association
- Note:
- Pages:
- 3577–3589
- Language:
- URL:
- https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.287/
- DOI:
- Cite (ACL):
- Alie Lassche, Pascale Feldkamp, Yuri Bizzoni, Katrine Baunvig, Kristoffer Nielbo, and Johan Heinsen. 2026. Evaluating Embedding Models on Danish Historical Newspapers: A Corpus and Benchmark Resource. International Conference on Language Resources and Evaluation, main:3577–3589.
- Cite (Informal):
- Evaluating Embedding Models on Danish Historical Newspapers: A Corpus and Benchmark Resource (Lassche et al., LREC 2026)
- PDF:
- https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.287.pdf