Evaluating Embedding Models on Danish Historical Newspapers: A Corpus and Benchmark Resource

Alie Lassche, Pascale Feldkamp, Yuri Bizzoni, Katrine Baunvig, Kristoffer Nielbo, Johan Heinsen


Abstract
We present an enriched dataset of almost five million Danish historical newspaper articles from the late seventeenth to nineteenth century, augmented with semantic embeddings and an annotated subset, to enable semi-automated classification as well as thematic and linguistic exploration. Through three historical benchmark tasks that evaluate the performance of Danish and multilingual embedding models on this historical Danish corpus, we discuss how the choice for an embedding model depends on the type of task, and enrich our corpus with embeddings from the overall best performing model. As a showcase experiment, we look at the distribution of article categories in the three subgenres that can be observed in the corpus. This experiment highlights the corpus and article-level embeddings’ potential for further exploration and analysis of the Danish historical mediascape. The resource is freely available for research use and aims to foster reproducible, data-driven studies of language and culture in the Danish nineteenth century.
Anthology ID:
2026.lrec-main.287
Volume:
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:
May
Year:
2026
Address:
Palma de Mallorca, Spain
Editors:
Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:
LREC
SIG:
Publisher:
ELRA Language Resource Association
Note:
Pages:
3577–3589
Language:
URL:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.287/
DOI:
Bibkey:
Cite (ACL):
Alie Lassche, Pascale Feldkamp, Yuri Bizzoni, Katrine Baunvig, Kristoffer Nielbo, and Johan Heinsen. 2026. Evaluating Embedding Models on Danish Historical Newspapers: A Corpus and Benchmark Resource. International Conference on Language Resources and Evaluation, main:3577–3589.
Cite (Informal):
Evaluating Embedding Models on Danish Historical Newspapers: A Corpus and Benchmark Resource (Lassche et al., LREC 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.287.pdf