Jens Madsen


2021

pdf
The Danish Gigaword Corpus
Leon Strømberg-Derczynski | Manuel Ciosici | Rebekah Baglini | Morten H. Christiansen | Jacob Aarup Dalsgaard | Riccardo Fusaroli | Peter Juel Henrichsen | Rasmus Hvingelby | Andreas Kirkedal | Alex Speed Kjeldsen | Claus Ladefoged | Finn Årup Nielsen | Jens Madsen | Malte Lau Petersen | Jonathan Hvithamar Rystrøm | Daniel Varab
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)

Danish language technology has been hindered by a lack of broad-coverage corpora at the scale modern NLP prefers. This paper describes the Danish Gigaword Corpus, the result of a focused effort to provide a diverse and freely-available one billion word corpus of Danish text. The Danish Gigaword corpus covers a wide array of time periods, domains, speakers’ socio-economic status, and Danish dialects.