Jens Madsen
2021
The Danish Gigaword Corpus
Leon Strømberg-Derczynski
|
Manuel Ciosici
|
Rebekah Baglini
|
Morten H. Christiansen
|
Jacob Aarup Dalsgaard
|
Riccardo Fusaroli
|
Peter Juel Henrichsen
|
Rasmus Hvingelby
|
Andreas Kirkedal
|
Alex Speed Kjeldsen
|
Claus Ladefoged
|
Finn Årup Nielsen
|
Jens Madsen
|
Malte Lau Petersen
|
Jonathan Hvithamar Rystrøm
|
Daniel Varab
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)
Danish language technology has been hindered by a lack of broad-coverage corpora at the scale modern NLP prefers. This paper describes the Danish Gigaword Corpus, the result of a focused effort to provide a diverse and freely-available one billion word corpus of Danish text. The Danish Gigaword corpus covers a wide array of time periods, domains, speakers’ socio-economic status, and Danish dialects.
Search