The Norwegian Colossal Corpus: A Text Corpus for Training Large Norwegian Language Models

Per Kummervold, Freddy Wetjen, Javier de la Rosa


Abstract
Norwegian has been one of many languages lacking sufficient available text to train quality language models. In an attempt to bridge this gap, we introduce the Norwegian Colossal Corpus (NCC), which comprises 49GB of clean Norwegian textual data containing over 7B words. The NCC is composed of different and varied sources, ranging from books and newspapers to government documents and public reports, showcasing the various uses of the Norwegian language in society. The corpus contains mainly Norwegian Bokmål and Norwegian Nynorsk. Each document in the corpus is tagged with metadata that enables the creation of sub-corpora for specific needs. Its structure makes it easy to combine with large web archives that for licensing reasons could not be distributed together with the NCC. By releasing this corpus openly to the public, we hope to foster the creation of both better Norwegian language models and multilingual language models with support for Norwegian.
Anthology ID:
2022.lrec-1.410
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
3852–3860
Language:
URL:
https://aclanthology.org/2022.lrec-1.410
DOI:
Bibkey:
Cite (ACL):
Per Kummervold, Freddy Wetjen, and Javier de la Rosa. 2022. The Norwegian Colossal Corpus: A Text Corpus for Training Large Norwegian Language Models. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 3852–3860, Marseille, France. European Language Resources Association.
Cite (Informal):
The Norwegian Colossal Corpus: A Text Corpus for Training Large Norwegian Language Models (Kummervold et al., LREC 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2022.lrec-1.410.pdf
Data
mC4