Abstract
Norwegian has been one of many languages lacking sufficient available text to train quality language models. In an attempt to bridge this gap, we introduce the Norwegian Colossal Corpus (NCC), which comprises 49GB of clean Norwegian textual data containing over 7B words. The NCC is composed of different and varied sources, ranging from books and newspapers to government documents and public reports, showcasing the various uses of the Norwegian language in society. The corpus contains mainly Norwegian Bokmål and Norwegian Nynorsk. Each document in the corpus is tagged with metadata that enables the creation of sub-corpora for specific needs. Its structure makes it easy to combine with large web archives that for licensing reasons could not be distributed together with the NCC. By releasing this corpus openly to the public, we hope to foster the creation of both better Norwegian language models and multilingual language models with support for Norwegian.- Anthology ID:
- 2022.lrec-1.410
- Volume:
- Proceedings of the Thirteenth Language Resources and Evaluation Conference
- Month:
- June
- Year:
- 2022
- Address:
- Marseille, France
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association
- Note:
- Pages:
- 3852–3860
- Language:
- URL:
- https://aclanthology.org/2022.lrec-1.410
- DOI:
- Cite (ACL):
- Per Kummervold, Freddy Wetjen, and Javier de la Rosa. 2022. The Norwegian Colossal Corpus: A Text Corpus for Training Large Norwegian Language Models. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 3852–3860, Marseille, France. European Language Resources Association.
- Cite (Informal):
- The Norwegian Colossal Corpus: A Text Corpus for Training Large Norwegian Language Models (Kummervold et al., LREC 2022)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/2022.lrec-1.410.pdf
- Data
- mC4