MC-19: A Corpus of 19th Century Icelandic Texts
Steinþór Steingrímsson, Einar Freyr Sigurðsson, Atli Jasonarson
Abstract
We present MC-19, a new Icelandic historical corpus containing texts from the period 1800-1920. We describe approaches for enhancing a corpus of historical texts, by preparing the texts so that they can be processed using state-of-the-art NLP tools. We train encoder-decoder models to reduce the number of OCR errors while leaving other orthographical variation be. We generate a separate modern spelling layer by normalizing the spelling to comply with modern spelling rules, using a statistical modernization ruleset as well as a dictionary of the most common words. This allows for the texts to be PoS-tagged and lemmatized using available tools, facilitating usage of the corpus for researchers and language technologists. The published version of the corpus contains over 270 million tokens.- Anthology ID:
- 2025.nodalida-1.68
- Volume:
- Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)
- Month:
- march
- Year:
- 2025
- Address:
- Tallinn, Estonia
- Editors:
- Richard Johansson, Sara Stymne
- Venue:
- NoDaLiDa
- SIG:
- Publisher:
- University of Tartu Library
- Note:
- Pages:
- 680–687
- Language:
- URL:
- https://preview.aclanthology.org/fix-sig-urls/2025.nodalida-1.68/
- DOI:
- Cite (ACL):
- Steinþór Steingrímsson, Einar Freyr Sigurðsson, and Atli Jasonarson. 2025. MC-19: A Corpus of 19th Century Icelandic Texts. In Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025), pages 680–687, Tallinn, Estonia. University of Tartu Library.
- Cite (Informal):
- MC-19: A Corpus of 19th Century Icelandic Texts (Steingrímsson et al., NoDaLiDa 2025)
- PDF:
- https://preview.aclanthology.org/fix-sig-urls/2025.nodalida-1.68.pdf