Small Languages, Big Models: A Study of Continual Training on Languages of Norway

David Samuel, Vladislav Mikhailov, Erik Velldal, Lilja Øvrelid, Lucas Georges Gabriel Charpentier, Andrey Kutuzov, Stephan Oepen


Abstract
Training large language models requires vast amounts of data, posing a challenge for less widely spoken languages like Norwegian and even more so for truly low-resource languages like Northern Sámi. To address this issue, we present a novel three-stage continual training approach that substantially improves the downstream performance together with the inference efficiency for the target languages. Based on our findings, we train, evaluate, and openly release a new generative language model for Norwegian Bokmål, Nynorsk, and Northern Sámi with 11.4 billion parameters: NorMistral-11B.
Anthology ID:
2025.nodalida-1.61
Volume:
Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)
Month:
march
Year:
2025
Address:
Tallinn, Estonia
Editors:
Richard Johansson, Sara Stymne
Venue:
NoDaLiDa
SIG:
Publisher:
University of Tartu Library
Note:
Pages:
573–608
Language:
URL:
https://preview.aclanthology.org/fix-sig-urls/2025.nodalida-1.61/
DOI:
Bibkey:
Cite (ACL):
David Samuel, Vladislav Mikhailov, Erik Velldal, Lilja Øvrelid, Lucas Georges Gabriel Charpentier, Andrey Kutuzov, and Stephan Oepen. 2025. Small Languages, Big Models: A Study of Continual Training on Languages of Norway. In Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025), pages 573–608, Tallinn, Estonia. University of Tartu Library.
Cite (Informal):
Small Languages, Big Models: A Study of Continual Training on Languages of Norway (Samuel et al., NoDaLiDa 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-sig-urls/2025.nodalida-1.61.pdf