Stefan Dumitrescu
2020
The birth of Romanian BERT
Stefan Dumitrescu
|
Andrei-Marius Avram
|
Sampo Pyysalo
Findings of the Association for Computational Linguistics: EMNLP 2020
Large-scale pretrained language models have become ubiquitous in Natural Language Processing. However, most of these models are available either in high-resource languages, in particular English, or as multilingual models that compromise performance on individual languages for coverage. This paper introduces Romanian BERT, the first purely Romanian transformer-based language model, pretrained on a large text corpus. We discuss corpus com-position and cleaning, the model training process, as well as an extensive evaluation of the model on various Romanian datasets. We opensource not only the model itself, but also a repository that contains information on how to obtain the corpus, fine-tune and use this model in production (with practical examples), and how to fully replicate the evaluation process.
Search