Ivan Smal
2025
Rubic2: Ensemble Model for Russian Lemmatization
Ilia Afanasev
|
Anna Glazkova
|
Olga Lyashevskaya
|
Dmitry Morozov
|
Ivan Smal
|
Natalia Vlasova
Proceedings of the 10th Workshop on Slavic Natural Language Processing (Slavic NLP 2025)
Pre-trained language models have significantly advanced natural language processing (NLP), particularly in analyzing languages with complex morphological structures. This study addresses lemmatization for the Russian language, the errors in which can critically affect the performance of information retrieval, question answering, and other tasks. We present the results of experiments on generative lemmatization using pre-trained language models. Our findings demonstrate that combining generative models with the existing solutions allows achieving performance that surpasses current results for the lemmatization of Russian. This paper also introduces Rubic2, a new ensemble approach that combines the generative BART-base model, fine-tuned on a manually annotated data set of 2.1 million tokens, with the neural model called Rubic which is currently used for morphological annotation and lemmatization in the Russian National Corpus. Extensive experiments show that Rubic2 outperforms current solutions for the lemmatization of Russian, offering superior results across various text domains and contributing to advancements in NLP applications.