Rubic2: Ensemble Model for Russian Lemmatization
Ilia Afanasev, Anna Glazkova, Olga Lyashevskaya, Dmitry Morozov, Ivan Smal, Natalia Vlasova
Abstract
Pre-trained language models have significantly advanced natural language processing (NLP), particularly in analyzing languages with complex morphological structures. This study addresses lemmatization for the Russian language, the errors in which can critically affect the performance of information retrieval, question answering, and other tasks. We present the results of experiments on generative lemmatization using pre-trained language models. Our findings demonstrate that combining generative models with the existing solutions allows achieving performance that surpasses current results for the lemmatization of Russian. This paper also introduces Rubic2, a new ensemble approach that combines the generative BART-base model, fine-tuned on a manually annotated data set of 2.1 million tokens, with the neural model called Rubic which is currently used for morphological annotation and lemmatization in the Russian National Corpus. Extensive experiments show that Rubic2 outperforms current solutions for the lemmatization of Russian, offering superior results across various text domains and contributing to advancements in NLP applications.- Anthology ID:
- 2025.bsnlp-1.18
- Volume:
- Proceedings of the 10th Workshop on Slavic Natural Language Processing (Slavic NLP 2025)
- Month:
- July
- Year:
- 2025
- Address:
- Vienna, Austria
- Editors:
- Jakub Piskorski, Pavel Přibáň, Preslav Nakov, Roman Yangarber, Michal Marcinczuk
- Venues:
- BSNLP | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 157–170
- Language:
- URL:
- https://preview.aclanthology.org/acl25-workshop-ingestion/2025.bsnlp-1.18/
- DOI:
- Cite (ACL):
- Ilia Afanasev, Anna Glazkova, Olga Lyashevskaya, Dmitry Morozov, Ivan Smal, and Natalia Vlasova. 2025. Rubic2: Ensemble Model for Russian Lemmatization. In Proceedings of the 10th Workshop on Slavic Natural Language Processing (Slavic NLP 2025), pages 157–170, Vienna, Austria. Association for Computational Linguistics.
- Cite (Informal):
- Rubic2: Ensemble Model for Russian Lemmatization (Afanasev et al., BSNLP 2025)
- PDF:
- https://preview.aclanthology.org/acl25-workshop-ingestion/2025.bsnlp-1.18.pdf