Olga Lyashevskaya

2025

pdf bib abs
BERT-like Models for Slavic Morpheme Segmentation
Dmitry Morozov | Lizaveta Astapenka | Anna Glazkova | Timur Garipov | Olga Lyashevskaya
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Automatic morpheme segmentation algorithms are applicable in various tasks, such as building tokenizers and language education. For Slavic languages, the development of such algorithms is complicated by the rich derivational capabilities of these languages. Previous research has shown that, on average, these algorithms have already reached expert-level quality. However, a key unresolved issue is the significant decline in performance when segmenting words containing roots not present in the training data. This problem can be partially addressed by using pre-trained language models to better account for word semantics. In this work, we explored the possibility of fine-tuning BERT-like models for morpheme segmentation using data from Belarusian, Czech, and Russian. We found that for Czech and Russian, our models outperform all previously proposed approaches, achieving word-level accuracy of 92.5-95.1%. For Belarusian, this task was addressed for the first time. The best-performing approach for Belarusian was an ensemble of convolutional neural networks with word-level accuracy of 90.45%.

Pre-trained language models have significantly advanced natural language processing (NLP), particularly in analyzing languages with complex morphological structures. This study addresses lemmatization for the Russian language, the errors in which can critically affect the performance of information retrieval, question answering, and other tasks. We present the results of experiments on generative lemmatization using pre-trained language models. Our findings demonstrate that combining generative models with the existing solutions allows achieving performance that surpasses current results for the lemmatization of Russian. This paper also introduces Rubic2, a new ensemble approach that combines the generative BART-base model, fine-tuned on a manually annotated data set of 2.1 million tokens, with the neural model called Rubic which is currently used for morphological annotation and lemmatization in the Russian National Corpus. Extensive experiments show that Rubic2 outperforms current solutions for the lemmatization of Russian, offering superior results across various text domains and contributing to advancements in NLP applications.

pdf bib abs
The Application of Corpus-Based Language Distance Measurement to the Diatopic Variation Study (on the Material of the Old Novgorodian Birchbark Letters)
Ilia Afanasev | Olga Lyashevskaya
Proceedings of the Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2025)

The paper presents a computer-assisted exploration of a set of texts, where qualitative analysis complements the linguistically-aware vector-based language distance measurements, interpreting them through close reading and thus proving or disproving their conclusions. It proposes using a method designed for small raw corpora to explore the individual, chronological, and gender-based differences within an extinct single territorial lect, known only by a scarce collection of documents. The material under consideration is the Novgorodian birchbark letters, a set of rather small manuscripts (not a single one is more than 1000 tokens) that are witnesses of the Old Novgorodian lect, spoken on the territories of modern Novgorod and Staraya Russa at the first half of the first millennium CE. The study shows the existence of chronological variation, a mild degree of individual variation, and almost absent gender-based differences. Possible prospects of the study include its application to the newly discovered birchbark letters and using an outgroup for more precise measurements.

2023

pdf bib abs
From web to dialects: how to enhance non-standard Russian lects lemmatisation?
Ilia Afanasev | Olga Lyashevskaya
Proceedings of the 2023 CLASP Conference on Learning with Small Data (LSD)

The growing need for using small data distinguished by a set of distributional properties becomes all the more apparent in the era of large language models (LLM). In this paper, we show that for the lemmatisation of the web as corpora texts, heterogeneous social media texts, and dialect texts, the morphological tagging by a model trained on a small dataset with specific properties generally works better than the morphological tagging by a model trained on a large dataset. The material we use is Russian non-standard texts and interviews with dialect speakers. The sequence-to-sequence lemmatisation with the help of taggers trained on smaller linguistically aware datasets achieves the average results of 85 to 90 per cent. These results are consistently (but not always), by 1-2 per cent. higher than the results of lemmatisation with the help of the large-dataset-trained taggers. We analyse these results and outline the possible further research directions.

2022

pdf bib abs
Constructing a Lexical Resource of Russian Derivational Morphology
Lukáš Kyjánek | Olga Lyashevskaya | Anna Nedoluzhko | Daniil Vodolazsky | Zdeněk Žabokrtský
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Words of any language are to some extent related thought the ways they are formed. For instance, the verb ‘exempl-ify’ and the noun ‘example-s’ are both based on the word ‘example’, but the verb is derived from it, while the noun is inflected. In Natural Language Processing of Russian, the inflection is satisfactorily processed; however, there are only a few machine-trackable resources that capture derivations even though Russian has both of these morphological processes very rich. Therefore, we devote this paper to improving one of the methods of constructing such resources and to the application of the method to a Russian lexicon, which results in the creation of the largest lexical resource of Russian derivational relations. The resulting database dubbed DeriNet.RU includes more than 300 thousand lexemes connected with more than 164 thousand binary derivational relations. To create such data, we combined the existing machine-learning methods that we improved to manage this goal. The whole approach is evaluated on our newly created data set of manual, parallel annotation. The resulting DeriNet.RU is freely available under an open license agreement.

2017

pdf bib
REALEC learner treebank: annotation principles and evaluation of automatic parsing
Olga Lyashevskaya | Irina Panteleeva
Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories

2013

pdf bib
Learning Computational Linguistics through NLP Evaluation Events: the experience of Russian evaluation initiative
Anastasia Bonch-Osmolovskaya | Svetlana Toldova | Olga Lyashevskaya
Proceedings of the Fourth Workshop on Teaching NLP and CL

2012

2010

pdf bib abs
Bank of Russian Constructions and Valencies
Olga Lyashevskaya
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The Bank of Russian Constructions and Valencies (Russian FrameBank) is an annotation project that takes as input samples from the Russian National Corpus (http://www.ruscorpora.ru). Since Russian verbs and predicates from other POS classes have their particular and not always predictable case pattern, these words and their argument structures are to be described as lexical constructions. The slots of partially filled phrasal constructions (e.g. vzjal i uexal he suddenly (lit. took and) went away) are also under analysis. Thus, the notion of construction is understood in the sense of Fillmores Construction Grammar and is not limited to that of argument structure of verbs. FrameBank brings together the dictionary of constructions and the annotated collection of examples. Our goal is to mark the set of arguments and adjuncts of a certain construction. The main focus is on realization of the elements in the running text, to facilitate searches through pattern realizations by a certain combination of features. The relevant dataset involves lexical, POS and other morphosyntactic tags, semantic classes, as well as grammatical constructions that introduce or license the use of elements within a given construction.