Regina Nasyrova


2025

We introduce a modified sequence tagging architecture, proposed in (Omelianchuk et al., 2020), for the Grammatical Error Correction of the Russian language. We propose language-specific operation set and preprocessing algorithm as well as a classification scheme which makes distinct predictions for insertions and other operations. The best versions of our models outperform previous approaches and set new SOTA on the two Russian GEC benchmarks – RU-Lang8 and GERA, while achieve competitive performance on RULEC-GEC.
We release LORuGEC – the first rule-annotated corpus for Russian Grammatical Error Correction. The corpus is designed for diagnostic purposes and contains 348 validation and 612 test sentences specially selected to represent complex rules of Russian writing. This makes our corpus significantly different from other Russian GEC corpora. We apply several large language models and approaches to our corpus, the best F0.5 score of 83% is achieved by 5-shot learning using YandexGPT-5 Pro model.To move further the boundaries of few-shot learning, we are the first to apply a GECTOR-like encoder model for similar examples retrieval. GECTOR-based example selection significantly boosts few-shot performance. This result is true not only for LORuGEC but for other Russian GEC corpora as well. On LORuGEC, the GECTOR-based retriever might be further improved using contrastive tuning on the task of rule label prediction. All these results hold for a broad class of large language models.