Anna Smirnova
2026
What Aggregate Scores Hide: Per-Rule Evaluation of Russian Grammatical Error Correction
Anna Smirnova | Artyom Kopan | Vladislav Makeev | George Chernishev
Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)
Anna Smirnova | Artyom Kopan | Vladislav Makeev | George Chernishev
Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)
Russian grammar correction models can improveon aggregate benchmarkswhile getting worse at specific grammar rules.We show this through per-rule evaluationon a diagnostic benchmark of 48 prescriptive rules:finetuning on synthetic data improves overall F0.5while driving subordinate-clause comma accuracyfrom 14% to 1%.The suppression is invisible under corpus-level metricsand undetectable with existing coarse, corpus-specific tagsets;it is recoverable only when diagnosed at rule granularity.To enable this analysis,we develop a 98-category error taxonomygrounded in Rozental’s reference grammarand SyntErr, an open-source synthetic data generatorwhose per-rule distribution is an explicit parameter,designed to support arbitrary rule sets and languages.Finetuning eight open models (0.8B–12B)on 39K synthetic examplesyields up to 75.3 F0.5,approaching frontier API modelswith models small enough to run on device.We release the taxonomy, generator,per-rule evaluation data, and all training artifacts.
2022
Query Processing and Optimization for a Custom Retrieval Language
Yakov Kuzin | Anna Smirnova | Evgeniy Slobodkin | George Chernishev
Proceedings of the First Workshop on Pattern-based Approaches to NLP in the Age of Deep Learning
Yakov Kuzin | Anna Smirnova | Evgeniy Slobodkin | George Chernishev
Proceedings of the First Workshop on Pattern-based Approaches to NLP in the Age of Deep Learning
Data annotation has been a pressing issue ever since the rise of machine learning and associated areas. It is well-known that obtaining high-quality annotated data incurs high costs, be they financial or time-related. In our previous work, we have proposed a custom, SQL-like retrieval language used to query collections of short documents, such as chat transcripts or tweets. Its main purpose is enabling a human annotator to select “situations” from such collections, i.e. subsets of documents that are related both thematically and temporally. This language, named Matcher, was prototyped in our custom annotation tool. Entering the next stage of development of the tool, we have tested the prototype implementation. Given the language’s rich semantics, many possible execution options with various costs arise. We have found out we could provide tangible improvement in terms of speed and memory consumption by carefully selecting the execution strategy in each particular case. In this work, we present the improved algorithms and proposed optimization methods, as well as a benchmark suite whose results show the significance of the presented techniques. While this is an initial work and not a full-fledged optimization framework, it nevertheless yields good results, providing up to tenfold improvement.
2021
Situation-Based Multiparticipant Chat Summarization: a Concept, an Exploration-Annotation Tool and an Example Collection
Anna Smirnova | Evgeniy Slobodkin | George Chernishev
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop
Anna Smirnova | Evgeniy Slobodkin | George Chernishev
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop
Currently, text chatting is one of the primary means of communication. However, modern text chat still in general does not offer any navigation or even full-featured search, although the high volumes of messages demand it. In order to mitigate these inconveniences, we formulate the problem of situation-based summarization and propose a special data annotation tool intended for developing training and gold-standard data. A situation is a subset of messages revolving around a single event in both temporal and contextual senses: e.g, a group of friends arranging a meeting in chat, agreeing on date, time, and place. Situations can be extracted via information retrieval, natural language processing, and machine learning techniques. Since the task is novel, neither training nor gold-standard datasets for it have been created yet. In this paper, we present the formulation of the situation-based summarization problem. Next, we describe Chat Corpora Annotator (CCA): the first annotation system designed specifically for exploring and annotating chat log data. We also introduce a custom query language for semi-automatic situation extraction. Finally, we present the first gold-standard dataset for situation-based summarization. The software source code and the dataset are publicly available.