Artyom Kopan
2026
What Aggregate Scores Hide: Per-Rule Evaluation of Russian Grammatical Error Correction
Anna Smirnova | Artyom Kopan | Vladislav Makeev | George Chernishev
Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)
Anna Smirnova | Artyom Kopan | Vladislav Makeev | George Chernishev
Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)
Russian grammar correction models can improveon aggregate benchmarkswhile getting worse at specific grammar rules.We show this through per-rule evaluationon a diagnostic benchmark of 48 prescriptive rules:finetuning on synthetic data improves overall F0.5while driving subordinate-clause comma accuracyfrom 14% to 1%.The suppression is invisible under corpus-level metricsand undetectable with existing coarse, corpus-specific tagsets;it is recoverable only when diagnosed at rule granularity.To enable this analysis,we develop a 98-category error taxonomygrounded in Rozental’s reference grammarand SyntErr, an open-source synthetic data generatorwhose per-rule distribution is an explicit parameter,designed to support arbitrary rule sets and languages.Finetuning eight open models (0.8B–12B)on 39K synthetic examplesyields up to 75.3 F0.5,approaching frontier API modelswith models small enough to run on device.We release the taxonomy, generator,per-rule evaluation data, and all training artifacts.