Spelling Correction for Russian: A Comparative Study of Datasets and Methods

Alla Rozovskaya


Abstract
We develop a minimally-supervised model for spelling correction and evaluate its performance on three datasets annotated for spelling errors in Russian. The first corpus is a dataset of Russian social media data that was recently used in a shared task on Russian spelling correction. The other two corpora contain texts produced by learners of Russian as a foreign language. Evaluating on three diverse datasets allows for a cross-corpus comparison. We compare the performance of the minimally-supervised model to two baseline models that do not use context for candidate re-ranking, as well as to a character-level statistical machine translation system with context-based re-ranking. We show that the minimally-supervised model outperforms all of the other models. We also present an analysis of the spelling errors and discuss the difficulty of the task compared to the spelling correction problem in English.
Anthology ID:
2021.ranlp-1.136
Volume:
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)
Month:
September
Year:
2021
Address:
Held Online
Venue:
RANLP
SIG:
Publisher:
INCOMA Ltd.
Note:
Pages:
1206–1216
Language:
URL:
https://aclanthology.org/2021.ranlp-1.136
DOI:
Bibkey:
Cite (ACL):
Alla Rozovskaya. 2021. Spelling Correction for Russian: A Comparative Study of Datasets and Methods. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pages 1206–1216, Held Online. INCOMA Ltd..
Cite (Informal):
Spelling Correction for Russian: A Comparative Study of Datasets and Methods (Rozovskaya, RANLP 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2021.ranlp-1.136.pdf