A Methodology for Generative Spelling Correction via Natural Spelling Errors Emulation across Multiple Domains and Languages

Nikita Martynov, Mark Baushenko, Anastasia Kozlova, Katerina Kolomeytseva, Aleksandr Abramov, Alena Fenogenova


Abstract
Large language models excel in text generation and generalization, however they face challenges in text editing tasks, especially in correcting spelling errors and mistyping.In this paper, we present a methodology for generative spelling correction (SC), tested on English and Russian languages and potentially can be extended to any language with minor changes. Our research mainly focuses on exploring natural spelling errors and mistyping in texts and studying how those errors can be emulated in correct sentences to enrich generative models’ pre-train procedure effectively. We investigate the effects of emulations in various text domains and examine two spelling corruption techniques: 1) first one mimics human behavior when making a mistake through leveraging statistics of errors from a particular dataset, and 2) second adds the most common spelling errors, keyboard miss clicks, and some heuristics within the texts.We conducted experiments employing various corruption strategies, models’ architectures, and sizes in the pre-training and fine-tuning stages and evaluated the models using single-domain and multi-domain test sets. As a practical outcome of our work, we introduce SAGE (Spell checking via Augmentation and Generative distribution Emulation).
Anthology ID:
2024.findings-eacl.10
Volume:
Findings of the Association for Computational Linguistics: EACL 2024
Month:
March
Year:
2024
Address:
St. Julian’s, Malta
Editors:
Yvette Graham, Matthew Purver
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
138–155
Language:
URL:
https://aclanthology.org/2024.findings-eacl.10
DOI:
Bibkey:
Cite (ACL):
Nikita Martynov, Mark Baushenko, Anastasia Kozlova, Katerina Kolomeytseva, Aleksandr Abramov, and Alena Fenogenova. 2024. A Methodology for Generative Spelling Correction via Natural Spelling Errors Emulation across Multiple Domains and Languages. In Findings of the Association for Computational Linguistics: EACL 2024, pages 138–155, St. Julian’s, Malta. Association for Computational Linguistics.
Cite (Informal):
A Methodology for Generative Spelling Correction via Natural Spelling Errors Emulation across Multiple Domains and Languages (Martynov et al., Findings 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-4/2024.findings-eacl.10.pdf
Software:
 2024.findings-eacl.10.software.zip
Note:
 2024.findings-eacl.10.note.zip