A Benchmark Corpus of English Misspellings and a Minimally-supervised Model for Spelling Correction

Michael Flor; Michael Fried; Alla Rozovskaya

doi:10.18653/v1/W19-4407

A Benchmark Corpus of English Misspellings and a Minimally-supervised Model for Spelling Correction

Michael Flor, Michael Fried, Alla Rozovskaya

Abstract

Spelling correction has attracted a lot of attention in the NLP community. However, models have been usually evaluated on artificiallycreated or proprietary corpora. A publiclyavailable corpus of authentic misspellings, annotated in context, is still lacking. To address this, we present and release an annotated data set of 6,121 spelling errors in context, based on a corpus of essays written by English language learners. We also develop a minimallysupervised context-aware approach to spelling correction. It achieves strong results on our data: 88.12% accuracy. This approach can also train with a minimal amount of annotated data (performance reduced by less than 1%). Furthermore, this approach allows easy portability to new domains. We evaluate our model on data from a medical domain and demonstrate that it rivals the performance of a model trained and tuned on in-domain data.

Anthology ID:: W19-4407
Volume:: Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications
Month:: August
Year:: 2019
Address:: Florence, Italy
Editors:: Helen Yannakoudakis, Ekaterina Kochmar, Claudia Leacock, Nitin Madnani, Ildikó Pilán, Torsten Zesch
Venue:: BEA
SIG:: SIGEDU
Publisher:: Association for Computational Linguistics
Note:
Pages:: 76–86
Language:
URL:: https://preview.aclanthology.org/add-emnlp-2024-awards/W19-4407/
DOI:: 10.18653/v1/W19-4407
Bibkey:
Cite (ACL):: Michael Flor, Michael Fried, and Alla Rozovskaya. 2019. A Benchmark Corpus of English Misspellings and a Minimally-supervised Model for Spelling Correction. In Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 76–86, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):: A Benchmark Corpus of English Misspellings and a Minimally-supervised Model for Spelling Correction (Flor et al., BEA 2019)
Copy Citation:
PDF:: https://preview.aclanthology.org/add-emnlp-2024-awards/W19-4407.pdf
Code: EducationalTestingService/toefl-spell
Data: MIMIC-III

PDF Cite Search Code Fix data