Abstract
Spelling correction has attracted a lot of attention in the NLP community. However, models have been usually evaluated on artificiallycreated or proprietary corpora. A publiclyavailable corpus of authentic misspellings, annotated in context, is still lacking. To address this, we present and release an annotated data set of 6,121 spelling errors in context, based on a corpus of essays written by English language learners. We also develop a minimallysupervised context-aware approach to spelling correction. It achieves strong results on our data: 88.12% accuracy. This approach can also train with a minimal amount of annotated data (performance reduced by less than 1%). Furthermore, this approach allows easy portability to new domains. We evaluate our model on data from a medical domain and demonstrate that it rivals the performance of a model trained and tuned on in-domain data.- Anthology ID:
- W19-4407
- Volume:
- Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications
- Month:
- August
- Year:
- 2019
- Address:
- Florence, Italy
- Editors:
- Helen Yannakoudakis, Ekaterina Kochmar, Claudia Leacock, Nitin Madnani, Ildikó Pilán, Torsten Zesch
- Venue:
- BEA
- SIG:
- SIGEDU
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 76–86
- Language:
- URL:
- https://aclanthology.org/W19-4407
- DOI:
- 10.18653/v1/W19-4407
- Cite (ACL):
- Michael Flor, Michael Fried, and Alla Rozovskaya. 2019. A Benchmark Corpus of English Misspellings and a Minimally-supervised Model for Spelling Correction. In Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 76–86, Florence, Italy. Association for Computational Linguistics.
- Cite (Informal):
- A Benchmark Corpus of English Misspellings and a Minimally-supervised Model for Spelling Correction (Flor et al., BEA 2019)
- PDF:
- https://preview.aclanthology.org/naacl24-info/W19-4407.pdf
- Code
- EducationalTestingService/toefl-spell
- Data
- MIMIC-III