Data Strategies for Low-Resource Grammatical Error Correction

Simon Flachs, Felix Stahlberg, Shankar Kumar


Abstract
Grammatical Error Correction (GEC) is a task that has been extensively investigated for the English language. However, for low-resource languages the best practices for training GEC systems have not yet been systematically determined. We investigate how best to take advantage of existing data sources for improving GEC systems for languages with limited quantities of high quality training data. We show that methods for generating artificial training data for GEC can benefit from including morphological errors. We also demonstrate that noisy error correction data gathered from Wikipedia revision histories and the language learning website Lang8, are valuable data sources. Finally, we show that GEC systems pre-trained on noisy data sources can be fine-tuned effectively using small amounts of high quality, human-annotated data.
Anthology ID:
2021.bea-1.12
Volume:
Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications
Month:
April
Year:
2021
Address:
Online
Venue:
BEA
SIG:
SIGEDU
Publisher:
Association for Computational Linguistics
Note:
Pages:
117–122
Language:
URL:
https://aclanthology.org/2021.bea-1.12
DOI:
Bibkey:
Cite (ACL):
Simon Flachs, Felix Stahlberg, and Shankar Kumar. 2021. Data Strategies for Low-Resource Grammatical Error Correction. In Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications, pages 117–122, Online. Association for Computational Linguistics.
Cite (Informal):
Data Strategies for Low-Resource Grammatical Error Correction (Flachs et al., BEA 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp-22-attachments/2021.bea-1.12.pdf