Synthetic Data Generation for Grammatical Error Correction with Tagged Corruption Models

Felix Stahlberg; Shankar Kumar

Synthetic Data Generation for Grammatical Error Correction with Tagged Corruption Models

Abstract

Synthetic data generation is widely known to boost the accuracy of neural grammatical error correction (GEC) systems, but existing methods often lack diversity or are too simplistic to generate the broad range of grammatical errors made by human writers. In this work, we use error type tags from automatic annotation tools such as ERRANT to guide synthetic data generation. We compare several models that can produce an ungrammatical sentence given a clean sentence and an error type tag. We use these models to build a new, large synthetic pre-training data set with error tag frequency distributions matching a given development set. Our synthetic data set yields large and consistent gains, improving the state-of-the-art on the BEA-19 and CoNLL-14 test sets. We also show that our approach is particularly effective in adapting a GEC system, trained on mixed native and non-native English, to a native English test set, even surpassing real training data consisting of high-quality sentence pairs.

Anthology ID:: 2021.bea-1.4
Volume:: Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications
Month:: April
Year:: 2021
Address:: Online
Editors:: Jill Burstein, Andrea Horbach, Ekaterina Kochmar, Ronja Laarmann-Quante, Claudia Leacock, Nitin Madnani, Ildikó Pilán, Helen Yannakoudakis, Torsten Zesch
Venue:: BEA
SIG:: SIGEDU
Publisher:: Association for Computational Linguistics
Note:
Pages:: 37–47
Language:
URL:: https://aclanthology.org/2021.bea-1.4
DOI:
Bibkey:
Cite (ACL):: Felix Stahlberg and Shankar Kumar. 2021. Synthetic Data Generation for Grammatical Error Correction with Tagged Corruption Models. In Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications, pages 37–47, Online. Association for Computational Linguistics.
Cite (Informal):: Synthetic Data Generation for Grammatical Error Correction with Tagged Corruption Models (Stahlberg & Kumar, BEA 2021)
Copy Citation:
PDF:: https://preview.aclanthology.org/naacl24-info/2021.bea-1.4.pdf
Optional supplementary material:: 2021.bea-1.4.OptionalSupplementaryMaterial.zip
Code: google-research-datasets/C4_200M-synthetic-dataset-for-grammatical-error-correction
Data: FCE, JFLEG

PDF Search Code Optional supplementary material