2021
pdf
abs
Data Strategies for Low-Resource Grammatical Error Correction
Simon Flachs
|
Felix Stahlberg
|
Shankar Kumar
Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications
Grammatical Error Correction (GEC) is a task that has been extensively investigated for the English language. However, for low-resource languages the best practices for training GEC systems have not yet been systematically determined. We investigate how best to take advantage of existing data sources for improving GEC systems for languages with limited quantities of high quality training data. We show that methods for generating artificial training data for GEC can benefit from including morphological errors. We also demonstrate that noisy error correction data gathered from Wikipedia revision histories and the language learning website Lang8, are valuable data sources. Finally, we show that GEC systems pre-trained on noisy data sources can be fine-tuned effectively using small amounts of high quality, human-annotated data.
2020
pdf
abs
Grammatical Error Correction in Low Error Density Domains: A New Benchmark and Analyses
Simon Flachs
|
Ophélie Lacroix
|
Helen Yannakoudakis
|
Marek Rei
|
Anders Søgaard
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Evaluation of grammatical error correction (GEC) systems has primarily focused on essays written by non-native learners of English, which however is only part of the full spectrum of GEC applications. We aim to broaden the target domain of GEC and release CWEB, a new benchmark for GEC consisting of website text generated by English speakers of varying levels of proficiency. Website data is a common and important domain that contains far fewer grammatical errors than learner essays, which we show presents a challenge to state-of-the-art GEC systems. We demonstrate that a factor behind this is the inability of systems to rely on a strong internal language model in low error density domains. We hope this work shall facilitate the development of open-domain GEC models that generalize to different topics and genres.
2019
pdf
abs
Noisy Channel for Low Resource Grammatical Error Correction
Simon Flachs
|
Ophélie Lacroix
|
Anders Søgaard
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications
This paper describes our contribution to the low-resource track of the BEA 2019 shared task on Grammatical Error Correction (GEC). Our approach to GEC builds on the theory of the noisy channel by combining a channel model and language model. We generate confusion sets from the Wikipedia edit history and use the frequencies of edits to estimate the channel model. Additionally, we use two pre-trained language models: 1) Google’s BERT model, which we fine-tune for specific error types and 2) OpenAI’s GPT-2 model, utilizing that it can operate with previous sentences as context. Furthermore, we search for the optimal combinations of corrections using beam search.
pdf
abs
A Simple and Robust Approach to Detecting Subject-Verb Agreement Errors
Simon Flachs
|
Ophélie Lacroix
|
Marek Rei
|
Helen Yannakoudakis
|
Anders Søgaard
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
While rule-based detection of subject-verb agreement (SVA) errors is sensitive to syntactic parsing errors and irregularities and exceptions to the main rules, neural sequential labelers have a tendency to overfit their training data. We observe that rule-based error generation is less sensitive to syntactic parsing errors and irregularities than error detection and explore a simple, yet efficient approach to getting the best of both worlds: We train neural sequential labelers on the combination of large volumes of silver standard data, obtained through rule-based error generation, and gold standard data. We show that our simple protocol leads to more robust detection of SVA errors on both in-domain and out-of-domain data, as well as in the context of other errors and long-distance dependencies; and across four standard benchmarks, the induced model on average achieves a new state of the art.
pdf
abs
Historical Text Normalization with Delayed Rewards
Simon Flachs
|
Marcel Bollmann
|
Anders Søgaard
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Training neural sequence-to-sequence models with simple token-level log-likelihood is now a standard approach to historical text normalization, albeit often outperformed by phrase-based models. Policy gradient training enables direct optimization for exact matches, and while the small datasets in historical text normalization are prohibitive of from-scratch reinforcement learning, we show that policy gradient fine-tuning leads to significant improvements across the board. Policy gradient training, in particular, leads to more accurate normalizations for long or unseen words.