Abstract
In this paper, we present a new method for training a writing improvement model adapted to the writer’s first language (L1) that goes beyond grammatical error correction (GEC). Without using annotated training data, we rely solely on pre-trained language models fine-tuned with parallel corpora of reference translation aligned with machine translation. We evaluate our model with corpora of academic papers written in English by L1 Portuguese and L1 Spanish scholars and a reference corpus of expert academic English. We show that our model is able to address specific L1-influenced writing and more complex linguistic phenomena than existing methods, outperforming what a state-of-the-art GEC system can achieve in this regard. Our code and data are open to other researchers.- Anthology ID:
- 2021.findings-emnlp.216
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2021
- Month:
- November
- Year:
- 2021
- Address:
- Punta Cana, Dominican Republic
- Editors:
- Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
- Venue:
- Findings
- SIG:
- SIGDAT
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 2534–2540
- Language:
- URL:
- https://aclanthology.org/2021.findings-emnlp.216
- DOI:
- 10.18653/v1/2021.findings-emnlp.216
- Cite (ACL):
- Gustavo Zomer and Ana Frankenberg-Garcia. 2021. Beyond Grammatical Error Correction: Improving L1-influenced research writing in English using pre-trained encoder-decoder models. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2534–2540, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Cite (Informal):
- Beyond Grammatical Error Correction: Improving L1-influenced research writing in English using pre-trained encoder-decoder models (Zomer & Frankenberg-Garcia, Findings 2021)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-3/2021.findings-emnlp.216.pdf
- Code
- gzomer/beyondgec