Abstract
Grammatical error correction (GEC) is a challenging task for non-native second language (L2) learners and learning machines. Data-driven GEC learning requires as much human-annotated genuine training data as possible. However, it is difficult to produce larger-scale human-annotated data, and synthetically generated large-scale parallel training data is valuable for GEC systems. In this paper, we propose a method for rebuilding a corpus of synthetic parallel data using target sentences predicted by a GEC model to improve performance. Experimental results show that our proposed pre-training outperforms that on the original synthetic datasets. Moreover, it is also shown that our proposed training without human-annotated L2 learners’ corpora is as practical as conventional full pipeline training with both synthetic datasets and L2 learners’ corpora in terms of accuracy.- Anthology ID:
- 2023.bea-1.38
- Volume:
- Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)
- Month:
- July
- Year:
- 2023
- Address:
- Toronto, Canada
- Editors:
- Ekaterina Kochmar, Jill Burstein, Andrea Horbach, Ronja Laarmann-Quante, Nitin Madnani, Anaïs Tack, Victoria Yaneva, Zheng Yuan, Torsten Zesch
- Venue:
- BEA
- SIG:
- SIGEDU
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 455–465
- Language:
- URL:
- https://aclanthology.org/2023.bea-1.38
- DOI:
- 10.18653/v1/2023.bea-1.38
- Cite (ACL):
- Mikio Oda. 2023. Training for Grammatical Error Correction Without Human-Annotated L2 Learners’ Corpora. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), pages 455–465, Toronto, Canada. Association for Computational Linguistics.
- Cite (Informal):
- Training for Grammatical Error Correction Without Human-Annotated L2 Learners’ Corpora (Oda, BEA 2023)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-2/2023.bea-1.38.pdf