MixEdit: Revisiting Data Augmentation and Beyond for Grammatical Error Correction

Jingheng Ye, Yinghui Li, Yangning Li, Hai-Tao Zheng


Abstract
Data Augmentation through generating pseudo data has been proven effective in mitigating the challenge of data scarcity in the field of Grammatical Error Correction (GEC). Various augmentation strategies have been widely explored, most of which are motivated by two heuristics, i.e., increasing the distribution similarity and diversity of pseudo data. However, the underlying mechanism responsible for the effectiveness of these strategies remains poorly understood. In this paper, we aim to clarify how data augmentation improves GEC models. To this end, we introduce two interpretable and computationally efficient measures: Affinity and Diversity. Our findings indicate that an excellent GEC data augmentation strategy characterized by high Affinity and appropriate Diversity can better improve the performance of GEC models. Based on this observation, we propose MixEdit, a data augmentation approach that strategically and dynamically augments realistic data, without requiring extra monolingual corpora. To verify the correctness of our findings and the effectiveness of the proposed MixEdit, we conduct experiments on mainstream English and Chinese GEC datasets. The results show that MixEdit substantially improves GEC models and is complementary to traditional data augmentation methods. All the source codes of MixEdit are released at https://github.com/THUKElab/MixEdit.
Anthology ID:
2023.findings-emnlp.681
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2023
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
10161–10175
Language:
URL:
https://aclanthology.org/2023.findings-emnlp.681
DOI:
10.18653/v1/2023.findings-emnlp.681
Bibkey:
Cite (ACL):
Jingheng Ye, Yinghui Li, Yangning Li, and Hai-Tao Zheng. 2023. MixEdit: Revisiting Data Augmentation and Beyond for Grammatical Error Correction. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10161–10175, Singapore. Association for Computational Linguistics.
Cite (Informal):
MixEdit: Revisiting Data Augmentation and Beyond for Grammatical Error Correction (Ye et al., Findings 2023)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-5/2023.findings-emnlp.681.pdf