Abstract
Several metrics have been proposed for evaluating grammatical error correction (GEC) systems based on grammaticality, fluency, and adequacy of the output sentences. Previous studies of the correlation of these metrics with human quality judgments were inconclusive, due to the lack of appropriate significance tests, discrepancies in the methods, and choice of datasets used. In this paper, we re-evaluate reference-based GEC metrics by measuring the system-level correlations with humans on a large dataset of human judgments of GEC outputs, and by properly conducting statistical significance tests. Our results show no significant advantage of GLEU over MaxMatch (M2), contradicting previous studies that claim GLEU to be superior. For a finer-grained analysis, we additionally evaluate these metrics for their agreement with human judgments at the sentence level. Our sentence-level analysis indicates that comparing GLEU and M2, one metric may be more useful than the other depending on the scenario. We further qualitatively analyze these metrics and our findings show that apart from being less interpretable and non-deterministic, GLEU also produces counter-intuitive scores in commonly occurring test examples.- Anthology ID:
- C18-1231
- Volume:
- Proceedings of the 27th International Conference on Computational Linguistics
- Month:
- August
- Year:
- 2018
- Address:
- Santa Fe, New Mexico, USA
- Editors:
- Emily M. Bender, Leon Derczynski, Pierre Isabelle
- Venue:
- COLING
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 2730–2741
- Language:
- URL:
- https://aclanthology.org/C18-1231
- DOI:
- Cite (ACL):
- Shamil Chollampatt and Hwee Tou Ng. 2018. A Reassessment of Reference-Based Grammatical Error Correction Metrics. In Proceedings of the 27th International Conference on Computational Linguistics, pages 2730–2741, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
- Cite (Informal):
- A Reassessment of Reference-Based Grammatical Error Correction Metrics (Chollampatt & Ng, COLING 2018)
- PDF:
- https://preview.aclanthology.org/naacl24-info/C18-1231.pdf
- Code
- nusnlp/gecmetrics
- Data
- CoNLL-2014 Shared Task: Grammatical Error Correction