Abstract
Model-based, reference-free evaluation metricshave been proposed as a fast and cost-effectiveapproach to evaluate Natural Language Generation(NLG) systems. Despite promising recentresults, we find evidence that reference-freeevaluation metrics of summarization and dialoggeneration may be relying on spuriouscorrelations with measures such as word overlap,perplexity, and length. We further observethat for text summarization, these metrics havehigh error rates when ranking current state-ofthe-art abstractive summarization systems. Wedemonstrate that these errors can be mitigatedby explicitly designing evaluation metrics toavoid spurious features in reference-free evaluation.- Anthology ID:
- 2022.acl-long.102
- Volume:
- Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
- Month:
- May
- Year:
- 2022
- Address:
- Dublin, Ireland
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1443–1454
- Language:
- URL:
- https://aclanthology.org/2022.acl-long.102
- DOI:
- 10.18653/v1/2022.acl-long.102
- Cite (ACL):
- Esin Durmus, Faisal Ladhak, and Tatsunori Hashimoto. 2022. Spurious Correlations in Reference-Free Evaluation of Text Generation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1443–1454, Dublin, Ireland. Association for Computational Linguistics.
- Cite (Informal):
- Spurious Correlations in Reference-Free Evaluation of Text Generation (Durmus et al., ACL 2022)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/2022.acl-long.102.pdf
- Code
- esdurmus/adversarial_eval
- Data
- DailyDialog