Abstract
Reference-based automatic evaluation metrics are notoriously limited for NLG due to their inability to fully capture the range of possible outputs. We examine a referenceless alternative: evaluating the adequacy of English sentences generated from Abstract Meaning Representation (AMR) graphs by parsing into AMR and comparing the parse directly to the input. We find that the errors introduced by automatic AMR parsing substantially limit the effectiveness of this approach, but a manual editing study indicates that as parsing improves, parsing-based evaluation has the potential to outperform most reference-based metrics.- Anthology ID:
- 2021.eval4nlp-1.12
- Volume:
- Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems
- Month:
- November
- Year:
- 2021
- Address:
- Punta Cana, Dominican Republic
- Editors:
- Yang Gao, Steffen Eger, Wei Zhao, Piyawat Lertvittayakumjorn, Marina Fomicheva
- Venue:
- Eval4NLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 114–122
- Language:
- URL:
- https://aclanthology.org/2021.eval4nlp-1.12
- DOI:
- 10.18653/v1/2021.eval4nlp-1.12
- Cite (ACL):
- Emma Manning and Nathan Schneider. 2021. Referenceless Parsing-Based Evaluation of AMR-to-English Generation. In Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems, pages 114–122, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Cite (Informal):
- Referenceless Parsing-Based Evaluation of AMR-to-English Generation (Manning & Schneider, Eval4NLP 2021)
- PDF:
- https://preview.aclanthology.org/improve-issue-templates/2021.eval4nlp-1.12.pdf