Active Evaluation: Efficient NLG Evaluation with Few Pairwise Comparisons

Akash Kumar Mohankumar; Mitesh M. Khapra

doi:10.18653/v1/2022.acl-long.600

Active Evaluation: Efficient NLG Evaluation with Few Pairwise Comparisons

Abstract

Recent studies have shown the advantages of evaluating NLG systems using pairwise comparisons as opposed to direct assessment. Given k systems, a naive approach for identifying the top-ranked system would be to uniformly obtain pairwise comparisons from all k \choose 2 pairs of systems. However, this can be very expensive as the number of human annotations required would grow quadratically with k. In this work, we introduce Active Evaluation, a framework to efficiently identify the top-ranked system by actively choosing system pairs for comparison using dueling bandit algorithms. We perform extensive experiments with 13 dueling bandits algorithms on 13 NLG evaluation datasets spanning 5 tasks and show that the number of human annotations can be reduced by 80%. To further reduce the number of human annotations, we propose model-based dueling bandit algorithms which combine automatic evaluation metrics with human evaluations. Specifically, we eliminate sub-optimal systems even before the human annotation process and perform human evaluations only on test examples where the automatic metric is highly uncertain. This reduces the number of human annotations required further by 89%. In effect, we show that identifying the top-ranked system requires only a few hundred human annotations, which grow linearly with k. Lastly, we provide practical recommendations and best practices to identify the top-ranked system efficiently. Our code has been made publicly available at https://github.com/akashkm99/duelnlg

Anthology ID:: 2022.acl-long.600
Volume:: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: May
Year:: 2022
Address:: Dublin, Ireland
Editors:: Smaranda Muresan, Preslav Nakov, Aline Villavicencio
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 8761–8781
Language:
URL:: https://aclanthology.org/2022.acl-long.600
DOI:: 10.18653/v1/2022.acl-long.600
Award:: Outstanding Paper
Bibkey:
Cite (ACL):: Akash Kumar Mohankumar and Mitesh Khapra. 2022. Active Evaluation: Efficient NLG Evaluation with Few Pairwise Comparisons. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8761–8781, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):: Active Evaluation: Efficient NLG Evaluation with Few Pairwise Comparisons (Mohankumar & Khapra, ACL 2022)
Copy Citation:
PDF:: https://preview.aclanthology.org/naacl24-info/2022.acl-long.600.pdf
Video:: https://preview.aclanthology.org/naacl24-info/2022.acl-long.600.mp4
Code: akashkm99/duelnlg
Data: CoNLL-2014 Shared Task: Grammatical Error Correction, ParaBank, WMT 2015, WMT 2016

PDF Search Code Video