Towards Better Evaluation of Instruction-Following: A Case-Study in Summarization

Ondrej Skopek, Rahul Aralikatte, Sian Gooding, Victor Carbune


Abstract
Despite recent advances, evaluating how well large language models (LLMs) follow user instructions remains an open problem. While evaluation methods of language models have seen a rise in prompt-based approaches, limited work on the correctness of these methods has been conducted. In this work, we perform a meta-evaluation of a variety of metrics to quantify how accurately they measure the instruction-following abilities of LLMs. Our investigation is performed on grounded query-based summarization by collecting a new short-form, real-world dataset riSum, containing 300 document-instruction pairs with 3 answers each. All 900 answers are rated by 3 human annotators. Using riSum, we analyze the agreement between evaluation methods and human judgment. Finally, we propose new LLM-based reference-free evaluation methods that improve upon established baselines and perform on par with costly reference-based metrics that require high-quality summaries.
Anthology ID:
2023.conll-1.16
Volume:
Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL)
Month:
December
Year:
2023
Address:
Singapore
Editors:
Jing Jiang, David Reitter, Shumin Deng
Venue:
CoNLL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
221–237
Language:
URL:
https://aclanthology.org/2023.conll-1.16
DOI:
10.18653/v1/2023.conll-1.16
Bibkey:
Cite (ACL):
Ondrej Skopek, Rahul Aralikatte, Sian Gooding, and Victor Carbune. 2023. Towards Better Evaluation of Instruction-Following: A Case-Study in Summarization. In Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL), pages 221–237, Singapore. Association for Computational Linguistics.
Cite (Informal):
Towards Better Evaluation of Instruction-Following: A Case-Study in Summarization (Skopek et al., CoNLL 2023)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-3/2023.conll-1.16.pdf