SummEQuAL: Summarization Evaluation via Question Answering using Large Language Models

Junyuan Liu, Zhengyan Shi, Aldo Lipani


Abstract
Summarization is hard to evaluate due to its diverse and abstract nature. Although N-gram-based metrics like BLEU and ROUGE are prevalent, they often do not align well with human evaluations. While model-based alternatives such as BERTScore improve, they typically require extensive labelled data. The advent of Large Language Models (LLMs) presents a promising avenue for evaluation. To this end, we introduce SummEQuAL, a novel content-based framework using LLMs for unified, reproducible summarization evaluation. SummEQuAL evaluates summaries by comparing their content with the source document, employing a question-answering approach to gauge both recall and precision. To validate SummEQuAL’s effectiveness, we develop a dataset based on MultiWOZ. We conduct experiments on SummEval and our MultiWOZ-based dataset, showing that SummEQuAL largely improves the quality of summarization evaluation. Notably, SummEQuAL demonstrates a 19.7% improvement over QuestEval in terms of sample-level Pearson correlation with human assessments of consistency on the SummEval dataset. Furthermore, it exceeds the performance of the BERTScore baseline by achieving a 17.3% increase in Spearman correlation on our MultiWOZ-based dataset. Our study illuminates the potential of LLMs for a unified evaluation framework, setting a new paradigm for future summarization evaluation.
Anthology ID:
2024.nlrse-1.5
Volume:
Proceedings of the 2nd Workshop on Natural Language Reasoning and Structured Explanations (@ACL 2024)
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Bhavana Dalvi Mishra, Greg Durrett, Peter Jansen, Ben Lipkin, Danilo Neves Ribeiro, Lionel Wong, Xi Ye, Wenting Zhao
Venues:
NLRSE | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
46–55
Language:
URL:
https://aclanthology.org/2024.nlrse-1.5
DOI:
Bibkey:
Cite (ACL):
Junyuan Liu, Zhengyan Shi, and Aldo Lipani. 2024. SummEQuAL: Summarization Evaluation via Question Answering using Large Language Models. In Proceedings of the 2nd Workshop on Natural Language Reasoning and Structured Explanations (@ACL 2024), pages 46–55, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
SummEQuAL: Summarization Evaluation via Question Answering using Large Language Models (Liu et al., NLRSE-WS 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-4/2024.nlrse-1.5.pdf