Revisiting Automatic Evaluation of Extractive Summarization Task: Can We Do Better than ROUGE?

Mousumi Akter, Naman Bansal, Shubhra Kanti Karmaker


Abstract
It has been the norm for a long time to evaluate automated summarization tasks using the popular ROUGE metric. Although several studies in the past have highlighted the limitations of ROUGE, researchers have struggled to reach a consensus on a better alternative until today. One major limitation of the traditional ROUGE metric is the lack of semantic understanding (relies on direct overlap of n-grams). In this paper, we exclusively focus on the extractive summarization task and propose a semantic-aware nCG (normalized cumulative gain)-based evaluation metric (called Sem-nCG) for evaluating this task. One fundamental contribution of the paper is that it demonstrates how we can generate more reliable semantic-aware ground truths for evaluating extractive summarization tasks without any additional human intervention. To the best of our knowledge, this work is the first of its kind. We have conducted extensive experiments with this new metric using the widely used CNN/DailyMail dataset. Experimental results show that the new Sem-nCG metric is indeed semantic-aware, shows higher correlation with human judgement (more reliable) and yields a large number of disagreements with the original ROUGE metric (suggesting that ROUGE often leads to inaccurate conclusions also verified by humans).
Anthology ID:
2022.findings-acl.122
Volume:
Findings of the Association for Computational Linguistics: ACL 2022
Month:
May
Year:
2022
Address:
Dublin, Ireland
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1547–1560
Language:
URL:
https://aclanthology.org/2022.findings-acl.122
DOI:
10.18653/v1/2022.findings-acl.122
Bibkey:
Cite (ACL):
Mousumi Akter, Naman Bansal, and Shubhra Kanti Karmaker. 2022. Revisiting Automatic Evaluation of Extractive Summarization Task: Can We Do Better than ROUGE?. In Findings of the Association for Computational Linguistics: ACL 2022, pages 1547–1560, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
Revisiting Automatic Evaluation of Extractive Summarization Task: Can We Do Better than ROUGE? (Akter et al., Findings 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2022.findings-acl.122.pdf
Video:
 https://preview.aclanthology.org/ingestion-script-update/2022.findings-acl.122.mp4