HOLMS: Alternative Summary Evaluation with Large Language Models

Yassine Mrabet, Dina Demner-Fushman


Abstract
Efficient document summarization requires evaluation measures that can not only rank a set of systems based on an average score, but also highlight which individual summary is better than another. However, despite the very active research on summarization approaches, few works have proposed new evaluation measures in the recent years. The standard measures relied upon for the development of summarization systems are most often ROUGE and BLEU which, despite being efficient in overall system ranking, remain lexical in nature and have a limited potential when it comes to training neural networks. In this paper, we present a new hybrid evaluation measure for summarization, called HOLMS, that combines both language models pre-trained on large corpora and lexical similarity measures. Through several experiments, we show that HOLMS outperforms ROUGE and BLEU substantially in its correlation with human judgments on several extractive summarization datasets for both linguistic quality and pyramid scores.
Anthology ID:
2020.coling-main.498
Volume:
Proceedings of the 28th International Conference on Computational Linguistics
Month:
December
Year:
2020
Address:
Barcelona, Spain (Online)
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
5679–5688
Language:
URL:
https://aclanthology.org/2020.coling-main.498
DOI:
10.18653/v1/2020.coling-main.498
Bibkey:
Cite (ACL):
Yassine Mrabet and Dina Demner-Fushman. 2020. HOLMS: Alternative Summary Evaluation with Large Language Models. In Proceedings of the 28th International Conference on Computational Linguistics, pages 5679–5688, Barcelona, Spain (Online). International Committee on Computational Linguistics.
Cite (Informal):
HOLMS: Alternative Summary Evaluation with Large Language Models (Mrabet & Demner-Fushman, COLING 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/author-url/2020.coling-main.498.pdf