Red-faced ROUGE: Examining the Suitability of ROUGE for Opinion Summary Evaluation

Wenyi Tay, Aditya Joshi, Xiuzhen Zhang, Sarvnaz Karimi, Stephen Wan


Abstract
One of the most common metrics to automatically evaluate opinion summaries is ROUGE, a metric developed for text summarisation. ROUGE counts the overlap of word or word units between a candidate summary against reference summaries. This formulation treats all words in the reference summary equally.In opinion summaries, however, not all words in the reference are equally important. Opinion summarisation requires to correctly pair two types of semantic information: (1) aspect or opinion target; and (2) polarity of candidate and reference summaries. We investigate the suitability of ROUGE for evaluating opin-ion summaries of online reviews. Using three simulation-based experiments, we evaluate the behaviour of ROUGE for opinion summarisation on the ability to match aspect and polarity. We show that ROUGE cannot distinguish opinion summaries of similar or opposite polarities for the same aspect. Moreover,ROUGE scores have significant variance under different configuration settings. As a result, we present three recommendations for future work that uses ROUGE to evaluate opinion summarisation.
Anthology ID:
U19-1008
Volume:
Proceedings of the The 17th Annual Workshop of the Australasian Language Technology Association
Month:
4--6 December
Year:
2019
Address:
Sydney, Australia
Venue:
ALTA
SIG:
Publisher:
Australasian Language Technology Association
Note:
Pages:
52–60
Language:
URL:
https://aclanthology.org/U19-1008
DOI:
Bibkey:
Cite (ACL):
Wenyi Tay, Aditya Joshi, Xiuzhen Zhang, Sarvnaz Karimi, and Stephen Wan. 2019. Red-faced ROUGE: Examining the Suitability of ROUGE for Opinion Summary Evaluation. In Proceedings of the The 17th Annual Workshop of the Australasian Language Technology Association, pages 52–60, Sydney, Australia. Australasian Language Technology Association.
Cite (Informal):
Red-faced ROUGE: Examining the Suitability of ROUGE for Opinion Summary Evaluation (Tay et al., ALTA 2019)
Copy Citation:
PDF:
https://preview.aclanthology.org/auto-file-uploads/U19-1008.pdf