Best Practices for Crowd-based Evaluation of German Summarization: Comparing Crowd, Expert and Automatic Evaluation

Neslihan Iskender, Tim Polzehl, Sebastian Möller


Abstract
One of the main challenges in the development of summarization tools is summarization quality evaluation. On the one hand, the human assessment of summarization quality conducted by linguistic experts is slow, expensive, and still not a standardized procedure. On the other hand, the automatic assessment metrics are reported not to correlate high enough with human quality ratings. As a solution, we propose crowdsourcing as a fast, scalable, and cost-effective alternative to expert evaluations to assess the intrinsic and extrinsic quality of summarization by comparing crowd ratings with expert ratings and automatic metrics such as ROUGE, BLEU, or BertScore on a German summarization data set. Our results provide a basis for best practices for crowd-based summarization evaluation regarding major influential factors such as the best annotation aggregation method, the influence of readability and reading effort on summarization evaluation, and the optimal number of crowd workers to achieve comparable results to experts, especially when determining factors such as overall quality, grammaticality, referential clarity, focus, structure & coherence, summary usefulness, and summary informativeness.
Anthology ID:
2020.eval4nlp-1.16
Volume:
Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems
Month:
November
Year:
2020
Address:
Online
Editors:
Steffen Eger, Yang Gao, Maxime Peyrard, Wei Zhao, Eduard Hovy
Venue:
Eval4NLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
164–175
Language:
URL:
https://aclanthology.org/2020.eval4nlp-1.16
DOI:
10.18653/v1/2020.eval4nlp-1.16
Bibkey:
Cite (ACL):
Neslihan Iskender, Tim Polzehl, and Sebastian Möller. 2020. Best Practices for Crowd-based Evaluation of German Summarization: Comparing Crowd, Expert and Automatic Evaluation. In Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, pages 164–175, Online. Association for Computational Linguistics.
Cite (Informal):
Best Practices for Crowd-based Evaluation of German Summarization: Comparing Crowd, Expert and Automatic Evaluation (Iskender et al., Eval4NLP 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-2/2020.eval4nlp-1.16.pdf
Video:
 https://slideslive.com/38939713