ReproHum #0729-04: Partial reproduction of the human evaluation of the MemSum and NeuSum summarisation systems

Simon Mille, Michela Lorandi


Abstract
In this paper, we present our reproduction of part of the human evaluation originally carried out by Gu et al. (2022), as part of Track B of ReproNLP 2025. Four human annotators were asked to rank two candidate summaries according to their overall quality, given a reference summary shown alongside the two candidate summaries at evaluation time. We describe the original experiment and provide details about the steps we followed to carry out the reproduction experiment, including the implementation of some missing pieces of code. Our results, in particular the high coefficients of variation and low inter-annotator agreement, suggest a low level of reproducibility in the original experiment despite identical pairwise ranks. However, given the very small sample size (two systems, one rating), we remain cautious about drawing definitive conclusions.
Anthology ID:
2025.gem-1.57
Volume:
Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)
Month:
July
Year:
2025
Address:
Vienna, Austria and virtual meeting
Editors:
Kaustubh Dhole, Miruna Clinciu
Venues:
GEM | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
615–621
Language:
URL:
https://preview.aclanthology.org/corrections-2025-08/2025.gem-1.57/
DOI:
Bibkey:
Cite (ACL):
Simon Mille and Michela Lorandi. 2025. ReproHum #0729-04: Partial reproduction of the human evaluation of the MemSum and NeuSum summarisation systems. In Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²), pages 615–621, Vienna, Austria and virtual meeting. Association for Computational Linguistics.
Cite (Informal):
ReproHum #0729-04: Partial reproduction of the human evaluation of the MemSum and NeuSum summarisation systems (Mille & Lorandi, GEM 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/corrections-2025-08/2025.gem-1.57.pdf