ReproHum: #0744-02: Investigating the Reproducibility of Semantic Preservation Human Evaluations

Mohammad Arvan, Natalie Parde


Abstract
Reproducibility remains a fundamental challenge for human evaluation in Natural Language Processing (NLP), particularly due to the inherent subjectivity and variability of human judgments. This paper presents a reproduction study of the human evaluation protocol introduced by Hosking and Lapata (2021), which assesses semantic preservation in paraphrase generation models. By faithfully reproducing the original experiment with careful adaptation and applying the Quantified Reproducibility Assessment framework (Belz and Thomson, 2024a; Belz, 2022), we demonstrate strong agreement with the original findings, confirming the semantic preservation ranking among four paraphrase models. Our analyses reveal moderate inter-annotator agreement and low variability in key results, underscoring a good degree of reproducibility despite practical deviations in participant recruitment and platform. These findings highlight the feasibility and challenges of reproducing human evaluation studies in NLP. We discuss implications for improving methodological rigor, transparent reporting, and standardized protocols to bolster reproducibility in future human evaluations. The data and analysis scripts are publicly available to support ongoing community efforts toward reproducible evaluation in NLP and beyond.
Anthology ID:
2025.gem-1.54
Volume:
Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)
Month:
July
Year:
2025
Address:
Vienna, Austria and virtual meeting
Editors:
Kaustubh Dhole, Miruna Clinciu
Venues:
GEM | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
590–600
Language:
URL:
https://preview.aclanthology.org/corrections-2025-08/2025.gem-1.54/
DOI:
Bibkey:
Cite (ACL):
Mohammad Arvan and Natalie Parde. 2025. ReproHum: #0744-02: Investigating the Reproducibility of Semantic Preservation Human Evaluations. In Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²), pages 590–600, Vienna, Austria and virtual meeting. Association for Computational Linguistics.
Cite (Informal):
ReproHum: #0744-02: Investigating the Reproducibility of Semantic Preservation Human Evaluations (Arvan & Parde, GEM 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/corrections-2025-08/2025.gem-1.54.pdf