Manuela Hürlimann
Other people with similar names: Manuela Huerlimann
2026
ReproHum #0031–01: Reproducing a Human Readability Evaluation for Question–Answer Generation Systems
Manuela Hürlimann | Mark Cieliebak
Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
Manuela Hürlimann | Mark Cieliebak
Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
Human evaluations play a central role in assessing natural language processing systems, yet their robustness and reproducibility remain incompletely understood. This paper reports on a reproduction of the human readability evaluation from Yao et al. (2022) for question–answer generation (QAG) systems, conducted within the ReproHum project and the ReproNLP 2026 shared task (Belz et al., 2026). The original evaluation compared three QAG systems with respect to three criteria. We reproduced the evaluation of one of these criteria, readability, using a new group of five evaluators. We report descriptive results, inter-annotator agreement, system-level comparisons, and cross-study robustness metrics compared to the original study and two previous reproductions. Our results support all conclusions of the original evaluation and are largely consistent with two previous reproductions.