ReproNLP 2026: A Third Replication of the Human Evaluation of a QAG System for Children’s Storybooks

Marcel Mroczek, Chiara Albarello, Paul-Emmanuel Floch, Maciej Gawinecki


Abstract
Abstract: Reproducibility of human evaluations in Natural Language Processing remains a critical open challenge. This paper presents a third independent replication of the human evaluation from Yao et al. (2022), which assessed an automated Question-Answer Generation (QAG) system for children’s storybooks against a baseline system and human-authored ground truth, across three criteria — Readability, Question Relevance, and Answer Relevance — using five NLP-literate annotators. Our replication confirms the main findings of the original study: the QAG system outperforms the baseline on Readability and Question Relevance, and Ground Truth ranks highest across all criteria. System rankings are preserved across all three criteria, with the exception of a statistically non-significant difference in Answer Relevance. This holds true despite a severe drop in inter-annotator agreement for Readability. We further document several methodological concerns, some unreported in prior replications, including data quality issues and evaluation design limitations identified during our pilot study.
Anthology ID:
2026.gem-main.85
Volume:
Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
Month:
July
Year:
2026
Address:
San Diego, California, USA
Editors:
Simon Mille, Sebastian Gehrmann, Patrícia Schmidtová, Ondřej Dušek, Marzieh Fadaee, Kyle Lo, Enrico Santus, Gabriel Stanovsky
Venues:
GEM | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1082–1093
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.gem-main.85/
DOI:
Bibkey:
Cite (ACL):
Marcel Mroczek, Chiara Albarello, Paul-Emmanuel Floch, and Maciej Gawinecki. 2026. ReproNLP 2026: A Third Replication of the Human Evaluation of a QAG System for Children’s Storybooks. In Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM), pages 1082–1093, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):
ReproNLP 2026: A Third Replication of the Human Evaluation of a QAG System for Children’s Storybooks (Mroczek et al., GEM 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.gem-main.85.pdf