Marcel Mroczek

2026

ReproNLP 2026: A Third Replication of the Human Evaluation of a QAG System for Children’s Storybooks
Marcel Mroczek | Chiara Albarello | Paul-Emmanuel Floch | Maciej Gawinecki
Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)

Abstract: Reproducibility of human evaluations in Natural Language Processing remains a critical open challenge. This paper presents a third independent replication of the human evaluation from Yao et al. (2022), which assessed an automated Question-Answer Generation (QAG) system for children’s storybooks against a baseline system and human-authored ground truth, across three criteria — Readability, Question Relevance, and Answer Relevance — using five NLP-literate annotators. Our replication confirms the main findings of the original study: the QAG system outperforms the baseline on Readability and Question Relevance, and Ground Truth ranks highest across all criteria. System rankings are preserved across all three criteria, with the exception of a statistically non-significant difference in Answer Relevance. This holds true despite a severe drop in inter-annotator agreement for Readability. We further document several methodological concerns, some unreported in prior replications, including data quality issues and evaluation design limitations identified during our pilot study.

Co-authors

Venues

GEM1
WS1

Fix author