Marcel Mroczek


2026

Abstract: Reproducibility of human evaluations in Natural Language Processing remains a critical open challenge. This paper presents a third independent replication of the human evaluation from Yao et al. (2022), which assessed an automated Question-Answer Generation (QAG) system for children’s storybooks against a baseline system and human-authored ground truth, across three criteria — Readability, Question Relevance, and Answer Relevance — using five NLP-literate annotators. Our replication confirms the main findings of the original study: the QAG system outperforms the baseline on Readability and Question Relevance, and Ground Truth ranks highest across all criteria. System rankings are preserved across all three criteria, with the exception of a statistically non-significant difference in Answer Relevance. This holds true despite a severe drop in inter-annotator agreement for Readability. We further document several methodological concerns, some unreported in prior replications, including data quality issues and evaluation design limitations identified during our pilot study.