“All that Glitters”: Techniques for Evaluations with Unreliable Model and Human Annotations

Michael Hardy


Abstract
“Gold” and “ground truth” human-mediated labels have error. This error can escape commonly reported metrics of label quality or obscure questions of accuracy, bias, fairness, and usefulness during model evaluation. This study demonstrates methods for answering such questions even in the context of very low reliabilities from expert humans. We analyze human labels, GPT model ratings, and transformer encoder model ratings of the quality of classroom teaching from two LLM architecture families–encoders and GPT decoders. First, we demonstrate that using standard metrics in the presence of poor labels can mask both label and model quality. The encoder family of models achieve state-of-the-art, even “super-human”, results across all classroom annotation tasks using standard metrics. However, evaluation techniques accounting for unreliable labels reveal important flaws, including spurious correlations and nonrandom racial biases across models and humans. We estimate that if models were used in a human-in-the-loop context, the variance contributed by GPT model labels would worsen ratings. These techniques also highlight tasks where encoders could offer 80% reduction in human costs while also reducing bias.
Anthology ID:
2025.findings-naacl.120
Volume:
Findings of the Association for Computational Linguistics: NAACL 2025
Month:
April
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2250–2278
Language:
URL:
https://preview.aclanthology.org/fix-sig-urls/2025.findings-naacl.120/
DOI:
Bibkey:
Cite (ACL):
Michael Hardy. 2025. “All that Glitters”: Techniques for Evaluations with Unreliable Model and Human Annotations. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 2250–2278, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
“All that Glitters”: Techniques for Evaluations with Unreliable Model and Human Annotations (Hardy, Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-sig-urls/2025.findings-naacl.120.pdf