Michael Hardy

2026

Knowledge without Wisdom: Measuring Misalignment between LLMs and Intended Impact
Michael Hardy | Yunsung Kim
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

LLMs increasingly excel on AI benchmarks, but doing so does not guarantee validity for downstream tasks. This study contrasts LLM alignment on benchmarks, downstream tasks, and, importantly the intended impact of those tasks. We evaluate the performance of leading LLMs (i.e., generative pre-trained base models) on difficult-to-verify tasks of the teaching and learning of schoolchildren. Across all LLMs, inter-model behaviors on disparate tasks correlate higher than they do with expert human behaviors on target tasks. These biases shared across LLMs are poorly aligned with downstream measures of teaching quality and often negatively aligned with the intended impact of student learning outcomes. Further, we find multi-model ensembles, both unanimous model voting and expert-weighting by benchmark performance, further exacerbate misalignment with learning. We measure that selection of LLM and/or prompting strategy only reliably accounts for 15% of all measured misalignment error and that variation in misalignment error is shared across LLMs, suggesting that common pretraining accounts for much of the misalignment in these tasks. We demonstrate methods for robustly measuring alignment of complex tasks and provide unique insights into practical applications of LLMs in high-noise contexts.

2025

pdf bib abs

“All that Glitters”: Techniques for Evaluations with Unreliable Model and Human Annotations
Michael Hardy
Findings of the Association for Computational Linguistics: NAACL 2025

“Gold” and “ground truth” human-mediated labels have error. This error can escape commonly reported metrics of label quality or obscure questions of accuracy, bias, fairness, and usefulness during model evaluation. This study demonstrates methods for answering such questions even in the context of very low reliabilities from expert humans. We analyze human labels, GPT model ratings, and transformer encoder model ratings of the quality of classroom teaching from two LLM architecture families–encoders and GPT decoders. First, we demonstrate that using standard metrics in the presence of poor labels can mask both label and model quality. The encoder family of models achieve state-of-the-art, even “super-human”, results across all classroom annotation tasks using standard metrics. However, evaluation techniques accounting for unreliable labels reveal important flaws, including spurious correlations and nonrandom racial biases across models and humans. We estimate that if models were used in a human-in-the-loop context, the variance contributed by GPT model labels would worsen ratings. These techniques also highlight tasks where encoders could offer 80% reduction in human costs while also reducing bias.

pdf bib abs

Measuring Teaching with LLMs
Michael Hardy
Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Full Papers

This paper introduces custom Large Language Models using sentence-level embeddings to measure teaching quality. The models achieve human-level performance in analyzing classroom transcripts, outperforming average human rater correlation. Aggregate model scores align with student learning outcomes, establishing a powerful new methodology for scalable teacher feedback. Important limitations discussed.

Co-authors

Yunsung Kim 1

Venues

Fix author