Ask Me Like I’m Human: LLM-based Evaluation with For-Human Instructions Correlates Better with Human Evaluations than Human Judges

Rudali Huidrom; Anja Belz

Ask Me Like I’m Human: LLM-based Evaluation with For-Human Instructions Correlates Better with Human Evaluations than Human Judges

Abstract

Human evaluation in NLP has high cost and expertise requirements, and instruction-tuned LLMs are increasingly seen as a viable alternative. Reported correlations with human judgements vary across evaluation contexts and prompt types, and it is hard currently to predict if an LLM-as-judge metric will work equally well for new evaluation contexts and prompts, unless human evaluations are also carried out for comparison. Addressing two main factors contributing to this uncertainty, model suitability and prompt engineering, in the work reported in this focused contribution, we test four LLMs and different ways of combining them, in conjunction with a standard approach to prompt formulation, namely using written-for-human instructions verbatim. We meta-evaluate performance against human evaluations on two data-to-text tasks, and eight evaluation measures, also comparing against more conventional LLM prompt formulations. We find that the best LLM (combination)s are excellent predictors of mean human judgements, and are particularly good at content-related evaluation (in contrast to form-related criteria such as Fluency). Moreover, the best LLMs correlate far more strongly with human evaluations than individual human judges across all scenarios.

Anthology ID:: 2025.trl-workshop.9
Volume:: Proceedings of the 4th Table Representation Learning Workshop
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Shuaichen Chang, Madelon Hulsebos, Qian Liu, Wenhu Chen, Huan Sun
Venues:: TRL | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 98–108
Language:
URL:: https://preview.aclanthology.org/acl25-workshop-ingestion/2025.trl-workshop.9/
DOI:
Bibkey:
Cite (ACL):: Rudali Huidrom and Anya Belz. 2025. Ask Me Like I’m Human: LLM-based Evaluation with For-Human Instructions Correlates Better with Human Evaluations than Human Judges. In Proceedings of the 4th Table Representation Learning Workshop, pages 98–108, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Ask Me Like I’m Human: LLM-based Evaluation with For-Human Instructions Correlates Better with Human Evaluations than Human Judges (Huidrom & Belz, TRL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/acl25-workshop-ingestion/2025.trl-workshop.9.pdf

PDF Cite Search Fix data