CheckEval: A reliable LLM-as-a-Judge framework for evaluating text generation using checklists

Yukyung Lee; JoongHoon Kim; Jaehee Kim; Hyowon Cho; Jaewook Kang; Pilsung Kang; Najoung Kim

CheckEval: A reliable LLM-as-a-Judge framework for evaluating text generation using checklists

Yukyung Lee, JoongHoon Kim, Jaehee Kim, Hyowon Cho, Jaewook Kang, Pilsung Kang, Najoung Kim

Abstract

Existing LLM-as-a-Judge approaches for evaluating text generation suffer from rating inconsistencies, with low agreement and high rating variance across different evaluator models. We attribute this to subjective evaluation criteria combined with Likert scale scoring in existing protocols. To address this issue, we introduce CheckEval, a checklist-based evaluation framework that improves rating reliability via decomposed binary questions. Through experiments with 12 evaluator models across multiple datasets, we first demonstrate that CheckEval strongly correlates with human judgments. More importantly, CheckEval dramatically improves the average agreement across evaluator models by 0.45 and reduces the score variance. CheckEval scores furthermore have the benefit of being more interpretable because it decomposes evaluation criteria into traceable binary decisions, allowing analyses of specific attributes driving quality judgments.

Anthology ID:: 2025.emnlp-main.796
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 15782–15809
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.796/
DOI:
Bibkey:
Cite (ACL):: Yukyung Lee, JoongHoon Kim, Jaehee Kim, Hyowon Cho, Jaewook Kang, Pilsung Kang, and Najoung Kim. 2025. CheckEval: A reliable LLM-as-a-Judge framework for evaluating text generation using checklists. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 15782–15809, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: CheckEval: A reliable LLM-as-a-Judge framework for evaluating text generation using checklists (Lee et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.796.pdf
Checklist:: 2025.emnlp-main.796.checklist.pdf

PDF Cite Search Checklist Fix data