HypoEval: Hypothesis-Guided Evaluation for Natural Language Generation

Mingxuan Li, Hanchen Li, Chenhao Tan


Abstract
Large language models (LLMs) have demonstrated great potential for automating the evaluation of natural language generation. Previous frameworks of LLM-as-a-judge fall short in two ways: they either use zero-shot setting without consulting any human input, which leads to low alignment, or fine-tune LLMs on labeled data, which requires a non-trivial number of samples. Moreover, previous methods often provide little reasoning behind automated evaluations. In this paper, we propose HypoEval, Hypothesis-guided Evaluation framework, which first uses a small corpus of human evaluations to generate more detailed rubrics for human judgments and then incorporates a checklist-like approach to combine LLM’s assigned scores on each decomposed dimension to acquire overall scores. With only 30 human evaluations, HypoEval achieves state-of-the-art performance in alignment with both human rankings (Spearman correlation) and human scores (Pearson correlation), on average outperforming G-Eval by 11.86% and fine-tuned Llama-3.1-8B-Instruct with at least 3 times more human evaluations by 11.95%. Furthermore, we conduct systematic studies to assess the robustness of HypoEval, highlighting its effectiveness as a reliable and interpretable automated evaluation framework.
Anthology ID:
2026.acl-long.1963
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
42424–42443
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1963/
DOI:
Bibkey:
Cite (ACL):
Mingxuan Li, Hanchen Li, and Chenhao Tan. 2026. HypoEval: Hypothesis-Guided Evaluation for Natural Language Generation. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 42424–42443, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
HypoEval: Hypothesis-Guided Evaluation for Natural Language Generation (Li et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1963.pdf
Checklist:
 2026.acl-long.1963.checklist.pdf