Jiamu Zhang
2025
ReasonerRank: Redefining Language Model Evaluation with Ground-Truth-Free Ranking Frameworks
Jiamu Zhang
|
Jiayi Yuan
|
Andrew Wen
|
Hoang Anh Duy Le
|
Yu-Neng Chuang
|
Soo-Hyun Choi
|
Rui Chen
|
Xia Hu
Findings of the Association for Computational Linguistics: ACL 2025
Large Language Models (LLMs) are increasingly adopted across real-world applications, yet traditional evaluations rely on expensive, domain-specific ground-truth labels that are often unavailable or infeasible. We introduce a ground-truth-free evaluation framework focused on reasoning consistency and instruction following, shifting the emphasis from correctness—which is elusive without labels—to transparent, coherent, evidence-based reasoning. Each model response must include a direct answer, a structured multi-step explanation, and supporting evidence, all assessed via semantic similarity and output adherence checks. We further propose TopK-ReRank, which refines rankings by constructing a consensus answer from the most reliable models, reducing ambiguity across diverse reasoning styles. Experiments show that our framework outperforms existing label-free methods, including majority voting, triplet ranking, and peer-review approaches, providing a more interpretable and efficient alternative for evaluating LLMs in the absence of ground-truth labels.
Search
Fix author
Co-authors
- Rui Chen (陈蕊) 1
- Soo-Hyun Choi 1
- Yu-Neng Chuang 1
- Xia Hu 1
- Hoang Anh Duy Le 1
- show all...