Ranking Unraveled: Recipes for LLM Rankings in Head-to-Head AI Combat
Roland Daynauth, Christopher Clarke, Krisztian Flautner, Lingjia Tang, Jason Mars
Abstract
Evaluating large language model (LLM) is a complex task. Pairwise ranking has emerged as state-of-the-art method to evaluate human preferences by having humans compare pairs of LLM outputs based on predefined criteria, enabling ranking across multiple LLMs by aggregating pairwise results through algorithms like Elo. However, applying these ranking algorithms in the context of LLM evaluation introduces several challenges, such as inconsistent ranking results when using ELO. Currently there is a lack of systematic study of those ranking algorithms in evaluating LLMs. In this paper, we explore the effectiveness of ranking systems for head-to-head comparisons of LLMs. We formally define a set of fundamental principles for effective ranking and conduct extensive evaluations on the robustness of several ranking algorithms in the context of LLMs. Our analysis uncovers key insights into the factors that affect ranking accuracy and efficiency, offering guidelines for selecting the most appropriate methods based on specific evaluation contexts and resource constraints.- Anthology ID:
- 2025.acl-long.1265
- Volume:
- Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
- Month:
- July
- Year:
- 2025
- Address:
- Vienna, Austria
- Editors:
- Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 26078–26091
- Language:
- URL:
- https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.1265/
- DOI:
- Cite (ACL):
- Roland Daynauth, Christopher Clarke, Krisztian Flautner, Lingjia Tang, and Jason Mars. 2025. Ranking Unraveled: Recipes for LLM Rankings in Head-to-Head AI Combat. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 26078–26091, Vienna, Austria. Association for Computational Linguistics.
- Cite (Informal):
- Ranking Unraveled: Recipes for LLM Rankings in Head-to-Head AI Combat (Daynauth et al., ACL 2025)
- PDF:
- https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.1265.pdf