Model Consistency as a Cheap yet Predictive Proxy for LLM Elo Scores

Ashwin Ramaswamy, Nestor Demeure, Ermal Rrapaj


Abstract
New large language models (LLMs) are being released every day. Some perform significantly better or worse than expected given their parameter count. Therefore, there is a need for a method to independently evaluate models. The current best way to evaluate a model is to measure its Elo score by comparing it to other models in a series of contests—an expensive operation since humans are ideally required to compare LLM outputs. We observe that when an LLM is asked to judge such contests, the consistency with which it selects a model as the best in a matchup produces a metric that is 91% correlated with its own human-produced Elo score. This provides a simple proxy for Elo scores that can be computed cheaply, without any human data or prior knowledge.
Anthology ID:
2025.emnlp-main.1534
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
30155–30163
Language:
URL:
https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.emnlp-main.1534/
DOI:
10.18653/v1/2025.emnlp-main.1534
Bibkey:
Cite (ACL):
Ashwin Ramaswamy, Nestor Demeure, and Ermal Rrapaj. 2025. Model Consistency as a Cheap yet Predictive Proxy for LLM Elo Scores. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 30155–30163, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Model Consistency as a Cheap yet Predictive Proxy for LLM Elo Scores (Ramaswamy et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.emnlp-main.1534.pdf
Checklist:
 2025.emnlp-main.1534.checklist.pdf