Elo Uncovered: Robustness and Best Practices in Language Model Evaluation
Meriem Boubdir, Edward Kim, Beyza Ermis, Sara Hooker, Marzieh Fadaee
Abstract
In Natural Language Processing (NLP), the Elo rating system, well-established for ranking dynamic competitors in games like chess, has seen increasing adoption for evaluating Large Language Models (LLMs) through “A vs B” paired comparisons. However, while popular, the system’s suitability for assessing entities with constant skill levels, such as LLMs, remains relatively unexplored. Our study investigates the sensitivity and reproducibility of Elo scores for LLMs, integrating both synthetic and human feedback. We show that Elo ratings for LLMs stabilize with 100 or more comparison permutations. A lower K-factor is preferable for closely matched models, whereas a higher K-factor better distinguishes models with clear performance differences. We also report that transitivity (A B and B C implies A C) does not consistently hold, particularly when models demonstrate similar performance. Our empirical findings provide guidelines for more reliable LLM evaluation.- Anthology ID:
- 2023.gem-1.28
- Volume:
- Proceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)
- Month:
- December
- Year:
- 2023
- Address:
- Singapore
- Editors:
- Sebastian Gehrmann, Alex Wang, João Sedoc, Elizabeth Clark, Kaustubh Dhole, Khyathi Raghavi Chandu, Enrico Santus, Hooman Sedghamiz
- Venues:
- GEM | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 339–352
- Language:
- URL:
- https://aclanthology.org/2023.gem-1.28
- DOI:
- Cite (ACL):
- Meriem Boubdir, Edward Kim, Beyza Ermis, Sara Hooker, and Marzieh Fadaee. 2023. Elo Uncovered: Robustness and Best Practices in Language Model Evaluation. In Proceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM), pages 339–352, Singapore. Association for Computational Linguistics.
- Cite (Informal):
- Elo Uncovered: Robustness and Best Practices in Language Model Evaluation (Boubdir et al., GEM-WS 2023)
- PDF:
- https://preview.aclanthology.org/proper-vol2-ingestion/2023.gem-1.28.pdf