ZeroSumEval: An Extensible Framework For Scaling LLM Evaluation with Inter-Model Competition
Hisham Abdullah Alyahya, Haidar Khan, Yazeed Alnumay, M Saiful Bari, Bulent Yener
Abstract
We introduce ZeroSumEval, a dynamic, competition-based, and evolving evaluation framework for Large Language Models (LLMs) that leverages competitive games. ZeroSumEval encompasses a diverse suite of games, including security challenges (Capture the Flag), classic board games (chess), and knowledge tests (MathQuiz). These games are designed to evaluate a range of capabilities such as strategic reasoning, planning, knowledge application, safety, and adaptability. Building upon recent studies that highlight the effectiveness of game-based evaluations for LLMs, ZeroSumEval enhances these approaches by providing a standardized and extensible framework for easily implementing games and leverages DSPy to provide a better abstraction for LLM player strategies.- Anthology ID:
- 2025.acl-demo.33
- Volume:
- Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
- Month:
- July
- Year:
- 2025
- Address:
- Vienna, Austria
- Editors:
- Pushkar Mishra, Smaranda Muresan, Tao Yu
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 340–350
- Language:
- URL:
- https://preview.aclanthology.org/ingestion-acl-25/2025.acl-demo.33/
- DOI:
- Cite (ACL):
- Hisham Abdullah Alyahya, Haidar Khan, Yazeed Alnumay, M Saiful Bari, and Bulent Yener. 2025. ZeroSumEval: An Extensible Framework For Scaling LLM Evaluation with Inter-Model Competition. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 340–350, Vienna, Austria. Association for Computational Linguistics.
- Cite (Informal):
- ZeroSumEval: An Extensible Framework For Scaling LLM Evaluation with Inter-Model Competition (Alyahya et al., ACL 2025)
- PDF:
- https://preview.aclanthology.org/ingestion-acl-25/2025.acl-demo.33.pdf