ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments
Gili Lior, Eliya Habba, Shahar Levy, Avi Caciularu, Gabriel Stanovsky
Abstract
LLMs are highly sensitive to prompt phrasing, yet standard benchmarks typically report performance using a single prompt, raising concerns about the reliability of such evaluations. In this work, we argue for a stochastic method of moments evaluation over the space of meaning-preserving prompt perturbations. We introduce a formal definition of *reliable evaluation* that accounts for prompt sensitivity, and suggest ReliableEval - a method for estimating the number of prompt resamplings needed to obtain meaningful results. Using our framework, we stochastically evaluate five frontier LLMs and find that even top-performing models like GPT-4o and Claude-3.7-Sonnet exhibit substantial prompt sensitivity. Our approach is model-, task-, and metric-agnostic, offering a recipe for meaningful and robust LLM evaluation.- Anthology ID:
- 2025.findings-emnlp.594
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2025
- Month:
- November
- Year:
- 2025
- Address:
- Suzhou, China
- Editors:
- Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 11146–11153
- Language:
- URL:
- https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.594/
- DOI:
- 10.18653/v1/2025.findings-emnlp.594
- Cite (ACL):
- Gili Lior, Eliya Habba, Shahar Levy, Avi Caciularu, and Gabriel Stanovsky. 2025. ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 11146–11153, Suzhou, China. Association for Computational Linguistics.
- Cite (Informal):
- ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments (Lior et al., Findings 2025)
- PDF:
- https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.594.pdf