Beyond the Singular: Revealing the Value of Multiple Generations in Benchmark Evaluation

Wenbo Zhang; Hengrui Cai; Wenyu Chen

Beyond the Singular: Revealing the Value of Multiple Generations in Benchmark Evaluation

Abstract

Large language models (LLMs) have demonstrated significant utility in real-world applications, exhibiting impressive capabilities in natural language processing and understanding. Benchmark evaluations are crucial for assessing the capabilities of LLMs as they can provide a comprehensive assessment of their strengths and weaknesses. However, current evaluation methods often overlook the inherent randomness of LLMs by employing deterministic generation strategies or relying on a single random sample, resulting in unaccounted sampling variance and unreliable benchmark score estimates. In this paper, we propose a hierarchical statistical model that provides a more comprehensive representation of the benchmarking process by incorporating both benchmark characteristics and LLM randomness. We show that leveraging multiple generations improves the accuracy of estimating the benchmark score and reduces variance. Multiple generations also allow us to define ℙ (correct), a prompt-level difficulty score based on correct ratios, providing fine-grained insights into individual prompts. Additionally, we create a data map that visualizes difficulty and semantics of prompts, enabling error detection and quality control in benchmark construction.

Anthology ID:: 2026.findings-acl.488
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 10033–10043
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.488/
DOI:
Bibkey:
Cite (ACL):: Wenbo Zhang, Hengrui Cai, and Wenyu Chen. 2026. Beyond the Singular: Revealing the Value of Multiple Generations in Benchmark Evaluation. In Findings of the Association for Computational Linguistics: ACL 2026, pages 10033–10043, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Beyond the Singular: Revealing the Value of Multiple Generations in Benchmark Evaluation (Zhang et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.488.pdf
Checklist:: 2026.findings-acl.488.checklist.pdf

PDF Cite Search Checklist Fix data