A Monte-Carlo Sampling Framework For Reliable Evaluation of Large Language Models Using Behavioral Analysis

Davood Wadi; Marc Fredette

doi:10.18653/v1/2025.findings-emnlp.500

A Monte-Carlo Sampling Framework For Reliable Evaluation of Large Language Models Using Behavioral Analysis

Abstract

Scientific evaluation of Large Language Models is an important topic that quantifies any degree of progress we make with new models. Even though current LLMs show high level of accuracy on benchmark datasets, the single-sample approach to evaluating them is not sufficient as it ignores high entropy of LLM responses. We introduce a Monte-Carlo evaluation framework for evaluating LLMs that follows behavioral science methodologies and provides statistical guarantees for estimates of performance. We test our framework on multiple LLMs to see if they are susceptible to cognitive biases. We find significant effect of prompts that induce cognitive biases in LLMs, raising questions about their reliability in social sciences and business. We also see higher susceptibility of newer and larger LLMs to cognitive biases, which shows a development towards more human-like and less rational LLM responses. We conclude by calling for the use of Monte-Carlo sampling as opposed to pass@1 for the broader LLM evaluations.

Anthology ID:: 2025.findings-emnlp.500
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 9414–9432
Language:
URL:: https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.500/
DOI:: 10.18653/v1/2025.findings-emnlp.500
Bibkey:
Cite (ACL):: Davood Wadi and Marc Fredette. 2025. A Monte-Carlo Sampling Framework For Reliable Evaluation of Large Language Models Using Behavioral Analysis. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 9414–9432, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: A Monte-Carlo Sampling Framework For Reliable Evaluation of Large Language Models Using Behavioral Analysis (Wadi & Fredette, Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.500.pdf
Checklist:: 2025.findings-emnlp.500.checklist.pdf

PDF Cite Search Checklist Fix data