Davood Wadi


Fixing paper assignments

  1. Please select all papers that belong to the same person.
  2. Indicate below which author they should be assigned to.
Provide a valid ORCID iD here. This will be used to match future papers to this author.
Provide the name of the school or the university where the author has received or will receive their highest degree (e.g., Ph.D. institution for researchers, or current affiliation for students). This will be used to form the new author page ID, if needed.

TODO: "submit" and "cancel" buttons here


2025

pdf bib
A Monte-Carlo Sampling Framework For Reliable Evaluation of Large Language Models Using Behavioral Analysis
Davood Wadi | Marc Fredette
Findings of the Association for Computational Linguistics: EMNLP 2025

Scientific evaluation of Large Language Models is an important topic that quantifies any degree of progress we make with new models. Even though current LLMs show high level of accuracy on benchmark datasets, the single-sample approach to evaluating them is not sufficient as it ignores high entropy of LLM responses. We introduce a Monte-Carlo evaluation framework for evaluating LLMs that follows behavioral science methodologies and provides statistical guarantees for estimates of performance. We test our framework on multiple LLMs to see if they are susceptible to cognitive biases. We find significant effect of prompts that induce cognitive biases in LLMs, raising questions about their reliability in social sciences and business. We also see higher susceptibility of newer and larger LLMs to cognitive biases, which shows a development towards more human-like and less rational LLM responses. We conclude by calling for the use of Monte-Carlo sampling as opposed to pass@1 for the broader LLM evaluations.