Shahar Levy
2025
ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments
Gili Lior
|
Eliya Habba
|
Shahar Levy
|
Avi Caciularu
|
Gabriel Stanovsky
Findings of the Association for Computational Linguistics: EMNLP 2025
LLMs are highly sensitive to prompt phrasing, yet standard benchmarks typically report performance using a single prompt, raising concerns about the reliability of such evaluations. In this work, we argue for a stochastic method of moments evaluation over the space of meaning-preserving prompt perturbations. We introduce a formal definition of *reliable evaluation* that accounts for prompt sensitivity, and suggest ReliableEval - a method for estimating the number of prompt resamplings needed to obtain meaningful results. Using our framework, we stochastically evaluate five frontier LLMs and find that even top-performing models like GPT-4o and Claude-3.7-Sonnet exhibit substantial prompt sensitivity. Our approach is model-, task-, and metric-agnostic, offering a recipe for meaningful and robust LLM evaluation.
More Documents, Same Length: Isolating the Challenge of Multiple Documents in RAG
Shahar Levy
|
Nir Mazor
|
Lihi Shalmon
|
Michael Hassid
|
Gabriel Stanovsky
Findings of the Association for Computational Linguistics: EMNLP 2025
Retrieval-Augmented Generation (RAG) enhances the accuracy of Large Language Model (LLM) responses by leveraging relevant external documents during generation. Although previous studies noted that retrieving many documents can degrade performance, they did not isolate how the quantity of documents affects performance while controlling for context length. We evaluate various language models on custom datasets derived from a multi-hop QA task. We keep the context length and position of relevant information constant while varying the number of documents, and find that increasing the document count in RAG settings poses significant challenges for most LLMs, reducing performance by up to 20%. However, Qwen2 maintained consistent results across increasing document counts, indicating better multi-document handling capability. Finally, our results indicate that processing multiple documents is a separate challenge from handling long contexts. We will publicly release the datasets and code upon publication to facilitate further research in multi-document retrieval.
2021
Collecting a Large-Scale Gender Bias Dataset for Coreference Resolution and Machine Translation
Shahar Levy
|
Koren Lazar
|
Gabriel Stanovsky
Findings of the Association for Computational Linguistics: EMNLP 2021
Recent works have found evidence of gender bias in models of machine translation and coreference resolution using mostly synthetic diagnostic datasets. While these quantify bias in a controlled experiment, they often do so on a small scale and consist mostly of artificial, out-of-distribution sentences. In this work, we find grammatical patterns indicating stereotypical and non-stereotypical gender-role assignments (e.g., female nurses versus male dancers) in corpora from three domains, resulting in a first large-scale gender bias dataset of 108K diverse real-world English sentences. We manually verify the quality of our corpus and use it to evaluate gender bias in various coreference resolution and machine translation models. We find that all tested models tend to over-rely on gender stereotypes when presented with natural inputs, which may be especially harmful when deployed in commercial systems. Finally, we show that our dataset lends itself to finetuning a coreference resolution model, finding it mitigates bias on a held out set. Our dataset and models are publicly available at github.com/SLAB-NLP/BUG. We hope they will spur future research into gender bias evaluation mitigation techniques in realistic settings.
Search
Fix author
Co-authors
- Gabriel Stanovsky 3
- Avi Caciularu 1
- Eliya Habba 1
- Michael Hassid 1
- Koren Lazar 1
- show all...