Erfan Nourbakhsh
2026
Are LLM Benchmarks Already Contaminated? A Systematic Review of Contamination Detection Methods
Erfan Nourbakhsh | Mohammad Sadegh Sirjani | Amir Mousavi | Khoa Nguyen | John Quarles | Mimi Xie | Rocky Slavin
Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
Erfan Nourbakhsh | Mohammad Sadegh Sirjani | Amir Mousavi | Khoa Nguyen | John Quarles | Mimi Xie | Rocky Slavin
Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
Large Language Models (LLMs) are trained on web-scale corpora, increasing the risk that benchmark test data appears in training sets and inflates reported performance. We present a systematic literature review of 55 studies on LLM benchmark contamination through late 2025. Our contributions are: (1) a four-tier contamination taxonomy (Exact, Syntactic, Semantic, Task-Level; T1–T4); (2) a comparative analysis of five detection families (string-matching, likelihood-based, membership inference, LLM-prompted detection, and benchmark auditing), including access assumptions and failure modes; (3) a synthesis of contamination evidence on MMLU, GSM8K, HumanEval, and HellaSwag by measurement construct; (4) a comparative evaluation of mitigation strategies across lifecycle points, access assumptions, and evidence maturity; and (5) a Contamination Transparency Card (CTC) framework for future releases. Across studies, no detection method is consistently reliable across contamination tiers, model-access settings, and training stages. We identify instruction tuning as a persistent blind spot, note that RL/post-training contamination auditing is only beginning to mature, and report inflation estimates spanning roughly 6%–40% under benchmark- and setting-dependent assumptions.
When Retrieval Doesn’t Help: A Large-Scale Study of Biomedical RAG
Erfan Nourbakhsh | Rocky Slavin | Ke Yang | Anthony Rios
BioNLP 2026
Erfan Nourbakhsh | Rocky Slavin | Ke Yang | Anthony Rios
BioNLP 2026
Medical question answering is a high-stakes setting where factual errors can have serious consequences. Retrieval-augmented generation (RAG) is widely viewed as a promising solution, and prior work has reported substantial gains for large medical QA models. We revisit this assumption across a broad range of open-weight instruction-tuned models spanning 7B to 72B parameters. Across five models, ten biomedical QA datasets, four retrieval methods, and four retrieval corpora, we find that retrieval yields only small and inconsistent improvements over a no-retrieval baseline, typically within 1–2 points. In contrast, the choice of backbone model has a much larger effect than the choice of retriever or corpus, and expert and layman retrieval sources perform similarly in most settings. These results suggest that the main bottleneck is not retrieval quality alone, but the model’s limited ability to use retrieved evidence effectively.