Mimi Xie
2026
Are LLM Benchmarks Already Contaminated? A Systematic Review of Contamination Detection Methods
Erfan Nourbakhsh | Mohammad Sadegh Sirjani | Amir Mousavi | Khoa Nguyen | John Quarles | Mimi Xie | Rocky Slavin
Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
Erfan Nourbakhsh | Mohammad Sadegh Sirjani | Amir Mousavi | Khoa Nguyen | John Quarles | Mimi Xie | Rocky Slavin
Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
Large Language Models (LLMs) are trained on web-scale corpora, increasing the risk that benchmark test data appears in training sets and inflates reported performance. We present a systematic literature review of 55 studies on LLM benchmark contamination through late 2025. Our contributions are: (1) a four-tier contamination taxonomy (Exact, Syntactic, Semantic, Task-Level; T1–T4); (2) a comparative analysis of five detection families (string-matching, likelihood-based, membership inference, LLM-prompted detection, and benchmark auditing), including access assumptions and failure modes; (3) a synthesis of contamination evidence on MMLU, GSM8K, HumanEval, and HellaSwag by measurement construct; (4) a comparative evaluation of mitigation strategies across lifecycle points, access assumptions, and evidence maturity; and (5) a Contamination Transparency Card (CTC) framework for future releases. Across studies, no detection method is consistently reliable across contamination tiers, model-access settings, and training stages. We identify instruction tuning as a persistent blind spot, note that RL/post-training contamination auditing is only beginning to mature, and report inflation estimates spanning roughly 6%–40% under benchmark- and setting-dependent assumptions.
2022
Sparse Progressive Distillation: Resolving Overfitting under Pretrain-and-Finetune Paradigm
Shaoyi Huang | Dongkuan Xu | Ian Yen | Yijue Wang | Sung-En Chang | Bingbing Li | Shiyang Chen | Mimi Xie | Sanguthevar Rajasekaran | Hang Liu | Caiwen Ding
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Shaoyi Huang | Dongkuan Xu | Ian Yen | Yijue Wang | Sung-En Chang | Bingbing Li | Shiyang Chen | Mimi Xie | Sanguthevar Rajasekaran | Hang Liu | Caiwen Ding
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Conventional wisdom in pruning Transformer-based language models is that pruning reduces the model expressiveness and thus is more likely to underfit rather than overfit. However, under the trending pretrain-and-finetune paradigm, we postulate a counter-traditional hypothesis, that is: pruning increases the risk of overfitting when performed at the fine-tuning phase. In this paper, we aim to address the overfitting problem and improve pruning performance via progressive knowledge distillation with error-bound properties. We show for the first time that reducing the risk of overfitting can help the effectiveness of pruning under the pretrain-and-finetune paradigm. Ablation studies and experiments on the GLUE benchmark show that our method outperforms the leading competitors across different tasks.