Max Obreiter
2026
Test of Time: Rethinking Temporal Signal of Benchmark Contamination
Terry Jingchen Zhang | Gopal Dev | Ning Wang | Max Obreiter | Wenyuan Jiang | Punya Syon Pandey | Keenan Samway | Yinya Huang | Bernhard Sch\"olkopf | Mrinmaya Sachan | Zhijing Jin
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Terry Jingchen Zhang | Gopal Dev | Ning Wang | Max Obreiter | Wenyuan Jiang | Punya Syon Pandey | Keenan Samway | Yinya Huang | Bernhard Sch\"olkopf | Mrinmaya Sachan | Zhijing Jin
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Post-cutoff performance decay has been widely interpreted as a temporal signal for benchmark contamination.We critically examine this belief and demonstrate that this temporal signal is highly sensitive to how benchmark questions are constructed.Specifically, we show that LLM-generated questions can produce remarkably different temporal patterns compared to fill-in-the-blank questions directly retrieved from the very same materials.We validated this finding on previous benchmarks that reported clear post-cutoff performance decay such as LiveCodeBench and further showed simple LLM transformation could effectively remove this temporal pattern when evaluated on the same models.We also provide a mechanistic understanding of our observation using influence function analysis.Overall, this work offers a new perspective on the sensitivity of temporal contamination signal and highlights the need for more robust contamination detection methods for reliable AI evaluation.