Test of Time: Rethinking Temporal Signal of Benchmark Contamination

Terry Jingchen Zhang, Gopal Dev, Ning Wang, Max Obreiter, Punya Syon Pandey, Keenan Samway, Wenyuan Jiang, Yinya Huang, Bernhard Schölkopf, Mrinmaya Sachan, Zhijing Jin


Abstract
Post-cutoff performance decay of LLMs has been widely interpreted as a temporal signal for benchmark contamination, where public information released before the training cutoff may have been included into training corpora and inflated model performance by memorization. We critically examine this view and demonstrate that this temporal signal is highly sensitive to how benchmark questions are constructed, even if the underlying source material remains invariant. Specifically, we show that LLM-transformed questions can produce remarkably different temporal patterns compared to fill-in-the-blank (cloze) questions directly retrieved from the very same documents. We validate this effect on prior benchmarks that report clear post-cutoff decay (LiveCodeBench), and show that a simple LLM-driven transformation of the same problems can effectively remove the temporal pattern. We further provide a mechanistic understanding of this phenomenon using influence function analysis. Overall, our results suggest that post-cutoff performance decay is a sensitive contamination signal, motivating more robust contamination probes for reliable LLM evaluation.
Anthology ID:
2026.acl-long.1693
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
36538–36555
Language:
URL:
https://preview.aclanthology.org/corrections-2026-06/2026.acl-long.1693/
DOI:
10.18653/v1/2026.acl-long.1693
Bibkey:
Cite (ACL):
Terry Jingchen Zhang, Gopal Dev, Ning Wang, Max Obreiter, Punya Syon Pandey, Keenan Samway, Wenyuan Jiang, Yinya Huang, Bernhard Schölkopf, Mrinmaya Sachan, and Zhijing Jin. 2026. Test of Time: Rethinking Temporal Signal of Benchmark Contamination. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 36538–36555, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Test of Time: Rethinking Temporal Signal of Benchmark Contamination (Zhang et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/corrections-2026-06/2026.acl-long.1693.pdf
Checklist:
 2026.acl-long.1693.checklist.pdf