Gap-K%: Measuring Top-1 Prediction Gap for Detecting Pretraining Data

Minseo Kwak, Jaehyung Kim


Abstract
The opacity of massive pretraining corpora in Large Language Models (LLMs) raises significant privacy and copyright concerns, making pretraining data detection a critical challenge.Existing state-of-the-art methods typically rely on token likelihoods, yet they often overlook the gap between the target token and the model’s top-1 prediction, as well as local correlations between adjacent tokens.In this work, we propose Gap-K%, a novel pretraining data detection method grounded in the optimization dynamics of LLM pretraining. By analyzing the next-token prediction objective, we observe that discrepancies between the model’s top-1 prediction and the target token induce strong gradient signals, which are explicitly penalized during training.Motivated by this, Gap-K% leverages the log probability gap between the top-1 predicted token and the target token, incorporating a sliding window strategy to capture local correlations and mitigate token-level fluctuations. Extensive experiments on the WikiMIA and MIMIR benchmarks demonstrate that Gap-K% achieves state-of-the-art performance, consistently outperforming prior baselines across various model sizes and input lengths.
Anthology ID:
2026.acl-long.1072
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
23391–23405
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1072/
DOI:
Bibkey:
Cite (ACL):
Minseo Kwak and Jaehyung Kim. 2026. Gap-K%: Measuring Top-1 Prediction Gap for Detecting Pretraining Data. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23391–23405, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Gap-K%: Measuring Top-1 Prediction Gap for Detecting Pretraining Data (Kwak & Kim, ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1072.pdf
Checklist:
 2026.acl-long.1072.checklist.pdf