Hey, That’s My Data! Token-Only Dataset Inference in Large Language Models
Chen Xiong, Zihao Wang, Rui Zhu, Tsung-Yi Ho, Pin-Yu Chen, Jingwei Xiong, Haixu Tang
Abstract
Large Language Models (LLMs) rely on massive training datasets, often including proprietary data, which raises concerns about unauthorized usage and copyright infringement. Existing dataset inference methods typically require access to log probabilities or other internal signals, but many modern LLMs restrict such access, motivating token-only inference approaches. We propose CatShift, a token-only dataset inference framework based on catastrophic forgetting, where models overwrite prior knowledge when trained on new data. Fine-tuning an LLM on a subset of its training data induces larger output shifts than fine-tuning on unseen data. CatShift compares these shifts against those from a known non-member validation set to infer whether a dataset was included in training. Experiments on both open-source and API-based LLMs show that CatShift remains effective without logit access, enabling practical protection of proprietary datasets.- Anthology ID:
- 2026.findings-acl.353
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 7105–7120
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.353/
- DOI:
- Cite (ACL):
- Chen Xiong, Zihao Wang, Rui Zhu, Tsung-Yi Ho, Pin-Yu Chen, Jingwei Xiong, and Haixu Tang. 2026. Hey, That’s My Data! Token-Only Dataset Inference in Large Language Models. In Findings of the Association for Computational Linguistics: ACL 2026, pages 7105–7120, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- Hey, That’s My Data! Token-Only Dataset Inference in Large Language Models (Xiong et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.353.pdf