Proving membership in LLM pretraining data via data watermarks

Johnny Wei, Ryan Wang, Robin Jia


Abstract
Detecting whether copyright holders’ works were used in LLM pretraining is poised to be an important problem. This work proposes using data watermarks to enable principled detection with only black-box model access, provided that the rightholder contributed multiple training documents and watermarked them before public release. By applying a randomly sampled data watermark, detection can be framed as hypothesis testing, which provides guarantees on the false detection rate. We study two watermarks: one that inserts random sequences, and another that randomly substitutes characters with Unicode lookalikes. We first show how three aspects of watermark design - watermark length, number of duplications, and interference - affect the power of the hypothesis test. Next, we study how a watermark’s detection strength changes under model and dataset scaling: while increasing the dataset size decreases the strength of the watermark, watermarks remain strong if the model size also increases. Finally, we view SHA hashes as natural watermarks and show that we can robustly detect hashes from BLOOM-176B’s training data, as long as they occurred at least 90 times. Together, our results point towards a promising future for data watermarks in real world use.
Anthology ID:
2024.findings-acl.788
Volume:
Findings of the Association for Computational Linguistics: ACL 2024
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
13306–13320
Language:
URL:
https://aclanthology.org/2024.findings-acl.788
DOI:
10.18653/v1/2024.findings-acl.788
Bibkey:
Cite (ACL):
Johnny Wei, Ryan Wang, and Robin Jia. 2024. Proving membership in LLM pretraining data via data watermarks. In Findings of the Association for Computational Linguistics: ACL 2024, pages 13306–13320, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
Proving membership in LLM pretraining data via data watermarks (Wei et al., Findings 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/autopr/2024.findings-acl.788.pdf