Understanding Secret Leakage Risks in Code LLMs: A Tokenization Perspective

Meifang Chen; Zhe Yang; Huang Nianchen; Yizhan Huang; Yichen Li; Zihan Li; Michael R. Lyu

Understanding Secret Leakage Risks in Code LLMs: A Tokenization Perspective

Meifang Chen, Zhe Yang, Huang Nianchen, Yizhan Huang, Yichen LI, Zihan Li, Michael R. Lyu

Abstract

Code secrets are sensitive assets for software developers, and their leakage poses significant cybersecurity risks. While the rapid development of AI code assistants powered by Code Large Language Models (CLLMs), CLLMs are shown to inadvertently leak such secrets due to a notorious memorization phenomenon. This study first reveals that Byte-Pair Encoding (BPE) tokenization leads to unexpected behavior of secret memorization, which we term as gibberish bias. Specifically, we identified that some secrets are among the easiest for CLLMs to memorize. These secrets yield high character-level entropy, but low token-level entropy. Then, this paper supports the biased claim with numerical data. We identified that the roots of the bias are the token distribution shift between the CLLM training data and the secret data. We further discuss how gibberish bias manifests under the “larger vocabulary” trend. To conclude the paper, we discuss potential mitigation strategies and the broader implications on current tokenizer design.

Anthology ID:: 2026.findings-acl.6
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 108–119
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.6/
DOI:
Bibkey:
Cite (ACL):: Meifang Chen, Zhe Yang, Huang Nianchen, Yizhan Huang, Yichen LI, Zihan Li, and Michael R. Lyu. 2026. Understanding Secret Leakage Risks in Code LLMs: A Tokenization Perspective. In Findings of the Association for Computational Linguistics: ACL 2026, pages 108–119, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Understanding Secret Leakage Risks in Code LLMs: A Tokenization Perspective (Chen et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.6.pdf
Checklist:: 2026.findings-acl.6.checklist.pdf

PDF Cite Search Checklist Fix data