Loss Masking Under the Hood: Backdoor Concealment and Private Data Memorization in LLMs

Tagore Rao Kosireddy, Evan Lucas


Abstract
Loss masking has been proposed as a method for preventing language models from generating specific content by selectively zeroes the training loss on sensitive tokens,which allows a language model to learn protected content as contextwithout learning to reproduce it (CITATION).% Although promising, many critical questions about the impacts to a model remain unanswered. In this work, we investigate the impact of loss masking on internal model representation and context understanding using a small causal language model (GPT-2) at three scales (124M, 355M, 774M parameters) and apply mechanistic interpretability tools including causal tracing, attention analysis, and linear probing. We explore two use cases of loss-masking: backdoor concealment and prevention of memorization of named entities. In both settings, we find that loss masking successfully blocks generation of the protected tokens. Through mechanistic analysis, we show that protected token identity remains fully encoded in hidden states regardless of loss masking, confirming that loss masking suppresses the output pathway but not the internal encoding. Code is available at https://github.com/Tagore-7/loss-masking-analysis
Anthology ID:
2026.privatenlp-main.5
Volume:
Proceedings of the Seventh Workshop on Privacy in Natural Language Processing
Month:
July
Year:
2026
Address:
San Diego, California
Editors:
Ivan Habernal, Sepideh Ghanavati, Sara Haghighi, Krithika Ramesh, Timour Igamberdiev, Shomir Wilson
Venues:
PrivateNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
69–79
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.privatenlp-main.5/
DOI:
Bibkey:
Cite (ACL):
Tagore Rao Kosireddy and Evan Lucas. 2026. Loss Masking Under the Hood: Backdoor Concealment and Private Data Memorization in LLMs. In Proceedings of the Seventh Workshop on Privacy in Natural Language Processing, pages 69–79, San Diego, California. Association for Computational Linguistics.
Cite (Informal):
Loss Masking Under the Hood: Backdoor Concealment and Private Data Memorization in LLMs (Kosireddy & Lucas, PrivateNLP 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.privatenlp-main.5.pdf