On Localizing and Deleting Toxic Memories in Large Language Models

Anubrata Das; Manoj Kumar; Ninareh Mehrabi; Anil Ramakrishna; Anna Rumshisky; Kai-Wei Chang; Aram Galstyan; Morteza Ziyadi; Rahul Gupta

On Localizing and Deleting Toxic Memories in Large Language Models

Anubrata Das, Manoj Kumar, Ninareh Mehrabi, Anil Ramakrishna, Anna Rumshisky, Kai-Wei Chang, Aram Galstyan, Morteza Ziyadi, Rahul Gupta

Abstract

Warning: This paper contains offensive language.Ensuring that large language models (LLMs) do not generate harmful text is critical for their safe deployment. A common failure mode involves producing toxic responses to otherwise innocuous prompts. While various detoxification methods have been proposed, the underlying mechanisms that drive toxic generation in LLMs are not yet fully understood. Our work aims to provide a mechanistic understanding of toxic generation against innocuous-seeming adversarial prompts through the lens of memory localization. We find evidence of localization of toxic memories in the early Multilayer Perceptron (MLP) layers of GPT-2-XL. We further investigate the effects of editing and deleting these toxic memories in MLP layers to reduce toxic generation. Editing significantly reduces toxic generation, from 62.86% to 28.61%. However, this reduction comes with a trade-off in generation quality as perplexity increases from 78.18 on GPT2-XL against the adversarial prompts to 106.06 after editing. Localization-informed deletion achieves a better toxicity-perplexity tradeoff compared to random early layer editing, which reduces toxicity but leads to greater perplexity increases.

Anthology ID:: 2025.findings-naacl.129
Volume:: Findings of the Association for Computational Linguistics: NAACL 2025
Month:: April
Year:: 2025
Address:: Albuquerque, New Mexico
Editors:: Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2415–2423
Language:
URL:: https://preview.aclanthology.org/fix-sig-urls/2025.findings-naacl.129/
DOI:
Bibkey:
Cite (ACL):: Anubrata Das, Manoj Kumar, Ninareh Mehrabi, Anil Ramakrishna, Anna Rumshisky, Kai-Wei Chang, Aram Galstyan, Morteza Ziyadi, and Rahul Gupta. 2025. On Localizing and Deleting Toxic Memories in Large Language Models. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 2415–2423, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):: On Localizing and Deleting Toxic Memories in Large Language Models (Das et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/fix-sig-urls/2025.findings-naacl.129.pdf

PDF Cite Search Fix data