From Evasion to Concealment: Stealthy Knowledge Unlearning for LLMs

Tianle Gu, Kexin Huang, Ruilin Luo, Yuanqi Yao, Xiuying Chen, Yujiu Yang, Yan Teng, Yingchun Wang


Abstract
LLM Unlearning plays a crucial role in removing sensitive information from language models to mitigate potential misuse. However, previous approaches often treat nonsensical responses or template-based refusals (e.g., “Sorry, I cannot answer.”) as the unlearning target, which can give the impression of deliberate information suppression, making the process even more vulnerable to attacks and jailbreaks. Moreover, most methods rely on auxiliary models or retaining datasets, which adds complexity to the unlearning process. To address these challenges, we propose MEOW, a streamlined and stealthy unlearning method that eliminates the need for auxiliary models or retaining data while avoiding leakage through its innovative use of inverted facts. These inverted facts are generated by an offline LLM and serve as fine-tuning labels. Meanwhile, we introduce MEMO, a novel metric that measures the model’s memorization, to select optimal fine-tuning targets. The use of inverted facts not only maintains the covert nature of the model but also ensures that sensitive information is effectively forgotten without revealing the target data. Evaluated on the ToFU Knowledge Unlearning dataset using Llama2-7B-Chat and Phi-1.5, MEOW outperforms baselines in forgetting quality while preserving model utility. MEOW also maintains strong performance across NLU and NLG tasks and demonstrates superior resilience to attacks, validated via the Min-K% membership inference method.
Anthology ID:
2025.findings-acl.535
Volume:
Findings of the Association for Computational Linguistics: ACL 2025
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
10261–10279
Language:
URL:
https://preview.aclanthology.org/landing_page/2025.findings-acl.535/
DOI:
Bibkey:
Cite (ACL):
Tianle Gu, Kexin Huang, Ruilin Luo, Yuanqi Yao, Xiuying Chen, Yujiu Yang, Yan Teng, and Yingchun Wang. 2025. From Evasion to Concealment: Stealthy Knowledge Unlearning for LLMs. In Findings of the Association for Computational Linguistics: ACL 2025, pages 10261–10279, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
From Evasion to Concealment: Stealthy Knowledge Unlearning for LLMs (Gu et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2025.findings-acl.535.pdf