Soft Token Attacks Cannot Reliably Audit Unlearning in Large Language Models

Haokun Chen; Sebastian Szyller; Weilin Xu; Nageen Himayat

doi:10.18653/v1/2025.findings-emnlp.117

Soft Token Attacks Cannot Reliably Audit Unlearning in Large Language Models

Haokun Chen, Sebastian Szyller, Weilin Xu, Nageen Himayat

Abstract

Large language models (LLMs) are trained using massive datasets.However, these datasets often contain undesirable content, e.g., harmful texts, personal information, and copyrighted material.To address this, machine unlearning aims to remove information from trained models.Recent work has shown that soft token attacks () can successfully extract unlearned information from LLMs.In this work, we show that s can be an inadequate tool for auditing unlearning.Using common unlearning benchmarks, i.e., Who Is Harry Potter? and TOFU, we demonstrate that, in a strong auditor setting, such attacks can elicit any information from the LLM, regardless of (1) the deployed unlearning algorithm, and (2) whether the queried content was originally present in the training corpus.Also, we show that with just a few soft tokens (1-10) can elicit random strings over 400-characters long.Thus showing that s must be used carefully to effectively audit unlearning.Example code can be found at https://github.com/IntelLabs/LLMart/tree/main/examples/unlearning

Anthology ID:: 2025.findings-emnlp.117
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2183–2192
Language:
URL:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.117/
DOI:: 10.18653/v1/2025.findings-emnlp.117
Bibkey:
Cite (ACL):: Haokun Chen, Sebastian Szyller, Weilin Xu, and Nageen Himayat. 2025. Soft Token Attacks Cannot Reliably Audit Unlearning in Large Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 2183–2192, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Soft Token Attacks Cannot Reliably Audit Unlearning in Large Language Models (Chen et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.117.pdf
Checklist:: 2025.findings-emnlp.117.checklist.pdf

PDF Cite Search Checklist Fix data