Albert No

2026

Machine unlearning aims to remove “forget” data while preserving knowledge from the “retain” data, yet a fundamental question arises when the two share content. By definition, an unlearned model should be indistinguishable from a model retrained solely on the retain set, which implies that shared knowledge must remain while only forget-specific content is removed. To evaluate this requirement, we introduce DUSK, the first benchmark for unlearning under realistic knowledge overlap. DUSK constructs documents containing both shared and unique knowledge and defines seven metrics to test whether methods erase forget-specific expressions without discarding shared facts. Evaluating nine recent approaches, we find that although surface text is often removed, current methods struggle to distinguish shared from unique knowledge, either erasing information that should be retained or failing to fully forget target content. DUSK provides a controlled, reproducible testbed for diagnosing these failures and guiding precise unlearning algorithms.

2025

pdf bib abs

Decomposing weight matrices into quantization and low-rank components ( W≈ Q+LR) is a widely used technique for compressing large language models (LLMs). Existing joint optimization methods iteratively alternate between quantization and low-rank approximation. However, these methods tend to prioritize one component at the expense of the other, resulting in suboptimal decompositions that fail to leverage each component’s unique strengths. In this work, we introduce Outlier-Driven Low-Rank Initialization (ODLRI), which assigns low-rank components the specific role of capturing activation-sensitive weights. This structured decomposition mitigates outliers’ negative impact on quantization, enabling more effective balance between quantization and low-rank approximation. Experiments on Llama2 (7B, 13B, 70B), Llama3-8B, and Mistral-7B demonstrate that incorporating ODLRI into the joint optimization framework consistently reduces activation-aware error, minimizes quantization scale, and improves perplexity and zero-shot accuracy in low-bit settings.

pdf bib abs

R-TOFU: Unlearning in Large Reasoning Models
Sangyeon Yoon | Wonje Jeung | Albert No
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Large Reasoning Models (LRMs) embed private or copyrighted information not only in their final answers but also throughout multi-step chain-of-thought (CoT) traces, making reliable unlearning far more demanding than in standard LLMs. We introduce Reasoning-TOFU (R-TOFU), the first benchmark tailored to this setting. R-TOFU augments existing unlearning tasks with realistic CoT annotations and provides step-wise metrics that expose residual knowledge invisible to answer-level checks. Using R-TOFU, we carry out a comprehensive comparison of gradient-based and preference-optimization baselines and show that conventional answer-only objectives leave substantial forget traces in reasoning. We further propose Reasoned IDK, a preference-optimization variant that preserves coherent yet inconclusive reasoning, achieving a stronger balance between forgetting efficacy and model utility than earlier refusal styles. Finally, we identify a failure mode: decoding variants such as ZeroThink and LessThink can still reveal forgotten content despite seemingly successful unlearning, emphasizing the need to evaluate models under diverse decoding settings. Together, the benchmark, analysis, and new baseline establish a systematic foundation for studying and improving unlearning in LRMs while preserving their reasoning capabilities.

pdf bib abs

SEPS: A Separability Measure for Robust Unlearning in LLMs
Wonje Jeung | Sangyeon Yoon | Albert No
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Machine unlearning aims to selectively remove targeted knowledge from Large Language Models (LLMs), ensuring they forget specified content while retaining essential information. Existing unlearning metrics assess whether a model correctly answers retain queries and rejects forget queries, but they fail to capture real-world scenarios where forget queries rarely appear in isolation. In fact, forget and retain queries often coexist within the same prompt, making mixed-query evaluation crucial.We introduce SEPS, an evaluation framework that explicitly measures a model’s ability to both forget and retain information within a single prompt. Through extensive experiments across three benchmarks, we identify two key failure modes in existing unlearning methods: (1) untargeted unlearning indiscriminately erases both forget and retain content once a forget query appears, and (2) targeted unlearning overfits to single-query scenarios, leading to catastrophic failures when handling multiple queries. To address these issues, we propose Mixed Prompt (MP) unlearning, a strategy that integrates both forget and retain queries into a unified training objective. Our approach significantly improves unlearning effectiveness, demonstrating robustness even in complex settings with up to eight mixed forget and retain queries in a single prompt.

Co-authors

Venues

EMNLP2
Findings2

Fix author