Qingqing Ye
2026
From Domains to Instances: Dual-Granularity Data Synthesis for LLM Unlearning
Xiaoyu Xu | Minxin Du | Zitong LI | Zi Liang | Zhibiao Guo | Zhang Shiyu | Peizhao Hu | Qingqing Ye | Haibo Hu
Findings of the Association for Computational Linguistics: ACL 2026
Xiaoyu Xu | Minxin Du | Zitong LI | Zi Liang | Zhibiao Guo | Zhang Shiyu | Peizhao Hu | Qingqing Ye | Haibo Hu
Findings of the Association for Computational Linguistics: ACL 2026
Although machine unlearning is essential for removing private, harmful, or copyrighted content from LLMs, current benchmarks often fail to faithfully represent the true “forgetting scope” learned by the model. We formalize two distinct unlearning granularities, domain-level and instance-level, and propose , an automated framework for synthesizing high-quality forget sets.Unlike prior work relying on external generators, exploits the target model per se to elicit data that matches its internal knowledge distribution through seed-guided and adversarial prompting. Our experiments across diverse benchmarks show that it achieves a superior balance of relevance, diversity, and efficiency. Quantitatively, in the Harry Potter domain, it improves relevance by ∼20 and diversity by ∼0.05 while halving the total data size compared to SOTAs. Ultimately, it facilitates more robust forgetting and better utility preservation, providing a more rigorous foundation for evaluating LLM unlearning.
2025
“Yes, My LoRD.” Guiding Language Model Extraction with Locality Reinforced Distillation
Zi Liang | Qingqing Ye | Yanyun Wang | Sen Zhang | Yaxin Xiao | RongHua Li | Jianliang Xu | Haibo Hu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zi Liang | Qingqing Ye | Yanyun Wang | Sen Zhang | Yaxin Xiao | RongHua Li | Jianliang Xu | Haibo Hu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Model extraction attacks (MEAs) on large language models (LLMs) have received increasing attention in recent research. However, existing attack methods typically adapt the extraction strategies originally developed for deep neural networks (DNNs). They neglect the underlying inconsistency between the training tasks of MEA and LLM alignment, leading to suboptimal attack performance. To tackle this issue, we propose Locality Reinforced Distillation (LoRD), a novel model extraction algorithm specifically designed for LLMs. In particular, LoRD employs a newly defined policy-gradient-style training task that utilizes the responses of victim model as the signal to guide the crafting of preference for the local model. Theoretical analyses demonstrate that I) The convergence procedure of LoRD in model extraction is consistent with the alignment procedure of LLMs, and II) LoRD can reduce query complexity while mitigating watermark protection through our exploration-based stealing. Extensive experiments validate the superiority of our method in extracting various state-of-the-art commercial LLMs. Our code is available at: https://github.com/liangzid/LoRD-MEA.
OBLIVIATE: Robust and Practical Machine Unlearning for Large Language Models
Xiaoyu Xu | Minxin Du | Qingqing Ye | Haibo Hu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Xiaoyu Xu | Minxin Du | Qingqing Ye | Haibo Hu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Large language models (LLMs) trained over extensive corpora risk memorizing sensitive, copyrighted, or toxic content. To address this, we propose OBLIVIATE, a robust unlearning framework that removes targeted data while preserving model utility. The framework follows a structured process: extracting target tokens, building retain sets, and fine-tuning with a tailored loss function comprising three components—masking, distillation, and world fact. Using low-rank adapters (LoRA) ensures efficiency without compromising unlearning quality. We conduct experiments on multiple datasets, including Harry Potter series, WMDP, and TOFU, using a comprehensive suite of metrics: forget quality (via a new document-level memorization score), model utility, and fluency. Results demonstrate its effectiveness in resisting membership inference attacks, minimizing the impact on retained data, and maintaining robustness across diverse scenarios.