Disentangling Biased Knowledge from Reasoning in Large Language Models via Machine Unlearning

Zheyuan Liu; Suraj Maharjan; Fanyou Wu; Rahil Parikh; Belhassen Bayar; Srinivasan H. Sengamedu; Meng Jiang

Disentangling Biased Knowledge from Reasoning in Large Language Models via Machine Unlearning

Zheyuan Liu, Suraj Maharjan, Fanyou Wu, Rahil Parikh, Belhassen Bayar, Srinivasan H. Sengamedu, Meng Jiang

Abstract

The rapid development of Large Language Models (LLMs) has led to their widespread adoption across various domains, leveraging vast pre-training knowledge and impressive generalization capabilities. However, these models often inherit biased knowledge, resulting in unfair decisions in sensitive applications. It is challenging to remove this biased knowledge without compromising reasoning abilities due to the entangled nature of the learned knowledge within LLMs. To solve this problem, existing approaches have attempted to mitigate the bias using techniques such as fine-tuning with unbiased datasets, model merging, and gradient ascent. While these methods have experimentally proven effective, they can still be sub-optimum in fully disentangling biases from reasoning. To address this gap, we propose Selective Disentanglement Unlearning (SDU), a novel unlearning framework that selectively removes biased knowledge while preserving reasoning capabilities. SDU operates in three stages: identifying biased parameters using a shadow LLM, fine-tuning with unbiased data, and performing selective parameter updates based on weight saliency. Experimental results across multiple LLMs show that SDU improves fairness accuracy by 14.7% and enhances reasoning performance by 62.6% compared to existing baselines.

Anthology ID:: 2025.acl-long.305
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 6105–6123
Language:
URL:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.305/
DOI:
Bibkey:
Cite (ACL):: Zheyuan Liu, Suraj Maharjan, Fanyou Wu, Rahil Parikh, Belhassen Bayar, Srinivasan H. Sengamedu, and Meng Jiang. 2025. Disentangling Biased Knowledge from Reasoning in Large Language Models via Machine Unlearning. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6105–6123, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Disentangling Biased Knowledge from Reasoning in Large Language Models via Machine Unlearning (Liu et al., ACL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.305.pdf

PDF Cite Search Fix data