Mitigating Biases in Language Models via Bias Unlearning

Dianqing Liu, Yi Liu, Guoqing Jin, Zhendong Mao


Abstract
Many studies have shown various biases targeting different demographic groups in language models, amplifying discrimination and harming fairness. Recent parameter modification debiasing approaches significantly degrade core capabilities such as text coherence and task accuracy. And Prompt-based debiasing methods, only effective for predefined trigger words, fail to address deeply embedded stereotypical associations in model parameters. In this paper, we propose BiasUnlearn, a novel model debiasing framework which achieves targeted debiasing via dual-pathway unlearning mechanisms coordinating stereotype forgetting with anti-stereotype retention, while preventing bias polarity reversal through adversarial forget set and dynamic dataset swapping. We conducted extensive experiments with multiple language models across various evaluation benchmarks. The results show that BiasUnlearn outperforms existing methods in mitigating bias in language models while retaining language modeling capabilities. Further experiments reveal that debiasing weights are transferable across model variants, confirming that bias representations become entrenched during pre-training and persist through fine-tuning phases.
Anthology ID:
2025.emnlp-main.208
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4160–4178
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.208/
DOI:
Bibkey:
Cite (ACL):
Dianqing Liu, Yi Liu, Guoqing Jin, and Zhendong Mao. 2025. Mitigating Biases in Language Models via Bias Unlearning. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 4160–4178, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Mitigating Biases in Language Models via Bias Unlearning (Liu et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.208.pdf
Checklist:
 2025.emnlp-main.208.checklist.pdf