Abstract
Large Language Models (LLMs) have become pivotal in advancing natural language processing, yet their potential to perpetuate biases poses significant concerns. This paper introduces a new framework employing Direct Preference Optimization (DPO) to mitigate gender, racial, and religious biases in LLM-generated English text. By developing a loss function that favors less biased over biased completions, our approach cultivates a preference for respectful and non-discriminatory language in LLMs. We also contribute a manually designed dataset for training LLMs to recognize and correct biases. This dataset encompasses a diverse range of prompts paired with both biased and unbiased completions. Implementing this approach on the Microsoft Phi-2 model, we demonstrate substantial reductions in biased outputs as our model outperforms the baseline model on almost all bias benchmarks. Our model also achieves better performance compared to other open-source models on most benchmarks. By reducing biases in the language generated by the model, our study marks a significant step towards developing more ethical and socially responsible LLMs. We publicly release BiasDPO dataset on HuggingFace.- Anthology ID:
- 2024.acl-srw.7
- Volume:
- Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)
- Month:
- August
- Year:
- 2024
- Address:
- Bangkok, Thailand
- Editors:
- Xiyan Fu, Eve Fleisig
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 42–50
- Language:
- URL:
- https://aclanthology.org/2024.acl-srw.7
- DOI:
- 10.18653/v1/2024.acl-srw.7
- Cite (ACL):
- Ahmed Allam. 2024. BiasDPO: Mitigating Bias in Language Models through Direct Preference Optimization. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop), pages 42–50, Bangkok, Thailand. Association for Computational Linguistics.
- Cite (Informal):
- BiasDPO: Mitigating Bias in Language Models through Direct Preference Optimization (Allam, ACL 2024)
- PDF:
- https://preview.aclanthology.org/dois-2013-emnlp/2024.acl-srw.7.pdf