BiasGuard: A Reasoning-Enhanced Bias Detection Tool for Large Language Models

Zhiting Fan; Ruizhe Chen; Zuozhu Liu

BiasGuard: A Reasoning-Enhanced Bias Detection Tool for Large Language Models

Abstract

Identifying bias in LLM-generated content is a crucial prerequisite for ensuring fairness in LLMs. Existing methods, such as fairness classifiers and LLM-based judges, face limitations related to difficulties in understanding underlying intentions and the lack of criteria for fairness judgment. In this paper, we introduce BiasGuard, a novel bias detection tool that explicitly analyzes inputs and reasons through fairness specifications to provide accurate judgments. BiasGuard is implemented through a two-stage approach: the first stage initializes the model to explicitly reason based on fairness specifications, while the second stage leverages reinforcement learning to enhance its reasoning and judgment capabilities. Our experiments, conducted across five datasets, demonstrate that BiasGuard outperforms existing tools, improving accuracy and reducing over-fairness misjudgments. We also highlight the importance of reasoning-enhanced decision-making and provide evidence for the effectiveness of our two-stage optimization pipeline.

Anthology ID:: 2025.findings-acl.506
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venues:: Findings | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 9753–9764
Language:
URL:: https://preview.aclanthology.org/acl25-workshop-ingestion/2025.findings-acl.506/
DOI:
Bibkey:
Cite (ACL):: Zhiting Fan, Ruizhe Chen, and Zuozhu Liu. 2025. BiasGuard: A Reasoning-Enhanced Bias Detection Tool for Large Language Models. In Findings of the Association for Computational Linguistics: ACL 2025, pages 9753–9764, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: BiasGuard: A Reasoning-Enhanced Bias Detection Tool for Large Language Models (Fan et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/acl25-workshop-ingestion/2025.findings-acl.506.pdf

PDF Cite Search Fix data