C2PO: Diagnosing and Disentangling Bias Shortcuts in LLMs

Xuan Feng; Bo An; Tianlong Gu; Liang Chang; Fengrui Hao; Peipeng Yu; Shuai Zhao

C2PO: Diagnosing and Disentangling Bias Shortcuts in LLMs

Xuan Feng, Bo An, Tianlong Gu, Liang Chang, Fengrui Hao, Peipeng Yu, Shuai Zhao

Abstract

Bias in Large Language Models (LLMs) poses significant risks to trustworthiness, manifesting primarily as stereotypical biases (e.g., gender or racial stereotypes) and structural biases (e.g., lexical overlap or position preferences). However, prior paradigms typically address these in isolation, often mitigating one at the expense of exacerbating the other. To address this, we conduct a systematic exploration of these reasoning failures and identify a primary inducement: the latent spurious feature correlations within the input that drive these erroneous reasoning shortcuts. Driven by these findings, we introduce Causal-Contrastive Preference Optimization (C2PO), a unified alignment framework designed to tackle these specific failures by simultaneously discovering and suppressing these correlations directly within the optimization process. Specifically, C2PO leverages causal counterfactual signals to isolate bias-inducing features from valid reasoning paths, and employs a fairness-sensitive preference update mechanism to dynamically evaluate logit-level contributions and suppress shortcut features. Extensive experiments across multiple benchmarks covering stereotypical bias (BBQ, Unqover), structural bias (MNLI, HANS, Chatbot, MT-Bench), out-of-domain fairness (StereoSet, WinoBias), and general utility (MMLU, GSM8K) demonstrate that C2PO effectively mitigates stereotypical and structural biases while preserving robust general reasoning capabilities.

Anthology ID:: 2026.findings-acl.1226
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 24496–24515
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1226/
DOI:
Bibkey:
Cite (ACL):: Xuan Feng, Bo An, Tianlong Gu, Liang Chang, Fengrui Hao, Peipeng Yu, and Shuai Zhao. 2026. C2PO: Diagnosing and Disentangling Bias Shortcuts in LLMs. In Findings of the Association for Computational Linguistics: ACL 2026, pages 24496–24515, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: C2PO: Diagnosing and Disentangling Bias Shortcuts in LLMs (Feng et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1226.pdf
Checklist:: 2026.findings-acl.1226.checklist.pdf

PDF Cite Search Checklist Fix data