When Debiasing Backfires: Counterintuitive Side Effects of Preprocessing-Based Stereotype Mitigation
Yahan Zheng, John J. Guerrerio, Soroush Vosoughi, Weicheng Ma
Abstract
Preprocessing-based methods for stereotype mitigation, such as pre-/post-training on debiased corpora, are widely used in NLP. While these approaches reduce measurable stereotypes for targeted groups, we find they often induce unintended shifts: stereotyping or counter-stereotyping can increase for other demographics, including across unrelated categories. We demonstrate these side effects across two model families (encoder-only and decoder-only), multiple preprocessing strategies (removing stereotypical sentences, removing group mentions, and swapping references), and both pre- and post-training at different data scales on Wikipedia. Standard benchmarks frequently miss these shifts. Using attention-rollout analysis, we observe that such side effects are not accompanied by large changes in attention flow, complicating mechanistic explanations. We discuss implications for evaluation, provide actionable diagnostics, and argue for side-effect-aware, transparent mitigation practices that make claims calibrated to uncertainty.- Anthology ID:
- 2026.findings-acl.486
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 10001–10021
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.486/
- DOI:
- Cite (ACL):
- Yahan Zheng, John J. Guerrerio, Soroush Vosoughi, and Weicheng Ma. 2026. When Debiasing Backfires: Counterintuitive Side Effects of Preprocessing-Based Stereotype Mitigation. In Findings of the Association for Computational Linguistics: ACL 2026, pages 10001–10021, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- When Debiasing Backfires: Counterintuitive Side Effects of Preprocessing-Based Stereotype Mitigation (Zheng et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.486.pdf