John J. Guerrerio

2026

When Debiasing Backfires: Counterintuitive Side Effects of Preprocessing-Based Stereotype Mitigation
Yahan Zheng | John J. Guerrerio | Soroush Vosoughi | Weicheng Ma
Findings of the Association for Computational Linguistics: ACL 2026

Preprocessing-based methods for stereotype mitigation, such as pre-/post-training on debiased corpora, are widely used in NLP. While these approaches reduce measurable stereotypes for targeted groups, we find they often induce unintended shifts: stereotyping or counter-stereotyping can increase for other demographics, including across unrelated categories. We demonstrate these side effects across two model families (encoder-only and decoder-only), multiple preprocessing strategies (removing stereotypical sentences, removing group mentions, and swapping references), and both pre- and post-training at different data scales on Wikipedia. Standard benchmarks frequently miss these shifts. Using attention-rollout analysis, we observe that such side effects are not accompanied by large changes in attention flow, complicating mechanistic explanations. We discuss implications for evaluation, provide actionable diagnostics, and argue for side-effect-aware, transparent mitigation practices that make claims calibrated to uncertainty.

2025

pdf bib abs

Scalable and Culturally Specific Stereotype Dataset Construction via Human-LLM Collaboration
Weicheng Ma | John J. Guerrerio | Soroush Vosoughi
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Research on stereotypes in large language models (LLMs) has largely focused on English-speaking contexts, due to the lack of datasets in other languages and the high cost of manual annotation in underrepresented cultures. To address this gap, we introduce a cost-efficient human-LLM collaborative annotation framework and apply it to construct EspanStereo, a Spanish-language stereotype dataset spanning multiple Spanish-speaking countries across Europe and Latin America. EspanStereo captures both well-documented stereotypes from prior literature and culturally specific biases absent from English-centric resources. Using LLMs to generate candidate stereotypes and in-culture annotators to validate them, we demonstrate the framework’s effectiveness in identifying nuanced, region-specific biases. Our evaluation of Spanish-supporting LLMs using EspanStereo reveals significant variation in stereotypical behavior across countries, highlighting the need for more culturally grounded assessments. Beyond Spanish, our framework is adaptable to other languages and regions, offering a scalable path toward multilingual stereotype benchmarks. This work broadens the scope of stereotype analysis in LLMs and lays the groundwork for comprehensive cross-cultural bias evaluation.

Co-authors

Venues

EMNLP1
Findings1

Fix author