Medha Hira
2026
Mind the Gap: Multilingual Divide in LLM Bias Detection and Reasoning
Medha Hira | Prachi Goyal | Raj Maheshwari | Arnav Goel
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Medha Hira | Prachi Goyal | Raj Maheshwari | Arnav Goel
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Large Language Models (LLMs) are increasingly deployed in multilingual settings, yet most bias evaluation remains English-centric and overlooks how bias manifests within reasoning. We present a systematic study of social bias in both predictions and chain-of-thought reasoning across English, Dutch, Spanish, and Turkish using the MBBQ benchmark. We evaluate instruction-tuned, CoT-prompted, and reasoning-native models under supervised fine-tuning and preference optimization, using accuracy, F1, bias metrics, and a novel reasoning-level language drift measure. We find that (1) bias varies substantially across languages, with consistent degradation in non-English settings, (2) reasoning traces often introduce additional stereotype-driven signals beyond final outputs, and (3) English-trained debiasing methods fail to generalize reliably, with preference optimization introducing cross-lingual trade-offs. We further show that performance gains in multilingual settings are frequently driven by implicit reliance on English-centric reasoning, revealed through increased language drift. Together, our results demonstrate that multilingual fairness cannot be inferred from English performance and requires reasoning-aware, language-specific evaluation and alignment.
MORPHOGEN: A Multilingual Benchmark for Evaluating Gender-Aware Morphological Generation
Mehul Agarwal | Aditya Aggarwal | Arnav Goel | Medha Hira | Anubha Gupta
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Mehul Agarwal | Aditya Aggarwal | Arnav Goel | Medha Hira | Anubha Gupta
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
While multilingual large language models (LLMs) perform well on high-level tasks like translation and question answering, their ability to handle grammatical gender and morphological agreement remains underexplored. In morphologically rich languages, gender influences verb conjugation, pronouns, and even first-person constructions with explicit and implicit mentions of gender. We introduce MORPHOGEN, a morphologically grounded large-scale benchmark dataset for evaluating gender-aware generation in three typologically diverse grammatically gendered languages: French, Arabic, and Hindi. The core task, GENFORM, requires models to rewrite a first-person sentence in the opposite gender while preserving its meaning and structure. We construct a high-quality synthetic dataset spanning these three languages and benchmark 15 popular multilingual LLMs (2B–70B) on their ability to perform this transformation. Our results reveal significant gaps and interesting insights into how current models handle morphological gender. MORPHOGEN provides a focused diagnostic lens for gender-aware language modeling and lays the groundwork for future research on inclusive and morphology-sensitive NLP.