Surfacing Subtle Stereotypes: A Multilingual, Debate-Oriented Evaluation of Modern LLMs

Muhammed Yahia Gaffar Saeed; Muhammad Abdul-Mageed; Shady Shehata

Surfacing Subtle Stereotypes: A Multilingual, Debate-Oriented Evaluation of Modern LLMs

Muhammed Yahia Gaffar Saeed, Muhammad Abdul-Mageed, Shady Shehata

Abstract

Large language models (LLMs) are widely deployed for open-ended communication, yet most bias evaluations still rely on English, classification-style tasks. We introduce , a new multilingual, debate-style benchmark designed to reveal how narrative bias appears in realistic generative settings. Our dataset includes 8,400 structured debate prompts spanning four sensitive domains – Women’s Rights, Backwardness, Terrorism, and Religion – across seven languages ranging from high-resource (English, Chinese) to low-resource (Swahili, Nigerian Pidgin). Using four flagship models (GPT-4o, Claude 3.5 Haiku, DeepSeek-Chat, and LLaMA-3-70B), we generate over 100,000 debate responses and automatically classify which demographic groups are assigned stereotyped versus modern roles. Results show that all models reproduce entrenched stereotypes despite safety alignment: Arabs are overwhelmingly linked to Terrorism and Religion (≥89%), Africans to socioeconomic “backwardness” (up to 77%), and Western groups are consistently framed as modern or progressive. Biases grow sharply in lower-resource languages, revealing that alignment trained primarily in English does not generalize globally. Our findings highlight a persistent divide in multilingual fairness: current alignment methods reduce explicit toxicity but fail to prevent biased outputs in open-ended contexts. We release our benchmark and analysis framework to support the next generation of multilingual bias evaluation and safer, culturally inclusive model alignment

Anthology ID:: 2026.lrec-main.643
Volume:: Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:: May
Year:: 2026
Address:: Palma de Mallorca, Spain
Editors:: Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:: LREC
SIG:
Publisher:: ELRA Language Resource Association
Note:
Pages:: 8106–8121
Language:
URL:: https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.643/
DOI:
Bibkey:
Cite (ACL):: Muhammed Yahia Gaffar Saeed, Muhammad Abdul-Mageed, and Shady Shehata. 2026. Surfacing Subtle Stereotypes: A Multilingual, Debate-Oriented Evaluation of Modern LLMs. International Conference on Language Resources and Evaluation, main:8106–8121.
Cite (Informal):: Surfacing Subtle Stereotypes: A Multilingual, Debate-Oriented Evaluation of Modern LLMs (Saeed et al., LREC 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.643.pdf

PDF Cite Search Fix data