Samar A. Assem


2026

Stereotype detection benchmarks assume that stereotyping occurs through what is said — via lexical co-occurrence between demographic terms and stereotypical attributes. We argue that stereotyping is often conveyed by what is meant: through presupposition, implicature, and speech-act framing that leave surface content unchanged while embedding prejudice in the pragmatic layer. We call this phenomenon pragmatic stereotyping. Evaluating GPT-4 and Claude 3.5 Sonnet on a stratified sample of 500 Egyptian Arabic social media comments annotated with a seven-tag sentiment/(im)politeness taxonomy, we find that cultural grounding is the critical bottleneck in detecting pragmatic stereotyping in non-English discourse. About 35% of LLM errors result from cultural grounding gaps, leading to a 15-percentage-point F1 difference between explicit tags (0.81) and implicit tags (0.66). These failures are bidirectional: on the author side, LLMs under-detect prejudice encoded through concessive presupposition and backhanded compliments; on the model side, LLMs apply English-based pragmatic assumptions, misinterpreting genuine polite criticism as sarcasm and positive-intended impoliteness as conflictive. Our five-layer Chain-of-Thought diagnostic framework localizes these failures to the culture-dependent inference layers. These results extend stereotype evaluation beyond lexical benchmarks and have direct implications for content moderation pipelines serving Arabic-speaking communities.