Gilad Gressel


2026

Cross-cultural psychology has shown that moral judgments about failures to help vary systematically across cultures. In a landmark study, Miller, Bersoff, and Harwood (1990) found that while Indian and American participants agreed that failures to help are undesirable, they differed in whether they considered helping a moral obligation subject to social sanction or a personal decision. We adapt Miller et al.’s paradigm—nine scenarios crossing need severity (life-threatening, moderate, minor) with role relationship (parent, friend, stranger) and their original probe questions—to a cross-lingual LLM setting, presenting them to four LLMs (GPT-5.4, Claude-Opus-4.6, DeepSeek-V3.1, Qwen3-235B) across ten languages. We find that language significantly shapes how LLMs categorize failures to help as moral violations, social conventions, personal-moral concerns, or personal decisions (𝜒2(27) = 116.14, p < .001, Cramer’s V = 0.147). Models agree across languages that failures to help are undesirable, but diverge substantially in how they classify them, with the primary divergence falling between moral violations and personal decisions. The proportion of responses classifying failures as moral violations decreases as need severity decreases and the role relationship becomes more distant. Cross-lingual variation differs substantially across models, with open-weight models showing significantly stronger variation than closed-weight models. These findings indicate that users consulting LLMs in different languages may receive substantively different moral guidance, underscoring the need for cross-lingual normative auditing as a component of multilingual LLM evaluation.
Large language models (LLMs) are increasingly deployed in multilingual settings, yet little is known about whether their moral and social judgments remain consistent across languages. In particular, when faced with moral and social dilemmas, LLMs must often implicitly or explicitly assign responsibility — to an individual, to broader social forces, or across multiple parties — a process known as responsibility attribution. This study investigates whether responsibility attributions vary across languages, whether any observed variation persists across thematic domains, and whether the degree of variation differs across LLMs. We evaluate three models (GPT-5.2, Gemini-2.5-Pro, and LLaMA-3.3-70B) across 12 scenarios spanning six thematic domains (marriage, career, authority, gender, elder care, and family). Each model was prompted to attribute responsibility for each scenario by selecting from four options: the primary individual, a secondary interpersonal actor, a broader societal factor, or distributed responsibility shared across multiple parties. Results reveal a significant overall association between language and responsibility attribution (Cramér’s V = 0.24) that persists within every thematic domain (V = 0.26–0.53). The magnitude of cross-language variation is strongly model-dependent: GPT-5.2 and Gemini-2.5-Pro show modest shifts (V ≈ 0.19), while LLaMA-3.3-70B exhibits substantially stronger divergence (V = 0.52). These findings suggest that normative consistency across languages cannot be assumed and should be treated as a distinct dimension of model evaluation.