Proceedings of the 4th Workshop on Cross-Cultural Considerations in NLP (C3NLP 2026)

Vinodkumar Prabhakaran, Sunipa Dev, Luciana Benotti, Daniel Hershcovich, Yong Cao, Li Zhou, BOlei Ma, Ife Adebara (Editors)


Anthology ID:
2026.c3nlp-1
Month:
July
Year:
2026
Address:
San Diego, California, United States
Venues:
C3NLP | WS
Events:
Annual Meeting of the Association for Computational Linguistics (2026) | Cross-Cultural Considerations in NLP (2026) | Other Workshops and Events (2026)
SIG:
Publisher:
Association for Computational Linguistics
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.c3nlp-1/
DOI:
ISBN:
979-8-89176-420-0
Bib Export formats:
BibTeX

Human annotation is a foundational component of modern natural language processing (NLP). Labeled datasets underpin widely used benchmarks for sentiment analysis, toxicity detection, hate speech classification, and stance detection. Within standard NLP workflows, annotation is generally treated as a technical process aimed at recovering an objective ground truth according to predefined guidelines. This paper argues that such a view overlooks the inherently interpretive nature of annotation. Drawing on insights from sociolinguistics, discourse analysis, and cultural theory, and on a growing empirical literature on annotator subjectivity, we propose that annotation should be understood as a culturally situated interpretive practice. Annotators rely on culturally shaped norms, values, and communicative expectations when interpreting linguistic meaning, and labels in NLP datasets often reflect culturally specific interpretations rather than universal truths. We position this argument relative to recent work on perspectivism, annotator-aware modeling, and cross-cultural annotation, and we use published findings from large-scale cross-cultural annotation studies to illustrate the concrete consequences of treating annotation as objective. We close with a research agenda for culturally informed annotation practice that includes operational recommendations on documentation, modeling, and evaluation.
Large language models (LLMs) are increasingly used for mental health applications, raising questions about whether they reflect established clinical knowledge. Clinical psychology has documented systematic cultural differences in the presentation of depression symptoms, with Western populations emphasizing emotional symptoms and many East Asian populations reporting more somatic symptoms. We evaluate whether general-purpose LLMs reproduce these clinically established cross-cultural patterns using prompts grounded in clinical descriptions of depression. We examine model responses under different cultural personas and languages.We find that LLMs struggle to reproduce expected cultural patterns when prompted in English. Prompting in major Eastern languages improves alignment in some configurations, suggesting that language cues partially activate cultural knowledge. However, model behavior remains dominated by a strong, culture-invariant hierarchy of depression symptoms that often overrides cultural cues, highlighting limitations in current LLMs for mental health applications.
Code-switching is often modeled in NLP as a structural or token-level phenomenon, overlooking its role as a discourse practice shaped by social and cultural context. In this work, we propose topic-based annotation as a framework for analyzing cultural and subcultural variation in bilingual discourse. Using large language models, we annotate 3,691 code-switched sentences from Spanish-English (Miami) and Spanish-Guaraní (Paraguay) corpora with topic and discourse-level information, integrating sociolinguistic metadata. Our analysis reveals systematic relationships between discourse topics, language choice, and social variables such as gender and language dominance. We observe subcultural variation within the Miami community and a clear diglossic distribution in Paraguay, where Guaraní is associated with formal domains and Spanish with informal communication. These findings suggest that modeling code-switching through discourse-level categories provides a more complete representation of multilingual communication and enables both cross-cultural and intra-cultural comparison at scale.
Adapting large language models (LLMs) to extremely low-resource languages remains challenging due to severe data scarcity and the lack of structured linguistic supervision. We introduce GCCLA, a graph-conditioned cross-lingual adaptation framework that integrates multilingual knowledge graphs into parameter-efficient LLM adaptation. GCCLA conditions a frozen multilingual LLM on structured semantic and typological relations encoded in a multilingual graph, providing a strong inductive bias for data-efficient transfer. We instantiate and evaluate the framework through a focused case study on English-to-Amharic-to-Tigrinya transfer, where labeled data is extremely limited. By separating knowledge representation from language modeling, GCCLA stabilizes learning and improves sample efficiency in few-shot regimes. We evaluate the approach on five tasks, sentiment analysis, named entity recognition, natural language inference, question answering, and extractive summarization, under extreme data scarcity, with as few as 0–1000 labeled Tigrinya examples. Experimental results show that GCCLA consistently outperforms multilingual, translation-based, and parameter-efficient baselines, achieves competitive performance with as few as 100 labeled examples, and degrades gracefully under partial graph coverage. These findings demonstrate that graph conditioning is an effective principle for data-efficient cross-lingual adaptation of LLMs advancing equitable NLP.
We evaluate whether open-source LLMs can produce proficiency-graded English adaptations of entries from the Diccionario de colombianismos (DiCol), a Colombian Spanish lexicographic resource used in language teaching. Three 7–8B instruction-tuned models—Llama 3.1, Qwen2.5, and Mistral—generate Beginner, Intermediate, and Advanced translations for all 8,252 definitions using structured zero-shot prompts identical across levels except for the target CEFR band. Automated metrics show that Intermediate targeting collapses (73–83% classified as Advanced by vocabulary, 𝜒2 > 705, p < .001) and that Advanced outputs expand 4.9–8.2× relative to the source. Expert annotation of a 360-entry stratified sample (𝜅 = 0.61–0.68) identifies hallucination in 19% of entries (Fleiss’ 𝜅 = 0.77 for cultural preservation categories, 97% unanimity among flagged cases). Hallucination concentrates in the Advanced condition (81%, 𝜒2 = 86.6, p < .001) and is associated with higher expansion (U = 16,662, p < .001, r = 0.68), manifesting primarily as generic elaboration and, in a smaller proportion, as Colombia-stereotyping and pragmatic polarity inversion. We discuss these findings through the lens of (CITATION)’s domestication framework and describe the observed pattern as algorithmic domestication.
Large Language Models (LLMs) show unbalanced knowledge of cultures across the globe, favoring high-resource cultures over low-resource ones. A possible way to tackle this issue is to fine-tune LLMs on culturally specific data. However, fine-tuning recent LLMs requires high computational resources as well as memory storage, which triggered the development of parameter-efficient fine-tuning (PEFT) approaches, the most widespread being LoRA. In this article, we investigate the use of another class of PEFT approaches, namely soft prompt methods (prompt-tuning and prefix-tuning), to improve LLMs’ cultural knowledge across diverse cultures. We focus on cultural alignment on Multiple-Choice Questions of cultural commonsense knowledge. On this task with limited fine-tuning data, we show that soft-prompt-based methods outperform LoRA in comparable settings. Moreover, the trained soft prompts are interpretable and capture similarities between cultures.
Large Language Models are widely used to generate and adapt cultural texts, yet the depth of their cultural representation remains poorly quantified. Intuitively, as a narrative text expands in length, the diversity of cultural words should scale proportionately. To formally test this, we evaluate the FairyTaleQA dataset, adapted by three models and introduce our primary contribution: the Contextual Stereotype Amplification Index (CSAI), an evaluation framework combining LLM-as-a-judge extraction, embedding-based cliché anchoring, and Natural Language Inference (NLI) congruence validation. By mapping the frequency of extracted Culture Specific Items (CSIs) against narrative length using Heaps’ Law (V = k ⋅ T𝛽), we present empirical evidence of a systematic limitation in current systems: they struggle to scale cultural diversity even under explicit cultural prompting. Models rapidly hit a "Cultural Vocabulary Ceiling," constrained to a fixed set of hyper-stereotypical terms. Furthermore, we demonstrate that merely optimizing for higher CSI frequency as done in prior works rewards logically broken tokenism. Our CSAI formulation actively penalizes such gratuitous stereotyping, offering a more principled approach to measuring and evaluating cultural homogenization in generative AI systems.
Large Language Models are increasingly deployed as writing assistants for usersin the Global South, yet rewriting prompts can suppress institutionalizedpostcolonial varieties. We quantify South Asian English (SAsE) dialect erasure ina state-of-the-art open-weight model using a 500-sentence diagnostic benchmark(320 lexical and 180 syntactic markers). On Llama 3.3 70B, standard grammarcorrection retains only 26.0% of markers (lexical 31.2%; syntactic 16.7%),while formalization is more destructive (14.0% overall retention). For lexicalitems, we observe Americanization in 56.2% (correction) and 59.4%(formalization) of cases, typically via Standard American paraphrases. A simpledialect-aware prompt raises retention to 92.0% and reduces lexicalAmericanization to 6.2%, although some function-word phenomena remain resistant. A stress test shows evenstronger suppression (6.7% retention). We position dialect erasure withinrepresentational-harm and cultural-competence frameworks, and provide areplicable protocol for auditing writing-assistance systems.
Multilingual NLP is often treated as a route to global inclusion, but linguistic coverage and cultural competence frequently diverge. This paper synthesizes over 50 papers spanning multilingual performance inequality, cross-lingual transfer, culture-aware evaluation, cultural alignment, multimodal benchmarks, benchmark-design critique, and community-grounded data practices. Across this literature, training data coverage remains important, but outcomes are also shaped by tokenization, prompt language, translated benchmark design, culturally grounded supervision, modality, and who authors or validates evaluation data. We argue that culturally grounded NLP should move beyond treating languages as isolated rows in benchmark tables and instead model communicative ecologies: the institutions, scripts, domains, modalities, and communities through which language is used. We propose a layered evaluation and reporting agenda centered on representation audits, mixed elicitation, ecological validity, community validation, adaptation provenance, within-language variation, and maintenance of living cultural resources.
Half a million cuneiform clay tablets survive in museums worldwide, yet modern humans can neither read nor write in the world’s oldest writing system, creating a 4,000-year cultural barrier that existing NLP tools have only partially addressed. Prior work enables one-way, scholar-oriented translation from Akkadian to English, but offers no path in the reverse direction: ordinary people cannot express their own thoughts in cuneiform, and thus remain passive consumers of ancient culture rather than active participants. We present TabletCraft, the first open-source system that enables bidirectional interaction with Mesopotamian writing. Users can read ancient tablets (Akkadian to English) and write their own messages as cuneiform clay tablets (English to Akkadian to cuneiform to rendered tablet). The system integrates a ByT5-based translation model trained on 116K bidirectional samples, a cuneiform sign converter with 14,240 mappings (95.3% coverage), and a visual tablet renderer, packaged as a pip-installable toolkit with both a command-line interface and a web demo.
Large language models (LLMs) are increasingly deployed in multilingual settings, yet little is known about whether their moral and social judgments remain consistent across languages. In particular, when faced with moral and social dilemmas, LLMs must often implicitly or explicitly assign responsibility — to an individual, to broader social forces, or across multiple parties — a process known as responsibility attribution. This study investigates whether responsibility attributions vary across languages, whether any observed variation persists across thematic domains, and whether the degree of variation differs across LLMs. We evaluate three models (GPT-5.2, Gemini-2.5-Pro, and LLaMA-3.3-70B) across 12 scenarios spanning six thematic domains (marriage, career, authority, gender, elder care, and family). Each model was prompted to attribute responsibility for each scenario by selecting from four options: the primary individual, a secondary interpersonal actor, a broader societal factor, or distributed responsibility shared across multiple parties. Results reveal a significant overall association between language and responsibility attribution (Cramér’s V = 0.24) that persists within every thematic domain (V = 0.26–0.53). The magnitude of cross-language variation is strongly model-dependent: GPT-5.2 and Gemini-2.5-Pro show modest shifts (V ≈ 0.19), while LLaMA-3.3-70B exhibits substantially stronger divergence (V = 0.52). These findings suggest that normative consistency across languages cannot be assumed and should be treated as a distinct dimension of model evaluation.
This paper proposes a semi-automatic lexico-semantic modeling framework for Chinese chéngyǔ containing body-part and animal lexemes. The framework combines manual semantic annotation, lightweight RDF/OWL formalization and semantic classification in order to investigate whether lexical mediators such as 心 xīn “heart/mind”, 口 kǒu “mouth” or 马 mǎ “horse” are sufficient to predict idiomatic semantic interpretation. Based on 440 annotated chéngyǔ normalized into 18 semantic categories, we compare three classification approaches: a rule-based keyword baseline, character n-gram TF-IDF with logistic regression, and BERT-base-chinese. The results show that lexical mediators cannot be directly equated with semantic categories and that TF-IDF achieves the best overall performance, suggesting that lightweight character-level representations remain robust for very short idioms in low-resource settings. The study contributes an interpretable RDF/OWL-compatible resource for culture-aware modeling of Chinese idioms.
Cross-cultural psychology has shown that moral judgments about failures to help vary systematically across cultures. In a landmark study, Miller, Bersoff, and Harwood (1990) found that while Indian and American participants agreed that failures to help are undesirable, they differed in whether they considered helping a moral obligation subject to social sanction or a personal decision. We adapt Miller et al.’s paradigm—nine scenarios crossing need severity (life-threatening, moderate, minor) with role relationship (parent, friend, stranger) and their original probe questions—to a cross-lingual LLM setting, presenting them to four LLMs (GPT-5.4, Claude-Opus-4.6, DeepSeek-V3.1, Qwen3-235B) across ten languages. We find that language significantly shapes how LLMs categorize failures to help as moral violations, social conventions, personal-moral concerns, or personal decisions (𝜒2(27) = 116.14, p < .001, Cramer’s V = 0.147). Models agree across languages that failures to help are undesirable, but diverge substantially in how they classify them, with the primary divergence falling between moral violations and personal decisions. The proportion of responses classifying failures as moral violations decreases as need severity decreases and the role relationship becomes more distant. Cross-lingual variation differs substantially across models, with open-weight models showing significantly stronger variation than closed-weight models. These findings indicate that users consulting LLMs in different languages may receive substantively different moral guidance, underscoring the need for cross-lingual normative auditing as a component of multilingual LLM evaluation.
Verbal humor involves reasoning through complex conversational contexts. Although LLMs have achieved strong performance on English humor datasets, their ability to interpret humor in Hindi remains unexplored. In this paper, we evaluate Hindi humor for which we extract dialogues from humorous video clips. We use a pipeline that transforms video content into detailed textual streams, including dialogue transcripts and scene descriptions, allowing reasoning over inputs exceeding 2,000 words. We test various LLMs, from efficient edge models (Qwen-2.5-7B, Qwen-3-7B, Gemma-3-27B) to Indic-focused models (Sarvam-M-24B) and large frontier models (Llama-3.1-70B, Gemini-2.0-Flash). Our findings show a concave performance pattern in long-context understanding, with reasoning quality peaking at moderate lengths (250–750 words) and declining at higher context lengths. We also show that standard metrics overstate pragmatic competence. While increasing model size generally improves performance, we also observe distinct failures in smaller LLMs due to instructional and linguistic issues, necessitating diversity metrics to capture hallucinations. Smaller, Hindi-focused models can compete with much larger generalist models. Importantly, our evaluation reveals that conversational humor is a challenge for even specialized models, making HinS a valuable benchmark for advancing research in Hindi Long-Context Humor Reasoning.
Conversational AI systems trained on large-scale web corpora inevitably encode the cultural values and interactional norms embedded in their training data, yet our understanding of how deployed LLMs reflect or reinforce culture-specific social expectations remains limited. This study examined how supportive versus challenging chatbot interaction styles shape user experience and continuance intention, and whether people-pleasing tendency (PPT) moderates these effects across cultures. Taiwanese (N = 49) and Korean (N = 52) participants completed a collaborative tourism-planning task. Results showed that: (1) supportive chatbots consistently led to higher continuance intention, satisfaction, and trust; (2) PPT did not moderate these effects; and (3) cultural variation emerged only in perceived threat, where higher PPT was associated with greater baseline threat in the Taiwanese but not the Korean sample. These findings reveal how a general-purpose LLM style may differentially activate culturally situated social scripts, raising implications for culturally inclusive conversational AI design.
Culture shapes how people interpret language, especially in online reviews containing culture-specific items (CSIs). Yet, most existing evaluations treat culture as a monolithic construct, offering no insight into which cultural dimensions pose difficulty for readers, or how large language models (LLMs), which power AI reading assistants, perform across them. This gap limits our ability to obtain reliable, cross-cultural estimates of model performance. To address this, we analyze CSIs in English Goodreads reviews across Newmark’s cultural dimensions (e.g., material, ecology, customs, habits, social) and evaluate six LLMs of varying sizes on their ability to identify CSIs within each dimension. We find that readers struggle most with CSIs from the material, customs, and social dimensions, while models underperform on more localized ones (e.g., habits), revealing systematic cultural blind spots. To support further research on culturally representative benchmarking, we release an expert-annotated dataset of CSIs labeled by cultural dimension. Empirical analysis shows our dataset as more challenging and of higher quality than existing cultural benchmarks, enabling finer-grained evaluation of cultural understanding in models.