Soumedhik Bharati

2026

The American Palimpsest: Quantifying South Asian English Dialect Erasure in LLMs
Soumedhik Bharati | Shibam Mandal | Swarup Kr Ghosh | Sayani Mondal
Proceedings of the 4th Workshop on Cross-Cultural Considerations in NLP (C3NLP 2026)

Large Language Models are increasingly deployed as writing assistants for usersin the Global South, yet rewriting prompts can suppress institutionalizedpostcolonial varieties. We quantify South Asian English (SAsE) dialect erasure ina state-of-the-art open-weight model using a 500-sentence diagnostic benchmark(320 lexical and 180 syntactic markers). On Llama 3.3 70B, standard grammarcorrection retains only 26.0% of markers (lexical 31.2%; syntactic 16.7%),while formalization is more destructive (14.0% overall retention). For lexicalitems, we observe Americanization in 56.2% (correction) and 59.4%(formalization) of cases, typically via Standard American paraphrases. A simpledialect-aware prompt raises retention to 92.0% and reduces lexicalAmericanization to 6.2%, although some function-word phenomena remain resistant. A stress test shows evenstronger suppression (6.7% retention). We position dialect erasure withinrepresentational-harm and cultural-competence frameworks, and provide areplicable protocol for auditing writing-assistance systems.

pdf bib abs

The Mirage of Diversity: Unmasking the Cultural Vocabulary Ceiling in LLMs
Soumedhik Bharati | Subhrajit Mukherjee | Shibam Mandal
Proceedings of the 4th Workshop on Cross-Cultural Considerations in NLP (C3NLP 2026)

Large Language Models are widely used to generate and adapt cultural texts, yet the depth of their cultural representation remains poorly quantified. Intuitively, as a narrative text expands in length, the diversity of cultural words should scale proportionately. To formally test this, we evaluate the FairyTaleQA dataset, adapted by three models and introduce our primary contribution: the Contextual Stereotype Amplification Index (CSAI), an evaluation framework combining LLM-as-a-judge extraction, embedding-based cliché anchoring, and Natural Language Inference (NLI) congruence validation. By mapping the frequency of extracted Culture Specific Items (CSIs) against narrative length using Heaps’ Law (V = k ⋅ T𝛽), we present empirical evidence of a systematic limitation in current systems: they struggle to scale cultural diversity even under explicit cultural prompting. Models rapidly hit a "Cultural Vocabulary Ceiling," constrained to a fixed set of hyper-stereotypical terms. Furthermore, we demonstrate that merely optimizing for higher CSI frequency as done in prior works rewards logically broken tokenism. Our CSAI formulation actively penalizes such gratuitous stereotyping, offering a more principled approach to measuring and evaluating cultural homogenization in generative AI systems.

pdf bib abs

FROST: Factual Reasoning via Optimized Stochastic Trajectories in Large Language Models during Inference
Soumedhik Bharati | Ebad Shabbir | Jiechao Gao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)

Large language models face a trade-off between factual consistency and reasoningdiversity: deterministic decoding prioritizes reliability but may miss alternativesolution paths, while high-temperature sampling increases exploration at the costof accuracy. We present FROST (Factual Reasoning via Optimized StochasticTrajectories), an inference-time framework that balances exploration andexploitation without additional training or context augmentation. FROST combinesdeterministic inference from a large model with targeted stochastic sampling froma smaller model, selecting outputs via multi-criteria validation over coherence,factual grounding, and semantic novelty. Across HotpotQA, CommonsenseQA, andMMLU, FROST achieves 2–5 percentage point improvements over standard chain-of-thoughtprompting and reduces unsupported outputs by 40% relative to Standard CoT. Comparedto Self-Consistency ensembles, FROST delivers comparable accuracy at 28% lowerinference cost through strategic delegation to smaller models. On an adversarialsubset with unanswerable queries, FROST abstains on 34% of cases versus 8% forstandard chain-of-thought, reducing false positives by 45%. Task-stratifiedevaluation shows that exploration benefits scale with problem ambiguity.Generalization to mathematical reasoning, code generation, and multimodal domainsremains future work.

pdf bib abs

Morphological Feature Extraction for Fine-Grained Sorani Kurdish Dialect Identification: A Hybrid Transformer-Linguistic Approach
Soumedhik Bharati | Shibam Mandal | Subham Majumdar | Swarup Kr Ghosh | Sayani Mondal
Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script

As reported, approximately 6 million people in Iraq and Iran speak in Sorani Kurdish, which exhibits substantial regional variation but lacks computational resources for dialect identification. We present the first fine-grained sub-dialect classification system for six Sorani varieties namely, Sulaymaniyah, Erbil, Iranian Sorani, Ardalani, Babani, and Mukriani. This investigation combines cross-lingual contextual embeddings (XLM-RoBERTa) with morphological features derived from explicit linguistic rules, including 24 patterns capturing verb prefixes, pronominal clitics, and definite markers. The suggested morphology-augmented XLM-R model has been trained on a unified dataset of 16,409 sentences without manual annotation, and achieves 91.91% accuracy, outperforming pure transformers (91.79%) and traditional machine learning baselines (SVM 86.41%). Key ablation studies reveal that morphological features serve as effective regularizers for geographically proximate dialects.

pdf bib abs

AjamiMorph: Zero-Annotation Morphological Discovery for Hausa Ajami via Multi-Method Consensus
Soumedhik Bharati | Shibam Mandal | Prithwish Ghosh | Swarup Kr Ghosh | Sayani Mondal
Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script

Hausa Ajami (Hausa written in Arabic script) remains severely under-resourced for computational morphology. We present AjamiMorph, a zero-annotation framework that discovers morphemes through consensus among three unsupervised methods, namely, Byte Pair Encoding (BPE), transition-based boundary detection using Pointwise Mutual Information (PMI), and computational linguistics based Distributional Affix Mining (DAM). Using a Hausa Ajami Bible corpus consisting of 637,414 tokens, AjamiMorph identifies 1,611 high-confidence morphemes, achieving 99.9% coverage. The inventory exhibits a linguistically realistic distribution (66.0% stems, 22.6% suffixes, 11.4% prefixes) and recovers 77.8% of known Hausa affixes. A permutation test that shuffles method assignments (preserving per-method selection sizes) confirms that the observed agreement is above-chance; chi-square remains as a secondary check. A lightweight 5-gram LM comparison (characters vs. consensus morphemes) provides an extrinsic signal. We also report negative results for script-driven Arabic assumptions and LLM-first annotation. This work provides the first unsupervised morpheme inventory for Hausa Ajami and demonstrates consensus as a robust strategy for zero-resource morphology.

Co-authors

Subham Majumdar 1

Subhrajit Mukherjee 1

Ebad Shabbir 1

Venues

Fix author