Badal Nyalang
2026
Stereotyped by Silence: How LLMs Erase Northeast Indian Languages Through Omission and Orthographic Corruption
Badal Nyalang
Proceedings of the 1st Workshop on Stereotypes Across Cultures in Language Technologies (StereACuLT 2026)
Badal Nyalang
Proceedings of the 1st Workshop on Stereotypes Across Cultures in Language Technologies (StereACuLT 2026)
Large language models (LLMs) perpetuate cultural stereotypes not only through biased associations but through systematic omission and orthographic erasure of underrepresented languages. We present empirical evidence of two compounding failure modes affecting Northeast Indian languages: (1) entity-level invisibility, where state-of-the-art NER systems score F1=0.000 on culturally critical named entities such as Khasi surnames, Garo festivals, and tribal names; and (2) orthographic corruption, where LLM tokenizers corrupt semantically meaningful diacritics (ï, ñ) and the Garo morpheme boundary marker (U+00B7) at rates of 18.8–50% across four of five evaluated models. Drawing on NortheastNER (F1=0.964, six entity categories, XLM-RoBERTa-base) and a systematic tokenization study across Khasi and Garo, we argue that stereotype-by-omission constitutes a distinct and measurable harm to indigenous language communities. We further show that a custom multilingual tokenizer achieves 26–50% token reduction over five baseline LLMs, demonstrating that culturally grounded infrastructure can partially remediate these failures. Our findings call for cultural representation audits as a standard component of multilingual NLP evaluation.
Beyond Multilinguality: Typological Limitations in Multilingual Models for Meitei Language
Badal Nyalang
Proceedings of the 8th Workshop on Research in Computational Linguistic Typology and Multilingual NLP
Badal Nyalang
Proceedings of the 8th Workshop on Research in Computational Linguistic Typology and Multilingual NLP
We present MeiteiRoBERTa, the first publicly available monolingual RoBERTa-based language model for Meitei (Manipuri), a low-resource language spoken by over 1.8 million people in Northeast India. Trained from scratch on 76 million words of Meitei text in Bengali script, our model achieves a perplexity of 65.89, representing a 5.2× improvement over multilingual baselines BERT (341.56) and MuRIL (355.65). Through comprehensive evaluation on perplexity, tokenization efficiency, and semantic representation quality, we demonstrate that domain-specific pre training significantly outperforms general-purpose multilingual models for low-resource languages. Our model exhibits superior semantic understanding with 0.769 similarity separation compared to 0.035 for mBERT and near-zero for MuRIL, despite MuRIL’s better tokenization efficiency (fertility: 3.29 vs. 4.65). We publicly release the model, training code, and datasets to accelerate NLP research for Meitei and other underrepresented Northeast Indian languages
NE-BERT: A Multilingual Language Model for Nine Northeast Indian Languages
Badal Nyalang
Proceedings of the Second Workshop on Language Models for Low-Resource Languages (LoResLM 2026)
Badal Nyalang
Proceedings of the Second Workshop on Language Models for Low-Resource Languages (LoResLM 2026)
Large pretrained language models have demonstrated remarkable capabilities across diverse languages, yet critically underrepresented low-resource languages remain marginalized. We present NE-BERT, a domain-specific multilingual encoder model trained on approximately 8.3 million sentences spanning 9 Northeast Indian languages and 2 anchor languages (Hindi, English), a linguistically diverse region with minimal representation in existing multilingual models. By employing weighted data sampling and a custom SentencePiece Unigram tokenizer, NE-BERT outperforms IndicBERT-V2 and MuRIL across all 9 Northeast Indian languages, achieving 15.97× and 7.64× lower average perplexity respectively, with 1.50× better tokenization fertility than mBERT. We address critical vocabulary fragmentation issues in extremely low-resource languages such as Pnar (1,002 sentences) and Kokborok (2,463 sentences) through aggressive upsampling strategies. Downstream evaluation on part-of-speech tagging validates practical utility on three Northeast Indian languages. We release NE-BERT, test sets, and training corpus under CC-BY-4.0 to support NLP research and digital inclusion for Northeast Indian communities.