Badal Nyalang


2026

Large language models (LLMs) perpetuate cultural stereotypes not only through biased associations but through systematic omission and orthographic erasure of underrepresented languages. We present empirical evidence of two compounding failure modes affecting Northeast Indian languages: (1) entity-level invisibility, where state-of-the-art NER systems score F1=0.000 on culturally critical named entities such as Khasi surnames, Garo festivals, and tribal names; and (2) orthographic corruption, where LLM tokenizers corrupt semantically meaningful diacritics (ï, ñ) and the Garo morpheme boundary marker (U+00B7) at rates of 18.8–50% across four of five evaluated models. Drawing on NortheastNER (F1=0.964, six entity categories, XLM-RoBERTa-base) and a systematic tokenization study across Khasi and Garo, we argue that stereotype-by-omission constitutes a distinct and measurable harm to indigenous language communities. We further show that a custom multilingual tokenizer achieves 26–50% token reduction over five baseline LLMs, demonstrating that culturally grounded infrastructure can partially remediate these failures. Our findings call for cultural representation audits as a standard component of multilingual NLP evaluation.
We present MeiteiRoBERTa, the first publicly available monolingual RoBERTa-based language model for Meitei (Manipuri), a low-resource language spoken by over 1.8 million people in Northeast India. Trained from scratch on 76 million words of Meitei text in Bengali script, our model achieves a perplexity of 65.89, representing a 5.2× improvement over multilingual baselines BERT (341.56) and MuRIL (355.65). Through comprehensive evaluation on perplexity, tokenization efficiency, and semantic representation quality, we demonstrate that domain-specific pre training significantly outperforms general-purpose multilingual models for low-resource languages. Our model exhibits superior semantic understanding with 0.769 similarity separation compared to 0.035 for mBERT and near-zero for MuRIL, despite MuRIL’s better tokenization efficiency (fertility: 3.29 vs. 4.65). We publicly release the model, training code, and datasets to accelerate NLP research for Meitei and other underrepresented Northeast Indian languages
Large pretrained language models have demonstrated remarkable capabilities across diverse languages, yet critically underrepresented low-resource languages remain marginalized. We present NE-BERT, a domain-specific multilingual encoder model trained on approximately 8.3 million sentences spanning 9 Northeast Indian languages and 2 anchor languages (Hindi, English), a linguistically diverse region with minimal representation in existing multilingual models. By employing weighted data sampling and a custom SentencePiece Unigram tokenizer, NE-BERT outperforms IndicBERT-V2 and MuRIL across all 9 Northeast Indian languages, achieving 15.97× and 7.64× lower average perplexity respectively, with 1.50× better tokenization fertility than mBERT. We address critical vocabulary fragmentation issues in extremely low-resource languages such as Pnar (1,002 sentences) and Kokborok (2,463 sentences) through aggressive upsampling strategies. Downstream evaluation on part-of-speech tagging validates practical utility on three Northeast Indian languages. We release NE-BERT, test sets, and training corpus under CC-BY-4.0 to support NLP research and digital inclusion for Northeast Indian communities.