Stereotyped by Silence: How LLMs Erase Northeast Indian Languages Through Omission and Orthographic Corruption

Badal Nyalang

Stereotyped by Silence: How LLMs Erase Northeast Indian Languages Through Omission and Orthographic Corruption

Abstract

Large language models (LLMs) perpetuate cultural stereotypes not only through biased associations but through systematic omission and orthographic erasure of underrepresented languages. We present empirical evidence of two compounding failure modes affecting Northeast Indian languages: (1) entity-level invisibility, where state-of-the-art NER systems score F1=0.000 on culturally critical named entities such as Khasi surnames, Garo festivals, and tribal names; and (2) orthographic corruption, where LLM tokenizers corrupt semantically meaningful diacritics (ï, ñ) and the Garo morpheme boundary marker (U+00B7) at rates of 18.8–50% across four of five evaluated models. Drawing on NortheastNER (F1=0.964, six entity categories, XLM-RoBERTa-base) and a systematic tokenization study across Khasi and Garo, we argue that stereotype-by-omission constitutes a distinct and measurable harm to indigenous language communities. We further show that a custom multilingual tokenizer achieves 26–50% token reduction over five baseline LLMs, demonstrating that culturally grounded infrastructure can partially remediate these failures. Our findings call for cultural representation audits as a standard component of multilingual NLP evaluation.

Anthology ID:: 2026.stereacult-1.6
Volume:: Proceedings of the 1st Workshop on Stereotypes Across Cultures in Language Technologies (StereACuLT 2026)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Weicheng Ma, Soroush Vosoughi, Nabeel Gillani, Rolando Coto-Solano
Venues:: StereACuLT | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 62–68
Language:
URL:: https://preview.aclanthology.org/ingest-acl-workshops/2026.stereacult-1.6/
DOI:
Bibkey:
Cite (ACL):: Badal Nyalang. 2026. Stereotyped by Silence: How LLMs Erase Northeast Indian Languages Through Omission and Orthographic Corruption. In Proceedings of the 1st Workshop on Stereotypes Across Cultures in Language Technologies (StereACuLT 2026), pages 62–68, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Stereotyped by Silence: How LLMs Erase Northeast Indian Languages Through Omission and Orthographic Corruption (Nyalang, StereACuLT 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl-workshops/2026.stereacult-1.6.pdf

PDF Cite Search Fix data