Do Tokenizers Fail on Informal Hindi Expressions? Evidence from Static, Downstream, and Robustness Analyses
Manikandan Ravikiran, Tanmay Tiwari, Vibhu Gupta, Rakesh Prakash, Rohit Saluja, Shayan Mohanty
Abstract
We present, to our knowledge, the first systematic evaluation of tokenization quality for informal Hindi expressions, combining static, downstream, and robustness analyses. Our investigation centers on three questions: (RQ1) how well tokenizers preserve informal expression units using static boundary and integrity metrics, (RQ2) how tokenization choices affect downstream identification of informal expressions, and (RQ3) how robust tokenizers remain under orthographic variation, romanization, and noisy spelling. Across multilingual, Indic-focused, and byte-level tokenizers, we find that Indic-oriented models (e.g., MuRIL, IndicBERT) preserve expression boundaries better and achieve higher downstream F1 on clean text than generic multilingual models (e.g., mBERT, XLM-R). However, all tokenizers exhibit severe degradation under romanization, with phrase integrity rates approaching zero. These findings demonstrate that tokenization constitutes a hidden but critical bottleneck for informal Hindi NLP, particularly in cross-script settings, and motivate the need for tokenization strategies that explicitly account for phrase-level semantics and orthographic variation.- Anthology ID:
- 2026.loreslm-1.2
- Volume:
- Proceedings of the Second Workshop on Language Models for Low-Resource Languages (LoResLM 2026)
- Month:
- March
- Year:
- 2026
- Address:
- Rabat, Morocco
- Editors:
- Hansi Hettiarachchi, Tharindu Ranasinghe, Alistair Plum, Paul Rayson, Ruslan Mitkov, Mohamed Gaber, Damith Premasiri, Fiona Anting Tan, Lasitha Uyangodage
- Venue:
- LoResLM
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 13–28
- Language:
- URL:
- https://preview.aclanthology.org/manual-author-scripts/2026.loreslm-1.2/
- DOI:
- Cite (ACL):
- Manikandan Ravikiran, Tanmay Tiwari, Vibhu Gupta, Rakesh Prakash, Rohit Saluja, and Shayan Mohanty. 2026. Do Tokenizers Fail on Informal Hindi Expressions? Evidence from Static, Downstream, and Robustness Analyses. In Proceedings of the Second Workshop on Language Models for Low-Resource Languages (LoResLM 2026), pages 13–28, Rabat, Morocco. Association for Computational Linguistics.
- Cite (Informal):
- Do Tokenizers Fail on Informal Hindi Expressions? Evidence from Static, Downstream, and Robustness Analyses (Ravikiran et al., LoResLM 2026)
- PDF:
- https://preview.aclanthology.org/manual-author-scripts/2026.loreslm-1.2.pdf