Waleed Jamil


2026

Tokenization is central to modern language models, yet its effects on cross-script efficiency, input cost, and truncation behavior remain underexplored. We study this issue through aligned comparisons of Urdu and Roman Urdu, asking whether semantically equivalent content incurs systematically different tokenization costs across scripts. We introduce Token Cost Inequality (TCI), a metric for quantifying relative tokenization efficiency under semantic alignment, and propose a multi-axis framework spanning token cost, fragmentation, and fixed-budget retention. Across three tokenizer families (cl100k, mT5, and ByT5), we find that tokenization disparities are strongly tokenizer-dependent, with substantial differences in token cost and segmentation behavior across scripts. We further identify an efficiency-retention paradox: token cost alone does not fully explain truncation behavior. Under fixed token budgets, Roman Urdu preserves more character-level content than native Urdu, reflecting differences in character-per-token density and fragmentation. Lightweight normalization yields minimal gains, suggesting that the observed disparities arise primarily from tokenizer design rather than superficial orthographic variation. These findings provide controlled evidence that fixed token budgets can produce unequal surface-coverage conditions across scripts, with implications for input-side cost estimation, benchmark design, and multilingual evaluation under constrained token budgets.
Large Language Models exhibit robust safety alignment when harmful intent is expressed in English, yet their resilience to code-switching and transliteration remains underexplored. This paper presents the first targeted investigation of code-switching as a safety failure mode, focusing on Roman Urdu—a widely used transliterated form common in informal and emotionally expressive communication. We introduce the Roman Urdu Adversarial Benchmark (RUAB), a semantically controlled evaluation benchmark designed to isolate linguistic variation from intent across four safety-critical categories: passive suicidal ideation, psychological distress, threat or intimidation, and coercion or emotional manipulation. Evaluating seven state-of-the-art models, we find that safety detection degrades consistently in code-switched and transliterated inputs, with the most pronounced failures occurring for passive suicidal ideation. Instruction-tuned and reasoning-capable models demonstrate greater robustness, suggesting these failures reflect alignment gaps rather than inherent model limitations. Our findings highlight transliteration and code-switching as under-recognized safety risks and motivate the development of linguistically inclusive, transliteration-aware safety methods.