Token Cost Inequality: Measuring Tokenization Disparities Across Scripts in Roman Urdu and Urdu

Waleed Jamil, Saima Rafi, Yanchao Yu


Abstract
Tokenization is central to modern language models, yet its effects on cross-script efficiency, input cost, and truncation behavior remain underexplored. We study this issue through aligned comparisons of Urdu and Roman Urdu, asking whether semantically equivalent content incurs systematically different tokenization costs across scripts. We introduce Token Cost Inequality (TCI), a metric for quantifying relative tokenization efficiency under semantic alignment, and propose a multi-axis framework spanning token cost, fragmentation, and fixed-budget retention. Across three tokenizer families (cl100k, mT5, and ByT5), we find that tokenization disparities are strongly tokenizer-dependent, with substantial differences in token cost and segmentation behavior across scripts. We further identify an efficiency-retention paradox: token cost alone does not fully explain truncation behavior. Under fixed token budgets, Roman Urdu preserves more character-level content than native Urdu, reflecting differences in character-per-token density and fragmentation. Lightweight normalization yields minimal gains, suggesting that the observed disparities arise primarily from tokenizer design rather than superficial orthographic variation. These findings provide controlled evidence that fixed token budgets can produce unequal surface-coverage conditions across scripts, with implications for input-side cost estimation, benchmark design, and multilingual evaluation under constrained token budgets.
Anthology ID:
2026.gem-main.54
Volume:
Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
Month:
July
Year:
2026
Address:
San Diego, California, USA
Editors:
Simon Mille, Sebastian Gehrmann, Patrícia Schmidtová, Ondřej Dušek, Marzieh Fadaee, Kyle Lo, Enrico Santus, Gabriel Stanovsky
Venues:
GEM | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
563–573
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.gem-main.54/
DOI:
Bibkey:
Cite (ACL):
Waleed Jamil, Saima Rafi, and Yanchao Yu. 2026. Token Cost Inequality: Measuring Tokenization Disparities Across Scripts in Roman Urdu and Urdu. In Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM), pages 563–573, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):
Token Cost Inequality: Measuring Tokenization Disparities Across Scripts in Roman Urdu and Urdu (Jamil et al., GEM 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.gem-main.54.pdf