TokLens: A Multilingual Lens on Tokenizer Quality for LLMs

Guan-Ming Chiu


Abstract
We introduce TokLens, an open-source toolkit for evaluating tokenizer quality across languages using six intrinsic metrics: fertility, characters per token, compression ratio, normalized sequence length, single-token retention rate, and cross-lingual parity. We evaluate 24 tokenizers from major LLM families across 15 typologically diverse languages and correlate these metrics with downstream performance. Our analysis reveals stark disparities: GPT-2 produces 56x more tokens per word in Japanese than in English, while newer tokenizers like Qwen2.5 and Gemma-2 reduce this gap to under 4x. No intrinsic metric predicts English benchmark performance after controlling for model size. However, on multilingual benchmarks (MMLU-ProX), linear mixed-effects models show that tokenizer metrics significantly predict per-language performance (STRR: 𝛽 = +5.7, z = 18.5, p < 0.001). A controlled experiment on the Qwen2.5 family further shows that languages with higher single-token retention rate exhibit steeper scaling slopes (𝜌 = 0.91, p < 0.001). These results indicate that tokenizer quality is significantly associated with multilingual LLM performance, though the evidence remains correlational and partially confounded with pretraining data composition.
Anthology ID:
2026.acl-srw.18
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Santosh T.Y.S.S., Juan Diego Rodriguez, Ona de Gibert
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
188–205
Language:
URL:
https://preview.aclanthology.org/ingestion-form-platform/2026.acl-srw.18/
DOI:
Bibkey:
Cite (ACL):
Guan-Ming Chiu. 2026. TokLens: A Multilingual Lens on Tokenizer Quality for LLMs. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop), pages 188–205, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
TokLens: A Multilingual Lens on Tokenizer Quality for LLMs (Chiu, ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-form-platform/2026.acl-srw.18.pdf