Guan-Ming Chiu
2026
TokLens: A Multilingual Lens on Tokenizer Quality for LLMs
Guan-Ming Chiu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Guan-Ming Chiu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
We introduce TokLens, an open-source toolkit for evaluating tokenizer quality across languages using six intrinsic metrics: fertility, characters per token, compression ratio, normalized sequence length, single-token retention rate, and cross-lingual parity. We evaluate 24 tokenizers from major LLM families across 15 typologically diverse languages and correlate these metrics with downstream performance. Our analysis reveals stark disparities: GPT-2 produces 56x more tokens per word in Japanese than in English, while newer tokenizers like Qwen2.5 and Gemma-2 reduce this gap to under 4x. No intrinsic metric predicts English benchmark performance after controlling for model size. However, on multilingual benchmarks (MMLU-ProX), linear mixed-effects models show that tokenizer metrics significantly predict per-language performance (STRR: 𝛽 = +5.7, z = 18.5, p < 0.001). A controlled experiment on the Qwen2.5 family further shows that languages with higher single-token retention rate exhibit steeper scaling slopes (𝜌 = 0.91, p < 0.001). These results indicate that tokenizer quality is significantly associated with multilingual LLM performance, though the evidence remains correlational and partially confounded with pretraining data composition.
Probing Functional Correctness in Diffusion Language Models
Guan-Ming Chiu | Jeng-Yue Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Guan-Ming Chiu | Jeng-Yue Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Diffusion language models generate text by iteratively denoising all tokens in parallel, but when and where their hidden states encode whether the output will be functionally correct remains unknown.We present the first probing study of DLM internals, training linear classifiers on hidden states to predict functional correctness.Across two models (LLaDA-8B, Dream-7B) and four tasks, we find that DLMs uniquely accumulate correctness signal across denoising steps (AUC gains of 0.08–0.11 on reasoning tasks), absent in single-pass AR decoding. However, step-0 signal reflects prompt difficulty rather than diffusion-specific computation. Signal emergence is task-dependent: structural tasks show flat profiles while reasoning tasks show gradual buildup. The two models exhibit distinct layer dynamics, with LLaDA concentrating signal in upper layers while Dream redistributes toward lower layers. We further show that probe confidence can identify likely failures, enabling selective generation that avoids 36–98% of wasted compute.