Benchmarking Byte-Pair Encoding Tokenizers on Different Languages with Bits per Byte

Soham Chowdhury; Warren Woolf

Benchmarking Byte-Pair Encoding Tokenizers on Different Languages with Bits per Byte

Abstract

Tokenization significantly affects the cross-lingual performance of language models, yet recent tokenizer variants such as SuperBPE and MorphBPE have not been systematically evaluated across typologically diverse languages. We conduct the first extrinsic cross-language comparison of BPE, SuperBPE, and MorphBPE tokenizers on English, Mandarin, and Hungarian, using bits per byte (BPB) normalized perplexity as our metric, with vocabulary sizes of 8K, 16K, and 32K. We find that SuperBPE matches BPE for English but underperforms by 0.01–0.06 BPB for Hungarian and Mandarin, suggesting that cross-whitespace merging is counterproductive for non-English languages. MorphBPE performs worse than BPE across all settings, with gaps of 0.02–0.04 BPB at the 32K vocabulary size. These results suggest that linguistic theory alone does not guarantee practical improvements in tokenizer design, and that standard BPE remains a surprisingly effective baseline across typologically diverse languages.

Anthology ID:: 2026.mellm-1.27
Volume:: Proceedings of the 1st Workshop on Multilinguality in the Era of Large Language Models (MeLLM 2026)
Month:: July
Year:: 2026
Address:: San Diego, United States
Editors:: Kaiyu Huang, Fengran Mo, Pinzhen Chen, Meng Jiang
Venues:: MeLLM | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 275–283
Language:
URL:: https://preview.aclanthology.org/ingest-acl-workshops/2026.mellm-1.27/
DOI:
Bibkey:
Cite (ACL):: Soham Chowdhury and Warren Woolf. 2026. Benchmarking Byte-Pair Encoding Tokenizers on Different Languages with Bits per Byte. In Proceedings of the 1st Workshop on Multilinguality in the Era of Large Language Models (MeLLM 2026), pages 275–283, San Diego, United States. Association for Computational Linguistics.
Cite (Informal):: Benchmarking Byte-Pair Encoding Tokenizers on Different Languages with Bits per Byte (Chowdhury & Woolf, MeLLM 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl-workshops/2026.mellm-1.27.pdf

PDF Cite Search Fix data