MorphBPE: Morphology-Aware Tokenization for Efficient LLM Training
Ehsaneddin Asgari, Yassine El Kheir, MohammadAli SadraeiJavaheri, Ali Nazari
Abstract
Tokenization fundamentally shapes NLP performance, affecting both efficiency and linguistic fidelity. While Byte Pair Encoding (BPE) underpins most Large Language Models (LLMs), its frequency-driven merges often disregard morpheme boundaries, yielding inconsistent and semantically opaque segmentations in morphologically rich languages. We introduce MorphBPE, a simple extension of BPE that constrains merge operations during tokenizer training to respect morpheme boundaries, while leaving inference unchanged and fully compatible with existing LLM pipelines. We evaluate tokenization quality using two intrinsic metrics, Morphological Consistency F1, which measures whether shared morphemes are assigned consistent token representations, and Morphological Edit Distance, which quantifies alignment with morpheme boundaries. We then train 300M and 1B parameter decoder-only LMs from scratch across four typologically diverse languages, English, Russian, Hungarian, and Arabic, under identical vocabulary sizes and training settings. Across all languages, MorphBPE consistently improves intrinsic morphological coherence and reduces language model cross-entropy, moreover, token length statistics indicate that these gains are not attributable to materially shorter tokens. Finally, on the Belebele multilingual reading comprehension benchmark, MorphBPE yields significant improvements in morphologically rich languages such as Russian and Arabic.- Anthology ID:
- 2026.findings-acl.2068
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 41610–41621
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.2068/
- DOI:
- Cite (ACL):
- Ehsaneddin Asgari, Yassine El Kheir, MohammadAli SadraeiJavaheri, and Ali Nazari. 2026. MorphBPE: Morphology-Aware Tokenization for Efficient LLM Training. In Findings of the Association for Computational Linguistics: ACL 2026, pages 41610–41621, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- MorphBPE: Morphology-Aware Tokenization for Efficient LLM Training (Asgari et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.2068.pdf