MorphBPE: Morphology-Aware Tokenization for Efficient LLM Training

Ehsaneddin Asgari; Yassine El Kheir; MohammadAli SadraeiJavaheri; Ali Nazari

MorphBPE: Morphology-Aware Tokenization for Efficient LLM Training

Ehsaneddin Asgari, Yassine El Kheir, MohammadAli SadraeiJavaheri, Ali Nazari

Abstract

Tokenization fundamentally shapes NLP performance, affecting both efficiency and linguistic fidelity. While Byte Pair Encoding (BPE) underpins most Large Language Models (LLMs), its frequency-driven merges often disregard morpheme boundaries, yielding inconsistent and semantically opaque segmentations in morphologically rich languages. We introduce MorphBPE, a simple extension of BPE that constrains merge operations during tokenizer training to respect morpheme boundaries, while leaving inference unchanged and fully compatible with existing LLM pipelines. We evaluate tokenization quality using two intrinsic metrics, Morphological Consistency F1, which measures whether shared morphemes are assigned consistent token representations, and Morphological Edit Distance, which quantifies alignment with morpheme boundaries. We then train 300M and 1B parameter decoder-only LMs from scratch across four typologically diverse languages, English, Russian, Hungarian, and Arabic, under identical vocabulary sizes and training settings. Across all languages, MorphBPE consistently improves intrinsic morphological coherence and reduces language model cross-entropy, moreover, token length statistics indicate that these gains are not attributable to materially shorter tokens. Finally, on the Belebele multilingual reading comprehension benchmark, MorphBPE yields significant improvements in morphologically rich languages such as Russian and Arabic.

Anthology ID:: 2026.findings-acl.2068
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 41610–41621
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.2068/
DOI:
Bibkey:
Cite (ACL):: Ehsaneddin Asgari, Yassine El Kheir, MohammadAli SadraeiJavaheri, and Ali Nazari. 2026. MorphBPE: Morphology-Aware Tokenization for Efficient LLM Training. In Findings of the Association for Computational Linguistics: ACL 2026, pages 41610–41621, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: MorphBPE: Morphology-Aware Tokenization for Efficient LLM Training (Asgari et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.2068.pdf
Checklist:: 2026.findings-acl.2068.checklist.pdf

PDF Cite Search Checklist Fix data