Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment
Saketh Reddy Vemula, Sandipan Dandapat, Dipti Sharma, Parameswari Krishnamurthy
Abstract
The relationship between tokenizer algorithm (e.g., Byte-Pair Encoding (BPE), Unigram), morphological alignment, tokenization quality (e.g., compression efficiency), and downstream performance remains largely unclear, particularly for languages with complex morphology. In this paper, we conduct a comprehensive evaluation of tokenizers using small-sized BERT models—from pre-training through fine-tuning—for Telugu (agglutinative), along with preliminary evaluation in Hindi (primarily fusional with some agglutination) and English (fusional). To evaluate morphological alignment of tokenizers in Telugu, we create a dataset containing gold morpheme segmentations of 600 derivational and 7000 inflectional word forms.Our experiments reveal two key findings for Telugu. First, the choice of tokenizer algorithm is the most significant factor influencing performance, with Unigram-based tokenizers consistently outperforming BPE across most settings. Second, while better morphological alignment shows a moderate, positive correlation with performance on text classification and structure prediction tasks, its impact is secondary to the tokenizer algorithm. Notably, hybrid approaches that use morphological information for pre-segmentation significantly boost the performance of BPE, though not Unigram. Our results further showcase the need for comprehensive intrinsic evaluation metrics for tokenizers that could explain downstream performance trends consistently.- Anthology ID:
- 2025.ijcnlp-srw.20
- Volume:
- The 14th International Joint Conference on Natural Language Processing and The 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
- Month:
- December
- Year:
- 2025
- Address:
- Mumbai, India
- Editors:
- Santosh T.y.s.s, Shuichiro Shimizu, Yifan Gong
- Venue:
- IJCNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 232–252
- Language:
- URL:
- https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.ijcnlp-srw.20/
- DOI:
- Cite (ACL):
- Saketh Reddy Vemula, Sandipan Dandapat, Dipti Sharma, and Parameswari Krishnamurthy. 2025. Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment. In The 14th International Joint Conference on Natural Language Processing and The 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, pages 232–252, Mumbai, India. Association for Computational Linguistics.
- Cite (Informal):
- Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment (Vemula et al., IJCNLP 2025)
- PDF:
- https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.ijcnlp-srw.20.pdf