Surendhar Muthukumar

2026

TamilTok: Morphologically-Informed Tokenization for Tamil
Surendhar Muthukumar | Aaricia Herygers | Lisa Beinborn
Proceedings of the Sixth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

Tokenization is fundamental to neural language modeling, yet for Tamil it remains largely adapted from general-purpose multilingual models without systematic consideration of the rich agglutinative morphology. We introduce TamilMorph, a large-scale dataset of more than 480,000 morphologically segmented Tamil word forms. Building on this new resource, we develop TamilTok, a morphology-aware tokenization framework that incorporates explicit morpheme structure into tokenizer training. We benchmark Tamil tokenization quality across multiple tokenization algorithms and vocabulary configurations and find that our approach improves both morphological alignment and downstream performance compared to previous approaches. Our morphological resource for Tamil and our systematic empirical analyses can guide future developments of tokenization for morphologically rich languages.

Co-authors

Venues

Fix author