TamilTok: Morphologically-Informed Tokenization for Tamil

Surendhar Muthukumar, Aaricia Herygers, Lisa Beinborn


Abstract
Tokenization is fundamental to neural language modeling, yet for Tamil it remains largely adapted from general-purpose multilingual models without systematic consideration of the rich agglutinative morphology. We introduce TamilMorph, a large-scale dataset of more than 480,000 morphologically segmented Tamil word forms. Building on this new resource, we develop TamilTok, a morphology-aware tokenization framework that incorporates explicit morpheme structure into tokenizer training. We benchmark Tamil tokenization quality across multiple tokenization algorithms and vocabulary configurations and find that our approach improves both morphological alignment and downstream performance compared to previous approaches. Our morphological resource for Tamil and our systematic empirical analyses can guide future developments of tokenization for morphologically rich languages.
Anthology ID:
2026.dravidianlangtech-1.7
Volume:
Proceedings of the Sixth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages
Month:
July
Year:
2026
Address:
Underline (Virtual)
Editors:
Bharathi Raja Chakravarthi, Ruba Priyadharshini, Anand Kumar Madasamy, Sajeetha Thavareesan, Saranya Rajiakodi, Subalalitha Navaneethakrishnan, Dhivya Chinnappa, Balasubramanian Palani, Malliga Subramanian, Kogilavani Shanmugavadivel, Ratnavel Rajalakshmi
Venues:
DravidianLangTech | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
52–61
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.dravidianlangtech-1.7/
DOI:
Bibkey:
Cite (ACL):
Surendhar Muthukumar, Aaricia Herygers, and Lisa Beinborn. 2026. TamilTok: Morphologically-Informed Tokenization for Tamil. In Proceedings of the Sixth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages, pages 52–61, Underline (Virtual). Association for Computational Linguistics.
Cite (Informal):
TamilTok: Morphologically-Informed Tokenization for Tamil (Muthukumar et al., DravidianLangTech 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.dravidianlangtech-1.7.pdf