Aaricia Herygers
2026
TamilTok: Morphologically-Informed Tokenization for Tamil
Surendhar Muthukumar | Aaricia Herygers | Lisa Beinborn
Proceedings of the Sixth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages
Surendhar Muthukumar | Aaricia Herygers | Lisa Beinborn
Proceedings of the Sixth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages
Tokenization is fundamental to neural language modeling, yet for Tamil it remains largely adapted from general-purpose multilingual models without systematic consideration of the rich agglutinative morphology. We introduce TamilMorph, a large-scale dataset of more than 480,000 morphologically segmented Tamil word forms. Building on this new resource, we develop TamilTok, a morphology-aware tokenization framework that incorporates explicit morpheme structure into tokenizer training. We benchmark Tamil tokenization quality across multiple tokenization algorithms and vocabulary configurations and find that our approach improves both morphological alignment and downstream performance compared to previous approaches. Our morphological resource for Tamil and our systematic empirical analyses can guide future developments of tokenization for morphologically rich languages.
BoostedCats at BEA 2026 Shared Task 1: What Makes a Word Hard to Learn? Modeling L1 Influence on English Vocabulary Difficulty
Jonas Mayer Martins | Zhuojing Huang | Aaricia Herygers | Lisa Beinborn
Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)
Jonas Mayer Martins | Zhuojing Huang | Aaricia Herygers | Lisa Beinborn
Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)
What makes a word difficult to learn, and how does the difficulty depend on the learner’s native language? We computationally model vocabulary difficulty for English learners whose first language is Spanish, German, or Chinese with gradient-boosted models trained on features related to a word’s familiarity (e.g., frequency), meaning, surface form, and cross-linguistic transfer. Using Shapley values, we determine the importance of each feature group. Word familiarity is the dominant feature group shared by all three languages. However, predictions for Spanish- and German-speaking learners rely additionally on orthographic transfer. This transfer mechanism is unavailable to Chinese learners, whose difficulty is shaped by a combination of familiarity and surface features alone. Our models provide interpretable, L1-tailored difficulty estimates that can be used to design vocabulary curricula.