Contextual morphologically-guided tokenization for Latin encoder models

Marisa Hudspeth, Patrick J. Burns, Brendan O'Connor


Abstract
Tokenization is a critical component of language model pretraining, yet standard tokenization methods often prioritize information-theoretical goals like high compression and low fertility rather than linguistic goals like morphological alignment. In fact, they have been shown to be suboptimal for morphologically rich languages, where tokenization quality directly impacts downstream performance. In this work, we investigate morphologically-aware tokenization for Latin, a morphologically rich language that is medium-resource in terms of pretraining data, but high-resource in terms of curated lexical resources – a distinction that is often overlooked but critical in discussions of low-resource language modeling. We find that morphologically-guided tokenization improves overall performance on four downstream tasks. Performance gains are most pronounced for out of domain texts, highlighting our models’ improved generalization ability. Our findings demonstrate the utility of linguistic resources to improve language modeling for morphologically complex languages. For low-resource languages that lack large-scale pretraining data, the development and incorporation of linguistic resources can serve as a feasible alternative to improve LM performance.
Anthology ID:
2026.eacl-long.270
Volume:
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
March
Year:
2026
Address:
Rabat, Morocco
Editors:
Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:
EACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5761–5775
Language:
URL:
https://preview.aclanthology.org/ingest-eacl/2026.eacl-long.270/
DOI:
Bibkey:
Cite (ACL):
Marisa Hudspeth, Patrick J. Burns, and Brendan O'Connor. 2026. Contextual morphologically-guided tokenization for Latin encoder models. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5761–5775, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
Contextual morphologically-guided tokenization for Latin encoder models (Hudspeth et al., EACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-eacl/2026.eacl-long.270.pdf