Vorm: Translations and a constrained hypothesis space support unsupervised morphological segmentation across languages

Barend Beekhuizen


Abstract
This paper introduces Vorm, an unsupervised morphological segmentation system, leveraging translation data to infer highly accurate morphological transformations, including less-frequently modeled processes such as infixation and reduplication. The system is evaluated on standard benchmark data and a novel, typologically diverse, dataset of 37 languages. Model performance is competitive and sometimes superior on canonical segmentation, but more limited on surface segmentation.
Anthology ID:
2025.conll-1.39
Volume:
Proceedings of the 29th Conference on Computational Natural Language Learning
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Gemma Boleda, Michael Roth
Venues:
CoNLL | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
602–626
Language:
URL:
https://preview.aclanthology.org/landing_page/2025.conll-1.39/
DOI:
Bibkey:
Cite (ACL):
Barend Beekhuizen. 2025. Vorm: Translations and a constrained hypothesis space support unsupervised morphological segmentation across languages. In Proceedings of the 29th Conference on Computational Natural Language Learning, pages 602–626, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Vorm: Translations and a constrained hypothesis space support unsupervised morphological segmentation across languages (Beekhuizen, CoNLL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2025.conll-1.39.pdf