Vorm: Translations and a constrained hypothesis space support unsupervised morphological segmentation across languages

Barend Beekhuizen

Vorm: Translations and a constrained hypothesis space support unsupervised morphological segmentation across languages

Abstract

This paper introduces Vorm, an unsupervised morphological segmentation system, leveraging translation data to infer highly accurate morphological transformations, including less-frequently modeled processes such as infixation and reduplication. The system is evaluated on standard benchmark data and a novel, typologically diverse, dataset of 37 languages. Model performance is competitive and sometimes superior on canonical segmentation, but more limited on surface segmentation.

Anthology ID:: 2025.conll-1.39
Volume:: Proceedings of the 29th Conference on Computational Natural Language Learning
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Gemma Boleda, Michael Roth
Venues:: CoNLL | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 602–626
Language:
URL:: https://preview.aclanthology.org/landing_page/2025.conll-1.39/
DOI:
Bibkey:
Cite (ACL):: Barend Beekhuizen. 2025. Vorm: Translations and a constrained hypothesis space support unsupervised morphological segmentation across languages. In Proceedings of the 29th Conference on Computational Natural Language Learning, pages 602–626, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Vorm: Translations and a constrained hypothesis space support unsupervised morphological segmentation across languages (Beekhuizen, CoNLL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/landing_page/2025.conll-1.39.pdf

PDF Cite Search Fix data