How to encode arbitrarily complex morphology in word embeddings, no corpus needed

Lane Schwartz, Coleman Haley, Francis Tyers


Abstract
In this paper, we present a straightforward technique for constructing interpretable word embeddings from morphologically analyzed examples (such as interlinear glosses) for all of the world’s languages. Currently, fewer than 300-400 languages out of approximately 7000 have have more than a trivial amount of digitized texts; of those, between 100-200 languages (most in the Indo-European language family) have enough text data for BERT embeddings of reasonable quality to be trained. The word embeddings in this paper are explicitly designed to be both linguistically interpretable and fully capable of handling the broad variety found in the world’s diverse set of 7000 languages, regardless of corpus size or morphological characteristics. We demonstrate the applicability of our representation through examples drawn from a typologically diverse set of languages whose morphology includes prefixes, suffixes, infixes, circumfixes, templatic morphemes, derivational morphemes, inflectional morphemes, and reduplication.
Anthology ID:
2022.fieldmatters-1.8
Volume:
Proceedings of the first workshop on NLP applications to field linguistics
Month:
October
Year:
2022
Address:
Gyeongju, Republic of Korea
Editors:
Oleg Serikov, Ekaterina Voloshina, Anna Postnikova, Elena Klyachko, Ekaterina Neminova, Ekaterina Vylomova, Tatiana Shavrina, Eric Le Ferrand, Valentin Malykh, Francis Tyers, Timofey Arkhangelskiy, Vladislav Mikhailov, Alena Fenogenova
Venue:
FieldMatters
SIG:
Publisher:
International Conference on Computational Linguistics
Note:
Pages:
64–76
Language:
URL:
https://aclanthology.org/2022.fieldmatters-1.8
DOI:
Bibkey:
Cite (ACL):
Lane Schwartz, Coleman Haley, and Francis Tyers. 2022. How to encode arbitrarily complex morphology in word embeddings, no corpus needed. In Proceedings of the first workshop on NLP applications to field linguistics, pages 64–76, Gyeongju, Republic of Korea. International Conference on Computational Linguistics.
Cite (Informal):
How to encode arbitrarily complex morphology in word embeddings, no corpus needed (Schwartz et al., FieldMatters 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/naacl-24-ws-corrections/2022.fieldmatters-1.8.pdf