Finding words that aren’t there: Using word embeddings to improve dictionary search for low-resource languages

Antti Arppe, Andrew Neitsch, Daniel Dacanay, Jolene Poulin, Daniel Hieber, Atticus Harrigan


Abstract
Modern machine learning techniques have produced many impressive results in language technology, but these techniques generally require an amount of training data that is many orders of magnitude greater than what exists for low-resource languages in general, and endangered ones in particular. However, dictionary definitions in a comparatively much more well-resourced majority language can provide a link between low-resource languages and machine learning models trained on massive amounts of majority-language data. By leveraging a pre-trained English word embedding to compute sentence embeddings for definitions in bilingual dictionaries for four Indigenous languages spoken in North America, Plains Cree (nhiyawwin), Arapaho (Hinno’itit), Northern Haida (Xaad Kl), and Tsuut’ina (Tst’n), we have obtained promising results for dictionary search. Not only are the search results in the majority language of the definitions more relevant, but they can be semantically relevant in ways not achievable with classic information retrieval techniques: users can perform successful searches for words that do not occur at all in the dictionary. These techniques are directly applicable to any bilingual dictionary providing translations between a high- and low-resource language.
Anthology ID:
2023.americasnlp-1.15
Volume:
Proceedings of the Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP)
Month:
July
Year:
2023
Address:
Toronto, Canada
Venue:
AmericasNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
144–155
Language:
URL:
https://aclanthology.org/2023.americasnlp-1.15
DOI:
Bibkey:
Cite (ACL):
Antti Arppe, Andrew Neitsch, Daniel Dacanay, Jolene Poulin, Daniel Hieber, and Atticus Harrigan. 2023. Finding words that aren’t there: Using word embeddings to improve dictionary search for low-resource languages. In Proceedings of the Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP), pages 144–155, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
Finding words that aren’t there: Using word embeddings to improve dictionary search for low-resource languages (Arppe et al., AmericasNLP 2023)
Copy Citation:
PDF:
https://preview.aclanthology.org/nodalida-main-page/2023.americasnlp-1.15.pdf