Jolene Poulin


2023

pdf
Finding words that aren’t there: Using word embeddings to improve dictionary search for low-resource languages
Antti Arppe | Andrew Neitsch | Daniel Dacanay | Jolene Poulin | Daniel Hieber | Atticus Harrigan
Proceedings of the Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP)

Modern machine learning techniques have produced many impressive results in language technology, but these techniques generally require an amount of training data that is many orders of magnitude greater than what exists for low-resource languages in general, and endangered ones in particular. However, dictionary definitions in a comparatively much more well-resourced majority language can provide a link between low-resource languages and machine learning models trained on massive amounts of majority-language data. By leveraging a pre-trained English word embedding to compute sentence embeddings for definitions in bilingual dictionaries for four Indigenous languages spoken in North America, Plains Cree (nhiyawwin), Arapaho (Hinno’itit), Northern Haida (Xaad Kl), and Tsuut’ina (Tst’n), we have obtained promising results for dictionary search. Not only are the search results in the majority language of the definitions more relevant, but they can be semantically relevant in ways not achievable with classic information retrieval techniques: users can perform successful searches for words that do not occur at all in the dictionary. These techniques are directly applicable to any bilingual dictionary providing translations between a high- and low-resource language.

pdf
Speech Database (Speech-DB) – An on-line platform for storing, validating, searching, and recording spoken language data
Jolene Poulin | Daniel Dacanay | Antti Arppe
Proceedings of the Second Workshop on NLP Applications to Field Linguistics

The Speech Database (Speech-DB: URL: https://speech-db.altlab.app) is an on-line platform for language documentation, written and spoken language validation, and speech exploration; its code-base is available as open source. In its current state, Speech-DB has expanded to contain content for several Indigenous languages spoken in Western Canada, having started with audio for the dialect of Plains Cree spoken in Maskwacîs, Alberta, Canada. Currently, it is used primarily for validation and storage. It can be accessed by anyone with an internet connection in six levels of access rights. What follows is the rationale for the development of speech-DB, an exploration of its features, and a description of usage scenarios, as well as initial user feedback on the application.