Olga Kriukova


2025

pdf bib
AI for Interlinearization and POS-tagging: Teaching Linguists to Fish
Olga Kriukova | Katherine Schmirler | Sarah Moeller | Olga Lovick | Inge Genee | Antti Arppe | Alexandra Smith
Proceedings of the Eight Workshop on the Use of Computational Methods in the Study of Endangered Languages

This paper describes the process and learn- ing outcomes of a three-day workshop on ma- chine learning basics for documentary linguists. During this workshop, two groups of linguists working with two Indigenous languages of North America, Blackfoot and Dënë Su ̨łıné, became acquainted with machine learning prin- ciples, explored how machine learning can be used in data processing for under-resourced languages and then applied different machine learning methods for automatic morphologi- cal interlinearization and parts-of-speech tag- ging. As a result, participants discovered paths to greater collaboration between computer sci- ence and documentary linguistics and reflected on how linguists might be enabled to apply ma- chine learning with less dependence on experts.

2024

pdf bib
Word-level prediction in Plains Cree: First steps
Olga Kriukova | Antti Arppe
Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP 2024)

Plains Cree (nêhiyawêwin) is a morphologically complex and predominantly prefixing language. The combinatory potential of inflectional and derivational/lexical prefixes and verb stems in Plains Cree makes it challenging for traditional auto-completion (or word suggestion) approaches to handle. The lack of a large corpus of Plains Cree also complicates the situation. This study attempts to investigate how well a BiLSTM model trained on a small Cree corpus can handle a word suggestion task. Moreover, this study evaluates whether the use of semantically and morphosyntactically refined Word2Vec embeddings can improve the overall accuracy and quality of BiLSTM suggestions. The results show that some models trained with the refined vectors provide semantically and morphosyntactically better suggestions. They are also more accurate in predictions of content words. The model trained with the non-refined vectors, in contrast, was better at predicting conjunctions, particles, and other non-inflecting words. The models trained with different refined vector combinations provide the expected next word among top-10 predictions in 36.73 to 37.88% of cases (depending on the model).