Anna Stacey
2026
Glossed Data in Northern Interior Salish
Anna Stacey
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Anna Stacey
Proceedings of the Fifteenth Language Resources and Evaluation Conference
The Northern Interior subgroup of the Salish language family, spoken in the Pacific Northwest of North America, comprises three languages: St’át’imcets, nɬeʔkepmxcín, and Secwepemctsín. Each has a small number of first-language (L1) speakers remaining due to the effects of colonization, though language revitalization efforts are ongoing. This work introduces the first compiled and cleaned language datasets in these languages, useable in natural language processing (NLP) projects. This data is in glossed format, with transcriptions in the language, translations into English, and linguistic segmentations and glosses that provide a detailed breakdown of meaning. In order to achieve consistently formatted data within and across each language, extensive data cleaning was conducted. This paper provides the glossed data standards that were developed and recounts the cleaning process. Scripts that help to automate parts of the data preparation processes are included. Finally, this work strives to keep the interconnectedness of language and community as a central consideration.
2023
Findings of the SIGMORPHON 2023 Shared Task on Interlinear Glossing
Michael Ginn | Sarah Moeller | Alexis Palmer | Anna Stacey | Garrett Nicolai | Mans Hulden | Miikka Silfverberg
Proceedings of the 20th SIGMORPHON workshop on Computational Research in Phonetics, Phonology, and Morphology
Michael Ginn | Sarah Moeller | Alexis Palmer | Anna Stacey | Garrett Nicolai | Mans Hulden | Miikka Silfverberg
Proceedings of the 20th SIGMORPHON workshop on Computational Research in Phonetics, Phonology, and Morphology
This paper presents the findings of the SIGMORPHON 2023 Shared Task on Interlinear Glossing. This first iteration of the shared task explores glossing of a set of six typologically diverse languages: Arapaho, Gitksan, Lezgi, Natügu, Tsez and Uspanteko. The shared task encompasses two tracks: a resource-scarce closed track and an open track, where participants are allowed to utilize external data resources. Five teams participated in the shared task. The winning team Tü-CL achieved a 23.99%-point improvement over a baseline RoBERTa system in the closed track and a 17.42%-point improvement in the open track.