Christopher Cox


2025

In this paper, we discuss the development of a long-term partnership between community and university-based language workers to create supportive language technologies for Tsuutina, a critically endangered Dene language spoken in southern Alberta, Canada. Initial development activities in this partnership sought to rapidly integrate existing language materials, with the aim of arriving at tools that would be effective and impactful for community use by virtue of their extensive lexical coverage. We describe how, as this partnership developed, this approach was gradually superseded by one that involved a more targeted, lexical-item-by-lexical-item review process that was directly informed by other community language priorities and connected to the work a local language authority. We describe how this shift in processes correlated with other changes in local language programs and priorities, noting how ongoing communication allowed this partnership to adapt to the evolving needs of local organizations.

2024

2023

2022

This paper describes the motivation and implementation details for a rule-based, index-preserving grapheme-to-phoneme engine ‘Gi2Pi' implemented in pure Python and released under the open source MIT license. The engine and interface have been designed to prioritize the developer experience of potential contributors without requiring a high level of programming knowledge. ‘Gi2Pi' already provides mappings for 30 (mostly Indigenous) languages, and the package is accompanied by a web-based interactive development environment, a RESTful API, and extensive documentation to encourage the addition of more mappings in the future. We also present three downstream applications of ‘Gi2Pi' and show results of a preliminary evaluation.
This paper describes the expansion of a finite state transducer (FST) for the transitive verb system of Tsuut’ina (ISO 639-3: srs), a Dene (Athabaskan) language spoken in Alberta, Canada. Dene languages have unique templatic morphology, in which lexical, inflectional and derivational tiers are interlaced. Drawing on data from close to 9,000 verbal forms, the expanded model can handle a great range of common and rare argument structure types, including ditransitive and uniquely Dene object experiencer verbs. While challenges of speed remain, this expansion shows the ability of FST modelling to handle morphology of this type, and the expnded FST shows great promise for community language applications such as a morphologically informed online dictionary and word predictor, and for further FST development. This paper describes the expansion of a finite state transducer (FST) for the transitive verb system of Tsuut’ina (ISO 639-3: srs), a Dene (Athabaskan) language spoken in Alberta, Canada. Dene languages have unique templatic morphology, in which lexical, inflectional and derivational tiers are interlaced. Drawing on data from over 12,000 verbs forms, the expanded model can handle a great range of common and rare argument structure types, including ditransitive and uniquely Dene object experiencer verbs. While challenges of speed remain, this expansion shows the ability of FST modelling to handle morphology of this type, and the expnded FST shows great promise for community language applications such as a morphologically informed online dictionary and word predictor, and for further FST development.

2021

2020

This paper surveys the first, three-year phase of a project at the National Research Council of Canada that is developing software to assist Indigenous communities in Canada in preserving their languages and extending their use. The project aimed to work within the empowerment paradigm, where collaboration with communities and fulfillment of their goals is central. Since many of the technologies we developed were in response to community needs, the project ended up as a collection of diverse subprojects, including the creation of a sophisticated framework for building verb conjugators for highly inflectional polysynthetic languages (such as Kanyen’kéha, in the Iroquoian language family), release of what is probably the largest available corpus of sentences in a polysynthetic language (Inuktut) aligned with English sentences and experiments with machine translation (MT) systems trained on this corpus, free online services based on automatic speech recognition (ASR) for easing the transcription bottleneck for recordings of speech in Indigenous languages (and other languages), software for implementing text prediction and read-along audiobooks for Indigenous languages, and several other subprojects.
This tutorial will focus on NLP for endangered languages documentation and revitalization. First, we will acquaint the attendees with the process and the challenges of language documentation, showing how the needs of the language communities and the documentary linguists map to specific NLP tasks. We will then present the state-of-the-art in NLP applied in this particularly challenging setting (extremely low-resource datasets, noisy transcriptions, limited annotations, non-standard orthographies). In doing so, we will also analyze the challenges of working in this domain and expand on both the capabilities and the limitations of current NLP approaches. Our ultimate goal is to motivate more NLP practitioners to work towards this very important direction, and also provide them with the tools and understanding of the limitations/challenges, both of which are needed in order to have an impact.

2018

In this article, we discuss which text, speech, and image technologies have been developed, and would be feasible to develop, for the approximately 60 Indigenous languages spoken in Canada. In particular, we concentrate on technologies that may be feasible to develop for most or all of these languages, not just those that may be feasible for the few most-resourced of these. We assess past achievements and consider future horizons for Indigenous language transliteration, text prediction, spell-checking, approximate search, machine translation, speech recognition, speaker diarization, speech synthesis, optical character recognition, and computer-aided language learning.

2004

2002