2024
pdf
abs
The Ethical Question – Use of Indigenous Corpora for Large Language Models
Linda Wiechetek
|
Flammie A. Pirinen
|
Børre Gaup
|
Trond Trosterud
|
Maja Lisa Kappfjell
|
Sjur Moshagen
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Creating language technology based on language data has become very popular with the recent advances of large language models and neural network technologies. This makes language resources very valuable, and especially in case of indigenous languages, the scarce resources are even more precious. Given the good results of simply fetching everything you can from the internet and feeding it to neural networks in English, there has been more work on doing the same for all languages. However, indigenous language resources as they are on the web are not comparable in that they would encode the most recent normativised language in all its aspects. This problematic is further due to not understanding the texts input to models or output by models by the people who work on them. Corpora also have intelligent property rights and copyrights that are not respected. Furthermore, the web is filled with the result of language model -generated texts. In this article we describe an ethical and sustainable way to work with indigenous languages.
2023
pdf
abs
GiellaLT — a stable infrastructure for Nordic minority languages and beyond
Flammie Pirinen
|
Sjur Moshagen
|
Katri Hiovain-Asikainen
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)
Long term language technology infrastructures are critical for continued maintenance of language technology based software that is used to support the use of languages in digital world. In Nordic area we have languages ranging from well-resourced national majority languages like Norwegian, Swedish and Finnish as well as minoritised, unresourced and indigenous languages like Sámi languages. We present an infrastructure that has been build in over 20 years time that supports building language technology and tools for most of the Nordic languages as well as many of the languages all over the world, with focus on Sámi and other indigenous, minoritised and unresourced languages. We show that one common infrastructure can be used to build tools from keyboards and spell-checkers to machine translators, grammar checkers and text-to-speech as well as automatic speech recognition.
2022
pdf
abs
Unmasking the Myth of Effortless Big Data - Making an Open Source Multi-lingual Infrastructure and Building Language Resources from Scratch
Linda Wiechetek
|
Katri Hiovain-Asikainen
|
Inga Lill Sigga Mikkelsen
|
Sjur Moshagen
|
Flammie Pirinen
|
Trond Trosterud
|
Børre Gaup
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Machine learning (ML) approaches have dominated NLP during the last two decades. From machine translation and speech technology, ML tools are now also in use for spellchecking and grammar checking, with a blurry distinction between the two. We unmask the myth of effortless big data by illuminating the efforts and time that lay behind building a multi-purpose corpus with regard to collecting, mark-up and building from scratch. We also discuss what kind of language technology minority languages actually need, and to what extent the dominating paradigm has been able to deliver these tools. In this context we present our alternative to corpus-based language technology, which is knowledge-based language technology, and we show how this approach can provide language technology solutions for languages being outside the reach of machine learning procedures. We present a stable and mature infrastructure (GiellaLT) containing more than hundred languages and building a number of language technology tools that are useful for language communities.
pdf
abs
Building Open-source Speech Technology for Low-resource Minority Languages with SáMi as an Example – Tools, Methods and Experiments
Katri Hiovain-Asikainen
|
Sjur Moshagen
Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages
This paper presents a work-in-progress report of an open-source speech technology project for indigenous Sami languages. A less detailed description of this work has been presented in a more general paper about the whole GiellaLT language infrastructure, submitted to the LREC 2022 main conference. At this stage, we have designed and collected a text corpus specifically for developing speech technology applications, namely Text-to-speech (TTS) and Automatic speech recognition (ASR) for the Lule and North Sami languages. We have also piloted and experimented with different speech synthesis technologies using a miniature speech corpus as well as developed tools for effective processing of large spoken corpora. Additionally, we discuss effective and mindful use of the speech corpus and also possibilities to use found/archive materials for training an ASR model for these languages.
2019
pdf
Is this the end? Two-step tokenization of sentence boundaries
Linda Wiechetek
|
Sjur Nørstebø Moshagen
|
Thomas Omma
Proceedings of the Fifth International Workshop on Computational Linguistics for Uralic Languages
pdf
Seeing more than whitespace — Tokenisation and disambiguation in a North Sámi grammar checker
Linda Wiechetek
|
Sjur Nørstebø Moshagen
|
Kevin Brubeck Unhammer
Proceedings of the 3rd Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers)
2018
pdf
Modeling Northern Haida Verb Morphology
Jordan Lachler
|
Lene Antonsen
|
Trond Trosterud
|
Sjur Moshagen
|
Antti Arppe
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
2017
pdf
bib
A Morphological Parser for Odawa
Dustin Bowers
|
Antti Arppe
|
Jordan Lachler
|
Sjur Moshagen
|
Trond Trosterud
Proceedings of the 2nd Workshop on the Use of Computational Methods in the Study of Endangered Languages
2014
pdf
Modeling the Noun Morphology of Plains Cree
Conor Snoek
|
Dorothy Thunder
|
Kaidi Lõo
|
Antti Arppe
|
Jordan Lachler
|
Sjur Moshagen
|
Trond Trosterud
Proceedings of the 2014 Workshop on the Use of Computational Methods in the Study of Endangered Languages
2013
pdf
Building an Open-Source Development Infrastructure for Language Technology Projects
Sjur N. Moshagen
|
Tommi Pirinen
|
Trond Trosterud
Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013)
2007
pdf
Usage of XSL Stylesheets for the Annotation of the Sámi Language Corpora.
Saara Huhmarniemi
|
Sjur N. Moshagen
|
Trond Trosterud
Proceedings of the Linguistic Annotation Workshop
1996
pdf
A Sign Expansion Approach to Dynamic, Multi-Purpose Lexicons
Jon Atle Gulla
|
Sjur Nørstebø Moshagen
COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics