Linda Wiechetek


Unmasking the Myth of Effortless Big Data - Making an Open Source Multi-lingual Infrastructure and Building Language Resources from Scratch
Linda Wiechetek | Katri Hiovain-Asikainen | Inga Lill Sigga Mikkelsen | Sjur Moshagen | Flammie Pirinen | Trond Trosterud | Børre Gaup
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Machine learning (ML) approaches have dominated NLP during the last two decades. From machine translation and speech technology, ML tools are now also in use for spellchecking and grammar checking, with a blurry distinction between the two. We unmask the myth of effortless big data by illuminating the efforts and time that lay behind building a multi-purpose corpus with regard to collecting, mark-up and building from scratch. We also discuss what kind of language technology minority languages actually need, and to what extent the dominating paradigm has been able to deliver these tools. In this context we present our alternative to corpus-based language technology, which is knowledge-based language technology, and we show how this approach can provide language technology solutions for languages being outside the reach of machine learning procedures. We present a stable and mature infrastructure (GiellaLT) containing more than hundred languages and building a number of language technology tools that are useful for language communities.

Reusing a Multi-lingual Setup to Bootstrap a Grammar Checker for a Very Low Resource Language without Data
Inga Lill Sigga Mikkelsen | Linda Wiechetek | Flammie A Pirinen
Proceedings of the Fifth Workshop on the Use of Computational Methods in the Study of Endangered Languages

Grammar checkers (GEC) are needed for digital language survival. Very low resource languages like Lule Sámi with less than 3,000 speakers need to hurry to build these tools, but do not have the big corpus data that are required for the construction of machine learning tools. We present a rule-based tool and a workflow where the work done for a related language can speed up the process. We use an existing grammar to infer rules for the new language, and we do not need a large gold corpus of annotated grammar errors, but a smaller corpus of regression tests is built while developing the tool. We present a test case for Lule Sámi reusing resources from North Sámi, show how we achieve a categorisation of the most frequent errors, and present a preliminary evaluation of the system. We hope this serves as an inspiration for small languages that need advanced tools in a limited amount of time, but do not have big data.

Building an Extremely Low Resource Language to High Resource Language Machine Translation System from Scratch
Flammie Pirinen | Linda Wiechetek
Proceedings of the 18th Conference on Natural Language Processing (KONVENS 2022)


No more fumbling in the dark - Quality assurance of high-level NLP tools in a multi-lingual infrastructure
Linda Wiechetek | Flammie A Pirinen | Børre Gaup | Thomas Omma
Proceedings of the Seventh International Workshop on Computational Linguistics of Uralic Languages

Rules Ruling Neural Networks - Neural vs. Rule-Based Grammar Checking for a Low Resource Language
Linda Wiechetek | Flammie Pirinen | Mika Hämäläinen | Chiara Argese
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

We investigate both rule-based and machine learning methods for the task of compound error correction and evaluate their efficiency for North Sámi, a low resource language. The lack of error-free data needed for a neural approach is a challenge to the development of these tools, which is not shared by bigger languages. In order to compensate for that, we used a rule-based grammar checker to remove erroneous sentences and insert compound errors by splitting correct compounds. We describe how we set up the error detection rules, and how we train a bi-RNN based neural network. The precision of the rule-based model tested on a corpus with real errors (81.0%) is slightly better than the neural model (79.4%). The rule-based model is also more flexible with regard to fixing specific errors requested by the user community. However, the neural model has a better recall (98%). The results suggest that an approach that combines the advantages of both models would be desirable in the future. Our tools and data sets are open-source and freely available on GitHub and Zenodo.


Morphological Disambiguation of South Sámi with FSTs and Neural Networks
Mika Hämäläinen | Linda Wiechetek
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

We present a method for conducting morphological disambiguation for South Sámi, which is an endangered language. Our method uses an FST-based morphological analyzer to produce an ambiguous set of morphological readings for each word in a sentence. These readings are disambiguated with a Bi-RNN model trained on the related North Sámi UD Treebank and some synthetically generated South Sámi data. The disambiguation is done on the level of morphological tags ignoring word forms and lemmas; this makes it possible to use North Sámi training data for South Sámi without the need for a bilingual dictionary or aligned word embeddings. Our approach requires only minimal resources for South Sámi, which makes it usable and applicable in the contexts of any other endangered language as well.


Is this the end? Two-step tokenization of sentence boundaries
Linda Wiechetek | Sjur Nørstebø Moshagen | Thomas Omma
Proceedings of the Fifth International Workshop on Computational Linguistics for Uralic Languages

Seeing more than whitespace — Tokenisation and disambiguation in a North Sámi grammar checker
Linda Wiechetek | Sjur Nørstebø Moshagen | Kevin Brubeck Unhammer
Proceedings of the 3rd Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers)


Reusing Grammatical Resources for New Languages
Lene Antonsen | Trond Trosterud | Linda Wiechetek
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Grammatical approaches to language technology are often considered less optimal than statistical approaches in multilingual settings, where large-scale portability becomes an important issue. The present paper argues that there is a notable gain in reusing grammatical resources when porting technology to new languages. The pivot language is North Sámi, and the paper discusses portability with respect to the closely related Lule and South Sámi, and to the unrelated Faroese and Greenlandic languages.


Developing Prototypes for Machine Translation between Two Sami Languages
Francis M. Tyers | Linda Wiechetek | Trond Trosterud
Proceedings of the 13th Annual conference of the European Association for Machine Translation