Flammie A. Pirinen

Also published as: Flammie Pirinen, Flammie A Pirinen, Tommi Pirinen, Tommi A Pirinen, Tommi A. Pirinen


2025

pdf bib
Proceedings of the 9th Workshop on Constraint Grammar and Finite State NLP
Trond Trosterud | Linda Wiechetek | Flammie Pirinen
Proceedings of the 9th Workshop on Constraint Grammar and Finite State NLP

pdf bib
Divvunspell—Finite-State Spell-Checking and Correction on Modern Platforms
Flammie A Pirinen | Sjur Nørstebø Moshagen
Proceedings of the 9th Workshop on Constraint Grammar and Finite State NLP

Spell-checking and correction is one of the key applications of natural language support. Historically, for the biggest, less morphologically complex languages, spell-checking and correction could be implemented by relatively simple means; however, for morphologically complex and low-resource languages, the solutions were often suboptimal. Finite-state methods are the state of the art in rule-based natural language processing and also for spell-checking and correction they have been effectively used. In this article, we show some recent developments of a finite-state spell-checker implementation that works with modern operating systems and platforms.

pdf bib
Exploring Limitations and Risks of LLM-Based Grammatical Error Correction for Indigenous Languages
Flammie A Pirinen | Linda Wiechetek
Proceedings of the Eight Workshop on the Use of Computational Methods in the Study of Endangered Languages

Rule-based grammatical error correction has long been seen as the most effective way to create user-friendly end-user systems for gram- matical error correction (GEC). However, in the recent years the large language models and generative AI systems based on that technol- ogy have been progressed fast to challenge the traditional GEC approach. In this article we show which possibilities and limitations this approach bears for Indigenous languages that have more limited digital presence in the large language model data and a different literacy background than English. We show experi- ments in North Sámi, an Indigenous language of Northern Europe.

pdf bib
Proceedings of the Eighth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2025)
Atul Kr. Ojha | Chao-hong Liu | Ekaterina Vylomova | Flammie Pirinen | Jonathan Washington | Nathaniel Oco | Xiaobing Zhao
Proceedings of the Eighth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2025)

pdf bib
How to Create Treebanks without Human Annotators – An Indigenous Language Grammar Checker for Treebank Construction
Linda Wiechetek | Flammie A Pirinen | Maja Lisa Kappfjell
Proceedings of the 23rd International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2025)

Creating treebanks for low resource languages is an important task. However, low resource Indigenous language contexts have not only limited resources in terms of text data, but also limited human resources that are available for linguistic annotation. We suggest a work-around by applying a Constraint Grammar operated rule-based dependency parser to do the work of creating a marked-up treebank. However, due to a lot of noise, meaning spelling and grammatical errors in South Sámi written texts, this tool often fails to create complete and correct trees. As a fix to this, we created a grammar checking tool for the most common South Sámi grammatical error types, which improves the quality of the dependency parser significantly. As both literacy and normative standards for most Indigenous languages are much more recent than for majority languages, spelling and grammatical variation and errors are a common source of noise, and the application of a correction tool like ours can be useful in the construction of treebanks for these languages.

2024

pdf bib
Proceedings of the 9th International Workshop on Computational Linguistics for Uralic Languages
Mika Hämäläinen | Flammie Pirinen | Melany Macias | Mario Crespo Avila
Proceedings of the 9th International Workshop on Computational Linguistics for Uralic Languages

pdf bib
Keeping Up Appearances—or how to get all Uralic languages included into bleeding edge research and software: generate, convert, and LLM your way into multilingual datasets
Flammie A Pirinen
Proceedings of the 9th International Workshop on Computational Linguistics for Uralic Languages

The current trends in natural language processing strongly favor large language models and generative AIs as the basis for everything. For Uralic languages that are not largely present in publically available data on the Internet, this can be problematic. In the current computational linguistic scene, it is very important to have representation of your language in popular datasets. Languages that are included in well-known datasets are also included in shared tasks, products by large technology corporations, and so forth. This inclusion will become especially important for under-resourced, under-studied minority, and Indigenous languages, which will otherwise be easily forgotten. In this article, we present the resources that are often deemed necessary for digital presence of a language in the large language model obsessed world of today. We show that there are methods and tricks available to alleviate the problems with a lack of data and a lack of creators and annotators of the data, some more successful than others.

pdf bib
Proceedings of the Seventh Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2024)
Atul Kr. Ojha | Chao-hong Liu | Ekaterina Vylomova | Flammie Pirinen | Jade Abbott | Jonathan Washington | Nathaniel Oco | Valentin Malykh | Varvara Logacheva | Xiaobing Zhao
Proceedings of the Seventh Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2024)

pdf bib
The Ethical Question – Use of Indigenous Corpora for Large Language Models
Linda Wiechetek | Flammie Pirinen | Maja Lisa Kappfjell | Trond Trosterud | Børre Gaup | Sjur Nørstebø Moshagen
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Creating language technology based on language data has become very popular with the recent advances of large language models and neural network technologies. This makes language resources very valuable, and especially in case of indigenous languages, the scarce resources are even more precious. Given the good results of simply fetching everything you can from the internet and feeding it to neural networks in English, there has been more work on doing the same for all languages. However, indigenous language resources as they are on the web are not comparable in that they would encode the most recent normativised language in all its aspects. This problematic is further due to not understanding the texts input to models or output by models by the people who work on them. Corpora also have intelligent property rights and copyrights that are not respected. Furthermore, the web is filled with the result of language model -generated texts. In this article we describe an ethical and sustainable way to work with indigenous languages.

2023

pdf bib
A Manual Evaluation Method of Neural MT for Indigenous Languages
Linda Wiechetek | Flammie A. Pirinen | Per E Kummervold
Proceedings of the 3rd Workshop on Human Evaluation of NLP Systems

Indigenous language expertise is not encoded in written text in the same way as it is for languages that have a long literal tradition. In many cases it is, on the contrary, mostly conserved orally. Therefore the evaluation of neural MT systems solely based on an algorithm learning from written texts is not adequate to measure the quality of a system that is used by the language community. If extensively using tools based on a big amount of non-native language this can even contribute to language change in a way that is not desired by the language community. It can also pollute the internet with automatically created texts that outweigh native texts. We propose a manual evaluation method focusing on flow and content separately, and additionally we use existing rule-based NLP to evaluate other factors such as spelling, grammar and grammatical richness. Our main conclusion is that language expertise of a native speaker is necessary to properly evaluate a given system. We test the method by manually evaluating two neural MT tools for an indigenous low resource language. We present an experiment on two different neural translations to and from North Sámi, an indigenous language of North Europe.

pdf bib
Proceedings of the Sixth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2023)
Atul Kr. Ojha | Chao-hong Liu | Ekaterina Vylomova | Flammie Pirinen | Jade Abbott | Jonathan Washington | Nathaniel Oco | Valentin Malykh | Varvara Logacheva | Xiaobing Zhao
Proceedings of the Sixth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2023)

pdf bib
Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages
Mika Hämäläinen | Emily Öhman | Flammie Pirinen | Khalid Alnajjar | So Miyagawa | Yuri Bizzoni | Niko Partanen | Jack Rueter
Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages

pdf bib
GiellaLT — a stable infrastructure for Nordic minority languages and beyond
Flammie A Pirinen | Sjur N. Moshagen | Katri Hiovain-Asikainen
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)

Long term language technology infrastructures are critical for continued maintenance of language technology based software that is used to support the use of languages in digital world. In Nordic area we have languages ranging from well-resourced national majority languages like Norwegian, Swedish and Finnish as well as minoritised, unresourced and indigenous languages like Sámi languages. We present an infrastructure that has been build in over 20 years time that supports building language technology and tools for most of the Nordic languages as well as many of the languages all over the world, with focus on Sámi and other indigenous, minoritised and unresourced languages. We show that one common infrastructure can be used to build tools from keyboards and spell-checkers to machine translators, grammar checkers and text-to-speech as well as automatic speech recognition.

2022

pdf bib
Reusing a Multi-lingual Setup to Bootstrap a Grammar Checker for a Very Low Resource Language without Data
Inga Lill Sigga Mikkelsen | Linda Wiechetek | Flammie A Pirinen
Proceedings of the Fifth Workshop on the Use of Computational Methods in the Study of Endangered Languages

Grammar checkers (GEC) are needed for digital language survival. Very low resource languages like Lule Sámi with less than 3,000 speakers need to hurry to build these tools, but do not have the big corpus data that are required for the construction of machine learning tools. We present a rule-based tool and a workflow where the work done for a related language can speed up the process. We use an existing grammar to infer rules for the new language, and we do not need a large gold corpus of annotated grammar errors, but a smaller corpus of regression tests is built while developing the tool. We present a test case for Lule Sámi reusing resources from North Sámi, show how we achieve a categorisation of the most frequent errors, and present a preliminary evaluation of the system. We hope this serves as an inspiration for small languages that need advanced tools in a limited amount of time, but do not have big data.

pdf bib
Building an Extremely Low Resource Language to High Resource Language Machine Translation System from Scratch
Flammie A Pirinen | Linda Wiechetek
Proceedings of the 18th Conference on Natural Language Processing (KONVENS 2022)

pdf bib
Proceedings of the Fifth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2022)
Atul Kr. Ojha | Chao-Hong Liu | Ekaterina Vylomova | Jade Abbott | Jonathan Washington | Nathaniel Oco | Tommi A Pirinen | Valentin Malykh | Varvara Logacheva | Xiaobing Zhao
Proceedings of the Fifth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2022)

pdf bib
Unmasking the Myth of Effortless Big Data - Making an Open Source Multi-lingual Infrastructure and Building Language Resources from Scratch
Linda Wiechetek | Katri Hiovain-Asikainen | Inga Lill Sigga Mikkelsen | Sjur N. Moshagen | Flammie A. Pirinen | Trond Trosterud | Børre Gaup
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Machine learning (ML) approaches have dominated NLP during the last two decades. From machine translation and speech technology, ML tools are now also in use for spellchecking and grammar checking, with a blurry distinction between the two. We unmask the myth of effortless big data by illuminating the efforts and time that lay behind building a multi-purpose corpus with regard to collecting, mark-up and building from scratch. We also discuss what kind of language technology minority languages actually need, and to what extent the dominating paradigm has been able to deliver these tools. In this context we present our alternative to corpus-based language technology, which is knowledge-based language technology, and we show how this approach can provide language technology solutions for languages being outside the reach of machine learning procedures. We present a stable and mature infrastructure (GiellaLT) containing more than hundred languages and building a number of language technology tools that are useful for language communities.

2021

pdf bib
Proceedings of the Seventh International Workshop on Computational Linguistics of Uralic Languages
Flammie A Pirinen | Timofey Arhangelskiy | Trond Trosterud | Michael Rießler
Proceedings of the Seventh International Workshop on Computational Linguistics of Uralic Languages

pdf bib
No more fumbling in the dark - Quality assurance of high-level NLP tools in a multi-lingual infrastructure
Linda Wiechetek | Flammie A Pirinen | Børre Gaup | Thomas Omma
Proceedings of the Seventh International Workshop on Computational Linguistics of Uralic Languages

pdf bib
Rules Ruling Neural Networks - Neural vs. Rule-Based Grammar Checking for a Low Resource Language
Linda Wiechetek | Flammie A Pirinen | Mika Hämäläinen | Chiara Argese
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

We investigate both rule-based and machine learning methods for the task of compound error correction and evaluate their efficiency for North Sámi, a low resource language. The lack of error-free data needed for a neural approach is a challenge to the development of these tools, which is not shared by bigger languages. In order to compensate for that, we used a rule-based grammar checker to remove erroneous sentences and insert compound errors by splitting correct compounds. We describe how we set up the error detection rules, and how we train a bi-RNN based neural network. The precision of the rule-based model tested on a corpus with real errors (81.0%) is slightly better than the neural model (79.4%). The rule-based model is also more flexible with regard to fixing specific errors requested by the user community. However, the neural model has a better recall (98%). The results suggest that an approach that combines the advantages of both models would be desirable in the future. Our tools and data sets are open-source and freely available on GitHub and Zenodo.

pdf bib
Vowel Harmony Viewed as Error-Correcting Code
Yvo Meeres | Tommi A Pirinen
Proceedings of the Society for Computation in Linguistics 2021

pdf bib
Numerals and what counts
Jack Rueter | Niko Partanen | Flammie A. Pirinen
Proceedings of the Fifth Workshop on Universal Dependencies (UDW, SyntaxFest 2021)

2020

pdf bib
Suoidne-varra-bleahkka-mála-bihkka-senet-dielku ‘hay-blood-ink-paint-tar-mustard-stain’ -Should compounds be lexicalized in NLP?
Linda Wiechetek | Chiara Argese | Tommi A Pirinen | Trond Trosterud
Proceedings of the Seventh Italian Conference on Computational Linguistics (CLiC-it 2020)

pdf bib
Proceedings of the Sixth International Workshop on Computational Linguistics of Uralic Languages
Tommi A. Pirinen | Francis M. Tyers | Michael Rießler
Proceedings of the Sixth International Workshop on Computational Linguistics of Uralic Languages

pdf bib
Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages
Alina Karakanta | Atul Kr. Ojha | Chao-Hong Liu | Jade Abbott | John Ortega | Jonathan Washington | Nathaniel Oco | Surafel Melaku Lakew | Tommi A Pirinen | Valentin Malykh | Varvara Logacheva | Xiaobing Zhao
Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages

pdf bib
An Unsupervised Method for Weighting Finite-state Morphological Analyzers
Amr Keleg | Francis M. Tyers | Nicholas Howell | Tommi A. Pirinen
Proceedings of the Twelfth Language Resources and Evaluation Conference

Morphological analysis is one of the tasks that have been studied for years. Different techniques have been used to develop models for performing morphological analysis. Models based on finite state transducers have proved to be more suitable for languages with low available resources. In this paper, we have developed a method for weighting a morphological analyzer built using finite state transducers in order to disambiguate its results. The method is based on a word2vec model that is trained in a completely unsupervised way using raw untagged corpora and is able to capture the semantic meaning of the words. Most of the methods used for disambiguating the results of a morphological analyzer relied on having tagged corpora that need to manually built. Additionally, the method developed uses information about the token irrespective of its context unlike most of the other techniques that heavily rely on the word’s context to disambiguate its set of candidate analyses.

2019

pdf bib
Proceedings of the Fifth International Workshop on Computational Linguistics for Uralic Languages
Tommi A. Pirinen | Heiki-Jaan Kaalep | Francis M. Tyers
Proceedings of the Fifth International Workshop on Computational Linguistics for Uralic Languages

pdf bib
Neural and rule-based Finnish NLP models—expectations, experiments and experiences
Tommi A Pirinen
Proceedings of the Fifth International Workshop on Computational Linguistics for Uralic Languages

pdf bib
Apertium-fin-eng–Rule-based Shallow Machine Translation for WMT 2019 Shared Task
Tommi A Pirinen
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

In this paper we describe a rule-based, bi-directional machine translation system for the Finnish—English language pair. The baseline system was based on the existing data of FinnWordNet, omorfi and apertium-eng. We have built the disambiguation, lexical selection and translation rules by hand. The dictionaries and rules have been developed based on the shared task data. We describe in this article the use of the shared task data as a kind of a test-driven development workflow in RBMT development and show that it suits perfectly to a modern software engineering continuous integration workflow of RBMT and yields big increases to BLEU scores with minimal effort.

pdf bib
Workflows for kickstarting RBMT in virtually No-Resource Situation
Tommi A Pirinen
Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages

pdf bib
Building minority dependency treebanks, dictionaries and computational grammars at the same time—an experiment in Karelian treebanking
Tommi A Pirinen
Proceedings of the Third Workshop on Universal Dependencies (UDW, SyntaxFest 2019)

2018

pdf bib
Proceedings of the Fourth International Workshop on Computational Linguistics of Uralic Languages
Tommi A. Pirinen | Michael Rießler | Jack Rueter | Trond Trosterud | Francis M. Tyers
Proceedings of the Fourth International Workshop on Computational Linguistics of Uralic Languages

2017

pdf bib
North-Sámi to Finnish rule-based machine translation system
Ryan Johnson | Tommi A Pirinen | Tiina Puolakainen | Francis Tyers | Trond Trosterud | Kevin Unhammer
Proceedings of the 21st Nordic Conference on Computational Linguistics

pdf bib
Proceedings of the Third Workshop on Computational Linguistics for Uralic Languages
Francis M. Tyers | Michael Rießler | Tommi A. Pirinen | Trond Trosterud
Proceedings of the Third Workshop on Computational Linguistics for Uralic Languages

2015

pdf bib
Abu-MaTran: Automatic building of Machine Translation
Antonio Toral | Tommi A Pirinen | Andy Way | Gema Ramírez-Sánchez | Sergio Ortiz Rojas | Raphael Rubino | Miquel Esplà | Mikel Forcada | Vassilis Papavassiliou | Prokopis Prokopidis | Nikola Ljubešić
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf bib
Omorfi — Free and open source morphological lexical database for Finnish
Tommi A Pirinen
Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015)

pdf bib
Abu-MaTran at WMT 2015 Translation Task: Morphological Segmentation and Web Crawling
Raphael Rubino | Tommi Pirinen | Miquel Esplà-Gomis | Nikola Ljubešić | Sergio Ortiz-Rojas | Vassilis Papavassiliou | Prokopis Prokopidis | Antonio Toral
Proceedings of the Tenth Workshop on Statistical Machine Translation

pdf bib
Abu-MaTran: Automatic building of Machine Translation
Antonio Toral | Tommi A. Pirinen | Andy Way | Gema Ramírez-Sánchez | Sergio Ortiz Rojas | Raphael Rubino | Miquel Esplà | Mikel L. Forcada | Vassilis Papavassiliou | Prokopis Prokopidis | Nikola Ljubešić
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

2014

pdf bib
Extrinsic evaluation of web-crawlers in machine translation: a study on Croatian-English for the tourism domain
Antonio Toral | Raphael Rubino | Miquel Esplà-Gomis | Tommi Pirinen | Andy Way | Gema Ramírez-Sánchez
Proceedings of the 17th Annual Conference of the European Association for Machine Translation

pdf bib
Heuristic Hyper-minimization of Finite State Lexicons
Senka Drobac | Krister Lindén | Tommi A Pirinen | Miikka Silfverberg
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Flag diacritics, which are special multi-character symbols executed at runtime, enable optimising finite-state networks by combining identical sub-graphs of its transition graph. Traditionally, the feature has required linguists to devise the optimisations to the graph by hand alongside the morphological description. In this paper, we present a novel method for discovering flag positions in morphological lexicons automatically, based on the morpheme structure implicit in the language description. With this approach, we have gained significant decrease in the size of finite-state networks while maintaining reasonable application speed. The algorithm can be applied to any language description, where the biggest achievements are expected in large and complex morphologies. The most noticeable reduction in size we got with a morphological transducer for Greenlandic, whose original size is on average about 15 times larger than other morphologies. With the presented hyper-minimization method, the transducer is reduced to 10,1% of the original size, with lookup speed decreased only by 9,5%.

2013

pdf bib
Building an Open-Source Development Infrastructure for Language Technology Projects
Sjur N. Moshagen | Tommi A Pirinen | Trond Trosterud
Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013)

2012

pdf bib
Effect of Language and Error Models on Efficiency of Finite-State Spell-Checking and Correction
Tommi A Pirinen | Sam Hardwick
Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing

2011

pdf bib
Modularisation of Finnish Finite-State Language Description – Towards Wide Collaboration in Open Source Development of a Morphological Analyser
Tommi A Pirinen
Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011)

2009

pdf bib
Weighted Finite-State Morphological Analysis of Finnish Compounding with HFST-LEXC
Krister Lindén | Tommi Pirinen
Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA 2009)