Niko Partanen


2023

pdf bib
Working Towards Digital Documentation of Uralic Languages With Open-Source Tools and Modern NLP Methods
Mika Hämäläinen | Jack Rueter | Khalid Alnajjar | Niko Partanen
Proceedings of the Big Picture Workshop

We present our work towards building an infrastructure for documenting endangered languages with the focus on Uralic languages in particular. Our infrastructure consists of tools to write dictionaries so that entries are structured in XML format. These dictionaries are the foundation for rule-based NLP tools such as FSTs. We also work actively towards enhancing these dictionaries and tools by using the latest state-of-the-art neural models by generating training data through rules and lexica

pdf bib
Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages
Mika Hämäläinen | Emily Öhman | Flammie Pirinen | Khalid Alnajjar | So Miyagawa | Yuri Bizzoni | Niko Partanen | Jack Rueter
Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages

2022

pdf bib
Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities
Mika Hämäläinen | Khalid Alnajjar | Niko Partanen | Jack Rueter
Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities

pdf
Semiautomatic Speech Alignment for Under-Resourced Languages
Juho Leinonen | Niko Partanen | Sami Virpioja | Mikko Kurimo
Proceedings of the Workshop on Resources and Technologies for Indigenous, Endangered and Lesser-resourced Languages in Eurasia within the 13th Language Resources and Evaluation Conference

Cross-language forced alignment is a solution for linguists who create speech corpora for very low-resource languages. However, cross-language is an additional challenge making a complex task, forced alignment, even more difficult. We study how linguists can impart domain expertise to the tasks to increase the performance of automatic forced aligners while keeping the time effort still lower than with manual forced alignment. First, we show that speech recognizers have a clear bias in starting the word later than a human annotator, which results in micro-pauses in the results that do not exist in manual alignments, and study which is the best way to automatically remove these silences. Second, we ask the linguists to simplify the task by splitting long interview audios into shorter lengths by providing some manually aligned segments and evaluating the results of this process. We also study how correlated source language performance is to target language performance, since often it is an easier task to find a better source model than to adapt to the target language.

2021

pdf bib
Keyword spotting for audiovisual archival search in Uralic languages
Nils Hjortnaes | Niko Partanen | Francis M. Tyers
Proceedings of the Seventh International Workshop on Computational Linguistics of Uralic Languages

pdf
Overview of Open-Source Morphology Development for the Komi-Zyrian Language: Past and future
Jack Rueter | Niko Partanen | Mika Hämäläinen | Trond Trosterud
Proceedings of the Seventh International Workshop on Computational Linguistics of Uralic Languages

pdf
Linguistic change and historical periodization of Old Literary Finnish
Niko Partanen | Khalid Alnajjar | Mika Hämäläinen | Jack Rueter
Proceedings of the 2nd International Workshop on Computational Approaches to Historical Language Change 2021

In this study, we have normalized and lemmatized an Old Literary Finnish corpus using a lemmatization model trained on texts from Agricola. We analyse the error types that occur and appear in different decades, and use word error rate (WER) and different error types as a proxy for measuring linguistic innovation and change. We show that the proposed approach works, and the errors are connected to accumulating changes and innovations, which also results in a continuous decrease in the accuracy of the model. The described error types also guide further work in improving these models, and document the currently observed issues. We also have trained word embeddings for four centuries of lemmatized Old Literary Finnish, which are available on Zenodo.

pdf
Finnish Dialect Identification: The Effect of Audio and Text
Mika Hämäläinen | Khalid Alnajjar | Niko Partanen | Jack Rueter
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Finnish is a language with multiple dialects that not only differ from each other in terms of accent (pronunciation) but also in terms of morphological forms and lexical choice. We present the first approach to automatically detect the dialect of a speaker based on a dialect transcript and transcript with audio recording in a dataset consisting of 23 different dialects. Our results show that the best accuracy is received by combining both of the modalities, as text only reaches to an overall accuracy of 57%, where as text and audio reach to 85%. Our code, models and data have been released openly on Github and Zenodo.

pdf
Never guess what I heard... Rumor Detection in Finnish News: a Dataset and a Baseline
Mika Hämäläinen | Khalid Alnajjar | Niko Partanen | Jack Rueter
Proceedings of the Fourth Workshop on NLP for Internet Freedom: Censorship, Disinformation, and Propaganda

This study presents a new dataset on rumor detection in Finnish language news headlines. We have evaluated two different LSTM based models and two different BERT models, and have found very significant differences in the results. A fine-tuned FinBERT reaches the best overall accuracy of 94.3% and rumor label accuracy of 96.0% of the time. However, a model fine-tuned on Multilingual BERT reaches the best factual label accuracy of 97.2%. Our results suggest that the performance difference is due to a difference in the original training data. Furthermore, we find that a regular LSTM model works better than one trained with a pretrained word2vec model. These findings suggest that more work needs to be done for pretrained models in Finnish language as they have been trained on small and biased corpora.

pdf bib
Proceedings of the Workshop on Natural Language Processing for Digital Humanities
Mika Hämäläinen | Khalid Alnajjar | Niko Partanen | Jack Rueter
Proceedings of the Workshop on Natural Language Processing for Digital Humanities

pdf
Processing M.A. Castrén’s Materials: Multilingual Historical Typed and Handwritten Manuscripts
Niko Partanen | Jack Rueter | Khalid Alnajjar | Mika Hämäläinen
Proceedings of the Workshop on Natural Language Processing for Digital Humanities

The study forms a technical report of various tasks that have been performed on the materials collected and published by Finnish ethnographer and linguist, Matthias Alexander Castrén (1813–1852). The Finno-Ugrian Society is publishing Castrén’s manuscripts as new critical and digital editions, and at the same time different research groups have also paid attention to these materials. We discuss the workflows and technical infrastructure used, and consider how datasets that benefit different computational tasks could be created to further improve the usability of these materials, and also to aid the further processing of similar archived collections. We specifically focus on the parts of the collections that are processed in a way that improves their usability in more technical applications, complementing the earlier work on the cultural and linguistic aspects of these materials. Most of these datasets are openly available in Zenodo. The study points to specific areas where further research is needed, and provides benchmarks for text recognition tasks.

pdf
Neural Morphology Dataset and Models for Multiple Languages, from the Large to the Endangered
Mika Hämäläinen | Niko Partanen | Jack Rueter | Khalid Alnajjar
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)

We train neural models for morphological analysis, generation and lemmatization for morphologically rich languages. We present a method for automatically extracting substantially large amount of training data from FSTs for 22 languages, out of which 17 are endangered. The neural models follow the same tagset as the FSTs in order to make it possible to use them as fallback systems together with the FSTs. The source code, models and datasets have been released on Zenodo.

pdf
Detecting Depression in Thai Blog Posts: a Dataset and a Baseline
Mika Hämäläinen | Pattama Patpong | Khalid Alnajjar | Niko Partanen | Jack Rueter
Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)

We present the first openly available corpus for detecting depression in Thai. Our corpus is compiled by expert verified cases of depression in several online blogs. We experiment with two different LSTM based models and two different BERT based models. We achieve a 77.53% accuracy with a Thai BERT model in detecting depression. This establishes a good baseline for future researcher on the same corpus. Furthermore, we identify a need for Thai embeddings that have been trained on a more varied corpus than Wikipedia. Our corpus, code and trained models have been released openly on Zenodo.

pdf
Lemmatization of Historical Old Literary Finnish Texts in Modern Orthography
Mika Hämäläinen | Niko Partanen | Khalid Alnajjar
Actes de la 28e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale

Texts written in Old Literary Finnish represent the first literary work ever written in Finnish starting from the 16th century. There have been several projects in Finland that have digitized old publications and made them available for research use. However, using modern NLP methods in such data poses great challenges. In this paper we propose an approach for simultaneously normalizing and lemmatizing Old Literary Finnish into modern spelling. Our best model reaches to 96.3% accuracy in texts written by Agricola and 87.7% accuracy in other contemporary out-of-domain text. Our method has been made freely available on Zenodo and Github.

pdf
Numerals and what counts
Jack Rueter | Niko Partanen | Flammie A. Pirinen
Proceedings of the Fifth Workshop on Universal Dependencies (UDW, SyntaxFest 2021)

pdf
Apurinã Universal Dependencies Treebank
Jack Rueter | Marília Fernanda Pereira de Freitas | Sidney Da Silva Facundes | Mika Hämäläinen | Niko Partanen
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

This paper presents and discusses the first Universal Dependencies treebank for the Apurinã language. The treebank contains 76 fully annotated sentences, applies 14 parts-of-speech, as well as seven augmented or new features — some of which are unique to Apurinã. The construction of the treebank has also served as an opportunity to develop finite-state description of the language and facilitate the transfer of open-source infrastructure possibilities to an endangered language of the Amazon. The source materials used in the initial treebank represent fieldwork practices where not all tokens of all sentences are equally annotated. For this reason, establishing regular annotation practices for the entire Apurinã treebank is an ongoing project.

pdf bib
Findings of the VarDial Evaluation Campaign 2021
Bharathi Raja Chakravarthi | Gaman Mihaela | Radu Tudor Ionescu | Heidi Jauhiainen | Tommi Jauhiainen | Krister Lindén | Nikola Ljubešić | Niko Partanen | Ruba Priyadharshini | Christoph Purschke | Eswari Rajagopal | Yves Scherrer | Marcos Zampieri
Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects

This paper describes the results of the shared tasks organized as part of the VarDial Evaluation Campaign 2021. The campaign was part of the eighth workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with EACL 2021. Four separate shared tasks were included this year: Dravidian Language Identification (DLI), Romanian Dialect Identification (RDI), Social Media Variety Geolocation (SMG), and Uralic Language Identification (ULI). DLI was organized for the first time and the other three continued a series of tasks from previous evaluation campaigns.

pdf
The Relevance of the Source Language in Transfer Learning for ASR
Nils Hjortnaes | Niko Partanen | Michael Rießler | Francis M. Tyers
Proceedings of the 4th Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers)

2020

pdf
Speech Recognition for Endangered and Extinct Samoyedic languages
Niko Partanen | Mika Hämäläinen | Tiina Klooster
Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation

pdf bib
Ve’rdd. Narrowing the Gap between Paper Dictionaries, Low-Resource NLP and Community Involvement
Khalid Alnajjar | Mika Hämäläinen | Jack Rueter | Niko Partanen
Proceedings of the 28th International Conference on Computational Linguistics: System Demonstrations

We present an open-source online dictionary editing system, Ve′rdd, that offers a chance to re-evaluate and edit grassroots dictionaries that have been exposed to multiple amateur editors. The idea is to incorporate community activities into a state-of-the-art finite-state language description of a seriously endangered minority language, Skolt Sami. Problems involve getting the community to take part in things above the pencil-and-paper level. At times, it seems that the native speakers and the dictionary oriented are lacking technical understanding to utilize the infrastructures which might make their work more meaningful in the future, i.e. multiple reuse of all of their input. Therefore, our system integrates with the existing tools and infrastructures for Uralic language masking the technical complexities behind a user-friendly UI.

pdf
Open-Source Morphology for Endangered Mordvinic Languages
Jack Rueter | Mika Hämäläinen | Niko Partanen
Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)

This document describes shared development of finite-state description of two closely related but endangered minority languages, Erzya and Moksha. It touches upon morpholexical unity and diversity of the two languages and how this provides a motivation for shared open-source FST development. We describe how we have designed the transducers so that they can benefit from existing open-source infrastructures and are as reusable as possible.

pdf bib
A pseudonymisation method for language documentation corpora: An experiment with spoken Komi
Rogier Blokland | Niko Partanen | Michael Rießler
Proceedings of the Sixth International Workshop on Computational Linguistics of Uralic Languages

pdf
On the questions in developing computational infrastructure for Komi-Permyak
Jack Rueter | Niko Partanen | Larisa Ponomareva
Proceedings of the Sixth International Workshop on Computational Linguistics of Uralic Languages

pdf
Towards a Speech Recognizer for Komi, an Endangered and Low-Resource Uralic Language
Nils Hjortnaes | Niko Partanen | Michael Rießler | Francis M. Tyers
Proceedings of the Sixth International Workshop on Computational Linguistics of Uralic Languages

pdf
Improving the Language Model for Low-Resource ASR with Online Text Corpora
Nils Hjortnaes | Timofey Arkhangelskiy | Niko Partanen | Michael Rießler | Francis Tyers
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

In this paper, we expand on previous work on automatic speech recognition in a low-resource scenario typical of data collected by field linguists. We train DeepSpeech models on 35 hours of dialectal Komi speech recordings and correct the output using language models constructed from various sources. Previous experiments showed that transfer learning using DeepSpeech can improve the accuracy of a speech recognizer for Komi, though the error rate remained very high. In this paper we present further experiments with language models created using KenLM from text materials available online. These are constructed from two corpora, one containing literary texts, one for social media content, and another combining the two. We then trained the model using each language model to explore the impact of the language model data source on the speech recognition model. Our results show significant improvements of over 25% in character error rate and nearly 20% in word error rate. This offers important methodological insight into how ASR results can be improved under low-resource conditions: transfer learning can be used to compensate the lack of training data in the target language, and online texts are a very useful resource when developing language models in this context.

pdf bib
A Report on the VarDial Evaluation Campaign 2020
Mihaela Gaman | Dirk Hovy | Radu Tudor Ionescu | Heidi Jauhiainen | Tommi Jauhiainen | Krister Lindén | Nikola Ljubešić | Niko Partanen | Christoph Purschke | Yves Scherrer | Marcos Zampieri
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects

This paper presents the results of the VarDial Evaluation Campaign 2020 organized as part of the seventh workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with COLING 2020. The campaign included three shared tasks each focusing on a different challenge of language and dialect identification: Romanian Dialect Identification (RDI), Social Media Variety Geolocation (SMG), and Uralic Language Identification (ULI). The campaign attracted 30 teams who enrolled to participate in one or multiple shared tasks and 14 of them submitted runs across the three shared tasks. Finally, 11 papers describing participating systems are published in the VarDial proceedings and referred to in this report.

pdf
Uralic Language Identification (ULI) 2020 shared task dataset and the Wanca 2017 corpora
Tommi Jauhiainen | Heidi Jauhiainen | Niko Partanen | Krister Lindén
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects

This article introduces the Wanca 2017 web corpora from which the sentences written in minor Uralic languages were collected for the test set of the Uralic Language Identification (ULI) 2020 shared task. We describe the ULI shared task and how the test set was constructed using the Wanca 2017 corpora and texts in different languages from the Leipzig corpora collection. We also provide the results of a baseline language identification experiment conducted using the ULI 2020 dataset.

2019

pdf
An OCR system for the Unified Northern Alphabet
Niko Partanen | Michael Rießler
Proceedings of the Fifth International Workshop on Computational Linguistics for Uralic Languages

pdf
Survey of Uralic Universal Dependencies development
Niko Partanen | Jack Rueter
Proceedings of the Third Workshop on Universal Dependencies (UDW, SyntaxFest 2019)

pdf
Dialect Text Normalization to Normative Standard Finnish
Niko Partanen | Mika Hämäläinen | Khalid Alnajjar
Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)

We compare different LSTMs and transformer models in terms of their effectiveness in normalizing dialectal Finnish into the normative standard Finnish. As dialect is the common way of communication for people online in Finnish, such a normalization is a necessary step to improve the accuracy of the existing Finnish NLP tools that are tailored for normative Finnish text. We work on a corpus consisting of dialectal data of 23 distinct Finnish dialects. The best functioning BRNN approach lowers the initial word error rate of the corpus from 52.89 to 5.73.

2018

pdf bib
Dependency Parsing of Code-Switching Data with Cross-Lingual Feature Representations
Niko Partanen | Kyungtae Lim | Michael Rießler | Thierry Poibeau
Proceedings of the Fourth International Workshop on Computational Linguistics of Uralic Languages

pdf
The First Komi-Zyrian Universal Dependencies Treebanks
Niko Partanen | Rogier Blokland | KyungTae Lim | Thierry Poibeau | Michael Rießler
Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)

Two Komi-Zyrian treebanks were included in the Universal Dependencies 2.2 release. This article contextualizes the treebanks, discusses the process through which they were created, and outlines the future plans and timeline for the next improvements. Special attention is paid to the possibilities of using UD in the documentation and description of endangered languages.

pdf
Analyse syntaxique de langues faiblement dotées à partir de plongements de mots multilingues [Syntactic analysis of under-resourced languages from multilingual word embeddings]
KyungTae Lim | Niko Partanen | Thierry Poibeau
Traitement Automatique des Langues, Volume 59, Numéro 3 : Traitement automatique des langues peu dotées [NLP for Under-Resourced Languages]

pdf
Multilingual Dependency Parsing for Low-Resource Languages: Case Studies on North Saami and Komi-Zyrian
KyungTae Lim | Niko Partanen | Thierry Poibeau
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf
Instant annotations in ELAN corpora of spoken and written Komi, an endangered language of the Barents Sea region
Ciprian Gerstenberger | Niko Partanen | Michael Rießler
Proceedings of the 2nd Workshop on the Use of Computational Methods in the Study of Endangered Languages

pdf
Instant Annotations – Applying NLP Methods to the Annotation of Spoken Language Documentation Corpora
Ciprian Gerstenberger | Niko Partanen | Michael Rießler | Joshua Wilbur
Proceedings of the Third Workshop on Computational Linguistics for Uralic Languages