Sami Virpioja


Semiautomatic Speech Alignment for Under-Resourced Languages
Juho Leinonen | Niko Partanen | Sami Virpioja | Mikko Kurimo
Proceedings of the Workshop on Resources and Technologies for Indigenous, Endangered and Lesser-resourced Languages in Eurasia within the 13th Language Resources and Evaluation Conference

Cross-language forced alignment is a solution for linguists who create speech corpora for very low-resource languages. However, cross-language is an additional challenge making a complex task, forced alignment, even more difficult. We study how linguists can impart domain expertise to the tasks to increase the performance of automatic forced aligners while keeping the time effort still lower than with manual forced alignment. First, we show that speech recognizers have a clear bias in starting the word later than a human annotator, which results in micro-pauses in the results that do not exist in manual alignments, and study which is the best way to automatically remove these silences. Second, we ask the linguists to simplify the task by splitting long interview audios into shorter lengths by providing some manually aligned segments and evaluating the results of this process. We also study how correlated source language performance is to target language performance, since often it is an easier task to find a better source model than to adapt to the target language.

Morfessor-enriched features and multilingual training for canonical morphological segmentation
Aku Rouhe | Stig-Arne Grönroos | Sami Virpioja | Mathias Creutz | Mikko Kurimo
Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

In our submission to the SIGMORPHON 2022 Shared Task on Morpheme Segmentation, we study whether an unsupervised morphological segmentation method, Morfessor, can help in a supervised setting. Previous research has shown the effectiveness of the approach in semisupervised settings with small amounts of labeled data. The current tasks vary in data size: the amount of word-level annotated training data is much larger, but the amount of sentencelevel annotated training data remains small. Our approach is to pre-segment the input data for a neural sequence-to-sequence model with the unsupervised method. As the unsupervised method can be trained with raw text data, we use Wikipedia to increase the amount of training data. In addition, we train multilingual models for the sentence-level task. The results for the Morfessor-enriched features are mixed, showing benefit for all three sentencelevel tasks but only some of the word-level tasks. The multilingual training yields considerable improvements over the monolingual sentence-level models, but it negates the effect of the enriched features.


Grapheme-Based Cross-Language Forced Alignment: Results with Uralic Languages
Juho Leinonen | Sami Virpioja | Mikko Kurimo
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)

Forced alignment is an effective process to speed up linguistic research. However, most forced aligners are language-dependent, and under-resourced languages rarely have enough resources to train an acoustic model for an aligner. We present a new Finnish grapheme-based forced aligner and demonstrate its performance by aligning multiple Uralic languages and English as an unrelated language. We show that even a simple non-expert created grapheme-to-phoneme mapping can result in useful word alignments.

Boosting Neural Machine Translation from Finnish to Northern Sámi with Rule-Based Backtranslation
Mikko Aulamo | Sami Virpioja | Yves Scherrer | Jörg Tiedemann
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)

We consider a low-resource translation task from Finnish into Northern Sámi. Collecting all available parallel data between the languages, we obtain around 30,000 sentence pairs. However, there exists a significantly larger monolingual Northern Sámi corpus, as well as a rule-based machine translation (RBMT) system between the languages. To make the best use of the monolingual data in a neural machine translation (NMT) system, we use the backtranslation approach to create synthetic parallel data from it using both NMT and RBMT systems. Evaluating the results on an in-domain test set and a small out-of-domain set, we find that the RBMT backtranslation outperforms NMT backtranslation clearly for the out-of-domain test set, but also slightly for the in-domain data, for which the NMT backtranslation model provided clearly better BLEU scores than the RBMT. In addition, combining both backtranslated data sets improves the RBMT approach only for the in-domain test set. This suggests that the RBMT system provides general-domain knowledge that cannot be found from the relative small parallel training data.

The Helsinki submission to the AmericasNLP shared task
Raúl Vázquez | Yves Scherrer | Sami Virpioja | Jörg Tiedemann
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

The University of Helsinki participated in the AmericasNLP shared task for all ten language pairs. Our multilingual NMT models reached the first rank on all language pairs in track 1, and first rank on nine out of ten language pairs in track 2. We focused our efforts on three aspects: (1) the collection of additional data from various sources such as Bibles and political constitutions, (2) the cleaning and filtering of training data with the OpusFilter toolkit, and (3) different multilingual training techniques enabled by the latest version of the OpenNMT-py toolkit to make the most efficient use of the scarce data. This paper describes our efforts in detail.


Controlling the Imprint of Passivization and Negation in Contextualized Representations
Hande Celikkanat | Sami Virpioja | Jörg Tiedemann | Marianna Apidianaki
Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

Contextualized word representations encode rich information about syntax and semantics, alongside specificities of each context of use. While contextual variation does not always reflect actual meaning shifts, it can still reduce the similarity of embeddings for word instances having the same meaning. We explore the imprint of two specific linguistic alternations, namely passivization and negation, on the representations generated by neural models trained with two different objectives: masked language modeling and translation. Our exploration methodology is inspired by an approach previously proposed for removing societal biases from word vectors. We show that passivization and negation leave their traces on the representations, and that neutralizing this information leads to more similar embeddings for words that should preserve their meaning in the transformation. We also find clear differences in how the respective features generalize across datasets.

Effects of Language Relatedness for Cross-lingual Transfer Learning in Character-Based Language Models
Mittul Singh | Peter Smit | Sami Virpioja | Mikko Kurimo
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

Character-based Neural Network Language Models (NNLM) have the advantage of smaller vocabulary and thus faster training times in comparison to NNLMs based on multi-character units. However, in low-resource scenarios, both the character and multi-character NNLMs suffer from data sparsity. In such scenarios, cross-lingual transfer has improved multi-character NNLM performance by allowing information transfer from a source to the target language. In the same vein, we propose to use cross-lingual transfer for character NNLMs applied to low-resource Automatic Speech Recognition (ASR). However, applying cross-lingual transfer to character NNLMs is not as straightforward. We observe that relatedness of the source language plays an important role in cross-lingual pretraining of character NNLMs. We evaluate this aspect on ASR tasks for two target languages: Finnish (with English and Estonian as source) and Swedish (with Danish, Norwegian, and English as source). Prior work has observed no difference between using the related or unrelated language for multi-character NNLMs. We, however, show that for character-based NNLMs, only pretraining with a related language improves the ASR performance, and using an unrelated language may deteriorate it. We also observe that the benefits are larger when there is much lesser target data than source data.

OpusTools and Parallel Corpus Diagnostics
Mikko Aulamo | Umut Sulubacak | Sami Virpioja | Jörg Tiedemann
Proceedings of the Twelfth Language Resources and Evaluation Conference

This paper introduces OpusTools, a package for downloading and processing parallel corpora included in the OPUS corpus collection. The package implements tools for accessing compressed data in their archived release format and make it possible to easily convert between common formats. OpusTools also includes tools for language identification and data filtering as well as tools for importing data from various sources into the OPUS format. We show the use of these tools in parallel corpus creation and data diagnostics. The latter is especially useful for the identification of potential problems and errors in the extensive data set. Using these tools, we can now monitor the validity of data sets and improve the overall quality and consitency of the data collection.

Morfessor EM+Prune: Improved Subword Segmentation with Expectation Maximization and Pruning
Stig-Arne Grönroos | Sami Virpioja | Mikko Kurimo
Proceedings of the Twelfth Language Resources and Evaluation Conference

Data-driven segmentation of words into subword units has been used in various natural language processing applications such as automatic speech recognition and statistical machine translation for almost 20 years. Recently it has became more widely adopted, as models based on deep neural networks often benefit from subword units even for morphologically simpler languages. In this paper, we discuss and compare training algorithms for a unigram subword model, based on the Expectation Maximization algorithm and lexicon pruning. Using English, Finnish, North Sami, and Turkish data sets, we show that this approach is able to find better solutions to the optimization problem defined by the Morfessor Baseline model than its original recursive training algorithm. The improved optimization also leads to higher morphological segmentation accuracy when compared to a linguistic gold standard. We publish implementations of the new algorithms in the widely-used Morfessor software package.

The University of Helsinki and Aalto University submissions to the WMT 2020 news and low-resource translation tasks
Yves Scherrer | Stig-Arne Grönroos | Sami Virpioja
Proceedings of the Fifth Conference on Machine Translation

This paper describes the joint participation of University of Helsinki and Aalto University to two shared tasks of WMT 2020: the news translation between Inuktitut and English and the low-resource translation between German and Upper Sorbian. For both tasks, our efforts concentrate on efficient use of monolingual and related bilingual corpora with scheduled multi-task learning as well as an optimized subword segmentation with sampling. Our submission obtained the highest score for Upper Sorbian -> German and was ranked second for German -> Upper Sorbian according to BLEU scores. For English–Inuktitut, we reached ranks 8 and 10 out of 11 according to BLEU scores.

OpusFilter: A Configurable Parallel Corpus Filtering Toolbox
Mikko Aulamo | Sami Virpioja | Jörg Tiedemann
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

This paper introduces OpusFilter, a flexible and modular toolbox for filtering parallel corpora. It implements a number of components based on heuristic filters, language identification libraries, character-based language models, and word alignment tools, and it can easily be extended with custom filters. Bitext segments can be ranked according to their quality or domain match using single features or a logistic regression model that can be trained without manually labeled training data. We demonstrate the effectiveness of OpusFilter on the example of a Finnish-English news translation task based on noisy web-crawled training data. Applying our tool leads to improved translation quality while significantly reducing the size of the training data, also clearly outperforming an alternative ranking given in the crawled data set. Furthermore, we show the ability of OpusFilter to perform data selection for domain adaptation.


pdf bib
North Sámi morphological segmentation with low-resource semi-supervised sequence labeling
Stig-Arne Grönroos | Sami Virpioja | Mikko Kurimo
Proceedings of the Fifth International Workshop on Computational Linguistics for Uralic Languages

The University of Helsinki Submissions to the WMT19 News Translation Task
Aarne Talman | Umut Sulubacak | Raúl Vázquez | Yves Scherrer | Sami Virpioja | Alessandro Raganato | Arvi Hurskainen | Jörg Tiedemann
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

In this paper we present the University of Helsinki submissions to the WMT 2019 shared news translation task in three language pairs: English-German, English-Finnish and Finnish-English. This year we focused first on cleaning and filtering the training data using multiple data-filtering approaches, resulting in much smaller and cleaner training sets. For English-German we trained both sentence-level transformer models as well as compared different document-level translation approaches. For Finnish-English and English-Finnish we focused on different segmentation approaches and we also included a rule-based system for English-Finnish.

The University of Helsinki Submissions to the WMT19 Similar Language Translation Task
Yves Scherrer | Raúl Vázquez | Sami Virpioja
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

This paper describes the University of Helsinki Language Technology group’s participation in the WMT 2019 similar language translation task. We trained neural machine translation models for the language pairs Czech <-> Polish and Spanish <-> Portuguese. Our experiments focused on different subword segmentation methods, and in particular on the comparison of a cognate-aware segmentation method, Cognate Morfessor, with character segmentation and unsupervised segmentation methods for which the data from different languages were simply concatenated. We did not observe major benefits from cognate-aware segmentation methods, but further research may be needed to explore larger parts of the parameter space. Character-level models proved to be competitive for translation between Spanish and Portuguese, but they are slower in training and decoding.


New Baseline in Automatic Speech Recognition for Northern Sámi
Juho Leinonen | Peter Smit | Sami Virpioja | Mikko Kurimo
Proceedings of the Fourth International Workshop on Computational Linguistics of Uralic Languages

Cognate-aware morphological segmentation for multilingual neural translation
Stig-Arne Grönroos | Sami Virpioja | Mikko Kurimo
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

This article describes the Aalto University entry to the WMT18 News Translation Shared Task. We participate in the multilingual subtrack with a system trained under the constrained condition to translate from English to both Finnish and Estonian. The system is based on the Transformer model. We focus on improving the consistency of morphological segmentation for words that are similar orthographically, semantically, and distributionally; such words include etymological cognates, loan words, and proper names. For this, we introduce Cognate Morfessor, a multilingual variant of the Morfessor method. We show that our approach improves the translation quality particularly for Estonian, which has less resources for training the translation model.


Extending hybrid word-character neural machine translation with multi-task learning of morphological analysis
Stig-Arne Grönroos | Sami Virpioja | Mikko Kurimo
Proceedings of the Second Conference on Machine Translation


A Comparative Study of Minimally Supervised Morphological Segmentation
Teemu Ruokolainen | Oskar Kohonen | Kairit Sirts | Stig-Arne Grönroos | Mikko Kurimo | Sami Virpioja
Computational Linguistics, Volume 42, Issue 1 - March 2016

Hybrid Morphological Segmentation for Phrase-Based Machine Translation
Stig-Arne Grönroos | Sami Virpioja | Mikko Kurimo
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers


Tuning Phrase-Based Segmented Translation for a Morphologically Complex Target Language
Stig-Arne Grönroos | Sami Virpioja | Mikko Kurimo
Proceedings of the Tenth Workshop on Statistical Machine Translation

LeBLEU: N-gram-based Translation Evaluation Score for Morphologically Complex Languages
Sami Virpioja | Stig-Arne Grönroos
Proceedings of the Tenth Workshop on Statistical Machine Translation


Morfessor FlatCat: An HMM-Based Method for Unsupervised and Semi-Supervised Learning of Morphology
Stig-Arne Grönroos | Sami Virpioja | Peter Smit | Mikko Kurimo
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

Morfessor 2.0: Toolkit for statistical morphological segmentation
Peter Smit | Sami Virpioja | Stig-Arne Grönroos | Mikko Kurimo
Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics

Painless Semi-Supervised Morphological Segmentation using Conditional Random Fields
Teemu Ruokolainen | Oskar Kohonen | Sami Virpioja | Mikko Kurimo
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers


Supervised Morphological Segmentation in a Low-Resource Learning Setting using Conditional Random Fields
Teemu Ruokolainen | Oskar Kohonen | Sami Virpioja | Mikko Kurimo
Proceedings of the Seventeenth Conference on Computational Natural Language Learning


Empirical Comparison of Evaluation Methods for Unsupervised Learning of Morphology
Sami Virpioja | Ville T. Turunen | Sebastian Spiegler | Oskar Kohonen | Mikko Kurimo
Traitement Automatique des Langues, Volume 52, Numéro 2 : Vers la morphologie et au-delà [Toward Morphology and beyond]

Evaluating the effect of word frequencies in a probabilistic generative model of morphology
Sami Virpioja | Oskar Kohonen | Krista Lagus
Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011)


Applying Morphological Decompositions to Statistical Machine Translation
Sami Virpioja | Jaakko Väyrynen | André Mansikkaniemi | Mikko Kurimo
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

Semi-Supervised Learning of Concatenative Morphology
Oskar Kohonen | Sami Virpioja | Krista Lagus
Proceedings of the 11th Meeting of the ACL Special Interest Group on Computational Morphology and Phonology

Morpho Challenge 2005-2010: Evaluations and Results
Mikko Kurimo | Sami Virpioja | Ville Turunen | Krista Lagus
Proceedings of the 11th Meeting of the ACL Special Interest Group on Computational Morphology and Phonology

Language Identification of Short Text Segments with N-gram Models
Tommi Vatanen | Jaakko J. Väyrynen | Sami Virpioja
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

There are many accurate methods for language identification of long text samples, but identification of very short strings still presents a challenge. This paper studies a language identification task, in which the test samples have only 5-21 characters. We compare two distinct methods that are well suited for this task: a naive Bayes classifier based on character n-gram models, and the ranking method by Cavnar and Trenkle (1994). For the n-gram models, we test several standard smoothing techniques, including the current state-of-the-art, the modified Kneser-Ney interpolation. Experiments are conducted with 281 languages using the Universal Declaration of Human Rights. Advanced language model smoothing techniques improve the identification accuracy and the respective classifiers outperform the ranking method. The higher accuracy is obtained at the cost of larger models and slower classification speed. However, there are several methods to reduce the size of an n-gram model, and our experiments with model pruning show that it provides an easy way to balance the size and the identification accuracy. We also compare the results to the language identifier in Google AJAX Language API, using a subset of 50 languages.


Web Augmentation of Language Models for Continuous Speech Recognition of SMS Text Messages
Mathias Creutz | Sami Virpioja | Anna Kovaleva
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

Minimum Bayes Risk Combination of Translation Hypotheses from Alternative Morphological Decompositions
Adrià de Gispert | Sami Virpioja | Mikko Kurimo | William Byrne
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers

Morpho Challenge - Evaluation of algorithms for unsupervised learning of morphology in various tasks and languages
Mikko Kurimo | Sami Virpioja | Ville Turunen | Teemu Hirsimäki
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Demonstration Session


Morphology-aware statistical machine translation based on morphs induced in an unsupervised manner
Sami Virpioja | Jaako J. Väyrynen | Mathias Creutz | Markus Sadeniemi
Proceedings of Machine Translation Summit XI: Papers