Kairit Sirts


2021

pdf bib
EstBERT: A Pretrained Language-Specific BERT for Estonian
Hasan Tanvir | Claudia Kittask | Sandra Eiche | Kairit Sirts
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)

This paper presents EstBERT, a large pretrained transformer-based language-specific BERT model for Estonian. Recent work has evaluated multilingual BERT models on Estonian tasks and found them to outperform the baselines. Still, based on existing studies on other languages, a language-specific BERT model is expected to improve over the multilingual ones. We first describe the EstBERT pretraining process and then present the models’ results based on the finetuned EstBERT for multiple NLP tasks, including POS and morphological tagging, dependency parsing, named entity recognition and text classification. The evaluation results show that the models based on EstBERT outperform multilingual BERT models on five tasks out of seven, providing further evidence towards a view that training language-specific BERT models are still useful, even when multilingual models are available.

pdf bib
Enhancing Sequence-to-Sequence Neural Lemmatization with External Resources
Kirill Milintsevich | Kairit Sirts
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

We propose a novel hybrid approach to lemmatization that enhances the seq2seq neural model with additional lemmas extracted from an external lexicon or a rule-based system. During training, the enhanced lemmatizer learns both to generate lemmas via a sequential decoder and copy the lemma characters from the external candidates supplied during run-time. Our lemmatizer enhanced with candidates extracted from the Apertium morphological analyzer achieves statistically significant improvements compared to baseline models not utilizing additional lemma information, achieves an average accuracy of 97.25% on a set of 23 UD languages, which is 0.55% higher than obtained with the Stanford Stanza model on the same set of languages. We also compare with other methods of integrating external data into lemmatization and show that our enhanced system performs considerably better than a simple lexicon extension method based on the Stanza system, and it achieves complementary improvements w.r.t. the data augmentation method.

2018

pdf bib
Modeling Composite Labels for Neural Morphological Tagging
Alexander Tkachenko | Kairit Sirts
Proceedings of the 22nd Conference on Computational Natural Language Learning

Neural morphological tagging has been regarded as an extension to POS tagging task, treating each morphological tag as a monolithic label and ignoring its internal structure. We propose to view morphological tags as composite labels and explicitly model their internal structure in a neural sequence tagger. For this, we explore three different neural architectures and compare their performance with both CRF and simple neural multiclass baselines. We evaluate our models on 49 languages and show that the neural architecture that models the morphological labels as sequences of morphological category values performs significantly better than both baselines establishing state-of-the-art results in morphological tagging for most languages.

2017

pdf bib
Linear Ensembles of Word Embedding Models
Avo Muromägi | Kairit Sirts | Sven Laur
Proceedings of the 21st Nordic Conference on Computational Linguistics

pdf bib
Idea density for predicting Alzheimer’s disease from transcribed speech
Kairit Sirts | Olivier Piguet | Mark Johnson
Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017)

Idea Density (ID) measures the rate at which ideas or elementary predications are expressed in an utterance or in a text. Lower ID is found to be associated with an increased risk of developing Alzheimer’s disease (AD) (Snowdon et al., 1996; Engelman et al., 2010). ID has been used in two different versions: propositional idea density (PID) counts the expressed ideas and can be applied to any text while semantic idea density (SID) counts pre-defined information content units and is naturally more applicable to normative domains, such as picture description tasks. In this paper, we develop DEPID, a novel dependency-based method for computing PID, and its version DEPID-R that enables to exclude repeating ideas—a feature characteristic to AD speech. We conduct the first comparison of automatically extracted PID and SID in the diagnostic classification task on two different AD datasets covering both closed-topic and free-recall domains. While SID performs better on the normative dataset, adding PID leads to a small but significant improvement (+1.7 F-score). On the free-topic dataset, PID performs better than SID as expected (77.6 vs 72.3 in F-score) but adding the features derived from the word embedding clustering underlying the automatic SID increases the results considerably, leading to an F-score of 84.8.

2016

pdf bib
STransE: a novel embedding model of entities and relationships in knowledge bases
Dat Quoc Nguyen | Kairit Sirts | Lizhen Qu | Mark Johnson
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Neighborhood Mixture Model for Knowledge Base Completion
Dat Quoc Nguyen | Kairit Sirts | Lizhen Qu | Mark Johnson
Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning

pdf bib
A Comparative Study of Minimally Supervised Morphological Segmentation
Teemu Ruokolainen | Oskar Kohonen | Kairit Sirts | Stig-Arne Grönroos | Mikko Kurimo | Sami Virpioja
Computational Linguistics, Volume 42, Issue 1 - March 2016

2015

pdf bib
Query-Based Single Document Summarization Using an Ensemble Noisy Auto-Encoder
Mahmood Yousefi Azar | Kairit Sirts | Diego Mollá Aliod | Len Hamey
Proceedings of the Australasian Language Technology Association Workshop 2015

pdf bib
Do POS Tags Help to Learn Better Morphological Segmentations?
Kairit Sirts | Mark Johnson
Proceedings of the Australasian Language Technology Association Workshop 2015

pdf bib
Improving Topic Coherence with Latent Feature Word Representations in MAP Estimation for Topic Modeling
Dat Quoc Nguyen | Kairit Sirts | Mark Johnson
Proceedings of the Australasian Language Technology Association Workshop 2015

2014

pdf bib
POS induction with distributional and morphological information using a distance-dependent Chinese restaurant process
Kairit Sirts | Jacob Eisenstein | Micha Elsner | Sharon Goldwater
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2013

pdf bib
Minimally-Supervised Morphological Segmentation using Adaptor Grammars
Kairit Sirts | Sharon Goldwater
Transactions of the Association for Computational Linguistics, Volume 1

This paper explores the use of Adaptor Grammars, a nonparametric Bayesian modelling framework, for minimally supervised morphological segmentation. We compare three training methods: unsupervised training, semi-supervised training, and a novel model selection method. In the model selection method, we train unsupervised Adaptor Grammars using an over-articulated metagrammar, then use a small labelled data set to select which potential morph boundaries identified by the metagrammar should be returned in the final output. We evaluate on five languages and show that semi-supervised training provides a boost over unsupervised training, while the model selection method yields the best average results over all languages and is competitive with state-of-the-art semi-supervised systems. Moreover, this method provides the potential to tune performance according to different evaluation metrics or downstream tasks.

2012

pdf bib
A Hierarchical Dirichlet Process Model for Joint Part-of-Speech and Morphology Induction
Kairit Sirts | Tanel Alumäe
Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies