Siim Orasmaa

2020

pdf bib abs
EstNLTK 1.6: Remastered Estonian NLP Pipeline
Sven Laur | Siim Orasmaa | Dage Särg | Paul Tammo
Proceedings of the 12th Language Resources and Evaluation Conference

The goal of the EstNLTK Python library is to provide a unified programming interface for natural language processing in Estonian. As such, previous versions of the library have been immensely successful both in academic and industrial circles. However, they also contained serious structural limitations – it was hard to add new components and there was a lack of fine-grained control needed for back-end programming. These issues have been explicitly addressed in the EstNLTK library while preserving the intuitive interface for novices. We have remastered the basic NLP pipeline by adding many data cleaning steps that are necessary for analyzing real-life texts, and state of the art components for morphological analysis and fact extraction. Our evaluation on unlabelled data shows that the remastered basic NLP pipeline outperforms both the previous version of the toolkit, as well as neural models of StanfordNLP. In addition, EstNLTK contains a new interface for storing, processing and querying text objects in Postgres database which greatly simplifies processing of large text collections. EstNLTK is freely available under the GNU GPL version 2 license, which is standard for academic software.

2017

pdf bib
Can We Create a Tool for General Domain Event Analysis?
Siim Orasmaa | Heiki-Jaan Kaalep
Proceedings of the 21st Nordic Conference on Computational Linguistics

2016

pdf bib abs
EstNLTK - NLP Toolkit for Estonian
Siim Orasmaa | Timo Petmanson | Alexander Tkachenko | Sven Laur | Heiki-Jaan Kaalep
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Although there are many tools for natural language processing tasks in Estonian, these tools are very loosely interoperable, and it is not easy to build practical applications on top of them. In this paper, we introduce a new Python library for natural language processing in Estonian, which provides unified programming interface for various NLP components. The EstNLTK toolkit provides utilities for basic NLP tasks including tokenization, morphological analysis, lemmatisation and named entity recognition as well as offers more advanced features such as a clause segmentation, temporal expression extraction and normalization, verb chain detection, Estonian Wordnet integration and rule-based information extraction. Accompanied by a detailed API documentation and comprehensive tutorials, EstNLTK is suitable for a wide range of audience. We believe EstNLTK is mature enough to be used for developing NLP-backed systems both in industry and research. EstNLTK is freely available under the GNU GPL version 2+ license, which is standard for academic software.

2014

pdf bib abs
Towards an Integration of Syntactic and Temporal Annotations in Estonian
Siim Orasmaa
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We investigate the question how manually created syntactic annotations can be used to analyse and improve consistency in manually created temporal annotations. Our work introduces an annotation project for Estonian, where temporal annotations in TimeML framework were manually added to a corpus containing gold standard morphological and dependency syntactic annotations. In the first part of our work, we evaluate the consistency of manual temporal annotations, focusing on event annotations. We use syntactic annotations to distinguish different event annotation models, and we observe highest inter-annotator agreements on models representing prototypical events (event verbs and events being part of the syntactic predicate of clause). In the second part of our work, we investigate how to improve consistency between syntactic and temporal annotations. We test on whether syntactic annotations can be used to validate temporal annotations: to find missing or partial annotations. Although the initial results indicate that such validation is promising, we also note that a better bridging between temporal (semantic) and syntactic annotations is needed for a complete automatic validation.

2010

pdf bib abs
Information Retrieval of Word Form Variants in Spoken Language Corpora Using Generalized Edit Distance
Siim Orasmaa | Reina Käärik | Jaak Vilo | Tiit Hennoste
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

An important feature of spoken language corpora is existence of different spelling variants of words in transcription. So there is an important problem for linguist who works with large spoken corpora: how to find all variants of the word without annotating them manually? Our work describes a search engine that enables finding different spelling variants (true positives) from corpus of spoken language, and reduces efficiently the amount of false positives returned during the search. Our search engine uses a generalized variant of the edit distance algorithm that allows defining text-specific string to string transformations in addition to the default edit operations defined in edit distance. We have extended our algorithm with capability to block transformations in specific substrings of search words. User can mark certain regions (blocked regions) of the search word where edit operations are not allowed. Our material comes from the Corpus of Spoken Estonian of the University of Tartu which consists of about 2000 dialogues and texts, about 1.4 million running text units in total.

Co-authors

Timo Petmanson 1

Alexander Tkachenko 1

Dage Särg 1

Paul Tammo 1

Venues

LREC4
WS1