Kaja Dobrovoljc


2020

pdf bib
Gigafida 2.0: The Reference Corpus of Written Standard Slovene
Simon Krek | Špela Arhar Holdt | Tomaž Erjavec | Jaka Čibej | Andraz Repar | Polona Gantar | Nikola Ljubešić | Iztok Kosem | Kaja Dobrovoljc
Proceedings of the 12th Language Resources and Evaluation Conference

We describe a new version of the Gigafida reference corpus of Slovene. In addition to updating the corpus with new material and annotating it with better tools, the focus of the upgrade was also on its transformation from a general reference corpus, which contains all language variants including non-standard language, to the corpus of standard (written) Slovene. This decision could be implemented as new corpora dedicated specifically to non-standard language emerged recently. In the new version, the whole Gigafida corpus was deduplicated for the first time, which facilitates automatic extraction of data for the purposes of compilation of new lexicographic resources such as the collocations dictionary and the thesaurus of Slovene.

2019

pdf bib
What does Neural Bring? Analysing Improvements in Morphosyntactic Annotation and Lemmatisation of Slovenian, Croatian and Serbian
Nikola Ljubešić | Kaja Dobrovoljc
Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing

We present experiments on Slovenian, Croatian and Serbian morphosyntactic annotation and lemmatisation between the former state-of-the-art for these three languages and one of the best performing systems at the CoNLL 2018 shared task, the Stanford NLP neural pipeline. Our experiments show significant improvements in morphosyntactic annotation, especially on categories where either semantic knowledge is needed, available through word embeddings, or where long-range dependencies have to be modelled. On the other hand, on the task of lemmatisation no improvements are obtained with the neural solution, mostly due to the heavy dependence of the task on the lookup in an external lexicon, but also due to obvious room for improvements in the Stanford NLP pipeline’s lemmatisation.

pdf bib
Annotating formulaic sequences in spoken Slovenian: structure, function and relevance
Kaja Dobrovoljc
Proceedings of the 13th Linguistic Annotation Workshop

This paper presents the identification of formulaic sequences in the reference corpus of spoken Slovenian and their annotation in terms of syntactic structure, pragmatic function and lexicographic relevance. The annotation campaign, specific in terms of setting, subjectivity and the multifunctionality of items under investigation, resulted in a preliminary lexicon of formulaic sequences in spoken Slovenian with immediate potential for future explorations in formulaic language research. This is especially relevant for the notable number of identified multi-word expressions with discourse-structuring and stance-marking functions, which have often been overlooked by traditional phraseology research.

pdf bib
Improving UD processing via satellite resources for morphology
Kaja Dobrovoljc | Tomaž Erjavec | Nikola Ljubešić
Proceedings of the Third Workshop on Universal Dependencies (UDW, SyntaxFest 2019)

2018

pdf bib
Er ... well, it matters, right? On the role of data representations in spoken language dependency parsing
Kaja Dobrovoljc | Matej Martinc
Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)

Despite the significant improvement of data-driven dependency parsing systems in recent years, they still achieve a considerably lower performance in parsing spoken language data in comparison to written data. On the example of Spoken Slovenian Treebank, the first spoken data treebank using the UD annotation scheme, we investigate which speech-specific phenomena undermine parsing performance, through a series of training data and treebank modification experiments using two distinct state-of-the-art parsing systems. Our results show that utterance segmentation is the most prominent cause of low parsing performance, both in parsing raw and pre-segmented transcriptions. In addition to shorter utterances, both parsers perform better on normalized transcriptions including basic markers of prosody and excluding disfluencies, discourse markers and fillers. On the other hand, the effects of written training data addition and speech-specific dependency representations largely depend on the parsing system selected.

2017

pdf bib
The Universal Dependencies Treebank for Slovenian
Kaja Dobrovoljc | Tomaž Erjavec | Simon Krek
Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing

This paper introduces the Universal Dependencies Treebank for Slovenian. We overview the existing dependency treebanks for Slovenian and then detail the conversion of the ssj200k treebank to the framework of Universal Dependencies version 2. We explain the mapping of part-of-speech categories, morphosyntactic features, and the dependency relations, focusing on the more problematic language-specific issues. We conclude with a quantitative overview of the treebank and directions for further work.

2016

pdf bib
The Universal Dependencies Treebank of Spoken Slovenian
Kaja Dobrovoljc | Joakim Nivre
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents the construction of an open-source dependency treebank of spoken Slovenian, the first syntactically annotated collection of spontaneous speech in Slovenian. The treebank has been manually annotated using the Universal Dependencies annotation scheme, a one-layer syntactic annotation scheme with a high degree of cross-modality, cross-framework and cross-language interoperability. In this original application of the scheme to spoken language transcripts, we address a wide spectrum of syntactic particularities in speech, either by extending the scope of application of existing universal labels or by proposing new speech-specific extensions. The initial analysis of the resulting treebank and its comparison with the written Slovenian UD treebank confirms significant syntactic differences between the two language modalities, with spoken data consisting of shorter and more elliptic sentences, less and simpler nominal phrases, and more relations marking disfluencies, interaction, deixis and modality.

2014

pdf bib
Cross-lingual Dependency Parsing of Related Languages with Rich Morphosyntactic Tagsets
Željko Agić | Jörg Tiedemann | Danijela Merkler | Simon Krek | Kaja Dobrovoljc | Sara Može
Proceedings of the EMNLP’2014 Workshop on Language Technology for Closely Related Languages and Language Variants