Kaja Dobrovoljc
Also published as: Kaja Dobrovoljc Zor
2026
DELTA: A Toolkit for Measuring Linguistic Diversity in Dependency-Parsed Corpora
Louis Estève | Kaja Dobrovoljc
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Louis Estève | Kaja Dobrovoljc
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Despite growing interest in measuring linguistic diversity on the one hand and the increasing availability of cross-linguistically comparable parsed corpora on the other, tools for systematically measuring the diversity of specific linguistic phenomena on such data remain limited. To address this gap, we present DELTA, an open-source framework that integrates dependency tree querying with diversity computation, enabling systematic measurement across multiple linguistic levels (e.g., lexis, morphology, syntax) and multiple diversity dimensions (variety, balance, disparity). The pipeline processes CoNLL-U formatted corpora through configurable workflows, treating the format as a general-purpose tabular structure independent of specific annotation conventions. We validate DELTA on Parallel Universal Dependencies multilingual dataset, demonstrating its capacity for corpus profiling and cross-corpus diversity comparison.
ROG: A Multi-Layer Manually Annotated Corpus of Spoken Slovenian
Kaja Dobrovoljc Zor | Darinka Verdonik | Jaka Čibej | Peter Rupnik | Nikola Ljubešić
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Kaja Dobrovoljc Zor | Darinka Verdonik | Jaka Čibej | Peter Rupnik | Nikola Ljubešić
Proceedings of the Fifteenth Language Resources and Evaluation Conference
We present ROG, the first manually annotated spoken corpus of Slovenian to integrate morphosyntactic, prosodic, and interactional layers in a unified framework. Building on the pre-existing Spoken Slovenian Treebank (SST) and newly available recordings from the GOS 2 reference corpus, the resource combines over 75,000 words (10 hours) of annotated speech. The entire corpus features lemmatization, MULTEXT-East morphosyntax, and Universal Dependencies annotations, while approximately half includes additional layers for prosodic units, disfluencies, and dialogue acts. All annotation layers are systematically aligned and cross-referenced, enabling detailed multi-dimensional analyses of spoken language. We describe the corpus design, annotation workflow, data release, and baseline modeling results, showcasing the resource’s value for both linguistic analysis and speech-aware NLP model development. All ROG transcriptions and annotations, along with half of the audio recordings, are freely available under CC-BY via (anonymized) repository.
Universal NER v2: Towards a Massively Multilingual Named Entity Recognition Benchmark
Terra Blevins | Stephen Mayhew | Marek Suppa | Hila Gonen | Shachar Mirkin | Vasile Pais | Kaja Dobrovoljc Zor | Voula Giouli | Jun Kevin | Eugene Jang | Eungseo Kim | Jeongyeon Seo | Xenophon Gialis | Yuval Pinter
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Terra Blevins | Stephen Mayhew | Marek Suppa | Hila Gonen | Shachar Mirkin | Vasile Pais | Kaja Dobrovoljc Zor | Voula Giouli | Jun Kevin | Eugene Jang | Eungseo Kim | Jeongyeon Seo | Xenophon Gialis | Yuval Pinter
Proceedings of the Fifteenth Language Resources and Evaluation Conference
We present Universal NER (UNER) v2, a significant extension of the initial version released in 2024. UNER is a collaborative dataset for multilingual named-entity annotations, built to support research on NER methods in a cross-linguistic setting. UNER v2 adds 11 new datasets in 10 typologically varied languages to the resource, including multiple parallel evaluation benchmarks aligned with each other and other datasets in UNER v1, while maintaining the same annotation guidelines and high standards for inter-annotator agreement. We report detailed statistics for the dataset and benchmark UNER v2 using both encoder-based model architectures and LLMs.
A Parallel Cross-Lingual Benchmark for Multimodal Idiomaticity Understanding
Dilara Torunoğlu-Selamet | Doğukan Arslan | Rodrigo Wilkens | Wei He | Doruk Eryiğit | Thomas Pickard | Adriana S. Pagano | Aline Villavicencio | Gülşen Eryiğit | Ágnes Abuczki | Aida Cardoso | Alesia Lazarenka | Dina Almassova | Amália Mendes | Anna Kanellopoulou | Antoni Brosa-Rodriguez | Baiba Valkovska | Beata Wojtowicz | Bolette Pedersen | Carlos Manuel Hidalgo-Ternero | Chaya Liebeskind | Danka Jokić | Diego Alves | Eleni Triantafyllidi | Erik Velldal | Fred Philippy | Giedre Valunaite Oleskeviciene | Ieva Rizgeliene | Inguna Skadina | Irina Lobzhanidze | Isabell Stinessen Haugen | Jauza Akbar Krito | Jelena M. Marković | Johanna Monti | Josue Alejandro Sauca | Kaja Dobrovoljc Zor | Kingsley O. Ugwuanyi | Laura Rituma | Lilja Øvrelid | Maha Tufail Agro | Manzura Abjalova | Maria Chatzigrigoriou | María del Mar Sánchez Ramos | Marija Pendevska | Masoumeh Seyyedrezaei | Mehrnoush Shamsfard | Momina Ahsan | Muhammad Ahsan Riaz Khan | Nathalie Carmen Hau Norman | Nilay Erdem Ayyıldız | Nina Hosseini-Kivanani | Noémi Ligeti-Nagy | Numaan Naeem | Olha Kanishcheva | Olha Yatsyshyna | Daniil Orel | Petra Giommarelli | Petya Osenova | Radovan Garabik | Regina E. Semou | Rozane Rebechi | Salsabila Zahirah Pranida | Samia Touileb | Sanni Nimb | Sarfraz Ahmad | Sarvinoz Sharipova | Shahar Golan | Shaoxiong Ji | Sopuruchi Christian Aboh | Srdjan Sucur | Stella Markantonatou | Sussi Olsen | Vahide Tajalli | Veronika Lipp | Voula Giouli | Yelda Yeşildal Eraydın | Zahra Saaberi | Zhuohan Xie
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Dilara Torunoğlu-Selamet | Doğukan Arslan | Rodrigo Wilkens | Wei He | Doruk Eryiğit | Thomas Pickard | Adriana S. Pagano | Aline Villavicencio | Gülşen Eryiğit | Ágnes Abuczki | Aida Cardoso | Alesia Lazarenka | Dina Almassova | Amália Mendes | Anna Kanellopoulou | Antoni Brosa-Rodriguez | Baiba Valkovska | Beata Wojtowicz | Bolette Pedersen | Carlos Manuel Hidalgo-Ternero | Chaya Liebeskind | Danka Jokić | Diego Alves | Eleni Triantafyllidi | Erik Velldal | Fred Philippy | Giedre Valunaite Oleskeviciene | Ieva Rizgeliene | Inguna Skadina | Irina Lobzhanidze | Isabell Stinessen Haugen | Jauza Akbar Krito | Jelena M. Marković | Johanna Monti | Josue Alejandro Sauca | Kaja Dobrovoljc Zor | Kingsley O. Ugwuanyi | Laura Rituma | Lilja Øvrelid | Maha Tufail Agro | Manzura Abjalova | Maria Chatzigrigoriou | María del Mar Sánchez Ramos | Marija Pendevska | Masoumeh Seyyedrezaei | Mehrnoush Shamsfard | Momina Ahsan | Muhammad Ahsan Riaz Khan | Nathalie Carmen Hau Norman | Nilay Erdem Ayyıldız | Nina Hosseini-Kivanani | Noémi Ligeti-Nagy | Numaan Naeem | Olha Kanishcheva | Olha Yatsyshyna | Daniil Orel | Petra Giommarelli | Petya Osenova | Radovan Garabik | Regina E. Semou | Rozane Rebechi | Salsabila Zahirah Pranida | Samia Touileb | Sanni Nimb | Sarfraz Ahmad | Sarvinoz Sharipova | Shahar Golan | Shaoxiong Ji | Sopuruchi Christian Aboh | Srdjan Sucur | Stella Markantonatou | Sussi Olsen | Vahide Tajalli | Veronika Lipp | Voula Giouli | Yelda Yeşildal Eraydın | Zahra Saaberi | Zhuohan Xie
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Potentially idiomatic expressions (PIEs) carry meanings inherently tied to the everyday experience of a given language community. As such, they constitute an interesting challenge for assessing the linguistic (and to some extent cultural) capabilities of NLP systems. In this paper, we present XMPIE, a parallel multilingual and multimodal dataset of potentially idiomatic expressions. The dataset, containing 34 languages and over ten thousand items, allows comparative analyses of idiomatic patterns among language-specific realisations and preferences in order to gather insights about shared cultural aspects. This parallel dataset allows evaluation of language model performance for a given PIE in different languages and whether idiomatic understanding in one language can be transferred to another. Moreover, the dataset supports the study of PIEs across textual and visual modalities, to measure to what extent PIE understanding in one modality transfers or implies in understanding in another modality (text vs. image). The data was created by language experts, with both textual and visual components crafted under multilingual guidelines, and each PIE is accompanied by five images representing a spectrum from idiomatic to literal meanings, including semantically related and random distractors. The result is a high-quality benchmark for evaluating multilingual and multimodal idiomatic language understanding.
Survey of Tools for Manual Linguistic Annotation: Supporting Diversity through Interactive Exploration
Ludovica Pannitto | Kaja Dobrovoljc Zor | Bruno Guillaume
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Ludovica Pannitto | Kaja Dobrovoljc Zor | Bruno Guillaume
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Manual annotation tools are core infrastructure for corpus creation, enabling the development of linguistically informed language resources relevant for both linguistic discovery and computational applications. We present a comprehensive survey of 21 tools supporting morphosyntactic and multi-word expression annotation, systematically documenting more than 50 features relevant for annotation workflows—from software architecture and usability to linguistic coverage and annotation scope. The survey results are published as an open dataset and made accessible through an interactive online platform that allows users to filter and compare tools according to their specific needs. Our initial analysis highlights a robust and open ecosystem of annotation tools, but advanced needs for complex and language-independent annotation are inconsistently addressed.
2025
ComparaTree: A Multi-Level Comparative Treebank Analysis Tool
Luka Terčon | Kaja Dobrovoljc
Proceedings of the 23rd International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2025)
Luka Terčon | Kaja Dobrovoljc
Proceedings of the 23rd International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2025)
ComparaTree is a tool for comparative treebank analysis that combines various methods of quantitative linguistic analysis to provide a general overview of the differences and similarities between two treebanks. The comparison tool covers a range of subfields of linguistic analysis, providing a summary of the differences and similarities in terms of the lexical diversity, n-gram diversity, part-of-speech and dependency relation proportions, syntactic complexity, and syntactic diversity. We explain the various quantitative analyses performed on every level along with the generation of graphical visualizations, which add value by enabling user-friendly comparisons at a glance. We exemplify the comparison process by presenting the results produced by the tool when comparing two treebanks from the Universal Dependencies collection.
STARK: A Toolkit for Dependency (Sub)Tree Extraction and Analysis
Luka Krsnik | Kaja Dobrovoljc
Proceedings of the 23rd International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2025)
Luka Krsnik | Kaja Dobrovoljc
Proceedings of the 23rd International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2025)
We present STARK, a lightweight and flexible Python toolkit for extracting and analyzing syntactic (sub)trees from dependency-parsed corpora. By systematically slicing each sentence into interpretable syntactic units based on configurable parameters, STARK enables bottom-up, data-driven exploration of syntactic patterns at multiple levels of abstraction—from fully lexicalized constructions to general structural templates. It supports any CoNLL-U-formatted corpus and is available as a command-line tool, Python library, and interactive online demo, ensuring seamless integration into both exploratory and large-scale corpus workflows. We illustrate its functionality through case studies in noun phrase analysis, multiword expression identification, and syntactic variation across corpora, demonstrating its utility for a wide range of corpus-driven syntactic investigations.
Word Order Variation in Spoken and Written Corpora: A Cross-Linguistic Study of SVO and Alternative Orders
Nives Hüll | Kaja Dobrovoljc
Proceedings of the Eighth International Conference on Dependency Linguistics (Depling, SyntaxFest 2025)
Nives Hüll | Kaja Dobrovoljc
Proceedings of the Eighth International Conference on Dependency Linguistics (Depling, SyntaxFest 2025)
This study investigates word order variation in spoken and written corpora across five Indo-European languages: English, French, Norwegian (Nynorsk), Slovenian, and Spanish. Using Universal Dependencies treebanks, we analyze the distribution of six canonical word orders (SVO, SOV, VSO, VOS, OSV, OVS). Our results reveal that spoken language consistently exhibits greater word order flexibility than written language. This increased flexibility manifests as a decrease in the dominant SVO pattern and a rise in alternative orders, though the extent of this variation differs across languages. Morphologically rich languages such as Slovenian and Spanish show the most pronounced shifts, while English remains syntactically rigid across modalities. These findings support the claim that modality significantly affects syntactic realizations and highlight the need for typological studies to account for spoken data.
2024
UniDive: A COST Action on Universality, Diversity and Idiosyncrasy in Language Technology
Agata Savary | Daniel Zeman | Verginica Barbu Mititelu | Anabela Barreiro | Olesea Caftanatov | Marie-Catherine de Marneffe | Kaja Dobrovoljc | Gülşen Eryiğit | Voula Giouli | Bruno Guillaume | Stella Markantonatou | Nurit Melnik | Joakim Nivre | Atul Kr. Ojha | Carlos Ramisch | Abigail Walsh | Beata Wójtowicz | Alina Wróblewska
Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024
Agata Savary | Daniel Zeman | Verginica Barbu Mititelu | Anabela Barreiro | Olesea Caftanatov | Marie-Catherine de Marneffe | Kaja Dobrovoljc | Gülşen Eryiğit | Voula Giouli | Bruno Guillaume | Stella Markantonatou | Nurit Melnik | Joakim Nivre | Atul Kr. Ojha | Carlos Ramisch | Abigail Walsh | Beata Wójtowicz | Alina Wróblewska
Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024
This paper presents the objectives, organization and activities of the UniDive COST Action, a scientific network dedicated to universality, diversity and idiosyncrasy in language technology. We describe the objectives and organization of this initiative, the people involved, the working groups and the ongoing tasks and activities. This paper is also an pen call for participation towards new members and countries.
SUK 1.0: A New Training Corpus for Linguistic Annotation of Modern Standard Slovene
Špela Arhar Holdt | Jaka Čibej | Kaja Dobrovoljc | Tomaž Erjavec | Polona Gantar | Simon Krek | Tina Munda | Nejc Robida | Luka Terčon | Slavko Zitnik
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Špela Arhar Holdt | Jaka Čibej | Kaja Dobrovoljc | Tomaž Erjavec | Polona Gantar | Simon Krek | Tina Munda | Nejc Robida | Luka Terčon | Slavko Zitnik
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
This paper introduces the upgrade of a training corpus for linguistic annotation of modern standard Slovene. The enhancement spans both the size of the corpus and the depth of annotation layers. The revised SUK 1.0 corpus, building on its predecessor ssj500k 2.3, has doubled in size, containing over a million tokens. This expansion integrates three preexisting open-access datasets, all of which have undergone automatic tagging and meticulous manual review across multiple annotation layers, each represented in varying proportions. These layers span tokenization, segmentation, lemmatization, MULTEXT-East morphology, Universal Dependencies, JOS-SYN syntax, semantic role labeling, named entity recognition, and the newly incorporated coreferences. The paper illustrates the annotation processes for each layer while also presenting the results of the new CLASSLA-Stanza annotation tool, trained on the SUK corpus data. As one of the fundamental language resources of modern Slovene, the SUK corpus calls for constant development, as outlined in the concluding section.
Gos 2: A New Reference Corpus of Spoken Slovenian
Darinka Verdonik | Kaja Dobrovoljc | Tomaž Erjavec | Nikola Ljubešić
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Darinka Verdonik | Kaja Dobrovoljc | Tomaž Erjavec | Nikola Ljubešić
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
This paper introduces a new version of the Gos reference corpus of spoken Slovenian, which was recently extended to more than double the original size (300 hours, 2.4 million words) by adding speech recordings and transcriptions from two related initiatives, the Gos VideoLectures corpus of public academic speech, and the Artur speech recognition database. We describe this process by first presenting the criteria guiding the balanced selection of the newly added data and the challenges encountered when merging language resources with divergent designs, followed by the presentation of other major enhancements of the new Gos corpus, such as improvements in lemmatization and morphosyntactic annotation, word-level speech alignment, a new XML schema and the development of a specialized online concordancer.
2022
Extending the SSJ Universal Dependencies Treebank for Slovenian: Was It Worth It?
Kaja Dobrovoljc | Nikola Ljubešić
Proceedings of the 16th Linguistic Annotation Workshop (LAW-XVI) within LREC2022
Kaja Dobrovoljc | Nikola Ljubešić
Proceedings of the 16th Linguistic Annotation Workshop (LAW-XVI) within LREC2022
This paper presents the creation and evaluation of a new version of the reference SSJ Universal Dependencies Treebank for Slovenian, which has been substantially improved and extended to almost double the original size. The process was based on the initial revision and documentation of the language-specific UD annotation guidelines for Slovenian and the corresponding modification of the original SSJ annotations, followed by a two-stage annotation campaign, in which two new subsets have been added, the previously unreleased sentences from the ssj500k corpus and the Slovenian subset of the ELEXIS parallel corpus. The annotation campaign resulted in an extended version of the SSJ UD treebank with 5,435 newly added sentences comprising of 126,427 tokens. To evaluate the potential benefits of this data increase for Slovenian dependency parsing, we compared the performance of the classla-stanza dependency parser trained on the old and the new SSJ data when evaluated on the new SSJ test set and its subsets. Our results show an increase of LAS performance in general, especially for previously under-represented syntactic phenomena, such as lists, elliptical constructions and appositions, but also confirm the distinct nature of the two newly added subsets and the diversification of the SSJ treebank as a whole.
Spoken Language Treebanks in Universal Dependencies: an Overview
Kaja Dobrovoljc
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Kaja Dobrovoljc
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Given the benefits of syntactically annotated collections of transcribed speech in spoken language research and applications, many spoken language treebanks have been developed in the last decades, with divergent annotation schemes posing important limitations to cross-resource explorations, such as comparing data across languages, grammatical frameworks, and language domains. As a consequence, there has been a growing number of spoken language treebanks adopting the Universal Dependencies (UD) annotation scheme, aimed at cross-linguistically consistent morphosyntactic annotation. In view of the non-central role of spoken language data within the scheme and with little in-domain consolidation to date, this paper presents a comparative overview of spoken language treebanks in UD to support cross-treebank data explorations on the one hand, and encourage further treebank harmonization on the other. Our results show that the spoken language treebanks differ considerably with respect to the inventory and the format of transcribed phenomena, as well as the principles adopted in their morphosyntactic annotation. This is particularly true for the dependency annotation of speech disfluencies, where conflicting data annotations suggest an underspecification of the guidelines pertaining to speech repairs in general and the reparandum dependency relation in particular.
2020
Gigafida 2.0: The Reference Corpus of Written Standard Slovene
Simon Krek | Špela Arhar Holdt | Tomaž Erjavec | Jaka Čibej | Andraz Repar | Polona Gantar | Nikola Ljubešić | Iztok Kosem | Kaja Dobrovoljc
Proceedings of the Twelfth Language Resources and Evaluation Conference
Simon Krek | Špela Arhar Holdt | Tomaž Erjavec | Jaka Čibej | Andraz Repar | Polona Gantar | Nikola Ljubešić | Iztok Kosem | Kaja Dobrovoljc
Proceedings of the Twelfth Language Resources and Evaluation Conference
We describe a new version of the Gigafida reference corpus of Slovene. In addition to updating the corpus with new material and annotating it with better tools, the focus of the upgrade was also on its transformation from a general reference corpus, which contains all language variants including non-standard language, to the corpus of standard (written) Slovene. This decision could be implemented as new corpora dedicated specifically to non-standard language emerged recently. In the new version, the whole Gigafida corpus was deduplicated for the first time, which facilitates automatic extraction of data for the purposes of compilation of new lexicographic resources such as the collocations dictionary and the thesaurus of Slovene.
2019
What does Neural Bring? Analysing Improvements in Morphosyntactic Annotation and Lemmatisation of Slovenian, Croatian and Serbian
Nikola Ljubešić | Kaja Dobrovoljc
Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing
Nikola Ljubešić | Kaja Dobrovoljc
Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing
We present experiments on Slovenian, Croatian and Serbian morphosyntactic annotation and lemmatisation between the former state-of-the-art for these three languages and one of the best performing systems at the CoNLL 2018 shared task, the Stanford NLP neural pipeline. Our experiments show significant improvements in morphosyntactic annotation, especially on categories where either semantic knowledge is needed, available through word embeddings, or where long-range dependencies have to be modelled. On the other hand, on the task of lemmatisation no improvements are obtained with the neural solution, mostly due to the heavy dependence of the task on the lookup in an external lexicon, but also due to obvious room for improvements in the Stanford NLP pipeline’s lemmatisation.
Annotating formulaic sequences in spoken Slovenian: structure, function and relevance
Kaja Dobrovoljc
Proceedings of the 13th Linguistic Annotation Workshop
Kaja Dobrovoljc
Proceedings of the 13th Linguistic Annotation Workshop
This paper presents the identification of formulaic sequences in the reference corpus of spoken Slovenian and their annotation in terms of syntactic structure, pragmatic function and lexicographic relevance. The annotation campaign, specific in terms of setting, subjectivity and the multifunctionality of items under investigation, resulted in a preliminary lexicon of formulaic sequences in spoken Slovenian with immediate potential for future explorations in formulaic language research. This is especially relevant for the notable number of identified multi-word expressions with discourse-structuring and stance-marking functions, which have often been overlooked by traditional phraseology research.
Improving UD processing via satellite resources for morphology
Kaja Dobrovoljc | Tomaž Erjavec | Nikola Ljubešić
Proceedings of the Third Workshop on Universal Dependencies (UDW, SyntaxFest 2019)
Kaja Dobrovoljc | Tomaž Erjavec | Nikola Ljubešić
Proceedings of the Third Workshop on Universal Dependencies (UDW, SyntaxFest 2019)
2018
Er ... well, it matters, right? On the role of data representations in spoken language dependency parsing
Kaja Dobrovoljc | Matej Martinc
Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)
Kaja Dobrovoljc | Matej Martinc
Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)
Despite the significant improvement of data-driven dependency parsing systems in recent years, they still achieve a considerably lower performance in parsing spoken language data in comparison to written data. On the example of Spoken Slovenian Treebank, the first spoken data treebank using the UD annotation scheme, we investigate which speech-specific phenomena undermine parsing performance, through a series of training data and treebank modification experiments using two distinct state-of-the-art parsing systems. Our results show that utterance segmentation is the most prominent cause of low parsing performance, both in parsing raw and pre-segmented transcriptions. In addition to shorter utterances, both parsers perform better on normalized transcriptions including basic markers of prosody and excluding disfluencies, discourse markers and fillers. On the other hand, the effects of written training data addition and speech-specific dependency representations largely depend on the parsing system selected.
2017
The Universal Dependencies Treebank for Slovenian
Kaja Dobrovoljc | Tomaž Erjavec | Simon Krek
Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing
Kaja Dobrovoljc | Tomaž Erjavec | Simon Krek
Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing
This paper introduces the Universal Dependencies Treebank for Slovenian. We overview the existing dependency treebanks for Slovenian and then detail the conversion of the ssj200k treebank to the framework of Universal Dependencies version 2. We explain the mapping of part-of-speech categories, morphosyntactic features, and the dependency relations, focusing on the more problematic language-specific issues. We conclude with a quantitative overview of the treebank and directions for further work.
2016
The Universal Dependencies Treebank of Spoken Slovenian
Kaja Dobrovoljc | Joakim Nivre
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Kaja Dobrovoljc | Joakim Nivre
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
This paper presents the construction of an open-source dependency treebank of spoken Slovenian, the first syntactically annotated collection of spontaneous speech in Slovenian. The treebank has been manually annotated using the Universal Dependencies annotation scheme, a one-layer syntactic annotation scheme with a high degree of cross-modality, cross-framework and cross-language interoperability. In this original application of the scheme to spoken language transcripts, we address a wide spectrum of syntactic particularities in speech, either by extending the scope of application of existing universal labels or by proposing new speech-specific extensions. The initial analysis of the resulting treebank and its comparison with the written Slovenian UD treebank confirms significant syntactic differences between the two language modalities, with spoken data consisting of shorter and more elliptic sentences, less and simpler nominal phrases, and more relations marking disfluencies, interaction, deixis and modality.
2014
Search
Fix author
Co-authors
- Nikola Ljubešić 6
- Tomaž Erjavec 5
- Simon Krek 4
- Voula Giouli 3
- Jaka Čibej 3
- Gülşen Eryiğit 2
- Polona Gantar 2
- Bruno Guillaume 2
- Špela Arhar Holdt 2
- Stella Markantonatou 2
- Joakim Nivre 2
- Luka Terčon 2
- Darinka Verdonik 2
- Beata Wójtowicz 2
- Manzura Abjalova 1
- Sopuruchi Christian Aboh 1
- Ágnes Abuczki 1
- Željko Agić 1
- Maha Tufail Agro 1
- Sarfraz Ahmad 1
- Momina Ahsan 1
- Dina Almassova 1
- Diego Alves 1
- Doğukan Arslan 1
- Verginica Barbu Mititelu 1
- Anabela Barreiro 1
- Terra Blevins 1
- Olesea Caftanatov 1
- Aida Cardoso 1
- Maria Chatzigrigoriou 1
- Nilay Erdem Ayyıldız 1
- Doruk Eryiğit 1
- Louis Estève 1
- Radovan Garabik 1
- Xenophon Gialis 1
- Petra Giommarelli 1
- Shahar Golan 1
- Hila Gonen 1
- Isabell Stinessen Haugen 1
- Wei He 1
- Carlos Manuel Hidalgo-Ternero 1
- Nina Hosseini-Kivanani 1
- Nives Hüll 1
- Eugene Jang 1
- Shaoxiong Ji 1
- Danka Jokić 1
- Anna Kanellopoulou 1
- Olha Kanishcheva 1
- Jun Kevin 1
- Muhammad Ahsan Riaz Khan 1
- Eungseo Kim 1
- Iztok Kosem 1
- Jauza Akbar Krito 1
- Luka Krsnik 1
- Alesia Lazarenka 1
- Chaya Liebeskind 1
- Noémi Ligeti-Nagy 1
- Veronika Lipp 1
- Irina Lobzhanidze 1
- Jelena M. Marković 1
- Matej Martinc 1
- Stephen Mayhew 1
- Nurit Melnik 1
- Amália Mendes 1
- Danijela Merkler 1
- Shachar Mirkin 1
- Johanna Monti 1
- Sara Može 1
- Tina Munda 1
- Numaan Naeem 1
- Sanni Nimb 1
- Nathalie Carmen Hau Norman 1
- Atul Kr. Ojha 1
- Sussi Olsen 1
- Daniil Orel 1
- Petya Osenova 1
- Adriana Silvina Pagano 1
- Vasile Pais 1
- Ludovica Pannitto 1
- Bolette Sandford Pedersen 1
- Marija Pendevska 1
- Fred Philippy 1
- Thomas Pickard 1
- Yuval Pinter 1
- Salsabila Zahirah Pranida 1
- Carlos Ramisch 1
- María Del Mar Sánchez Ramos 1
- Rozane Rebechi 1
- Andraž Repar 1
- Laura Rituma 1
- Ieva Rizgeliene 1
- Nejc Robida 1
- Antoni Brosa Rodríguez 1
- Peter Rupnik 1
- Zahra Saaberi 1
- Josue Alejandro Sauca 1
- Agata Savary 1
- Regina E. Semou 1
- Jeongyeon Seo 1
- Masoumeh Seyyedrezaei 1
- Mehrnoush Shamsfard 1
- Sarvinoz Sharipova 1
- Inguna Skadina 1
- Srdjan Sucur 1
- Marek Suppa 1
- Vahide Tajalli 1
- Jörg Tiedemann 1
- Dilara Torunoğlu-Selamet 1
- Samia Touileb 1
- Eleni Triantafyllidi 1
- Kingsley O. Ugwuanyi 1
- Baiba Valkovska 1
- Giedre Valunaite Oleskeviciene 1
- Erik Velldal 1
- Aline Villavicencio 1
- Abigail Walsh 1
- Rodrigo Wilkens 1
- Alina Wróblewska 1
- Zhuohan Xie 1
- Olha Yatsyshyna 1
- Yelda Yeşildal Eraydın 1
- Daniel Zeman 1
- Marie-Catherine de Marneffe 1
- Lilja Øvrelid 1
- Slavko Žitnik 1