Jaka Čibej
2026
ROG: A Multi-Layer Manually Annotated Corpus of Spoken Slovenian
Kaja Dobrovoljc Zor | Darinka Verdonik | Jaka Čibej | Peter Rupnik | Nikola Ljubešić
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Kaja Dobrovoljc Zor | Darinka Verdonik | Jaka Čibej | Peter Rupnik | Nikola Ljubešić
Proceedings of the Fifteenth Language Resources and Evaluation Conference
We present ROG, the first manually annotated spoken corpus of Slovenian to integrate morphosyntactic, prosodic, and interactional layers in a unified framework. Building on the pre-existing Spoken Slovenian Treebank (SST) and newly available recordings from the GOS 2 reference corpus, the resource combines over 75,000 words (10 hours) of annotated speech. The entire corpus features lemmatization, MULTEXT-East morphosyntax, and Universal Dependencies annotations, while approximately half includes additional layers for prosodic units, disfluencies, and dialogue acts. All annotation layers are systematically aligned and cross-referenced, enabling detailed multi-dimensional analyses of spoken language. We describe the corpus design, annotation workflow, data release, and baseline modeling results, showcasing the resource’s value for both linguistic analysis and speech-aware NLP model development. All ROG transcriptions and annotations, along with half of the audio recordings, are freely available under CC-BY via (anonymized) repository.
PARSEME 2.0 Multilingual Corpus of Multiword Expressions
Agata Savary | Manon Scholivet | Carlos Ramisch | Takuya Nakamura | Eric Bilinski | Sara Stymne | Voula Giouli | Stella Markantonatou | Vasile Pais | Maria Mitrofan | Louis Estève | Bruno Guillaume | Verginica Barbu Mititelu | Jaka Čibej | Roberto Díaz Hernández | Victoria Fendel | Polona Gantar | Olha Kanishcheva | Cvetana Krstev | Chaya Liebeskind | Irina Lobzhanidze | Aleksandra M. Marković | Gunta Nešpore-Bērzkalne | Adriana S. Pagano | Mehrnoush Shamsfard | Ranka Stankovic | Vahide Tajalli | Carole Tiberius | Aakanksha Padhye
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Agata Savary | Manon Scholivet | Carlos Ramisch | Takuya Nakamura | Eric Bilinski | Sara Stymne | Voula Giouli | Stella Markantonatou | Vasile Pais | Maria Mitrofan | Louis Estève | Bruno Guillaume | Verginica Barbu Mititelu | Jaka Čibej | Roberto Díaz Hernández | Victoria Fendel | Polona Gantar | Olha Kanishcheva | Cvetana Krstev | Chaya Liebeskind | Irina Lobzhanidze | Aleksandra M. Marković | Gunta Nešpore-Bērzkalne | Adriana S. Pagano | Mehrnoush Shamsfard | Ranka Stankovic | Vahide Tajalli | Carole Tiberius | Aakanksha Padhye
Proceedings of the Fifteenth Language Resources and Evaluation Conference
We present edition 2.0 of the PARSEME multilingual corpus annotated for multiword expressions (MWEs), resulting from efforts of the PARSEME community towards universality-driven modeling of idiomaticity. With respect to previous editions, we extend the annotation scope to all syntactic MWE categories: verbal, nominal, adjectival, adverbial and functional. We cover 17 languages, of which 7 are new. The annotation process is based on cross-lingually unified guidelines, phrased as decision diagrams over linguistic tests, and a typology of 18 MWE categories. The corpus contains almost 5 million tokens, over 250,000 sentences and 140,000 MWE annotations. The applicability of the corpus is tested in baseline experiments with a prompt-based MWE identification system. Results show that generic large language models do not encode sufficient knowledge to solve the MWE identification task.
2025
A Computational Method for Analyzing Syntactic Profiles: The Case of the ELEXIS-WSD Parallel Sense-Annotated Corpus
Jaka Čibej
Proceedings of the Third Workshop on Quantitative Syntax (QUASY, SyntaxFest 2025)
Jaka Čibej
Proceedings of the Third Workshop on Quantitative Syntax (QUASY, SyntaxFest 2025)
In the paper, we present an approach to comparing corpora annotated with dependency relations. The method relies on the compilation of syntactic profiles – numeric vectors representing the relative frequencies of different syntactic (sub)trees extracted automatically with the STARK 3.0 open-access dependency tree extraction tool. We perform the extraction on the ELEXIS-WSD Parallel Sense-Annotated Corpus, which has recently been published as version 1.2 with UD dependency relation annotations for 10 European languages. The corpus provides an additional resource for contrastive studies in quantitative syntax. In addition to presenting the corpus and conducting some proof-of-concept analyses, we discuss several other potential uses and improvements to the proposed approach.
2024
SI-NLI: A Slovene Natural Language Inference Dataset and Its Evaluation
Matej Klemen | Aleš Žagar | Jaka Čibej | Marko Robnik-Šikonja
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Matej Klemen | Aleš Žagar | Jaka Čibej | Marko Robnik-Šikonja
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Natural language inference (NLI) is an important language understanding benchmark. Two deficiencies of this benchmark are: i) most existing NLI datasets exist for English and a few other well-resourced languages, and ii) most NLI datasets are formed with a narrow set of annotators’ instructions, allowing the prediction models to capture linguistic clues instead of measuring true reasoning capability. We address both issues and introduce SI-NLI, the first dataset for Slovene natural language inference. The dataset is constructed from scratch using knowledgeable annotators with carefully crafted guidelines aiming to avoid commonly encountered problems in existing NLI datasets. We also manually translate the SI-NLI to English to enable cross-lingual model training and evaluation. Using the newly created dataset and its translation, we train and evaluate a variety of large transformer language models in a monolingual and cross-lingual setting. The results indicate that larger models, in general, achieve better performance. The qualitative analysis shows that the SI-NLI dataset is diverse and that there remains plenty of room for improvement even for the largest models.
Annotation of Multiword Expressions in the SUK 1.0 Training Corpus of Slovene: Lessons Learned and Future Steps
Jaka Čibej | Polona Gantar | Mija Bon
Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024
Jaka Čibej | Polona Gantar | Mija Bon
Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024
Recent progress within the UniDive COST Action on the compilation of universal guidelines for the annotation of non-verbal multiword expressions (MWEs) has provided an opportunity to improve and expand the work previously done within the PARSEME COST Action on the annotation of verbal multiword expressions in the SUK 1.0 Training Corpus of Slovene. A segment of the training corpus had already been annotated with verbal MWEs during PARSEME. As a follow-up and part of the New Grammar of Modern Standard Slovene (NSSSS) project, the same segment was annotated with non verbal MWEs, resulting in approximately 6, 500 sentences annotated by at least three annotators (described in Gantar et al., 2019). Since then, the entire SUK 1.0 was also manually annotated with UD part-of-speech tags. In the paper, we present an analysis of the MWE annotations exported from the corpus along with their part-of-speech structures through the lens of Universal Dependencies. We discuss the usefulness of the data in terms of potential insight for the further compilation and fine-tuning of guidelines particularly for non-verbal MWEs, and conclude with our plans for future work.
SUK 1.0: A New Training Corpus for Linguistic Annotation of Modern Standard Slovene
Špela Arhar Holdt | Jaka Čibej | Kaja Dobrovoljc | Tomaž Erjavec | Polona Gantar | Simon Krek | Tina Munda | Nejc Robida | Luka Terčon | Slavko Zitnik
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Špela Arhar Holdt | Jaka Čibej | Kaja Dobrovoljc | Tomaž Erjavec | Polona Gantar | Simon Krek | Tina Munda | Nejc Robida | Luka Terčon | Slavko Zitnik
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
This paper introduces the upgrade of a training corpus for linguistic annotation of modern standard Slovene. The enhancement spans both the size of the corpus and the depth of annotation layers. The revised SUK 1.0 corpus, building on its predecessor ssj500k 2.3, has doubled in size, containing over a million tokens. This expansion integrates three preexisting open-access datasets, all of which have undergone automatic tagging and meticulous manual review across multiple annotation layers, each represented in varying proportions. These layers span tokenization, segmentation, lemmatization, MULTEXT-East morphology, Universal Dependencies, JOS-SYN syntax, semantic role labeling, named entity recognition, and the newly incorporated coreferences. The paper illustrates the annotation processes for each layer while also presenting the results of the new CLASSLA-Stanza annotation tool, trained on the SUK corpus data. As one of the fundamental language resources of modern Slovene, the SUK corpus calls for constant development, as outlined in the concluding section.
DIALECT-COPA: Extending the Standard Translations of the COPA Causal Commonsense Reasoning Dataset to South Slavic Dialects
Nikola Ljubešić | Nada Galant | Sonja Benčina | Jaka Čibej | Stefan Milosavljević | Peter Rupnik | Taja Kuzman
Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024)
Nikola Ljubešić | Nada Galant | Sonja Benčina | Jaka Čibej | Stefan Milosavljević | Peter Rupnik | Taja Kuzman
Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024)
The paper presents new causal commonsense reasoning datasets for South Slavic dialects, based on the Choice of Plausible Alternatives (COPA) dataset. The dialectal datasets are built by translating by native dialect speakers from the English original and the corresponding standard translation. Three dialects are covered – the Cerkno dialect of Slovenian, the Chakavian dialect of Croatian and the Torlak dialect of Serbian. The datasets are the first resource for evaluation of large language models on South Slavic dialects, as well as among the first commonsense reasoning datasets on dialects overall. The paper describes specific challenges met during the translation process. A comparison of the dialectal datasets with their standard language counterparts shows a varying level of character-level, word-level and lexicon-level deviation of dialectal text from the standard datasets. The observed differences are well reproduced in initial zero-shot and 10-shot experiments, where the Slovenian Cerkno dialect and the Croatian Chakavian dialect show significantly lower results than the Torlak dialect. These results show also for the dialectal datasets to be significantly more challenging than the standard datasets. Finally, in-context learning on just 10 examples shows to improve the results dramatically, especially for the dialects with the lowest results.
2023
XL-WA: a Gold Evaluation Benchmark for Word Alignment in 14 Language Pairs
Federico Martelli | Andrei Stefan Bejgu | Cesare Campagnano | Jaka Čibej | Rute Costa | Apolonija Gantar | Jelena Kallas | Svetla Peneva Koeva | Kristina Koppel | Simon Krek | Margit Langemets | Veronika Lipp | Sanni Nimb | Sussi Olsen | Bolette Sanford Pedersen | Valeria Quochi | Ana Salgado | László Simon | Carole Tiberius | Rafael-J Ureña-Ruiz | Roberto Navigli
Proceedings of the Ninth Italian Conference on Computational Linguistics (CLiC-it 2023)
Federico Martelli | Andrei Stefan Bejgu | Cesare Campagnano | Jaka Čibej | Rute Costa | Apolonija Gantar | Jelena Kallas | Svetla Peneva Koeva | Kristina Koppel | Simon Krek | Margit Langemets | Veronika Lipp | Sanni Nimb | Sussi Olsen | Bolette Sanford Pedersen | Valeria Quochi | Ana Salgado | László Simon | Carole Tiberius | Rafael-J Ureña-Ruiz | Roberto Navigli
Proceedings of the Ninth Italian Conference on Computational Linguistics (CLiC-it 2023)
2020
Creating Expert Knowledge by Relying on Language Learners: a Generic Approach for Mass-Producing Language Resources by Combining Implicit Crowdsourcing and Language Learning
Lionel Nicolas | Verena Lyding | Claudia Borg | Corina Forascu | Karën Fort | Katerina Zdravkova | Iztok Kosem | Jaka Čibej | Špela Arhar Holdt | Alice Millour | Alexander König | Christos Rodosthenous | Federico Sangati | Umair ul Hassan | Anisia Katinskaia | Anabela Barreiro | Lavinia Aparaschivei | Yaakov HaCohen-Kerner
Proceedings of the Twelfth Language Resources and Evaluation Conference
Lionel Nicolas | Verena Lyding | Claudia Borg | Corina Forascu | Karën Fort | Katerina Zdravkova | Iztok Kosem | Jaka Čibej | Špela Arhar Holdt | Alice Millour | Alexander König | Christos Rodosthenous | Federico Sangati | Umair ul Hassan | Anisia Katinskaia | Anabela Barreiro | Lavinia Aparaschivei | Yaakov HaCohen-Kerner
Proceedings of the Twelfth Language Resources and Evaluation Conference
We introduce in this paper a generic approach to combine implicit crowdsourcing and language learning in order to mass-produce language resources (LRs) for any language for which a crowd of language learners can be involved. We present the approach by explaining its core paradigm that consists in pairing specific types of LRs with specific exercises, by detailing both its strengths and challenges, and by discussing how much these challenges have been addressed at present. Accordingly, we also report on on-going proof-of-concept efforts aiming at developing the first prototypical implementation of the approach in order to correct and extend an LR called ConceptNet based on the input crowdsourced from language learners. We then present an international network called the European Network for Combining Language Learning with Crowdsourcing Techniques (enetCollect) that provides the context to accelerate the implementation of this generic approach. Finally, we exemplify how it can be used in several language learning scenarios to produce a multitude of NLP resources and how it can therefore alleviate the long-standing NLP issue of the lack of LRs.
Gigafida 2.0: The Reference Corpus of Written Standard Slovene
Simon Krek | Špela Arhar Holdt | Tomaž Erjavec | Jaka Čibej | Andraz Repar | Polona Gantar | Nikola Ljubešić | Iztok Kosem | Kaja Dobrovoljc
Proceedings of the Twelfth Language Resources and Evaluation Conference
Simon Krek | Špela Arhar Holdt | Tomaž Erjavec | Jaka Čibej | Andraz Repar | Polona Gantar | Nikola Ljubešić | Iztok Kosem | Kaja Dobrovoljc
Proceedings of the Twelfth Language Resources and Evaluation Conference
We describe a new version of the Gigafida reference corpus of Slovene. In addition to updating the corpus with new material and annotating it with better tools, the focus of the upgrade was also on its transformation from a general reference corpus, which contains all language variants including non-standard language, to the corpus of standard (written) Slovene. This decision could be implemented as new corpora dedicated specifically to non-standard language emerged recently. In the new version, the whole Gigafida corpus was deduplicated for the first time, which facilitates automatic extraction of data for the purposes of compilation of new lexicographic resources such as the collocations dictionary and the thesaurus of Slovene.
2015
Search
Fix author
Co-authors
- Polona Gantar 4
- Nikola Ljubešić 4
- Kaja Dobrovoljc 3
- Tomaž Erjavec 3
- Špela Arhar Holdt 3
- Simon Krek 3
- Iztok Kosem 2
- Peter Rupnik 2
- Carole Tiberius 2
- Lavinia Aparaschivei 1
- Verginica Barbu Mititelu 1
- Anabela Barreiro 1
- Andrei Stefan Bejgu 1
- Sonja Benčina 1
- Eric Bilinski 1
- Mija Bon 1
- Claudia Borg 1
- Cesare Campagnano 1
- Rute Costa 1
- Roberto Díaz Hernández 1
- Louis Estève 1
- Victoria Fendel 1
- Darja Fišer 1
- Karën Fort 1
- Corina Forăscu 1
- Nada Galant 1
- Apolonija Gantar 1
- Voula Giouli 1
- Bruno Guillaume 1
- Yaakov HaCohen-Kerner 1
- Jelena Kallas 1
- Olha Kanishcheva 1
- Anisia Katinskaia 1
- Matej Klemen 1
- Svetla Peneva Koeva 1
- Kristina Koppel 1
- Cvetana Krstev 1
- Taja Kuzman 1
- Alexander König 1
- Margit Langemets 1
- Chaya Liebeskind 1
- Veronika Lipp 1
- Irina Lobzhanidze 1
- Verena Lyding 1
- Stella Markantonatou 1
- Dafne Marko 1
- Aleksandra M. Marković 1
- Federico Martelli 1
- Alice Millour 1
- Stefan Milosavljević 1
- Maria Mitrofan 1
- Tina Munda 1
- Takuya Nakamura 1
- Roberto Navigli 1
- Gunta Nešpore-Bērzkalne 1
- Lionel Nicolas 1
- Sanni Nimb 1
- Sussi Olsen 1
- Aakanksha Padhye 1
- Adriana Silvina Pagano 1
- Vasile Pais 1
- Senja Pollak 1
- Valeria Quochi 1
- Carlos Ramisch 1
- Andraž Repar 1
- Nejc Robida 1
- Marko Robnik-Šikonja 1
- Christos Rodosthenous 1
- Ana Salgado 1
- Bolette Sanford Pedersen 1
- Federico Sangati 1
- Agata Savary 1
- Manon Scholivet 1
- Mehrnoush Shamsfard 1
- László Simon 1
- Ranka Stankovic 1
- Sara Stymne 1
- Vahide Tajalli 1
- Luka Terčon 1
- Rafael-J. Ureña-Ruiz 1
- Darinka Verdonik 1
- Katerina Zdravkova 1
- Umair ul Hassan 1
- Iza Škrjanec 1
- Aleš Žagar 1
- Slavko Žitnik 1