Luka Terčon
2025
ComparaTree: A Multi-Level Comparative Treebank Analysis Tool
Luka Terčon
|
Kaja Dobrovoljc
Proceedings of the 23rd International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2025)
ComparaTree is a tool for comparative treebank analysis that combines various methods of quantitative linguistic analysis to provide a general overview of the differences and similarities between two treebanks. The comparison tool covers a range of subfields of linguistic analysis, providing a summary of the differences and similarities in terms of the lexical diversity, n-gram diversity, part-of-speech and dependency relation proportions, syntactic complexity, and syntactic diversity. We explain the various quantitative analyses performed on every level along with the generation of graphical visualizations, which add value by enabling user-friendly comparisons at a glance. We exemplify the comparison process by presenting the results produced by the tool when comparing two treebanks from the Universal Dependencies collection.
2024
SUK 1.0: A New Training Corpus for Linguistic Annotation of Modern Standard Slovene
Špela Arhar Holdt
|
Jaka Čibej
|
Kaja Dobrovoljc
|
Tomaž Erjavec
|
Polona Gantar
|
Simon Krek
|
Tina Munda
|
Nejc Robida
|
Luka Terčon
|
Slavko Zitnik
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
This paper introduces the upgrade of a training corpus for linguistic annotation of modern standard Slovene. The enhancement spans both the size of the corpus and the depth of annotation layers. The revised SUK 1.0 corpus, building on its predecessor ssj500k 2.3, has doubled in size, containing over a million tokens. This expansion integrates three preexisting open-access datasets, all of which have undergone automatic tagging and meticulous manual review across multiple annotation layers, each represented in varying proportions. These layers span tokenization, segmentation, lemmatization, MULTEXT-East morphology, Universal Dependencies, JOS-SYN syntax, semantic role labeling, named entity recognition, and the newly incorporated coreferences. The paper illustrates the annotation processes for each layer while also presenting the results of the new CLASSLA-Stanza annotation tool, trained on the SUK corpus data. As one of the fundamental language resources of modern Slovene, the SUK corpus calls for constant development, as outlined in the concluding section.
Search
Fix author
Co-authors
- Kaja Dobrovoljc 2
- Tomaž Erjavec 1
- Polona Gantar 1
- Špela Arhar Holdt 1
- Simon Krek 1
- show all...