2022
pdf
abs
Introducing the CURLICAT Corpora: Seven-language Domain Specific Annotated Corpora from Curated Sources
Tamás Váradi
|
Bence Nyéki
|
Svetla Koeva
|
Marko Tadić
|
Vanja Štefanec
|
Maciej Ogrodniczuk
|
Bartłomiej Nitoń
|
Piotr Pęzik
|
Verginica Barbu Mititelu
|
Elena Irimia
|
Maria Mitrofan
|
Dan Tufiș
|
Radovan Garabík
|
Simon Krek
|
Andraž Repar
Proceedings of the Thirteenth Language Resources and Evaluation Conference
This article presents the current outcomes of the CURLICAT CEF Telecom project, which aims to collect and deeply annotate a set of large corpora from selected domains. The CURLICAT corpus includes 7 monolingual corpora (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) containing selected samples from respective national corpora. These corpora are automatically tokenized, lemmatized and morphologically analysed and the named entities annotated. The annotations are uniformly provided for each language specific corpus while the common metadata schema is harmonised across the languages. Additionally, the corpora are annotated for IATE terms in all languages. The file format is CoNLL-U Plus format, containing the ten columns specific to the CoNLL-U format and three extra columns specific to our corpora as defined by Varádi et al. (2020). The CURLICAT corpora represent a rich and valuable source not just for training NMT models, but also for further studies and developments in machine learning, cross-lingual terminological data extraction and classification.
2019
pdf
Redesign of the Croatian derivational lexicon
Matea Filko
|
Krešimir Šojat
|
Vanja Štefanec
Proceedings of the Second International Workshop on Resources and Tools for Derivational Morphology
2016
pdf
abs
Croatian Error-Annotated Corpus of Non-Professional Written Language
Vanja Štefanec
|
Nikola Ljubešić
|
Jelena Kuvač Kraljević
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
In the paper authors present the Croatian corpus of non-professional written language. Consisting of two subcorpora, i.e. the clinical subcorpus, consisting of written texts produced by speakers with various types of language disorders, and the healthy speakers subcorpus, as well as by the levels of its annotation, it offers an opportunity for different lines of research. The authors present the corpus structure, describe the sampling methodology, explain the levels of annotation, and give some very basic statistics. On the basis of data from the corpus, existing language technologies for Croatian are adapted in order to be implemented in a platform facilitating text production to speakers with language disorders. In this respect, several analyses of the corpus data and a basic evaluation of the developed technologies are presented.