Ján Staš

Also published as: Jan Stas, Jan Staš

This work proposes an information retrieval evaluation set for the Slovak language. A set of 80 queries written in the natural language is given together with the set of relevant documents. The document set contains 3980 newspaper articles sorted into 6 categories. Each document in the result set is manually annotated for relevancy with its corresponding query. The evaluation set is mostly compatible with the Cranfield test collection using the same methodology for queries and annotation of relevancy. In addition to that it provides annotation for document title, author, publication date and category that can be used for evaluation of automatic document clustering and categorization.

pdf bib abs

An Extension of the Slovak Broadcast News Corpus based on Semi-Automatic Annotation
Peter Viszlay | Ján Staš | Tomáš Koctúr | Martin Lojka | Jozef Juhár
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper, we introduce an extension of our previously released TUKE-BNews-SK corpus based on a semi-automatic annotation scheme. It firstly relies on the automatic transcription of the BN data performed by our Slovak large vocabulary continuous speech recognition system. The generated hypotheses are then manually corrected and completed by trained human annotators. The corpus is composed of 25 hours of fully-annotated spontaneous and prepared speech. In addition, we have acquired 900 hours of another BN data, part of which we plan to annotate semi-automatically. We present a preliminary corpus evaluation that gives very promising results.

2014

pdf bib abs

The Slovak Categorized News Corpus
Daniel Hladek | Jan Stas | Jozef Juhar
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The presented corpus aims to be the first attempt to create a representative sample of the contemporary Slovak language from various domains with easy searching and automated processing. This first version of the corpus contains words and automatic morphological and named entity annotations and transcriptions of abbreviations and numerals. Integral part of the proposed paper is a word boundary and sentence boundary detection algorithm that utilizes characteristic features of the language.

Co-authors

Venues

lrec3
rocling3

Fix author

Ján Staš

2023

2019

2016

2014

Co-authors

Venues