Martin Tamajka
2025
skLEP: A Slovak General Language Understanding Benchmark
Marek Suppa
|
Andrej Ridzik
|
Daniel Hládek
|
Tomáš Javůrek
|
Viktória Ondrejová
|
Kristína Sásiková
|
Martin Tamajka
|
Marian Simko
Findings of the Association for Computational Linguistics: ACL 2025
In this work, we introduce skLEP, the first comprehensive benchmark specifically designed for evaluating Slovak natural language understanding (NLU) models. We have compiled skLEP to encompass nine diverse tasks that span token-level, sentence-pair, and document-level challenges, thereby offering a thorough assessment of model capabilities. To create this benchmark, we curated new, original datasets tailored for Slovak and meticulously translated established English NLU resources. Within this paper, we also present the first systematic and extensive evaluation of a wide array of Slovak-specific, multilingual, and English pre-trained language models using the skLEP tasks. Finally, we also release the complete benchmark data, an open-source toolkit facilitating both fine-tuning and evaluation of models, and a public leaderboard at https://github.com/slovak-nlp/sklep in the hopes of fostering reproducibility and drive future research in Slovak NLU.
2022
SlovakBERT: Slovak Masked Language Model
Matúš Pikuliak
|
Štefan Grivalský
|
Martin Konôpka
|
Miroslav Blšták
|
Martin Tamajka
|
Viktor Bachratý
|
Marian Simko
|
Pavol Balážik
|
Michal Trnka
|
Filip Uhlárik
Findings of the Association for Computational Linguistics: EMNLP 2022
We introduce a new Slovak masked language model called SlovakBERT. This is to our best knowledge the first paper discussing Slovak transformers-based language models. We evaluate our model on several NLP tasks and achieve state-of-the-art results. This evaluation is likewise the first attempt to establish a benchmark for Slovak language models. We publish the masked language model, as well as the fine-tuned models for part-of-speech tagging, sentiment analysis and semantic textual similarity.