Jozef Kubík

Also published as: Jozef Kubik


2025

pdf bib
Enhancing BERT Fine-Tuning for Sentiment Analysis in Lower-Resourced Languages
Jozef Kubík | Marek Suppa | Martin Takac
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

Limited data for low-resource languages typically yields weaker language models (LMs). Since pre-training is compute-intensive, it is more pragmatic to target improvements during fine-tuning. In this work, we examine the use of Active Learning (AL) methods augmented by structured data selection strategies across epochs, which we term ‘Active Learning schedulers,’ to boost the fine-tuning process with a limited amount of training data. We connect the AL process to data clustering and propose an integrated fine-tuning pipeline that systematically combines AL, data clustering, and dynamic data selection schedulers to enhance models’ performance. Several experiments on the Slovak, Maltese, Icelandic, and Turkish languages show that the use of clustering during the fine-tuning phase together with novel AL scheduling can for models simultaneously yield annotation savings up to 30% and performance improvements up to four F1 score points, while also providing better fine-tuning stability.

2024

pdf bib
ChatGPT as Your n-th Annotator: Experiments in Leveraging Large Language Models for Social Science Text Annotation in Slovak Language
Endre Hamerlik | Marek Šuppa | Miroslav Blšták | Jozef Kubík | Martin Takáč | Marián Šimko | Andrej Findor
Proceedings of the 4th Workshop on Computational Linguistics for the Political and Social Sciences: Long and short papers

Large Language Models (LLMs) are increasingly influential in Computational Social Science, offering new methods for processing and analyzing data, particularly in lower-resource language contexts. This study explores the use of OpenAI’s GPT-3.5 Turbo and GPT-4 for automating annotations for a unique news media dataset in a lower resourced language, focusing on stance classification tasks. Our results reveal that prompting in the native language, explanation generation, and advanced prompting strategies like Retrieval Augmented Generation and Chain of Thought prompting enhance LLM performance, particularly noting GPT-4’s superiority in predicting stance. Further evaluation indicates that LLMs can serve as a useful tool for social science text annotation in lower resourced languages, notably in identifying inconsistencies in annotation guidelines and annotated datasets.

2023

pdf bib
WikiGoldSK: Annotated Dataset, Baselines and Few-Shot Learning Experiments for Slovak Named Entity Recognition
David Suba | Marek Suppa | Jozef Kubik | Endre Hamerlik | Martin Takac
Proceedings of the 9th Workshop on Slavic Natural Language Processing 2023 (SlavicNLP 2023)

Named Entity Recognition (NER) is a fundamental NLP tasks with a wide range of practical applications. The performance of state-of-the-art NER methods depends on high quality manually anotated datasets which still do not exist for some languages. In this work we aim to remedy this situation in Slovak by introducing WikiGoldSK, the first sizable human labelled Slovak NER dataset. We benchmark it by evaluating state-of-the-art multilingual Pretrained Language Models and comparing it to the existing silver-standard Slovak NER dataset. We also conduct few-shot experiments and show that training on a sliver-standard dataset yields better results. To enable future work that can be based on Slovak NER, we release the dataset, code, as well as the trained models publicly under permissible licensing terms at https://github.com/NaiveNeuron/WikiGoldSK