Workshop on Balto-Slavic Natural Language Processing (2025)

Volumes

Proceedings of the 10th Workshop on Slavic Natural Language Processing (Slavic NLP 2025) 28 papers

pdf (full)
bib (full) Proceedings of the 10th Workshop on Slavic Natural Language Processing (Slavic NLP 2025)

pdf bib
Proceedings of the 10th Workshop on Slavic Natural Language Processing (Slavic NLP 2025)
Jakub Piskorski | Pavel Přibáň | Preslav Nakov | Roman Yangarber | Michal Marcinczuk

pdf bib abs
Identifying Filled Pauses in Speech Across South and West Slavic Languages
Nikola Ljubešić | Ivan Porupski | Peter Rupnik | Taja Kuzman

Filled pauses are among the most common paralinguistic features of speech, yet they are mainly omitted from transcripts. We propose a transformer-based approach for detecting filled pauses directly from the speech signal, fine-tuned on Slovenian and evaluated across South and West Slavic languages. Our results show that speech transformers achieve excellent performance in detecting filled pauses when evaluated in the in-language scenario. We further evaluate cross-lingual capabilities of the model on two closely related South Slavic languages (Croatian and Serbian) and two less closely related West Slavic languages (Czech and Polish). Our results reveal strong cross-lingual generalization capabilities of the model, with only minor performance drops. Moreover, error analysis reveals that the model outperforms human annotators in recall and F1 score, while trailing slightly in precision. In addition to evaluating the capabilities of speech transformers for filled pause detection across Slavic languages, we release new multilingual test datasets and make our fine-tuned model publicly available to support further research and applications in spoken language processing.

pdf bib abs
Few-Shot Prompting, Full-Scale Confusion: Evaluating Large Language Models for Humor Detection in Croatian Tweets
Petra Bago | Nikola Bakarić

Humor detection in low-resource languages is hampered by cultural nuance and subjective annotation. We test two large language models, GPT-4 and Gemini 2.5 Flash, on labeling humor in 6,000 Croatian tweets with expert gold labels generated through a rigorous annotation pipeline. LLM–human agreement (κ = 0.28) matches human–human agreement (κ = 0.27), while LLM–LLM agreement is substantially higher (κ = 0.63). Although concordance with expert adjudication is lower, additional metrics imply that the models equal a second human annotator while working far faster and at negligible cost. These findings suggest, even with simple prompting, LLMs can efficiently bootstrap subjective datasets and serve as practical annotation assistants in linguistically under-represented settings.

pdf bib abs
GigaEmbeddings — Efficient Russian Language Embedding Model
Egor Kolodin | Anastasia Ianina

We introduce GigaEmbeddings, a novel framework for training high-performance Russian-focused text embeddings through hierarchical instruction tuning of the decoder-only LLM designed specifically for Russian language (GigaChat-3B). Our three-stage pipeline, comprising large-scale contrastive pre-training in web-scale corpora, fine-tuning with hard negatives, and multitask generalization across retrieval, classification, and clustering tasks, addresses key limitations of existing methods by unifying diverse objectives and leveraging synthetic data generation. Architectural innovations include bidirectional attention for contextual modeling, latent attention pooling for robust sequence aggregation, and strategic pruning of 25% of transformer layers to enhance efficiency without compromising performance. Evaluated on the ruMTEB benchmark spanning 23 multilingual tasks, GigaEmbeddings achieves state-of-the-art results (69.1 avg. score), outperforming strong baselines with a larger number of parameters.

pdf bib abs
PL-Guard: Benchmarking Language Model Safety for Polish
Aleksandra Krasnodebska | Karolina Seweryn | Szymon Łukasik | Wojciech Kusa

We present a benchmark dataset for evaluating language model safety in Polish, addressing the underrepresentation of medium-resource languages in existing safety assessments. Our dataset includes both original and adversarially perturbed examples. We fine-tune and evaluate multiple models—LlamaGuard-3-8B, a HerBERT-based classifier, and PLLuM—and find that the HerBERT-based model outperforms others, especially under adversarial conditions.

pdf bib abs
Dialects, Topic Models, and Border Effects: The Rusyn Case
Achim Rabus | Yves Scherrer

In this contribution, we present, discuss, and apply a data-driven approach for analyzing varieties of the Slavic minority language Carpathian Rusyn spoken in different countries in the Carpathian region. Using topic modeling, a method originally developed for text mining, we show that the Rusyn varieties are subject to border effects, i.e., vertical convergence and horizontal divergence, due to language contacts with their respective umbrella languages Polish, Slovak and Standard Ukrainian. Additionally, we show that the method is suitable for uncovering fieldworker isoglosses, i.e., different transcription principles in an otherwise homogeneous dataset.

pdf bib abs
Towards Open Foundation Language Model and Corpus for Macedonian: A Low-Resource Language
Stefan Krsteski | Borjan Sazdov | Matea Tashkovska | Branislav Gerazov | Hristijan Gjoreski

The increase in technological adoption worldwide comes with demands for novel tools to be used by the general population. Large Language Models (LLMs) provide a great opportunity in this respect, but their capabilities remain limited for low-resource languages, restricting applications in countries where such languages are spoken. We create several resources to facilitate the adoption of LLMs and to support research advancements for Macedonian. We collect the largest Macedonian corpus to date, consisting of 40GB of textual data and totaling 3.5B words. To support conversational applications, we collect a 106k-instance instruction dataset, carefully built to be culturally grounded. For evaluation, we construct a Macedonian evaluation suite covering seven benchmarks. Finally, we train domestic-yak, a state-of-the-art 8B-parameter model, on our curated datasets and evaluate it against eight baseline models using the newly constructed benchmark suite. Our model outperforms all existing models in the 8B parameter range across all benchmarks, and achieves performance comparable to models up to 10× larger. Furthermore, a qualitative analysis with native speakers reveals that our model is preferred over larger counterparts, receiving higher ratings for grammatical correctness and cultural appropriateness. All datasets, code, and model weights are openly released, setting a foundation for advancing LLMs in similarly underrepresented languages. These resources are publicly available at https://github.com/LVSTCK for source code, and at https://huggingface.co/LVSTCK for pretrained model weights and data.

pdf bib abs
Towards compact and efficient Slovak summarization models
Sebastian Petrik | Giang Nguyen

Language models, especially LLMs, often face significant limitations due to their high resource demands. While various model compression methods have emerged, their application to smaller models in multilingual and low-resource settings remains understudied. Our work evaluates selected decoder and embedding pruning methods on T5-based models for abstractive summarization in English and Slovak using a parallel dataset. The results reveal differences in model performance degradation and expand the limited Slovak summarization resources and models.

pdf bib abs
Adapting Definition Modeling for New Languages: A Case Study on Belarusian
Daniela Kazakouskaya | Timothee Mickus | Janine Siewert

Definition modeling, the task of generating new definitions for words in context, holds great prospect as a means to assist the work of lexicographers in documenting a broader variety of lects and languages, yet much remains to be done in order to assess how we can leverage pre-existing models for as-of-yet unsupported languages. In this work, we focus on adapting existing models to Belarusian, for which we propose a novel dataset of 43,150 definitions. Our experiments demonstrate that adapting a definition modeling systems requires minimal amounts of data, but that there currently are gaps in what automatic metrics do capture.

pdf bib abs
Bridging the Gap with RedSQL: A Russian Text-to-SQL Benchmark for Domain-Specific Applications
Irina Brodskaya | Elena Tutubalina | Oleg Somov

We present the first domain-specific text-to-SQL benchmark in Russian, targeting fields with high operational load where rapid decision-making is critical. The benchmark spans across 9 domains, including healthcare, aviation, and others, and comprises 409 curated query pairs. It is designed to test model generalization under domain shift, introducing challenges such as specialized terminology and complex schema structures. Evaluation of state-of-the-art large language models (LLM) reveals significant performance drop in comparison to open-domain academic benchmarks, highlighting the need for domain-aware approaches in text-to-SQL. The benchmark is available at: https://github.com/BrodskaiaIrina/functional-text2sql-subsets

pdf bib abs
Can information theory unravel the subtext in a Chekhovian short story?
J. Nathanael Philipp | Olav Mueller-Reichau | Matthias Irmer | Michael Richter | Max Kölbl

In this study, we investigate whether information-theoretic measures such as surprisal can quantify the elusive notion of subtext in a Chekhovian short story. Specifically, we conduct a series of experiments for which we enrich the original text once with (different types of) meaningful glosses and once with fake glosses. For the different texts thus created, we calculate the surprisal values using two methods: using either a bag-of-words model or a large language model. We observe enrichment effects depending on the method, but no interpretable subtext effect.

This study explores the task of automatically extracting migration-related locations (source and destination) from media articles, focusing on the challenges posed by Slovak, a low-resource and morphologically complex language. We present the first comparative analysis of rule-based dictionary approaches (NLP4SK) versus Large Language Models (LLMs, e.g. SlovakBERT, GPT-4o) for both geographical relevance classification (Slovakia-focused migration) and specific source/target location extraction. To facilitate this research and future work, we introduce the first manually annotated Slovak dataset tailored for migration-focused locality detection. Our results show that while a fine-tuned SlovakBERT model achieves high accuracy for classification, specialized rule-based methods still have the potential to outperform LLMs for specific extraction tasks, though improved LLM performance with few-shot examples suggests future competitiveness as research in this area continues to evolve.

pdf bib abs
DIACU: A dataset for the DIAchronic analysis of Church Slavonic
Maria Cassese | Giovanni Puccetti | Marianna Napolitano | Andrea Esuli

The Church Slavonic language has evolved over time without being formalized into a precise grammar. Therefore, there is currently no clearly outlined history of this language tracing its evolution. However, in recent years, there has been a greater effort to digitize these resources, partly motivated by increased sensitivity with respect to the need to preserve multilingual knowledge. To exploit them, we propose DIACU (DIAchronic Analysis of Church Slavonic), a comprehensive collection of several existing corpora in Church Slavonic. In this work, we thoroughly describe the collection of this novel dataset and test its effectiveness as a training set for attributing Slavonic texts to specific periods. The dataset and the code of the experiments is available at https://github.com/MariaCassese/DIACU.

pdf bib abs
Characterizing Linguistic Shifts in Croatian News via Diachronic Word Embeddings
David Dukić | Ana Barić | Marko Čuljak | Josip Jukić | Martin Tutek

Measuring how semantics of words change over time improves our understanding of how cultures and perspectives change. Diachronic word embeddings help us quantify this shift, although previous studies leveraged substantial temporally annotated corpora. In this work, we use a corpus of 9.5 million Croatian news articles spanning the past 25 years and quantify semantic change using skip-gram word embeddings trained on five-year periods. Our analysis finds that word embeddings capture linguistic shifts of terms pertaining to major topics in this timespan (COVID-19, Croatia joining the European Union, technological advancements). We also find evidence that embeddings from post-2020 encode increased positivity in sentiment analysis tasks, contrasting studies reporting a decline in mental health over the same period.

pdf bib abs
What Makes You CLIC: Detection of Croatian Clickbait Headliness
Marija Andelic | Dominik Sipek | Laura Majer | Jan Snajder

Online news outlets operate predominantly on an advertising-based revenue model, compelling journalists to create headlines that are often scandalous, intriguing, and provocative – commonly referred to as clickbait. Automatic detection of clickbait headlines is essential for preserving information quality and reader trust in digital media and requires both contextual understanding and world knowledge. For this task, particularly in less-resourced languages, it remains unclear whether fine-tuned methods or in-context learning (ICL) yield better results. In this paper, we compile clic, a novel dataset for clickbait detection of Croatian news headlines spanning a 20-year period and encompassing mainstream and fringe outlets. Furthermore, we fine-tune the BERTić model on the task of clickbait detection for Croatian and compare its performance to LLM-based ICL methods with prompts both in Croatian and English. Finally, we analyze the linguistic properties of clickbait. We find that nearly half of the analyzed headlines contain clickbait, and that finetuned models deliver better results than general LLMs.

pdf bib abs
Gender Representation Bias Analysis in LLM-Generated Czech and Slovenian Texts
Erik Derner | Kristina Batistič

Large language models (LLMs) often reflect social biases present in their training data, including imbalances in how different genders are represented. While most prior work has focused on English, gender representation bias remains underexplored in morphologically rich languages where grammatical gender is pervasive. We present a method for detecting and quantifying such bias in Czech and Slovenian, using LLMs to classify gendered person references in LLM-generated narratives. Applying this method to outputs from a range of models, we find substantial variation in gender balance. While some models produce near-equal proportions of male and female references, others exhibit strong male overrepresentation. Our findings highlight the need for fine-grained bias evaluation in under-represented languages and demonstrate the potential of LLM-based annotation in this space. We make our code and data publicly available.

pdf bib abs
REPA: Russian Error Types Annotation for Evaluating Text Generation and Judgment Capabilities
Alexander Pugachev | Alena Fenogenova | Vladislav Mikhailov | Ekaterina Artemova

Recent advances in large language models (LLMs) have introduced the novel paradigm of using LLMs as judges, where an LLM evaluates and scores the outputs of another LLM, often correlating highly with human preferences. However, the use of LLM-as-a-judge has been primarily studied in English. In this paper, we evaluate this framework in Russian by introducing the Russian Error tyPes Annotation dataset (REPA, (eng: turnip)), a dataset of 1,000 user queries and 2,000 LLM-generated responses. Human annotators labeled each response pair, expressing their preferences across ten specific error types, as well as selecting an overall preference. We rank six generative LLMs across the error types using three rating systems based on human preferences. We also evaluate responses using eight LLM judges in zero-shot and few-shot settings. We describe the results of analyzing the judges and position and length biases. Our findings reveal a notable gap between LLM judge performance in Russian and English. However, rankings based on human and LLM preferences show partial alignment, suggesting that while current LLM judges struggle with fine-grained evaluation in Russian, there is potential for improvement.

pdf bib abs
Fine‐Tuned Transformers for Detection and Classification of Persuasion Techniques in Slavic Languages
Ekaterina Loginova

This paper details a system developed for the SlavicNLP 2025 Shared Task on the Detection and Classification of Persuasion Techniques in Texts for Slavic Languages (Bulgarian, Croatian, Polish, Russian and Slovene). The shared task comprises two subtasks: binary detection of persuasive content within text fragments and multi-class, multi-label identification of specific persuasion techniques at the token level. Our primary approach for both subtasks involved fine-tuning pre-trained multilingual Transformer models. For Subtask 1 (paragraph‐level binary detection) we fine‐tuned a multilingual Transformer sequence classifier, its training augmented by a set of additional labelled data. For Subtask 2 (token‐level multi‐label classification) we re‐cast the problem as named‐entity recognition. The resulting systems reached F1 score of 0.92 in paragraph‐level detection (ranked third on average). We present our system architecture, data handling, training procedures, and official results, alongside areas for future improvement.

Pre-trained language models have significantly advanced natural language processing (NLP), particularly in analyzing languages with complex morphological structures. This study addresses lemmatization for the Russian language, the errors in which can critically affect the performance of information retrieval, question answering, and other tasks. We present the results of experiments on generative lemmatization using pre-trained language models. Our findings demonstrate that combining generative models with the existing solutions allows achieving performance that surpasses current results for the lemmatization of Russian. This paper also introduces Rubic2, a new ensemble approach that combines the generative BART-base model, fine-tuned on a manually annotated data set of 2.1 million tokens, with the neural model called Rubic which is currently used for morphological annotation and lemmatization in the Russian National Corpus. Extensive experiments show that Rubic2 outperforms current solutions for the lemmatization of Russian, offering superior results across various text domains and contributing to advancements in NLP applications.

pdf bib abs
Gradient Flush at Slavic NLP 2025 Task: Leveraging Slavic BERT and Translation for Persuasion Techniques Classification
Sergey Senichev | Aleksandr Boriskin | Nikita Krayko | Daria Galimzianova

The task of persuasion techniques detection is limited by several challenges, such as insufficient training data and ambiguity in labels. In this paper, we describe a solution for the Slavic NLP 2025 Shared Task. It utilizes multilingual XLM-RoBERTa, that was trained on 100 various languages, and Slavic BERT, a model fine-tuned on four languages of the Slavic group. We suggest to augment the training dataset with related data from previous shared tasks, as well as some automatic translations from English and German. The resulting solutions are ranked among the top 3 for Russian in the Subtask 1 and for all languages in the Subtask 2. We release the code for our solution - https://github.com/ssenichev/ACL_SlavicNLP2025.

This paper presents our submission to Subtask 2 (multi-label classification of persuasion techniques) of the Shared Task on Detection and Classification of Persuasion Techniques in Slavic Languages at SlavNLP 2025. Our method leverages a teacher–student framework based on large language models (LLMs): a Qwen3 32B teacher model generates natural language explanations for annotated persuasion techniques, and a Qwen2.5 32B student model is fine-tuned to replicate both the teacher’s rationales and the final label predictions. We train our models on the official shared task dataset, supplemented by annotated resources from SemEval 2023 Task 3 and CLEF 2024 Task 3 covering English, Russian, and Polish to improve cross-lingual robustness. Our final system ranks 4th on BG, SI, and HR, and 5th on PL in terms of micro-F1 score among all participating teams.

pdf bib abs
Hierarchical Classification of Propaganda Techniques in Slavic Texts in Hyperbolic Space
Christopher Brückner | Pavel Pecina

Classification problems can often be tackled by modeling label hierarchies with broader categories in a graph and solving the task via node classification. While recent advances have shown that hyperbolic space is more suitable than Euclidean space for learning graph representations, this concept has yet to be applied to text classification, where node features first need to be extracted from text embeddings. A prototype of such an architecture is this contribution to the Slavic NLP 2025 shared task on the multi-label classification of persuasion techniques in parliamentary debates and social media posts. We do not achieve state-of-the-art performance, but outline the benefits of this hierarchical node classification approach and the advantages of hyperbolic graph embeddings

pdf bib abs
Team INSAntive at SlavicNLP-2025 Shared Task: Data Augmentation and Enhancement via Explanations for Persuasion Technique Classification
Yutong Wang | Diana Nurbakova | Sylvie Calabretto

This study investigates the automatic detection and classification of persuasion techniques across five Slavic languages (Bulgarian, Croatian, Polish, Russian, and Slovenian), addressing two subtasks: binary detection of persuasion techniques in text fragments (Subtask 1) and multi-label classification of specific technique types (Subtask 2). To overcome limited training resources, we implemented a multi-level cross-lingual augmentation strategy utilizing GPT-4o for non-Slavic to Slavic conversion and intra-Slavic language migration. We employ XLM-RoBERTa architecture with two LLM-enhanced variants that use explanations to improve classification performance. The experimental results demonstrate varied performance across languages and tasks, with our approach achieving first place in the Russian subtask 1 and second place in Bulgarian subtask 2, confirming that larger parameter models excel in complex classification tasks. These findings highlight the significant potential of LLMs for enhancing multilingual classification and the persistent difficulties in ensuring consistent cross-linguistic performance.

pdf bib abs
LLMs for Detection and Classification of Persuasion Techniques in Slavic Parliamentary Debates and Social Media Texts
Julia Jose | Rachel Greenstadt

We present an LLM-based method for the Slavic NLP 2025 shared task on detection and classification of persuasion techniques in parliamentary debates and social media. Our system uses OpenAI’s GPT models (gpt-4o-mini) and reasoning models (o4-mini) with chain-of-thought prompting, enforcing a geq 0.99 confidence threshold for verbatim span extraction. For subtask 1, each paragraph in the text is labeled “true” if any of the 25 persuasion techniques is present. For subtask 2, the model returns the full set of techniques used per paragraph. Across Bulgarian, Croatian, Polish, Russian, and Slovenian, we achieve Subtask 1 micro-F1 of 81.7%, 83.3%, 81.6%, 73.5%, 62.0%, respectively, and Subtask 2 F1 of 41.0%, 44.4%, 41.9%, 29.3%, 29.9%, respectively. Our system ranked in the top 2 for Subtask 2 and top 7 for Subtask 1.

pdf bib abs
Fine-Tuned Transformer-Based Weighted Soft Voting Ensemble for Persuasion Technique Classification in Slavic Languages
Mahshar Yahan | Sakib Sarker | Mohammad Islam

This paper explores detecting persuasion techniques in Slavic languages using both single transformer models and weighted soft voting ensemble methods. We focused on identifying the presence of persuasion in Bulgarian, Polish, Slovene, and Russian text fragments. We have applied various preprocessing steps to improve model performance. Our experiments show that weighted soft voting ensembles consistently outperform single models in most languages, achieving F1-scores of 0.867 for Bulgarian, 0.902 for Polish, and 0.804 for Russian. For Slovene, the single SlovakBERT model performed best with an F1-score of 0.823, just ahead of the ensemble. These results demonstrate that combining monolingual and multilingual transformer models is effective for robust persuasion detection in low-resource Slavic languages.

pdf bib abs
Robust Detection of Persuasion Techniques in Slavic Languages via Multitask Debiasing and Walking Embeddings
Ewelina Ksiezniak | Krzysztof Wecel | Marcin Sawinski

We present our solution to Subtask 1 of the Shared Task on the Detection and Classification of Persuasion Techniques in Texts for Slavic Languages. Our approach integrates fine-tuned multilingual transformer models with two complementary robustness-oriented strategies: Walking Embeddings and Content-Debiasing. With the first, we tried to understand the change in embeddings when various manipulation techniques were applied. The latter leverages a supervised contrastive objective over semantically equivalent yet stylistically divergent text pairs, generated via GPT-4. We conduct extensive experiments, including 5-fold cross-validation and out-of-domain evaluation, and explore the impact of contrastive loss weighting.

pdf bib abs
Multilabel Classification of Persuasion Techniques with self-improving LLM agent: SlavicNLP 2025 Shared Task
Marcin Sawinski | Krzysztof Wecel | Ewelina Ksiezniak

We present a system for the SlavicNLP 2025 Shared Task on multilabel classification of 25 persuasion techniques across Slavic languages. We investigate the effectiveness of in-context learning with one-shot classification, automatic prompt refinement, and supervised fine-tuning using self-generated annotations. Our findings highlight the potential of LLM-based system to generalize across languages and label sets with minimal supervision.

We present SlavicNLP 2025 Shared Task on Detection and Classification of Persuasion Techniques in Parliamentary Debates and Social Media. The task is structured into two subtasks: (1) Detection, to determine whether a given text fragment contains persuasion techniques, and (2) Classification, to determine for a given text fragment which persuasion techniques are present therein using a taxonomy of 25 persuasion technique taxonomy. The task focuses on two text genres, namely, parliamentary debates revolving around widely discussed topics, and social media, in five languages: Bulgarian, Croatian, Polish, Russian and Slovene. This task contributes to the broader effort of detecting and understanding manipulative attempts in various contexts. There were 15 teams that registered to participate in the task, of which 9 teams submitted a total of circa 220 system responses and described their approaches in 9 system description papers.