Filip Dobranić


2025

pdf bib
SlavicNLP 2025 Shared Task: Detection and Classification of Persuasion Techniques in Parliamentary Debates and Social Media
Jakub Piskorski | Dimitar Dimitrov | Filip Dobranić | Marina Ernst | Jacek Haneczok | Ivan Koychev | Nikola Ljubešić | Michal Marcinczuk | Arkadiusz Modzelewski | Ivo Moravski | Roman Yangarber
Proceedings of the 10th Workshop on Slavic Natural Language Processing (Slavic NLP 2025)

We present SlavicNLP 2025 Shared Task on Detection and Classification of Persuasion Techniques in Parliamentary Debates and Social Media. The task is structured into two subtasks: (1) Detection, to determine whether a given text fragment contains persuasion techniques, and (2) Classification, to determine for a given text fragment which persuasion techniques are present therein using a taxonomy of 25 persuasion technique taxonomy. The task focuses on two text genres, namely, parliamentary debates revolving around widely discussed topics, and social media, in five languages: Bulgarian, Croatian, Polish, Russian and Slovene. This task contributes to the broader effort of detecting and understanding manipulative attempts in various contexts. There were 15 teams that registered to participate in the task, of which 9 teams submitted a total of circa 220 system responses and described their approaches in 9 system description papers.

2024

pdf bib
A Lightweight Approach to a Giga-Corpus of Historical Periodicals: The Story of a Slovenian Historical Newspaper Collection
Filip Dobranić | Bojan Evkoski | Nikola Ljubešić
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Preparing historical newspaper collections is a complicated endeavour, consisting of multiple steps that have to be carefully adapted to the specific content in question, including imaging, layout prediction, optical character recognition, and linguistic annotation. To address the high costs associated with the process, we present a lightweight approach to producing high-quality corpora and apply it to a massive collection of Slovenian historical newspapers from the 18th, 19th and 20th century resulting in a billion-word giga-corpus. We start with noisy OCR-ed data produced by different technologies in varying periods by the National and University Library of Slovenia. To address the inherent variability in the quality of textual data, a challenge commonly encountered in digital libraries globally, we perform a targeted post-digitisation correction procedure, coupled with a robust curation mechanism for noisy texts via language model inference. Subsequently, we subject the corrected and filtered output to comprehensive linguistic annotation, enriching the corpus with part-of-speech tags, lemmas, and named entity labels. Finally, we perform an analysis through topic modeling at the noun lemma level, along with a frequency analysis of the named entities, to confirm the viability of our corpus preparation method.