Branislava Šandrih Todorović

2024

pdf bib abs
Abusive Speech Detection in Serbian using Machine Learning
Danka Jokić | Ranka Stanković | Branislava Šandrih Todorović
Proceedings of the First International Conference on Natural Language Processing and Artificial Intelligence for Cyber Security

The increase in the use of abusive language on social media and virtual platforms has emphasized the importance of developing efficient hate speech detection systems. While there have been considerable advancements in creating such systems for the English language, resources are scarce for other languages, such as Serbian. This research paper explores the use of machine learning and deep learning techniques to identify abusive language in Serbian text. The authors used AbCoSER, a dataset of Serbian tweets that have been labeled as abusive or non-abusive. They evaluated various algorithms to classify tweets, and the best-performing model is based on the deep learning transformer architecture. The model attained an F1 macro score of 0.827, a figure that is commensurate with the benchmarks established for offensive speech datasets of a similar magnitude in other languages.

2023

pdf bib abs
Three Approaches to Client Email Topic Classification
Branislava Šandrih Todorović | Katarina Josipović | Jurij Kodre
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

This paper describes a use case that was implemented and is currently running in production at the Nova Ljubljanska Banka, that involves classifying incoming client emails in the Slovenian language according to their topics and priorities. Since the proposed approach relies only on the Named Entity Recogniser (NER) of personal names as a language-dependent resource (for the purpose of anonymisation), that is the only prerequisite for applying the approach to any other language.

2022

pdf bib abs
Distant Reading in Digital Humanities: Case Study on the Serbian Part of the ELTeC Collection
Ranka Stanković | Cvetana Krstev | Branislava Šandrih Todorović | Dusko Vitas | Mihailo Skoric | Milica Ikonić Nešić
Proceedings of the Thirteenth Language Resources and Evaluation Conference

In this paper we present the Serbian part of the ELTeC multilingual corpus of novels written in the time period 1840-1920. The corpus is being built in order to test various distant reading methods and tools with the aim of re-thinking the European literary history. We present the various steps that led to the production of the Serbian sub-collection: the novel selection and retrieval, text preparation, structural annotation, POS-tagging, lemmatization and named entity recognition. The Serbian sub-collection was published on different platforms in order to make it freely available to various users. Several use examples show that this sub-collection is usefull for both close and distant reading approaches.

2021

pdf bib abs
Serbian NER&Beyond: The Archaic and the Modern Intertwinned
Branislava Šandrih Todorović | Cvetana Krstev | Ranka Stanković | Milica Ikonić Nešić
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

In this work, we present a Serbian literary corpus that is being developed under the umbrella of the “Distant Reading for European Literary History” COST Action CA16204. Using this corpus of novels written more than a century ago, we have developed and made publicly available a Named Entity Recognizer (NER) trained to recognize 7 different named entity types, with a Convolutional Neural Network (CNN) architecture, having F1 score of ≈91% on the test dataset. This model has been further assessed on a separate evaluation dataset. We wrap up with comparison of the developed model with the existing one, followed by a discussion of pros and cons of the both models.

Co-authors

Jurij Kodre 1

Duško Vitas 1

Mihailo Škorić 1

Venues

Fix data