Ranka Stanković

Also published as: Ranka Stankovic, Ranka Stankoviæ


Distant Reading in Digital Humanities: Case Study on the Serbian Part of the ELTeC Collection
Ranka Stanković | Cvetana Krstev | Branislava Šandrih Todorović | Dusko Vitas | Mihailo Skoric | Milica Ikonić Nešić
Proceedings of the Thirteenth Language Resources and Evaluation Conference

In this paper we present the Serbian part of the ELTeC multilingual corpus of novels written in the time period 1840-1920. The corpus is being built in order to test various distant reading methods and tools with the aim of re-thinking the European literary history. We present the various steps that led to the production of the Serbian sub-collection: the novel selection and retrieval, text preparation, structural annotation, POS-tagging, lemmatization and named entity recognition. The Serbian sub-collection was published on different platforms in order to make it freely available to various users. Several use examples show that this sub-collection is usefull for both close and distant reading approaches.

pdf bib
From ELTeC Text Collection Metadata and Named Entities to Linked-data (and Back)
Milica Ikonić Nešić | Ranka Stanković | Christof Schöch | Mihailo Skoric
Proceedings of the 8th Workshop on Linked Data in Linguistics within the 13th Language Resources and Evaluation Conference

In this paper we present the wikification of the ELTeC (European Literary Text Collection), developed within the COST Action “Distant Reading for European Literary History” (CA16204). ELTeC is a multilingual corpus of novels written in the time period 1840—1920, built to apply distant reading methods and tools to explore the European literary history. We present the pipeline that led to the production of the linked dataset, the novels’ metadata retrieval and named entity recognition, transformation, mapping and Wikidata population, followed by named entity linking and export to NIF (NLP Interchange Format). The speeding up of the process of data preparation and import to Wikidata is presented on the use case of seven sub-collections of ELTeC (English, Portuguese, French, Slovenian, German, Hungarian and Serbian). Our goal was to automate the process of preparing and importing information, so OpenRefine and QuickStatements were chosen as the best options. The paper also includes examples of SPARQL queries for retrieval of authors, novel titles, publication places and other metadata with different visualisation options as well as statistical overviews.

Sentiment Analysis of Serbian Old Novels
Ranka Stanković | Miloš Košprdić | Milica Ikonić Nešić | Tijana Radović
Proceedings of the 2nd Workshop on Sentiment Analysis and Linguistic Linked Data

In this paper we present first study of Sentiment Analysis (SA) of Serbian novels from the 1840-1920 period. The preparation of sentiment lexicon was based on three existing lexicons: NRC, AFFIN and Bing with additional extensive corrections. The first phase of dataset refinement included filtering the word that are not found in Serbian morphological dictionary and in second automatic POS tagging and lemma were manually corrected. The polarity lexicon was extracted and transformed into ontolex-lemon and published as initial version. The complex inflection system of Serbian language required expansion of sentiment lexicon with inflected forms from Serbian morphological dictionaries. Set of sentences for SA was extracted from 120 novels of Serbian part of ELTeC collection, labelled for polarity and used for several model training. Several approaches for SA are compared, starting with for variation of lexicon based and followed by Logistic Regression, Naive Bayes, Decision Tree, Random Forest, SVN and k-NN. The comparison with models trained on labelled movie reviews dataset indicates that it can not successfully be used for sentiment analysis of sentences in old novels.


Serbian NER&Beyond: The Archaic and the Modern Intertwinned
Branislava Šandrih Todorović | Cvetana Krstev | Ranka Stanković | Milica Ikonić Nešić
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

In this work, we present a Serbian literary corpus that is being developed under the umbrella of the “Distant Reading for European Literary History” COST Action CA16204. Using this corpus of novels written more than a century ago, we have developed and made publicly available a Named Entity Recognizer (NER) trained to recognize 7 different named entity types, with a Convolutional Neural Network (CNN) architecture, having F1 score of ≈91% on the test dataset. This model has been further assessed on a separate evaluation dataset. We wrap up with comparison of the developed model with the existing one, followed by a discussion of pros and cons of the both models.


A Multilingual Evaluation Dataset for Monolingual Word Sense Alignment
Sina Ahmadi | John Philip McCrae | Sanni Nimb | Fahad Khan | Monica Monachini | Bolette Pedersen | Thierry Declerck | Tanja Wissik | Andrea Bellandi | Irene Pisani | Thomas Troelsgård | Sussi Olsen | Simon Krek | Veronika Lipp | Tamás Váradi | László Simon | András Gyorffy | Carole Tiberius | Tanneke Schoonheim | Yifat Ben Moshe | Maya Rudich | Raya Abu Ahmad | Dorielle Lonke | Kira Kovalenko | Margit Langemets | Jelena Kallas | Oksana Dereza | Theodorus Fransen | David Cillessen | David Lindemann | Mikel Alonso | Ana Salgado | José Luis Sancho | Rafael-J. Ureña-Ruiz | Jordi Porta Zamorano | Kiril Simov | Petya Osenova | Zara Kancheva | Ivaylo Radev | Ranka Stanković | Andrej Perdih | Dejan Gabrovsek
Proceedings of the Twelfth Language Resources and Evaluation Conference

Aligning senses across resources and languages is a challenging task with beneficial applications in the field of natural language processing and electronic lexicography. In this paper, we describe our efforts in manually aligning monolingual dictionaries. The alignment is carried out at sense-level for various resources in 15 languages. Moreover, senses are annotated with possible semantic relationships such as broadness, narrowness, relatedness, and equivalence. In comparison to previous datasets for this task, this dataset covers a wide range of languages and resources and focuses on the more challenging task of linking general-purpose language. We believe that our data will pave the way for further advances in alignment and evaluation of word senses by creating new solutions, particularly those notoriously requiring data such as neural networks. Our resources are publicly available at https://github.com/elexis-eu/MWSA.

Machine Learning and Deep Neural Network-Based Lemmatization and Morphosyntactic Tagging for Serbian
Ranka Stankovic | Branislava Šandrih | Cvetana Krstev | Miloš Utvić | Mihailo Skoric
Proceedings of the Twelfth Language Resources and Evaluation Conference

The training of new tagger models for Serbian is primarily motivated by the enhancement of the existing tagset with the grammatical category of a gender. The harmonization of resources that were manually annotated within different projects over a long period of time was an important task, enabled by the development of tools that support partial automation. The supporting tools take into account different taggers and tagsets. This paper focuses on TreeTagger and spaCy taggers, and the annotation schema alignment between Serbian morphological dictionaries, MULTEXT-East and Universal Part-of-Speech tagset. The trained models will be used to publish the new version of the Corpus of Contemporary Serbian as well as the Serbian literary corpus. The performance of developed taggers were compared and the impact of training set size was investigated, which resulted in around 98% PoS-tagging precision per token for both new models. The sr_basic annotated dataset will also be published.

Multi-word Expressions for Abusive Speech Detection in Serbian
Ranka Stanković | Jelena Mitrović | Danka Jokić | Cvetana Krstev
Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons

This paper presents our work on the refinement and improvement of the Serbian language part of Hurtlex, a multilingual lexicon of words to hurt. We pay special attention to adding Multi-word expressions that can be seen as abusive, as such lexical entries are very important in obtaining good results in a plethora of abusive language detection tasks. We use Serbian morphological dictionaries as a basis for data cleaning and MWE dictionary creation. A connection to other lexical and semantic resources in Serbian is outlined and building of abusive language detection systems based on that connection is foreseen.


Development and Evaluation of Three Named Entity Recognition Systems for Serbian - The Case of Personal Names
Branislava Šandrih | Cvetana Krstev | Ranka Stankovic
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

In this paper we present a rule- and lexicon-based system for the recognition of Named Entities (NE) in Serbian newspaper texts that was used to prepare a gold standard annotated with personal names. It was further used to prepare training sets for four different levels of annotation, which were further used to train two Named Entity Recognition (NER) systems: Stanford and spaCy. All obtained models, together with a rule- and lexicon-based system were evaluated on two sample texts: a part of the gold standard and an independent newspaper text of approximately the same size. The results show that rule- and lexicon-based system outperforms trained models in all four scenarios (measured by F1), while Stanford models has the highest precision. All systems obtain best results in recognizing full names, while the recognition of first names only is rather poor. The produced models are incorporated into a Web platform NER&Beyond that provides various NE-related functions.


Using English Baits to Catch Serbian Multi-Word Terminology
Cvetana Krstev | Branislava Šandrih | Ranka Stanković | Miljana Mladenović
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)


Rule-based Automatic Multi-word Term Extraction and Lemmatization
Ranka Stanković | Cvetana Krstev | Ivan Obradović | Biljana Lazić | Aleksandra Trtovac
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper we present a rule-based method for multi-word term extraction that relies on extensive lexical resources in the form of electronic dictionaries and finite-state transducers for modelling various syntactic structures of multi-word terms. The same technology is used for lemmatization of extracted multi-word terms, which is unavoidable for highly inflected languages in order to pass extracted data to evaluators and subsequently to terminological e-dictionaries and databases. The approach is illustrated on a corpus of Serbian texts from the mining domain containing more than 600,000 simple word forms. Extracted and lemmatized multi-word terms are filtered in order to reject falsely offered lemmas and then ranked by introducing measures that combine linguistic and statistical information (C-Value, T-Score, LLR, and Keyness). Mean average precision for retrieval of MWU forms ranges from 0.789 to 0.804, while mean average precision of lemma production ranges from 0.956 to 0.960. The evaluation showed that 94% of distinct multi-word forms were evaluated as proper multi-word units, and among them 97% were associated with correct lemmas.


A tool for enhanced search of multilingual digital libraries of e-journals
Ranka Stanković | Cvetana Krstev | Ivan Obradović | Aleksandra Trtovac | Miloš Utvić
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper outlines the main features of Bibliša, a tool that offers various possibilities of enhancing queries submitted to large collections of TMX documents generated from aligned parallel articles residing in multilingual digital libraries of e-journals. The queries initiated by a simple or multiword keyword, in Serbian or English, can be expanded by Bibliša, both semantically and morphologically, using different supporting monolingual and multilingual resources, such as wordnets and electronic dictionaries. The tool operates within a complex system composed of several modules including a web application, which makes it readily accessible on the web. Its functionality has been tested on a collection of 44 TMX documents generated from articles published bilingually by the journal INFOtecha, yielding encouraging results. Further enhancements of the tool are underway, with the aim of transforming it from a powerful full-text and metadata search tool, to a useful translator's aid, which could be of assistance both in reviewing terminology used in context and in refining the multilingual resources used within the system.


GIS Application Improvement with Multilingual Lexical and Terminological Resources
Ranka Stanković | Ivan Obradović | Olivera Kitanović
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper introduces the results of integration of lexical and terminological resources, most of them developed within the Human Language Technology (HLT) Group at the University of Belgrade, with the Geological information system of Serbia (GeolISS) developed at the Faculty of Mining and Geology and funded by the Ministry of the Environmental protection. The approach to GeolISS development, which is aimed at the integration of existing geologic archives, data from published maps on different scales, newly acquired field data, and intranet and internet publishing of geologic is given, followed by the description of the geologic multilingual vocabulary and other lexical and terminological resources used. Two basic results are outlined: multilingual map annotation and improvement of queries for the GeolISS geodatabase. Multilingual labelling and annotation of maps for their graphic display and printing have been tested with Serbian, which describes regional information in the local language, and English, used for sharing geographic information with the world, although the geological vocabulary offers the possibility for integration of other languages as well. The resources also enable semantic and morphological expansion of queries, the latter being very important in highly inflective languages, such as Serbian.

A Description of Morphological Features of Serbian: a Revision using Feature System Declaration
Cvetana Krstev | Ranka Stanković | Duško Vitas
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In this paper we discuss some well-known morphological descriptions used in various projects and applications (most notably MULTEXT-East and Unitex) and illustrate the encountered problems on Serbian. We have spotted four groups of problems: the lack of a value for an existing category, the lack of a category, the interdependence of values and categories lacking some description, and the lack of a support for some types of categories. At the same time, various descriptions often describe exactly the same morphological property using different approaches. We propose a new morphological description for Serbian following the feature structure representation defined by the ISO standard. In this description we try do incorporate all characteristics of Serbian that need to be specified for various applications. We have developed several XSLT scripts that transform our description into descriptions needed for various applications. We have developed the first version of this new description, but we treat it as an ongoing project because for some properties we have not yet found the satisfactory solution.


E-Connecting Balkan Languages
Cvetana Krstev | Ranka Stanković | Duško Vitas | Svetla Koeva
Proceedings of the Workshop Multilingual resources, technologies and evaluation for central and Eastern European languages


The Usage of Various Lexical Resources and Tools to Improve the Performance of Web Search Engines
Cvetana Krstev | Ranka Stanković | Duško Vitas | Ivan Obradović
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this paper we present how resources and tools developed within the Human Language Technology Group at the University of Belgrade can be used for tuning queries before submitting them to a web search engine. We argue that the selection of words chosen for a query, which are of paramount importance for the quality of results obtained by the query, can be substantially improved by using various lexical resources, such as morphological dictionaries and wordnets. These dictionaries enable semantic and morphological expansion of the query, the latter being very important in highly inflective languages, such as Serbian. Wordnets can also be used for adding another language to a query, if appropriate, thus making the query bilingual. Problems encountered in retrieving documents of interest are discussed and illustrated by examples. A brief description of resources is given, followed by an outline of the web tool which enables their integration. Finally, a set of examples is chosen in order to illustrate the use of the lexical resources and tool in question. Results obtained for these examples show that the number of documents obtained through a query by using our approach can double and even quadruple in some cases.


WS4LR: A Workstation for Lexical Resources
Cvetana Krstev | Ranka Stanković | Duško Vitas | Ivan Obradović
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In this paper we describe WS4LR, the workstation for lexical resources, a software tool developed within the Human Language Technology Group at the Faculty of Mathematics, University of Belgrade. The tool is aimed at manipulating heterogeneous lexical resources, and the need for such a tool came from the large volume of resources the Group has developed in the course of many years and within different projects. The tool handles morphological dictionaries, wordnets, aligned texts and transducers equally and has already proved very useful for various tasks. Although it has so far been used mainly for Serbian, WS4LR is not language dependent and can be successfully used for resources in other languages provided that they follow the described formats and methodologies. The tool operates on the .NET platform and runs on a personal computer under Windows 2000/XP/2003 operating system with at least 256MB of internal memory.


Combining Heterogeneous Lexical Resources
Cvetana Krstev | Duško Vitas | Ranka Stankoviæ | Ivan Obradoviæ | Gordana Pavloviæ-Lažetiæ
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)