Darja Fišer


2021

pdf bib
Exploring Stylometric and Emotion-Based Features for Multilingual Cross-Domain Hate Speech Detection
Ilia Markov | Nikola Ljubešić | Darja Fišer | Walter Daelemans
Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

In this paper, we describe experiments designed to evaluate the impact of stylometric and emotion-based features on hate speech detection: the task of classifying textual content into hate or non-hate speech classes. Our experiments are conducted for three languages – English, Slovene, and Dutch – both in in-domain and cross-domain setups, and aim to investigate hate speech using features that model two linguistic phenomena: the writing style of hateful social media content operationalized as function word usage on the one hand, and emotion expression in hateful messages on the other hand. The results of experiments with features that model different combinations of these phenomena support our hypothesis that stylometric and emotion-based features are robust indicators of hate speech. Their contribution remains persistent with respect to domain and language variation. We show that the combination of features that model the targeted phenomena outperforms words and character n-gram features under cross-domain conditions, and provides a significant boost to deep learning models, which currently obtain the best results, when combined with them in an ensemble.

2020

pdf bib
CLARIN: Distributed Language Resources and Technology in a European Infrastructure
Maria Eskevich | Franciska de Jong | Alexander König | Darja Fišer | Dieter Van Uytvanck | Tero Aalto | Lars Borin | Olga Gerassimenko | Jan Hajic | Henk van den Heuvel | Neeme Kahusk | Krista Liin | Martin Matthiesen | Stelios Piperidis | Kadri Vider
Proceedings of the 1st International Workshop on Language Technology Platforms

CLARIN is a European Research Infrastructure providing access to digital language resources and tools from across Europe and beyond to researchers in the humanities and social sciences. This paper focuses on CLARIN as a platform for the sharing of language resources. It zooms in on the service offer for the aggregation of language repositories and the value proposition for a number of communities that benefit from the enhanced visibility of their data and services as a result of integration in CLARIN. The enhanced findability of language resources is serving the social sciences and humanities (SSH) community at large and supports research communities that aim to collaborate based on virtual collections for a specific domain. The paper also addresses the wider landscape of service platforms based on language technologies which has the potential of becoming a powerful set of interoperable facilities to a variety of communities of use.

pdf bib
Proceedings of the Second ParlaCLARIN Workshop
Darja Fišer | Maria Eskevich | Franciska de Jong
Proceedings of the Second ParlaCLARIN Workshop

pdf bib
The LiLaH Emotion Lexicon of Croatian, Dutch and Slovene
Nikola Ljubešić | Ilia Markov | Darja Fišer | Walter Daelemans
Proceedings of the Third Workshop on Computational Modeling of People's Opinions, Personality, and Emotion's in Social Media

In this paper, we present emotion lexicons of Croatian, Dutch and Slovene, based on manually corrected automatic translations of the English NRC Emotion lexicon. We evaluate the impact of the translation changes by measuring the change in supervised classification results of socially unacceptable utterances when lexicon information is used for feature construction. We further showcase the usage of the lexicons by calculating the difference in emotion distributions in texts containing and not containing socially unacceptable discourse, comparing them across four languages (English, Croatian, Dutch, Slovene) and two topics (migrants and LGBT). We show significant and consistent improvements in automatic classification across all languages and topics, as well as consistent (and expected) emotion distributions across all languages and topics, proving for the manually corrected lexicons to be a useful addition to the severely lacking area of emotion lexicons, the crucial resource for emotive analysis of text.

pdf bib
Interoperability in an Infrastructure Enabling Multidisciplinary Research: The case of CLARIN
Franciska de Jong | Bente Maegaard | Darja Fišer | Dieter van Uytvanck | Andreas Witt
Proceedings of the 12th Language Resources and Evaluation Conference

CLARIN is a European Research Infrastructure providing access to language resources and technologies for researchers in the humanities and social sciences. It supports the use and study of language data in general and aims to increase the potential for comparative research of cultural and societal phenomena across the boundaries of languages and disciplines, all in line with the European agenda for Open Science. Data infrastructures such as CLARIN have recently embarked on the emerging frameworks for the federation of infrastructural services, such as the European Open Science Cloud and the integration of services resulting from multidisciplinary collaboration in federated services for the wider SSH domain. In this paper we describe the interoperability requirements that arise through the existing ambitions and the emerging frameworks. The interoperability theme will be addressed at several levels, including organisation and ecosystem, design of workflow services, data curation, performance measurement and collaboration.

2018

pdf bib
Predicting Concreteness and Imageability of Words Within and Across Languages via Word Embeddings
Nikola Ljubešić | Darja Fišer | Anita Peti-Stantić
Proceedings of The Third Workshop on Representation Learning for NLP

The notions of concreteness and imageability, traditionally important in psycholinguistics, are gaining significance in semantic-oriented natural language processing tasks. In this paper we investigate the predictability of these two concepts via supervised learning, using word embeddings as explanatory variables. We perform predictions both within and across languages by exploiting collections of cross-lingual embeddings aligned to a single vector space. We show that the notions of concreteness and imageability are highly predictable both within and across languages, with a moderate loss of up to 20% in correlation when predicting across languages. We further show that the cross-lingual transfer via word embeddings is more efficient than the simple transfer via bilingual dictionaries.

pdf bib
Proceedings of the 2nd Workshop on Abusive Language Online (ALW2)
Darja Fišer | Ruihong Huang | Vinodkumar Prabhakaran | Rob Voigt | Zeerak Waseem | Jacqueline Wernimont
Proceedings of the 2nd Workshop on Abusive Language Online (ALW2)

pdf bib
Datasets of Slovene and Croatian Moderated News Comments
Nikola Ljubešić | Tomaž Erjavec | Darja Fišer
Proceedings of the 2nd Workshop on Abusive Language Online (ALW2)

This paper presents two large newly constructed datasets of moderated news comments from two highly popular online news portals in the respective countries: the Slovene RTV MCC and the Croatian 24sata. The datasets are analyzed by performing manual annotation of the types of the content which have been deleted by moderators and by investigating deletion trends among users and threads. Next, initial experiments on automatically detecting the deleted content in the datasets are presented. Both datasets are published in encrypted form, to enable others to perform experiments on detecting content to be deleted without revealing potentially inappropriate content. Finally, the baseline classification models trained on the non-encrypted datasets are disseminated as well to enable real-world use.

pdf bib
CLARIN’s Key Resource Families
Darja Fišer | Jakob Lenardič | Tomaž Erjavec
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
CLARIN: Towards FAIR and Responsible Data Science Using Language Resources
Franciska de Jong | Bente Maegaard | Koenraad De Smedt | Darja Fišer | Dieter Van Uytvanck
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib
Adapting a State-of-the-Art Tagger for South Slavic Languages to Non-Standard Text
Nikola Ljubešić | Tomaž Erjavec | Darja Fišer
Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing

In this paper we present the adaptations of a state-of-the-art tagger for South Slavic languages to non-standard texts on the example of the Slovene language. We investigate the impact of introducing in-domain training data as well as additional supervision through external resources or tools like word clusters and word normalization. We remove more than half of the error of the standard tagger when applied to non-standard texts by training it on a combination of standard and non-standard training data, while enriching the data representation with external resources removes additional 11 percent of the error. The final configuration achieves tagging accuracy of 87.41% on the full morphosyntactic description, which is, nevertheless, still quite far from the accuracy of 94.27% achieved on standard text.

pdf bib
Language-independent Gender Prediction on Twitter
Nikola Ljubešić | Darja Fišer | Tomaž Erjavec
Proceedings of the Second Workshop on NLP and Computational Social Science

In this paper we present a set of experiments and analyses on predicting the gender of Twitter users based on language-independent features extracted either from the text or the metadata of users’ tweets. We perform our experiments on the TwiSty dataset containing manual gender annotations for users speaking six different languages. Our classification results show that, while the prediction model based on language-independent features performs worse than the bag-of-words model when training and testing on the same language, it regularly outperforms the bag-of-words model when applied to different languages, showing very stable results across various languages. Finally we perform a comparative analysis of feature effect sizes across the six languages and show that differences in our features correspond to cultural distances.

pdf bib
Legal Framework, Dataset and Annotation Schema for Socially Unacceptable Online Discourse Practices in Slovene
Darja Fišer | Tomaž Erjavec | Nikola Ljubešić
Proceedings of the First Workshop on Abusive Language Online

In this paper we present the legal framework, dataset and annotation schema of socially unacceptable discourse practices on social networking platforms in Slovenia. On this basis we aim to train an automatic identification and classification system with which we wish contribute towards an improved methodology, understanding and treatment of such practices in the contemporary, increasingly multicultural information society.

2016

pdf bib
A Global Analysis of Emoji Usage
Nikola Ljubešić | Darja Fišer
Proceedings of the 10th Web as Corpus Workshop

pdf bib
Private or Corporate? Predicting User Types on Twitter
Nikola Ljubešić | Darja Fišer
Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)

In this paper we present a series of experiments on discriminating between private and corporate accounts on Twitter. We define features based on Twitter metadata, morphosyntactic tags and surface forms, showing that the simple bag-of-words model achieves single best results that can, however, be improved by building a weighted soft ensemble of classifiers based on each feature type. Investigating the time and language dependence of each feature type delivers quite unexpecting results showing that features based on metadata are neither time- nor language-insensitive as the way the two user groups use the social network varies heavily through time and space.

pdf bib
Corpus-Based Diacritic Restoration for South Slavic Languages
Nikola Ljubešić | Tomaž Erjavec | Darja Fišer
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In computer-mediated communication, Latin-based scripts users often omit diacritics when writing. Such text is typically easily understandable to humans but very difficult for computational processing because many words become ambiguous or unknown. Letter-level approaches to diacritic restoration generalise better and do not require a lot of training data but word-level approaches tend to yield better results. However, they typically rely on a lexicon which is an expensive resource, not covering non-standard forms, and often not available for less-resourced languages. In this paper we present diacritic restoration models that are trained on easy-to-acquire corpora. We test three different types of corpora (Wikipedia, general web, Twitter) for three South Slavic languages (Croatian, Serbian and Slovene) and evaluate them on two types of text: standard (Wikipedia) and non-standard (Twitter). The proposed approach considerably outperforms charlifter, so far the only open source tool available for this task. We make the best performing systems freely available.

2015

pdf bib
Predicting the Level of Text Standardness in User-generated Content
Nikola Ljubešić | Darja Fišer | Tomaž Erjavec | Jaka Čibej | Dafne Marko | Senja Pollak | Iza Škrjanec
Proceedings of the International Conference Recent Advances in Natural Language Processing

2014

pdf bib
sloWCrowd: A crowdsourcing tool for lexicographic tasks
Darja Fišer | Aleš Tavčar | Tomaž Erjavec
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The paper presents sloWCrowd, a simple tool developed to facilitate crowdsourcing lexicographic tasks, such as error correction in automatically generated wordnets and semantic annotation of corpora. The tool is open-source, language-independent and can be adapted to a broad range of crowdsourcing tasks. Since volunteers who participate in our crowdsourcing tasks are not trained lexicographers, the tool has been designed to obtain multiple answers to the same question and compute the majority vote, making sure individual unreliable answers are discarded. We also make sure unreliable volunteers, who systematically provide unreliable answers, are not taken into account. This is achieved by measuring their accuracy against a gold standard, the questions from which are posed to the annotators on a regular basis in between the real question. We tested the tool in an extensive crowdsourcing task, i.e. error correction of the Slovene wordnet, the results of which are encouraging, motivating us to use the tool in other annotation tasks in the future as well.

pdf bib
TweetCaT: a tool for building Twitter corpora of smaller languages
Nikola Ljubešić | Darja Fišer | Tomaž Erjavec
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper presents TweetCaT, an open-source Python tool for building Twitter corpora that was designed for smaller languages. Using the Twitter search API and a set of seed terms, the tool identifies users tweeting in the language of interest together with their friends and followers. By running the tool for 235 days we tested it on the task of collecting two monitor corpora, one for Croatian and Serbian and the other for Slovene, thus also creating new and valuable resources for these languages. A post-processing step on the collected corpus is also described, which filters out users that tweet predominantly in a foreign language thus further cleans the collected corpora. Finally, an experiment on discriminating between Croatian and Serbian Twitter users is reported.

2013

pdf bib
Identifying false friends between closely related languages
Nikola Ljubešić | Darja Fišer
Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing

pdf bib
Cross-lingual WSD for Translation Extraction from Comparable Corpora
Marianna Apidianaki | Nikola Ljubešić | Darja Fišer
Proceedings of the Sixth Workshop on Building and Using Comparable Corpora

2012

pdf bib
Were the clocks striking or surprising? Using WSD to improve MT performance
Špela Vintar | Darja Fišer | Aljoša Vrščaj
Proceedings of the Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra)

pdf bib
Lexicon Construction and Corpus Annotation of Historical Language with the CoBaLT Editor
Tom Kenter | Tomaž Erjavec | Maja Žorga Dulmin | Darja Fišer
Proceedings of the 6th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

pdf bib
Addressing polysemy in bilingual lexicon extraction from comparable corpora
Darja Fišer | Nikola Ljubešić | Ozren Kubelka
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper presents an approach to extract translation equivalents from comparable corpora for polysemous nouns. As opposed to the standard approaches that build a single context vector for all occurrences of a given headword, we first disambiguate the headword with third-party sense taggers and then build a separate context vector for each sense of the headword. Since state-of-the-art word sense disambiguation tools are still far from perfect, we also tried to improve the results by combining the sense assignments provided by two different sense taggers. Evaluation of the results shows that we outperform the baseline (0.473) in all the settings we experimented with, even when using only one sense tagger, and that the best-performing results are indeed obtained by taking into account the intersection of both sense taggers (0.720).

pdf bib
Cleaning noisy wordnets
Benoît Sagot | Darja Fišer
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Automatic approaches to creating and extending wordnets, which have become very popular in the past decade, inadvertently result in noisy synsets. This is why we propose an approach to detect synset outliers in order to eliminate the noise and improve accuracy of the developed wordnets, so that they become more useful lexico-semantic resources for natural language applications. The approach compares the words that appear in the synset and its surroundings with the contexts of the literals in question they are used in based on large monolingual corpora. By fine-tuning the outlier threshold we can influence how many outlier candidates will be eliminated. Although the proposed approach is language-independent we test it on Slovene and French that were created automatically from bilingual resources and contain plenty of disambiguation errors. Manual evaluation of the results shows that by applying a threshold similar to the estimated error rate in the respective wordnets, 67% of the proposed outlier candidates are indeed incorrect for French and a 64% for Slovene. This is a big improvement compared to the estimated overall error rates in the resources, which are 12% for French and 15% for Slovene.

2011

pdf bib
Bilingual lexicon extraction from comparable corpora for closely related languages
Darja Fišer | Nikola Ljubešić
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011

pdf bib
Building and Using Comparable Corpora for Domain-Specific Bilingual Lexicon Extraction
Darja Fišer | Nikola Ljubešić | Špela Vintar | Senja Pollak
Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web

2010

pdf bib
The JOS Linguistically Tagged Corpus of Slovene
Tomaž Erjavec | Darja Fišer | Simon Krek | Nina Ledinek
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The JOS language resources are meant to facilitate developments of HLT and corpus linguistics for the Slovene language and consist of the morphosyntactic specifications, defining the Slovene morphosyntactic features and tagset; two annotated corpora (jos100k and jos1M); and two web services (a concordancer and text annotation tool). The paper introduces these components, and concentrates on jos100k, a 100,000 word sampled balanced monolingual Slovene corpus, manually annotated for three levels of linguistic description. On the morphosyntactic level, each word is annotated with its morphosyntactic description and lemma; on the syntactic level the sentences are annotated with dependency links; on the semantic level, all the occurrences of 100 top nouns in the corpus are annotated with their wordnet synset from the Slovene semantic lexicon sloWNet. The JOS corpora and specifications have a standardised encoding (Text Encoding Initiative Guidelines TEI P5) and are available for research from http://nl.ijs.si/jos/ under the Creative Commons licence.

pdf bib
Learning to Mine Definitions from Slovene Structured and Unstructured Knowledge-Rich Resources
Darja Fišer | Senja Pollak | Špela Vintar
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The paper presents an innovative approach to extract Slovene definition candidates from domain-specific corpora using morphosyntactic patterns, automatic terminology recognition and semantic tagging with wordnet senses. First, a classification model was trained on examples from Slovene Wikipedia which was then used to find well-formed definitions among the extracted candidates. The results of the experiment are encouraging, with accuracy ranging from 67% to 71%. The paper also addresses some drawbacks of the approach and suggests ways to overcome them in future work.

2008

pdf bib
Harvesting Multi-Word Expressions from Parallel Corpora
Špela Vintar | Darja Fišer
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The paper presents a set of approaches to extend the automatically created Slovene wordnet with nominal multi-word expressions. In the first approach multi-word expressions from Princeton WordNet are translated with a technique that is based on word-alignment and lexico-syntactic patterns. This is followed by extracting new terms from a monolingual corpus using keywordness ranking and contextual patterns. Finally, the multi-word expressions are assigned a hypernym and added to our wordnet. Manual evaluation and comparison of the results shows that the translation approach is the most straightforward and accurate. However, it is successfully complemented by the two monolingual approaches which are able to identify more term candidates in the corpus that would otherwise go unnoticed. Some weaknesses of the proposed wordnet extension techniques are also addressed.

pdf bib
Construction d’un wordnet libre du français à partir de ressources multilingues
Benoît Sagot | Darja Fišer
Actes de la 15ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Cet article décrit la construction d’un Wordnet Libre du Français (WOLF) à partir du Princeton WordNet et de diverses ressources multilingues. Les lexèmes polysémiques ont été traités au moyen d’une approche reposant sur l’alignement en mots d’un corpus parallèle en cinq langues. Le lexique multilingue extrait a été désambiguïsé sémantiquement à l’aide des wordnets des langues concernées. Par ailleurs, une approche bilingue a été suffisante pour construire de nouvelles entrées à partir des lexèmes monosémiques. Nous avons pour cela extrait des lexiques bilingues à partir deWikipédia et de thésaurus. Le wordnet obtenu a été évalué par rapport au wordnet français issu du projet EuroWordNet. Les résultats sont encourageants, et des applications sont d’ores et déjà envisagées.

2006

pdf bib
Building Slovene WordNet
Tomaž Erjavec | Darja Fišer
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

A WordNet is a lexical database in which nouns, verbs, adjectives and adverbs are organized in a conceptual hierarchy, linking semantically and lexically related concepts. Such semantic lexicons have become oneof the most valuable resources for a wide range of NLP research and applications, such as semantic tagging, automatic word-sense disambiguation, information retrieval and document summarisation. Following the WordNet design for the English languagedeveloped at Princeton, WordNets for a number of other languages havebeen developed in the past decade, taking the idea into the domain ofmultilingual processing. This paper reports on the prototype SloveneWordNet which currently contains about 5,000 top-level concepts. Theresource has been automatically translated from the Serbian WordNet, with the help of a bilingual dictionary, synset literals ranked according to the frequency of corpus occurrence, and results manually corrected. The paper presents the results obtained, discusses some problems encountered along the way and points out some possibilitiesof automated acquisition and refinement of synsets in the future.