2024
pdf
abs
AustroTox: A Dataset for Target-Based Austrian German Offensive Language Detection
Pia Pachinger
|
Janis Goldzycher
|
Anna Planitzer
|
Wojciech Kusa
|
Allan Hanbury
|
Julia Neidhardt
Findings of the Association for Computational Linguistics: ACL 2024
Model interpretability in toxicity detection greatly profits from token-level annotations. However, currently, such annotations are only available in English. We introduce a dataset annotated for offensive language detection sourced from a news forum, notable for its incorporation of the Austrian German dialect, comprising 4,562 user comments. In addition to binary offensiveness classification, we identify spans within each comment constituting vulgar language or representing targets of offensive statements. We evaluate fine-tuned Transformer models as well as large language models in a zero- and few-shot fashion. The results indicate that while fine-tuned models excel in detecting linguistic peculiarities such as vulgar dialect, large language models demonstrate superior performance in detecting offensiveness in AustroTox.
pdf
abs
PopAut: An Annotated Corpus for Populism Detection in Austrian News Comments
Ahmadou Wagne
|
Julia Neidhardt
|
Thomas Elmar Kolb
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Populism is a phenomenon that is noticeably present in the political landscape of various countries over the past decades. While populism expressed by politicians has been thoroughly examined in the literature, populism expressed by citizens is still underresearched, especially when it comes to its automated detection in text. This work presents the PopAut corpus, which is the first annotated corpus of news comments for populism in the German language. It features 1,200 comments collected between 2019-2021 that are annotated for populist motives anti-elitism, people-centrism and people-sovereignty. Following the definition of Cas Mudde, populism is seen as a thin ideology. This work shows that annotators reach a high agreement when labeling news comments for these motives. The data set is collected to serve as the basis for automated populism detection using machine-learning methods. By using transformer-based models, we can outperform existing dictionaries tailored for automated populism detection in German social media content. Therefore our work provides a rich resource for future work on the classification of populist user comments in the German language.
2023
pdf
abs
Toward Disambiguating the Definitions of Abusive, Offensive, Toxic, and Uncivil Comments
Pia Pachinger
|
Allan Hanbury
|
Julia Neidhardt
|
Anna Planitzer
Proceedings of the First Workshop on Cross-Cultural Considerations in NLP (C3NLP)
The definitions of abusive, offensive, toxic and uncivil comments used for annotating corpora for automated content moderation are highly intersected and researchers call for their disambiguation. We summarize the definitions of these terms as they appear in 23 papers across different fields. We compare examples given for uncivil, offensive, and toxic comments, attempting to foster more unified scientific resources. Additionally, we stress that the term incivility that frequently appears in social science literature has hardly been mentioned in the literature we analyzed that focuses on computational linguistics and natural language processing.
2022
pdf
abs
The ALPIN Sentiment Dictionary: Austrian Language Polarity in Newspapers
Thomas Kolb
|
Sekanina Katharina
|
Bettina Manuela Johanna Kern
|
Julia Neidhardt
|
Tanja Wissik
|
Andreas Baumann
Proceedings of the Thirteenth Language Resources and Evaluation Conference
This paper introduces the Austrian German sentiment dictionary ALPIN to account for the lack of resources for dictionary-based sentiment analysis in this specific variety of German, which is characterized by lexical idiosyncrasies that also affect word sentiment. The proposed language resource is based on Austrian news media in the field of politics, an austriacism list based on different resources and a posting data set based on a popular Austrian news media. Different resources are used to increase the diversity of the resulting language resource. Extensive crowd-sourcing is performed followed by evaluation and automatic conversion into sentiment scores. We show that crowd-sourcing enables the creation of a sentiment dictionary for the Austrian German domain. Additionally, the different parts of the sentiment dictionary are evaluated to show their impact on the resulting resource. Furthermore, the proposed dictionary is utilized in a web application and available for future research and free to use for anyone.
pdf
abs
Visualizing Parliamentary Speeches as Networks: the DYLEN Tool
Seung-bin Yim
|
Katharina Wünsche
|
Asil Cetin
|
Julia Neidhardt
|
Andreas Baumann
|
Tanja Wissik
Proceedings of the Workshop ParlaCLARIN III within the 13th Language Resources and Evaluation Conference
In this paper, we present a web based interactive visualization tool for lexical networks based on the utterances of Austrian Members of Parliament. The tool is designed to compare two networks in parallel and is composed of graph visualization, node-metrics comparison and time-series comparison components that are interconnected with each other.
2020
pdf
abs
Comparing Lexical Usage in Political Discourse across Diachronic Corpora
Klaus Hofmann
|
Anna Marakasova
|
Andreas Baumann
|
Julia Neidhardt
|
Tanja Wissik
Proceedings of the Second ParlaCLARIN Workshop
Most diachronic studies on both lexico-semantic change and political language usage are based on individual or comparable corpora. In this paper, we explore ways of studying the stability (and changeability) of lexical usage in political discourse across two corpora which are substantially different in structure and size. We present a case study focusing on lexical items associated with political parties in two diachronic corpora of Austrian German, namely a diachronic media corpus (AMC) and a corpus of parliamentary records (ParlAT), and measure the cross-temporal stability of lexical usage over a period of 20 years. We conduct three sets of comparative analyses investigating a) the stability of sets of lexical items associated with the three major political parties over time, b) lexical similarity between parties, and c) the similarity between the lexical choices in parliamentary speeches by members of the parties vis-‘a-vis the media’s reporting on the parties. We employ time series modeling using generalized additive models (GAMs) to compare the lexical similarities and differences between parties within and across corpora. The results show that changes observed in these measures can be meaningfully related to political events during that time.
pdf
abs
Short-term Semantic Shifts and their Relation to Frequency Change
Anna Marakasova
|
Julia Neidhardt
Proceedings of the Probability and Meaning Conference (PaM 2020)
We present ongoing research on the relationship between short-term semantic shifts and frequency change patterns by examining the case of the refugee crisis in Austria from 2015 to 2016. Our experiments are carried out on a diachronic corpus of Austrian German, namely a corpus of newspaper articles. We trace the evolution of the usage of words that represent concepts in the context of the refugee crisis by analyzing cosine similarities of word vectors over time as well as similarities based on the words’ nearest neighbourhood sets. In order to investigate how exactly the contextual meanings have changed, we measure cosine similarity between the following pairs of words: words describing the refugee crisis, on the one hand, and words indicating the process of mediatization and politicization of the refugee crisis in Austria proposed by a domain expert, on the other hand. We evaluate our approach against the expert knowledge. The paper presents the current findings and outlines the directions of the future work.