Resources and Techniques for User and Author Profiling in Abusive Language (2022)


up

pdf (full)
bib (full)
Proceedings of the Second International Workshop on Resources and Techniques for User Information in Abusive Language Analysis

pdf bib
Proceedings of the Second International Workshop on Resources and Techniques for User Information in Abusive Language Analysis
Johanna Monti | Valerio Basile | Maria Pia Di Buono | Raffaele Manna | Antonio Pascucci | Sara Tonelli

pdf bib
A First Attempt at Unreliable News Detection in Swedish
Ricardo Muñoz Sánchez | Eric Johansson | Shakila Tayefeh | Shreyash Kad

Throughout the COVID-19 pandemic, a parallel infodemic has also been going on such that the information has been spreading faster than the virus itself. During this time, every individual needs to access accurate news in order to take corresponding protective measures, regardless of their country of origin or the language they speak, as misinformation can cause significant loss to not only individuals but also society. In this paper we train several machine learning models (ranging from traditional machine learning to deep learning) to try to determine whether news articles come from either a reliable or an unreliable source, using just the body of the article. Moreover, we use a previously introduced corpus of news in Swedish related to the COVID-19 pandemic for the classification task. Given that our dataset is both unbalanced and small, we use subsampling and easy data augmentation (EDA) to try to solve these issues. In the end, we realize that, due to the small size of our dataset, using traditional machine learning along with data augmentation yields results that rival those of transformer models such as BERT.

pdf bib
BanglaHateBERT: BERT for Abusive Language Detection in Bengali
Md Saroar Jahan | Mainul Haque | Nabil Arhab | Mourad Oussalah

This paper introduces BanglaHateBERT, a retrained BERT model for abusive language detection in Bengali. The model was trained with a large-scale Bengali offensive, abusive, and hateful corpus that we have collected from different sources and made available to the public. Furthermore, we have collected and manually annotated 15K Bengali hate speech balanced dataset and made it publicly available for the research community. We used existing pre-trained BanglaBERT model and retrained it with 1.5 million offensive posts. We presented the results of a detailed comparison between generic pre-trained language model and retrained with the abuse-inclined version. In all datasets, BanglaHateBERT outperformed the corresponding available BERT model.

pdf
A Comparison of Machine Learning Techniques for Turkish Profanity Detection
Levent Soykan | Cihan Karsak | Ilknur Durgar Elkahlout | Burak Aytan

Profanity detection became an important task with the increase of social media usage. Most of the users prefer a clean and profanity free environment to communicate with others. In order to provide a such environment for the users, service providers are using various profanity detection tools. In this paper, we researched on Turkish profanity detection in our search engine. We collected and labeled a dataset from search engine queries as one of the two classes: profane and not-profane. We experimented with several classical machine learning and deep learning methods and compared methods in means of speed and accuracy. We performed our best scores with transformer based Electra model with 0.93 F1 Score. We also compared our models with the state-of-the-art Turkish profanity detection tool and observed that we outperform it from all aspects.

pdf
Features and Categories of Hyperbole in Cyberbullying Discourse on Social Media
Simona Ignat | Carl Vogel

Cyberbullying discourse is achieved with multiple linguistic conveyances. Hyperboles witnessed in a corpus of cyberbullying utterances are studied. Linguistic features of hyperbole using the traditional grammatical indications of exaggerations are analyzed. The method relies on data selected from a larger corpus of utterances identified and labelled as “bullying”, from Twitter, from October 2020 to March 2022. An outcome is a lexicon of 250 entries. A small number of lexical level features have been isolated, and chi-squared contingency tests applied to evaluating their information value in identifying hyperbole. Words or affixes indicating superlatives or extremes of scales, with positive but not negative valency items, interact with hyperbole classification in this data set. All utterances extracted has been considered exaggerations and the stylistic status of “hyperbole” has been commented within the frame of new meanings in the context of social media.