2022
pdf
abs
Finnish Hate-Speech Detection on Social Media Using CNN and FinBERT
Md Saroar Jahan
|
Mourad Oussalah
|
Nabil Arhab
Proceedings of the Thirteenth Language Resources and Evaluation Conference
There has been a lot of research in identifying hate posts from social media because of their detrimental effects on both individuals and society. The majority of this research has concentrated on English, although one notices the emergence of multilingual detection tools such as multilingual-BERT (mBERT). However, there is a lack of hate speech datasets compared to English, and a multilingual pre-trained model often contains fewer tokens for other languages. This paper attempts to contribute to hate speech identification in Finnish by constructing a new hate speech dataset that is collected from a popular forum (Suomi24). Furthermore, we have experimented with FinBERT pre-trained model performance for Finnish hate speech detection compared to state-of-the-art mBERT and other practices. In addition, we tested the performance of FinBERT compared to fastText as embedding, which employed with Convolution Neural Network (CNN). Our results showed that FinBERT yields a 91.7% accuracy and 90.8% F1 score value, which outperforms all state-of-art models, including multilingual-BERT and CNN.
pdf
abs
Data Expansion Using WordNet-based Semantic Expansion and Word Disambiguation for Cyberbullying Detection
Md Saroar Jahan
|
Djamila Romaissa Beddiar
|
Mourad Oussalah
|
Muhidin Mohamed
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Automatic identification of cyberbullying from textual content is known to be a challenging task. The challenges arise from the inherent structure of cyberbullying and the lack of labeled large-scale corpus, enabling efficient machine-learning-based tools including neural networks. This paper advocates a data augmentation-based approach that could enhance the automatic detection of cyberbullying in social media texts. We use both word sense disambiguation and synonymy relation in WordNet lexical database to generate coherent equivalent utterances of cyberbullying input data. The disambiguation and semantic expansion are intended to overcome the inherent limitations of social media posts, such as an abundance of unstructured constructs and limited semantic content. Besides, to test the feasibility, a novel protocol has been employed to collect cyberbullying traces data from AskFm forum, where about a 10K-size dataset has been manually labeled. Next, the problem of cyberbullying identification is viewed as a binary classification problem using an elaborated data augmentation strategy and an appropriate classifier. For the latter, a Convolutional Neural Network (CNN) architecture with FastText and BERT was put forward, whose results were compared against commonly employed Naïve Bayes (NB) and Logistic Regression (LR) classifiers with and without data augmentation. The research outcomes were promising and yielded almost 98.4% of classifier accuracy, an improvement of more than 4% over baseline results.
pdf
bib
abs
BanglaHateBERT: BERT for Abusive Language Detection in Bengali
Md Saroar Jahan
|
Mainul Haque
|
Nabil Arhab
|
Mourad Oussalah
Proceedings of the Second International Workshop on Resources and Techniques for User Information in Abusive Language Analysis
This paper introduces BanglaHateBERT, a retrained BERT model for abusive language detection in Bengali. The model was trained with a large-scale Bengali offensive, abusive, and hateful corpus that we have collected from different sources and made available to the public. Furthermore, we have collected and manually annotated 15K Bengali hate speech balanced dataset and made it publicly available for the research community. We used existing pre-trained BanglaBERT model and retrained it with 1.5 million offensive posts. We presented the results of a detailed comparison between generic pre-trained language model and retrained with the abuse-inclined version. In all datasets, BanglaHateBERT outperformed the corresponding available BERT model.
2014
pdf
A Comparative Study of Conversion Aided Methods for WordNet Sentence Textual Similarity
Muhidin Mohamed
|
Mourad Oussalah
Proceedings of the First AHA!-Workshop on Information Discovery in Text