2024
pdf
abs
HausaHate: An Expert Annotated Corpus for Hausa Hate Speech Detection
Francielle Vargas
|
Samuel Guimarães
|
Shamsuddeen Hassan Muhammad
|
Diego Alves
|
Ibrahim Said Ahmad
|
Idris Abdulmumin
|
Diallo Mohamed
|
Thiago Pardo
|
Fabrício Benevenuto
Proceedings of the 8th Workshop on Online Abuse and Harms (WOAH 2024)
We introduce the first expert annotated corpus of Facebook comments for Hausa hate speech detection. The corpus titled HausaHate comprises 2,000 comments extracted from Western African Facebook pages and manually annotated by three Hausa native speakers, who are also NLP experts. Our corpus was annotated using two different layers. We first labeled each comment according to a binary classification: offensive versus non-offensive. Then, offensive comments were also labeled according to hate speech targets: race, gender and none. Lastly, a baseline model using fine-tuned LLM for Hausa hate speech detection is presented, highlighting the challenges of hate speech detection tasks for indigenous languages in Africa, as well as future advances.
2023
pdf
abs
NoHateBrazil: A Brazilian Portuguese Text Offensiveness Analysis System
Francielle Vargas
|
Isabelle Carvalho
|
Wolfgang Schmeisser-Nieto
|
Fabrício Benevenuto
|
Thiago Pardo
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing
Hate speech is a surely relevant problem in Brazil. Nevertheless, its regulation is not effective due to the difficulty to identify, quantify and classify offensive comments. Here, we introduce a novel system for offensive comment analysis in Brazilian Portuguese. The system titled “NoHateBrazil” recognizes explicit and implicit offensiveness in context at a fine-grained level. Specifically, we propose a framework for data collection, human annotation and machine learning models that were used to build the system. In addition, we assess the potential of our system to reflect stereotypical beliefs against marginalized groups by contrasting them with counter-stereotypes. As a result, a friendly web application was implemented, which besides presenting relevant performance, showed promising results towards mitigation of the risk of reinforcing social stereotypes. Lastly, new measures were proposed to improve the explainability of offensiveness classification and reliability of the model’s predictions.
pdf
abs
Socially Responsible Hate Speech Detection: Can Classifiers Reflect Social Stereotypes?
Francielle Vargas
|
Isabelle Carvalho
|
Ali Hürriyetoğlu
|
Thiago Pardo
|
Fabrício Benevenuto
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing
Recent studies have shown that hate speech technologies may propagate social stereotypes against marginalized groups. Nevertheless, there has been a lack of realistic approaches to assess and mitigate biased technologies. In this paper, we introduce a new approach to analyze the potential of hate-speech classifiers to reflect social stereotypes through the investigation of stereotypical beliefs by contrasting them with counter-stereotypes. We empirically measure the distribution of stereotypical beliefs by analyzing the distinctive classification of tuples containing stereotypes versus counter-stereotypes in machine learning models and datasets. Experiment results show that hate speech classifiers attribute unreal or negligent offensiveness to social identity groups by reflecting and reinforcing stereotypical beliefs regarding minorities. Furthermore, we also found that models that embed expert and context information from offensiveness markers present promising results to mitigate social stereotype bias towards socially responsible hate speech detection.
pdf
abs
Predicting Sentence-Level Factuality of News and Bias of Media Outlets
Francielle Vargas
|
Kokil Jaidka
|
Thiago Pardo
|
Fabrício Benevenuto
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing
Automated news credibility and fact-checking at scale require accurate prediction of news factuality and media bias. This paper introduces a large sentence-level dataset, titled “FactNews”, composed of 6,191 sentences expertly annotated according to factuality and media bias definitions proposed by AllSides. We use FactNews to assess the overall reliability of news sources by formulating two text classification problems for predicting sentence-level factuality of news reporting and bias of media outlets. Our experiments demonstrate that biased sentences present a higher number of words compared to factual sentences, besides having a predominance of emotions. Hence, the fine-grained analysis of subjectivity and impartiality of news articles showed promising results for predicting the reliability of entire media outlets. Finally, due to the severity of fake news and political polarization in Brazil, and the lack of research for Portuguese, both dataset and baseline were proposed for Brazilian Portuguese.
2022
pdf
abs
Rhetorical Structure Approach for Online Deception Detection: A Survey
Francielle Vargas
|
Jonas D‘Alessandro
|
Zohar Rabinovich
|
Fabrício Benevenuto
|
Thiago Pardo
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Most information is passed on in the form of language. Therefore, research on how people use language to inform and misinform, and how this knowledge may be automatically extracted from large amounts of text is surely relevant. This survey provides first-hand experiences and a comprehensive review of rhetorical-level structure analysis for online deception detection. We systematically analyze how discourse structure, aligned or not with other approaches, is applied to automatic fake news and fake reviews detection on the web and social media. Moreover, we categorize discourse-tagged corpora along with results, hence offering a summary and accessible introductions to new researchers.
pdf
abs
HateBR: A Large Expert Annotated Corpus of Brazilian Instagram Comments for Offensive Language and Hate Speech Detection
Francielle Vargas
|
Isabelle Carvalho
|
Fabiana Rodrigues de Góes
|
Thiago Pardo
|
Fabrício Benevenuto
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Due to the severity of the social media offensive and hateful comments in Brazil, and the lack of research in Portuguese, this paper provides the first large-scale expert annotated corpus of Brazilian Instagram comments for hate speech and offensive language detection. The HateBR corpus was collected from the comment section of Brazilian politicians’ accounts on Instagram and manually annotated by specialists, reaching a high inter-annotator agreement. The corpus consists of 7,000 documents annotated according to three different layers: a binary classification (offensive versus non-offensive comments), offensiveness-level classification (highly, moderately, and slightly offensive), and nine hate speech groups (xenophobia, racism, homophobia, sexism, religious intolerance, partyism, apology for the dictatorship, antisemitism, and fatphobia). We also implemented baseline experiments for offensive language and hate speech detection and compared them with a literature baseline. Results show that the baseline experiments on our corpus outperform the current state-of-the-art for the Portuguese language.
2021
pdf
abs
Contextual-Lexicon Approach for Abusive Language Detection
Francielle Vargas
|
Fabiana Rodrigues de Góes
|
Isabelle Carvalho
|
Fabrício Benevenuto
|
Thiago Pardo
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)
Since a lexicon-based approach is more elegant scientifically, explaining the solution components and being easier to generalize to other applications, this paper provides a new approach for offensive language and hate speech detection on social media, which embodies a lexicon of implicit and explicit offensive and swearing expressions annotated with contextual information. Due to the severity of the social media abusive comments in Brazil, and the lack of research in Portuguese, Brazilian Portuguese is the language used to validate the models. Nevertheless, our method may be applied to any other language. The conducted experiments show the effectiveness of the proposed approach, outperforming the current baseline methods for the Portuguese language.
pdf
abs
Toward Discourse-Aware Models for Multilingual Fake News Detection
Francielle Vargas
|
Fabrício Benevenuto
|
Thiago Pardo
Proceedings of the Student Research Workshop Associated with RANLP 2021
Statements that are intentionally misstated (or manipulated) are of considerable interest to researchers, government, security, and financial systems. According to deception literature, there are reliable cues for detecting deception and the belief that liars give off cues that may indicate their deception is near-universal. Therefore, given that deceiving actions require advanced cognitive development that honesty simply does not require, as well as people’s cognitive mechanisms have promising guidance for deception detection, in this Ph.D. ongoing research, we propose to examine discourse structure patterns in multilingual deceptive news corpora using the Rhetorical Structure Theory framework. Considering that our work is the first to exploit multilingual discourse-aware strategies for fake news detection, the research community currently lacks multilingual deceptive annotated corpora. Accordingly, this paper describes the current progress in this thesis, including (i) the construction of the first multilingual deceptive corpus, which was annotated by specialists according to the Rhetorical Structure Theory framework, and (ii) the introduction of two new proposed rhetorical relations: INTERJECTION and IMPERATIVE, which we assume to be relevant for the fake news detection task.
2011
pdf
Analyzing the Dynamic Evolution of Hashtags on Twitter: a Language-Based Approach
Evandro Cunha
|
Gabriel Magno
|
Giovanni Comarela
|
Virgilio Almeida
|
Marcos André Gonçalves
|
Fabrício Benevenuto
Proceedings of the Workshop on Language in Social Media (LSM 2011)