Sérgio Nunes


2019

pdf
A Hierarchically-Labeled Portuguese Hate Speech Dataset
Paula Fortuna | João Rocha da Silva | Juan Soler-Company | Leo Wanner | Sérgio Nunes
Proceedings of the Third Workshop on Abusive Language Online

Over the past years, the amount of online offensive speech has been growing steadily. To successfully cope with it, machine learning are applied. However, ML-based techniques require sufficiently large annotated datasets. In the last years, different datasets were published, mainly for English. In this paper, we present a new dataset for Portuguese, which has not been in focus so far. The dataset is composed of 5,668 tweets. For its annotation, we defined two different schemes used by annotators with different levels of expertise. Firstly, non-experts annotated the tweets with binary labels (‘hate’ vs. ‘no-hate’). Secondly, expert annotators classified the tweets following a fine-grained hierarchical multiple label scheme with 81 hate speech categories in total. The inter-annotator agreement varied from category to category, which reflects the insight that some types of hate speech are more subtle than others and that their detection depends on personal perception. This hierarchical annotation scheme is the main contribution of the presented work, as it facilitates the identification of different types of hate speech and their intersections. To demonstrate the usefulness of our dataset, we carried a baseline classification experiment with pre-trained word embeddings and LSTM on the binary classified data, with a state-of-the-art outcome.

pdf
Stop PropagHate at SemEval-2019 Tasks 5 and 6: Are abusive language classification results reproducible?
Paula Fortuna | Juan Soler-Company | Sérgio Nunes
Proceedings of the 13th International Workshop on Semantic Evaluation

This paper summarizes the participation of Stop PropagHate team at SemEval 2019. Our approach is based on replicating one of the most relevant works on the literature, using word embeddings and LSTM. After circumventing some of the problems of the original code, we found poor results when applying it to the HatEval contest (F1=0.45). We think this is due mainly to inconsistencies in the data of this contest. Finally, for the OffensEval the classifier performed well (F1=0.74), proving to have a better performance for offense detection than for hate speech.

2018

pdf
Merging Datasets for Aggressive Text Identification
Paula Fortuna | José Ferreira | Luiz Pires | Guilherme Routar | Sérgio Nunes
Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018)

This paper presents the approach of the team “groutar” to the shared task on Aggression Identification, considering the test sets in English, both from Facebook and general Social Media. This experiment aims to test the effect of merging new datasets in the performance of classification models. We followed a standard machine learning approach with training, validation, and testing phases, and considered features such as part-of-speech, frequencies of insults, punctuation, sentiment, and capitalization. In terms of algorithms, we experimented with Boosted Logistic Regression, Multi-Layer Perceptron, Parallel Random Forest and eXtreme Gradient Boosting. One question appearing was how to merge datasets using different classification systems (e.g. aggression vs. toxicity). Other issue concerns the possibility to generalize models and apply them to data from different social networks. Regarding these, we merged two datasets, and the results showed that training with similar data is an advantage in the classification of social networks data. However, adding data from different platforms, allowed slightly better results in both Facebook and Social Media, indicating that more generalized models can be an advantage.

2004

pdf
Evaluation of Different Similarity Measures for the Extraction of Multiword Units in a Reinforcement Learning Environment
Gaël Dias | Sérgio Nunes
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)