Offensive Language Detection Explained

Julian Risch, Robin Ruff, Ralf Krestel


Abstract
Many online discussion platforms use a content moderation process, where human moderators check user comments for offensive language and other rule violations. It is the moderator’s decision which comments to remove from the platform because of violations and which ones to keep. Research so far focused on automating this decision process in the form of supervised machine learning for a classification task. However, even with machine-learned models achieving better classification accuracy than human experts, there is still a reason why human moderators are preferred. In contrast to black-box models, such as neural networks, humans can give explanations for their decision to remove a comment. For example, they can point out which phrase in the comment is offensive or what subtype of offensiveness applies. In this paper, we analyze and compare four explanation methods for different offensive language classifiers: an interpretable machine learning model (naive Bayes), a model-agnostic explanation method (LIME), a model-based explanation method (LRP), and a self-explanatory model (LSTM with an attention mechanism). We evaluate these approaches with regard to their explanatory power and their ability to point out which words are most relevant for a classifier’s decision. We find that the more complex models achieve better classification accuracy while also providing better explanations than the simpler models.
Anthology ID:
2020.trac-1.22
Volume:
Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Ritesh Kumar, Atul Kr. Ojha, Bornini Lahiri, Marcos Zampieri, Shervin Malmasi, Vanessa Murdock, Daniel Kadar
Venue:
TRAC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
137–143
Language:
English
URL:
https://aclanthology.org/2020.trac-1.22
DOI:
Bibkey:
Cite (ACL):
Julian Risch, Robin Ruff, and Ralf Krestel. 2020. Offensive Language Detection Explained. In Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying, pages 137–143, Marseille, France. European Language Resources Association (ELRA).
Cite (Informal):
Offensive Language Detection Explained (Risch et al., TRAC 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp22-frontmatter/2020.trac-1.22.pdf
Code
 julian-risch/TRAC-LREC2020