Abstract
Language toxicity identification presents a gray area in the ethical debate surrounding freedom of speech and censorship. Today’s social media landscape is littered with unfiltered content that can be anywhere from slightly abusive to hate inducing. In response, we focused on training a multi-label classifier to detect both the type and level of toxicity in online content. This content is typically colloquial and conversational in style. Its classification therefore requires huge amounts of annotated data due to its variability and inconsistency. We compare standard methods of text classification in this task. A conventional one-vs-rest SVM classifier with character and word level frequency-based representation of text reaches 0.9763 ROC AUC score. We demonstrated that leveraging more advanced technologies such as word embeddings, recurrent neural networks, attention mechanism, stacking of classifiers and semi-supervised training can improve the ROC AUC score of classification to 0.9862. We suggest that in order to choose the right model one has to consider the accuracy of models as well as inference complexity based on the application.- Anthology ID:
- W18-5103
- Volume:
- Proceedings of the 2nd Workshop on Abusive Language Online (ALW2)
- Month:
- October
- Year:
- 2018
- Address:
- Brussels, Belgium
- Editors:
- Darja Fišer, Ruihong Huang, Vinodkumar Prabhakaran, Rob Voigt, Zeerak Waseem, Jacqueline Wernimont
- Venue:
- ALW
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 21–25
- Language:
- URL:
- https://aclanthology.org/W18-5103
- DOI:
- 10.18653/v1/W18-5103
- Cite (ACL):
- Isuru Gunasekara and Isar Nejadgholi. 2018. A Review of Standard Text Classification Practices for Multi-label Toxicity Identification of Online Content. In Proceedings of the 2nd Workshop on Abusive Language Online (ALW2), pages 21–25, Brussels, Belgium. Association for Computational Linguistics.
- Cite (Informal):
- A Review of Standard Text Classification Practices for Multi-label Toxicity Identification of Online Content (Gunasekara & Nejadgholi, ALW 2018)
- PDF:
- https://preview.aclanthology.org/improve-issue-templates/W18-5103.pdf