Abstract
Deep neural networks have been applied to hate speech detection with apparent success, but they have limited practical applicability without transparency into the predictions they make. In this paper, we perform several experiments to visualize and understand a state-of-the-art neural network classifier for hate speech (Zhang et al., 2018). We adapt techniques from computer vision to visualize sensitive regions of the input stimuli and identify the features learned by individual neurons. We also introduce a method to discover the keywords that are most predictive of hate speech. Our analyses explain the aspects of neural networks that work well and point out areas for further improvement.- Anthology ID:
- W18-5111
- Volume:
- Proceedings of the 2nd Workshop on Abusive Language Online (ALW2)
- Month:
- October
- Year:
- 2018
- Address:
- Brussels, Belgium
- Venue:
- ALW
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 86–92
- Language:
- URL:
- https://aclanthology.org/W18-5111
- DOI:
- 10.18653/v1/W18-5111
- Cite (ACL):
- Cindy Wang. 2018. Interpreting Neural Network Hate Speech Classifiers. In Proceedings of the 2nd Workshop on Abusive Language Online (ALW2), pages 86–92, Brussels, Belgium. Association for Computational Linguistics.
- Cite (Informal):
- Interpreting Neural Network Hate Speech Classifiers (Wang, ALW 2018)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/W18-5111.pdf