@inproceedings{wang-2018-interpreting,
    title = "Interpreting Neural Network Hate Speech Classifiers",
    author = "Wang, Cindy",
    editor = "Fi{\v{s}}er, Darja  and
      Huang, Ruihong  and
      Prabhakaran, Vinodkumar  and
      Voigt, Rob  and
      Waseem, Zeerak  and
      Wernimont, Jacqueline",
    booktitle = "Proceedings of the 2nd Workshop on Abusive Language Online ({ALW}2)",
    month = oct,
    year = "2018",
    address = "Brussels, Belgium",
    publisher = "Association for Computational Linguistics",
    url = "https://preview.aclanthology.org/iwcs-25-ingestion/W18-5111/",
    doi = "10.18653/v1/W18-5111",
    pages = "86--92",
    abstract = "Deep neural networks have been applied to hate speech detection with apparent success, but they have limited practical applicability without transparency into the predictions they make. In this paper, we perform several experiments to visualize and understand a state-of-the-art neural network classifier for hate speech (Zhang et al., 2018). We adapt techniques from computer vision to visualize sensitive regions of the input stimuli and identify the features learned by individual neurons. We also introduce a method to discover the keywords that are most predictive of hate speech. Our analyses explain the aspects of neural networks that work well and point out areas for further improvement."
}Markdown (Informal)
[Interpreting Neural Network Hate Speech Classifiers](https://preview.aclanthology.org/iwcs-25-ingestion/W18-5111/) (Wang, ALW 2018)
ACL