Abstract
Transformer-based Natural Language Processing models have become the standard for hate speech detection. However, the unconscious use of these techniques for such a critical task comes with negative consequences. Various works have demonstrated that hate speech classifiers are biased. These findings have prompted efforts to explain classifiers, mainly using attribution methods. In this paper, we provide the first benchmark study of interpretability approaches for hate speech detection. We cover four post-hoc token attribution approaches to explain the predictions of Transformer-based misogyny classifiers in English and Italian. Further, we compare generated attributions to attention analysis. We find that only two algorithms provide faithful explanations aligned with human expectations. Gradient-based methods and attention, however, show inconsistent outputs, making their value for explanations questionable for hate speech detection tasks.- Anthology ID:
- 2022.nlppower-1.11
- Volume:
- Proceedings of NLP Power! The First Workshop on Efficient Benchmarking in NLP
- Month:
- May
- Year:
- 2022
- Address:
- Dublin, Ireland
- Editors:
- Tatiana Shavrina, Vladislav Mikhailov, Valentin Malykh, Ekaterina Artemova, Oleg Serikov, Vitaly Protasov
- Venue:
- nlppower
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 100–112
- Language:
- URL:
- https://aclanthology.org/2022.nlppower-1.11
- DOI:
- 10.18653/v1/2022.nlppower-1.11
- Cite (ACL):
- Giuseppe Attanasio, Debora Nozza, Eliana Pastor, and Dirk Hovy. 2022. Benchmarking Post-Hoc Interpretability Approaches for Transformer-based Misogyny Detection. In Proceedings of NLP Power! The First Workshop on Efficient Benchmarking in NLP, pages 100–112, Dublin, Ireland. Association for Computational Linguistics.
- Cite (Informal):
- Benchmarking Post-Hoc Interpretability Approaches for Transformer-based Misogyny Detection (Attanasio et al., nlppower 2022)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-3/2022.nlppower-1.11.pdf
- Code
- milanlproc/benchmarking-xai-misogyny