Improving Generalizability in Implicitly Abusive Language Detection with Concept Activation Vectors

Isar Nejadgholi, Kathleen Fraser, Svetlana Kiritchenko


Abstract
Robustness of machine learning models on ever-changing real-world data is critical, especially for applications affecting human well-being such as content moderation. New kinds of abusive language continually emerge in online discussions in response to current events (e.g., COVID-19), and the deployed abuse detection systems should be updated regularly to remain accurate. In this paper, we show that general abusive language classifiers tend to be fairly reliable in detecting out-of-domain explicitly abusive utterances but fail to detect new types of more subtle, implicit abuse. Next, we propose an interpretability technique, based on the Testing Concept Activation Vector (TCAV) method from computer vision, to quantify the sensitivity of a trained model to the human-defined concepts of explicit and implicit abusive language, and use that to explain the generalizability of the model on new data, in this case, COVID-related anti-Asian hate speech. Extending this technique, we introduce a novel metric, Degree of Explicitness, for a single instance and show that the new metric is beneficial in suggesting out-of-domain unlabeled examples to effectively enrich the training data with informative, implicitly abusive texts.
Anthology ID:
2022.acl-long.378
Volume:
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
May
Year:
2022
Address:
Dublin, Ireland
Editors:
Smaranda Muresan, Preslav Nakov, Aline Villavicencio
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5517–5529
Language:
URL:
https://aclanthology.org/2022.acl-long.378
DOI:
10.18653/v1/2022.acl-long.378
Bibkey:
Cite (ACL):
Isar Nejadgholi, Kathleen Fraser, and Svetlana Kiritchenko. 2022. Improving Generalizability in Implicitly Abusive Language Detection with Concept Activation Vectors. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5517–5529, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
Improving Generalizability in Implicitly Abusive Language Detection with Concept Activation Vectors (Nejadgholi et al., ACL 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/naacl-24-ws-corrections/2022.acl-long.378.pdf
Software:
 2022.acl-long.378.software.zip
Video:
 https://preview.aclanthology.org/naacl-24-ws-corrections/2022.acl-long.378.mp4
Code
 isarnejad/tcav-for-text-classifiers