Hierarchical Adversarial Correction to Mitigate Identity Term Bias in Toxicity Detection

Johannes Schäfer, Ulrich Heid, Roman Klinger


Abstract
Corpora that are the fundament for toxicity detection contain such expressions typically directed against a target individual or group, e.g., people of a specific gender or ethnicity. Prior work has shown that the target identity mention can constitute a confounding variable. As an example, a model might learn that Christians are always mentioned in the context of hate speech. This misguided focus can lead to a limited generalization to newly emerging targets that are not found in the training data. In this paper, we hypothesize and subsequently show that this issue can be mitigated by considering targets on different levels of specificity. We distinguish levels of (1) the existence of a target, (2) a class (e.g., that the target is a religious group), or (3) a specific target group (e.g., Christians or Muslims). We define a target label hierarchy based on these three levels and then exploit this hierarchy in an adversarial correction for the lowest level (i.e. (3)) while maintaining some basic target features. This approach does not lower the toxicity detection performance but increases the generalization to targets not being available at training time.
Anthology ID:
2024.wassa-1.4
Volume:
Proceedings of the 14th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Orphée De Clercq, Valentin Barriere, Jeremy Barnes, Roman Klinger, João Sedoc, Shabnam Tafreshi
Venues:
WASSA | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
35–51
Language:
URL:
https://aclanthology.org/2024.wassa-1.4
DOI:
10.18653/v1/2024.wassa-1.4
Bibkey:
Cite (ACL):
Johannes Schäfer, Ulrich Heid, and Roman Klinger. 2024. Hierarchical Adversarial Correction to Mitigate Identity Term Bias in Toxicity Detection. In Proceedings of the 14th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis, pages 35–51, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
Hierarchical Adversarial Correction to Mitigate Identity Term Bias in Toxicity Detection (Schäfer et al., WASSA-WS 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/dois-2013-emnlp/2024.wassa-1.4.pdf