Improving Counterfactual Generation for Fair Hate Speech Detection

Aida Mostafazadeh Davani, Ali Omrani, Brendan Kennedy, Mohammad Atari, Xiang Ren, Morteza Dehghani


Abstract
Bias mitigation approaches reduce models’ dependence on sensitive features of data, such as social group tokens (SGTs), resulting in equal predictions across the sensitive features. In hate speech detection, however, equalizing model predictions may ignore important differences among targeted social groups, as hate speech can contain stereotypical language specific to each SGT. Here, to take the specific language about each SGT into account, we rely on counterfactual fairness and equalize predictions among counterfactuals, generated by changing the SGTs. Our method evaluates the similarity in sentence likelihoods (via pre-trained language models) among counterfactuals, to treat SGTs equally only within interchangeable contexts. By applying logit pairing to equalize outcomes on the restricted set of counterfactuals for each instance, we improve fairness metrics while preserving model performance on hate speech detection.
Anthology ID:
2021.woah-1.10
Volume:
Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021)
Month:
August
Year:
2021
Address:
Online
Venue:
WOAH
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
92–101
Language:
URL:
https://aclanthology.org/2021.woah-1.10
DOI:
10.18653/v1/2021.woah-1.10
Bibkey:
Cite (ACL):
Aida Mostafazadeh Davani, Ali Omrani, Brendan Kennedy, Mohammad Atari, Xiang Ren, and Morteza Dehghani. 2021. Improving Counterfactual Generation for Fair Hate Speech Detection. In Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021), pages 92–101, Online. Association for Computational Linguistics.
Cite (Informal):
Improving Counterfactual Generation for Fair Hate Speech Detection (Mostafazadeh Davani et al., WOAH 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2021.woah-1.10.pdf
Video:
 https://preview.aclanthology.org/ingestion-script-update/2021.woah-1.10.mp4
Data
Hate Speech