Enhancing Hate Speech Classifiers through a Gradient-assisted Counterfactual Text Generation Strategy

Michael Van Supranes; Shaowen Peng; Shoko Wakamiya; Eiji Aramaki

doi:10.18653/v1/2025.findings-emnlp.189

Enhancing Hate Speech Classifiers through a Gradient-assisted Counterfactual Text Generation Strategy

Michael Van Supranes, Shaowen Peng, Shoko Wakamiya, Eiji Aramaki

Abstract

Counterfactual data augmentation (CDA) is a promising strategy for improving hate speech classification, but automating counterfactual text generation remains a challenge. Strong attribute control can distort meaning, while prioritizing semantic preservation may weaken attribute alignment. We propose **Gradient-assisted Energy-based Sampling (GENES)** for counterfactual text generation, which restricts accepted samples to text meeting a minimum BERTScore threshold and applies gradient-assisted proposal generation to improve attribute alignment. Compared to other methods that solely rely on either prompting, gradient-based steering, or energy-based sampling, GENES is more likely to jointly satisfy attribute alignment and semantic preservation under the same base model. When applied to data augmentation, GENES achieved the best macro F1-score in two of three test sets, and it improved robustness in detecting targeted abusive language. In some cases, GENES exceeded the performance of prompt-based methods using a GPT-4o-mini, despite relying on a smaller model (Flan-T5-Large). Based on our cross-dataset evaluation, the average performance of models aided by GENES is the best among those methods that rely on a smaller model (Flan-T5-L). These results position GENES as a possible lightweight and open-source alternative.

Anthology ID:: 2025.findings-emnlp.189
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 3529–3544
Language:
URL:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.189/
DOI:: 10.18653/v1/2025.findings-emnlp.189
Bibkey:
Cite (ACL):: Michael Van Supranes, Shaowen Peng, Shoko Wakamiya, and Eiji Aramaki. 2025. Enhancing Hate Speech Classifiers through a Gradient-assisted Counterfactual Text Generation Strategy. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 3529–3544, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Enhancing Hate Speech Classifiers through a Gradient-assisted Counterfactual Text Generation Strategy (Van Supranes et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.189.pdf
Checklist:: 2025.findings-emnlp.189.checklist.pdf

PDF Cite Search Checklist Fix data