TaeBench: Improving Quality of Toxic Adversarial Examples

Jennifer Zhu; Dmitriy Bespalov; Liwen You; Ninad Kulkarni; Yanjun Qi

TaeBench: Improving Quality of Toxic Adversarial Examples

Jennifer Zhu, Dmitriy Bespalov, Liwen You, Ninad Kulkarni, Yanjun Qi

Abstract

Toxicity text detectors can be vulnerable to adversarial examples - small perturbations to input text that fool the systems into wrong detection. Existing attack algorithms are time-consuming and often produce invalid or ambiguous adversarial examples, making them less useful for evaluating or improving real-world toxicity content moderators. This paper proposes an annotation pipeline for quality control of generated toxic adversarial examples (TAE). We design model-based automated annotation and human-based quality verification to assess the quality requirements of . Successful should fool a target toxicity model into making benign predictions, be grammatically reasonable, appear natural like human-generated text, and exhibit semantic toxicity. When applying these requirements to more than 20 state-of-the-art (SOTA) TAE attack recipes, we find many invalid samples from a total of 940k raw TAE attack generations. We then utilize the proposed pipeline to filter and curate a high-quality TAE dataset we call TaeBench (of size 264k). Empirically, we demonstrate that TaeBench can effectively transfer-attack SOTA toxicity content moderation models and services. Our experiments also show that TaeBench with adversarial training achieve significant improvements of the robustness of two toxicity detectors.

Anthology ID:: 2025.naacl-industry.21
Volume:: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track)
Month:: April
Year:: 2025
Address:: Albuquerque, New Mexico
Editors:: Weizhu Chen, Yi Yang, Mohammad Kachuee, Xue-Yong Fu
Venue:: NAACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 251–265
Language:
URL:: https://preview.aclanthology.org/Ingest-2025-COMPUTEL/2025.naacl-industry.21/
DOI:
Bibkey:
Cite (ACL):: Jennifer Zhu, Dmitriy Bespalov, Liwen You, Ninad Kulkarni, and Yanjun Qi. 2025. TaeBench: Improving Quality of Toxic Adversarial Examples. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track), pages 251–265, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):: TaeBench: Improving Quality of Toxic Adversarial Examples (Zhu et al., NAACL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/Ingest-2025-COMPUTEL/2025.naacl-industry.21.pdf

PDF Cite Search Fix data