TaeBench: Improving Quality of Toxic Adversarial Examples
Jennifer Zhu, Dmitriy Bespalov, Liwen You, Ninad Kulkarni, Yanjun Qi
Abstract
Toxicity text detectors can be vulnerable to adversarial examples - small perturbations to input text that fool the systems into wrong detection. Existing attack algorithms are time-consuming and often produce invalid or ambiguous adversarial examples, making them less useful for evaluating or improving real-world toxicity content moderators. This paper proposes an annotation pipeline for quality control of generated toxic adversarial examples (TAE). We design model-based automated annotation and human-based quality verification to assess the quality requirements of . Successful should fool a target toxicity model into making benign predictions, be grammatically reasonable, appear natural like human-generated text, and exhibit semantic toxicity. When applying these requirements to more than 20 state-of-the-art (SOTA) TAE attack recipes, we find many invalid samples from a total of 940k raw TAE attack generations. We then utilize the proposed pipeline to filter and curate a high-quality TAE dataset we call TaeBench (of size 264k). Empirically, we demonstrate that TaeBench can effectively transfer-attack SOTA toxicity content moderation models and services. Our experiments also show that TaeBench with adversarial training achieve significant improvements of the robustness of two toxicity detectors.- Anthology ID:
- 2025.naacl-industry.21
- Volume:
- Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track)
- Month:
- April
- Year:
- 2025
- Address:
- Albuquerque, New Mexico
- Editors:
- Weizhu Chen, Yi Yang, Mohammad Kachuee, Xue-Yong Fu
- Venue:
- NAACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 251–265
- Language:
- URL:
- https://preview.aclanthology.org/Ingest-2025-COMPUTEL/2025.naacl-industry.21/
- DOI:
- Cite (ACL):
- Jennifer Zhu, Dmitriy Bespalov, Liwen You, Ninad Kulkarni, and Yanjun Qi. 2025. TaeBench: Improving Quality of Toxic Adversarial Examples. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track), pages 251–265, Albuquerque, New Mexico. Association for Computational Linguistics.
- Cite (Informal):
- TaeBench: Improving Quality of Toxic Adversarial Examples (Zhu et al., NAACL 2025)
- PDF:
- https://preview.aclanthology.org/Ingest-2025-COMPUTEL/2025.naacl-industry.21.pdf