ToxiGAN: Toxic Data Augmentation via LLM-Guided Directional Adversarial Generation

Peiran Li, Jan Fillies, Adrian Paschke


Abstract
Augmenting toxic language data in a controllable and class-specific manner is crucial for improving robustness in toxicity classification, yet remains challenging due to limited supervision and distributional skew. We propose ToxiGAN, a class-aware text augmentation framework that combines adversarial generation with semantic guidance from large language models (LLMs). To address common issues in GAN-based augmentation such as mode collapse and semantic drift, ToxiGAN introduces a two-step directional training strategy and leverages LLM-generated neutral texts as semantic ballast. Unlike prior work that treats LLMs as static generators, our approach dynamically selects neutral exemplars to provide balanced guidance. Toxic samples are explicitly optimized to diverge from these exemplars, reinforcing class-specific contrastive signals. Experiments on four hate speech benchmarks show that ToxiGAN achieves the strongest average performance in both macro-F1 and hate-F1, consistently outperforming traditional and LLM-based augmentation methods. Ablation and sensitivity analyses further confirm the benefits of semantic ballast and directional training in enhancing classifier robustness.
Anthology ID:
2026.eacl-long.188
Volume:
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
March
Year:
2026
Address:
Rabat, Morocco
Editors:
Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:
EACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4029–4044
Language:
URL:
https://preview.aclanthology.org/ingest-eacl/2026.eacl-long.188/
DOI:
Bibkey:
Cite (ACL):
Peiran Li, Jan Fillies, and Adrian Paschke. 2026. ToxiGAN: Toxic Data Augmentation via LLM-Guided Directional Adversarial Generation. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4029–4044, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
ToxiGAN: Toxic Data Augmentation via LLM-Guided Directional Adversarial Generation (Li et al., EACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-eacl/2026.eacl-long.188.pdf