Evading Toxicity Detection with ASCII-art: A Benchmark of Spatial Attacks on Moderation Systems

Sergey Berezin, Reza Farahbakhsh, Noel Crespi


Abstract
We introduce a novel class of adversarial attacks on toxicity detection models that exploit language models’ failure to interpret spatially structured text in the form of ASCII art. To evaluate the effectiveness of these attacks, we propose ToxASCII, a benchmark designed to assess the robustness of toxicity detection systems against visually obfuscated inputs. Our attacks achieve a perfect Attack Success Rate (ASR) across a diverse set of state-of-the-art large language models and dedicated moderation tools, revealing a significant vulnerability in current text-only moderation systems.
Anthology ID:
2025.woah-1.13
Volume:
Proceedings of the The 9th Workshop on Online Abuse and Harms (WOAH)
Month:
August
Year:
2025
Address:
Vienna, Austria
Editors:
Agostina Calabrese, Christine de Kock, Debora Nozza, Flor Miriam Plaza-del-Arco, Zeerak Talat, Francielle Vargas
Venues:
WOAH | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
153–162
Language:
URL:
https://preview.aclanthology.org/acl25-workshop-ingestion/2025.woah-1.13/
DOI:
Bibkey:
Cite (ACL):
Sergey Berezin, Reza Farahbakhsh, and Noel Crespi. 2025. Evading Toxicity Detection with ASCII-art: A Benchmark of Spatial Attacks on Moderation Systems. In Proceedings of the The 9th Workshop on Online Abuse and Harms (WOAH), pages 153–162, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Evading Toxicity Detection with ASCII-art: A Benchmark of Spatial Attacks on Moderation Systems (Berezin et al., WOAH 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/acl25-workshop-ingestion/2025.woah-1.13.pdf