Sergey Berezin
2025
Evading Toxicity Detection with ASCII-art: A Benchmark of Spatial Attacks on Moderation Systems
Sergey Berezin
|
Reza Farahbakhsh
|
Noel Crespi
Proceedings of the The 9th Workshop on Online Abuse and Harms (WOAH)
We introduce a novel class of adversarial attacks on toxicity detection models that exploit language models’ failure to interpret spatially structured text in the form of ASCII art. To evaluate the effectiveness of these attacks, we propose ToxASCII, a benchmark designed to assess the robustness of toxicity detection systems against visually obfuscated inputs. Our attacks achieve a perfect Attack Success Rate (ASR) across a diverse set of state-of-the-art large language models and dedicated moderation tools, revealing a significant vulnerability in current text-only moderation systems.
2023
No offence, Bert - I insult only humans! Multilingual sentence-level attack on toxicity detection networks
Sergey Berezin
|
Reza Farahbakhsh
|
Noel Crespi
Findings of the Association for Computational Linguistics: EMNLP 2023
We introduce a simple yet efficient sentence-level attack on black-box toxicity detector models. By adding several positive words or sentences to the end of a hateful message, we are able to change the prediction of a neural network and pass the toxicity detection system check. This approach is shown to be working on seven languages from three different language families. We also describe the defence mechanism against the aforementioned attack and discuss its limitations.
2022
Named Entity Inclusion in Abstractive Text Summarization
Sergey Berezin
|
Tatiana Batura
Proceedings of the Third Workshop on Scholarly Document Processing
We address the named entity omission - the drawback of many current abstractive text summarizers. We suggest a custom pretraining objective to enhance the model’s attention on the named entities in a text. At first, the named entity recognition model RoBERTa is trained to determine named entities in the text. After that this model is used to mask named entities in the text and the BART model is trained to reconstruct them. Next, BART model is fine-tuned on the summarization task. Our experiments showed that this pretraining approach drastically improves named entity inclusion precision and recall metrics.