Sergey Berezin

2025

pdf bib abs
The TIP of the Iceberg: Revealing a Hidden Class of Task-in-Prompt Adversarial Attacks on LLMs
Sergey Berezin | Reza Farahbakhsh | Noel Crespi
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We present a novel class of jailbreak adversarial attacks on LLMs, termed Task-in-Prompt (TIP) attacks. Our approach embeds sequence-to-sequence tasks (e.g., cipher decoding, riddles, code execution) into the model’s prompt to indirectly generate prohibited inputs. To systematically assess the effectiveness of these attacks, we introduce the PHRYGE benchmark. We demonstrate that our techniques successfully circumvent safeguards in six state-of-the-art language models, including GPT-4o and LLaMA 3.2. Our findings highlight critical weaknesses in current LLM safety alignment and underscore the urgent need for more sophisticated defence strategies.

pdf bib abs
Evading Toxicity Detection with ASCII-art: A Benchmark of Spatial Attacks on Moderation Systems
Sergey Berezin | Reza Farahbakhsh | Noel Crespi
Proceedings of the The 9th Workshop on Online Abuse and Harms (WOAH)

We introduce a novel class of adversarial attacks on toxicity detection models that exploit language models’ failure to interpret spatially structured text in the form of ASCII art. To evaluate the effectiveness of these attacks, we propose ToxASCII, a benchmark designed to assess the robustness of toxicity detection systems against visually obfuscated inputs. Our attacks achieve a perfect Attack Success Rate (ASR) across a diverse set of state-of-the-art large language models and dedicated moderation tools, revealing a significant vulnerability in current text-only moderation systems.

2023

pdf bib abs
No offence, Bert - I insult only humans! Multilingual sentence-level attack on toxicity detection networks
Sergey Berezin | Reza Farahbakhsh | Noel Crespi
Findings of the Association for Computational Linguistics: EMNLP 2023

We introduce a simple yet efficient sentence-level attack on black-box toxicity detector models. By adding several positive words or sentences to the end of a hateful message, we are able to change the prediction of a neural network and pass the toxicity detection system check. This approach is shown to be working on seven languages from three different language families. We also describe the defence mechanism against the aforementioned attack and discuss its limitations.

2022

pdf bib abs
Named Entity Inclusion in Abstractive Text Summarization
Sergey Berezin | Tatiana Batura
Proceedings of the Third Workshop on Scholarly Document Processing

We address the named entity omission - the drawback of many current abstractive text summarizers. We suggest a custom pretraining objective to enhance the model’s attention on the named entities in a text. At first, the named entity recognition model RoBERTa is trained to determine named entities in the text. After that this model is used to mask named entities in the text and the BART model is trained to reconstruct them. Next, BART model is fine-tuned on the summarization task. Our experiments showed that this pretraining approach drastically improves named entity inclusion precision and recall metrics.

Co-authors

Venues

Fix data