Sandwich attack: Multi-language Mixture Adaptive Attack on LLMs

Bibek Upadhayay; Vahid Behzadan

Sandwich attack: Multi-language Mixture Adaptive Attack on LLMs

Abstract

A significant challenge in reliable deployment of Large Language Models (LLMs) is malicious manipulation via adversarial prompting techniques such as jailbreaks. Employing mechanisms such as safety training have proven useful in addressing this challenge. However, in multilingual LLMs, adversaries can exploit the imbalanced representation of low-resource languages in datasets used for pretraining and safety training. In this paper, we introduce a new black-box attack vector called the Sandwich Attack: a multi-language mixture attack, which manipulates state-of-the-art LLMs into generating harmful and misaligned responses. Our experiments with five different models, namely Bard, Gemini Pro, LLaMA-2-70-B-Chat, GPT-3.5-Turbo, GPT-4, and Claude-3-OPUS, show that this attack vector can be used by adversaries to elicit harmful responses from these models. By detailing both the mechanism and impact of the Sandwich attack, this paper aims to guide future research and development towards more secure and resilient LLMs, ensuring they serve the public good while minimizing potential for misuse. Content Warning: This paper contains examples of harmful language.

Anthology ID:: 2024.trustnlp-1.18
Volume:: Proceedings of the 4th Workshop on Trustworthy Natural Language Processing (TrustNLP 2024)
Month:: June
Year:: 2024
Address:: Mexico City, Mexico
Editors:: Kai-Wei Chang, Anaelia Ovalle, Jieyu Zhao, Yang Trista Cao, Ninareh Mehrabi, Aram Galstyan, Jwala Dhamala, Anoop Kumar, Rahul Gupta
Venues:: TrustNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 208–226
Language:
URL:: https://aclanthology.org/2024.trustnlp-1.18
DOI:
Bibkey:
Cite (ACL):: Bibek Upadhayay and Vahid Behzadan. 2024. Sandwich attack: Multi-language Mixture Adaptive Attack on LLMs. In Proceedings of the 4th Workshop on Trustworthy Natural Language Processing (TrustNLP 2024), pages 208–226, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):: Sandwich attack: Multi-language Mixture Adaptive Attack on LLMs (Upadhayay & Behzadan, TrustNLP-WS 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/jeptaln-2024-ingestion/2024.trustnlp-1.18.pdf

PDF Search