Bibek Upadhayay

2025

pdf bib abs
Tongue-Tied: Breaking LLMs Safety Through New Language Learning
Bibek Upadhayay | Vahid Behzadan
Proceedings of the 7th Workshop on Computational Approaches to Linguistic Code-Switching

The safety mechanisms of large language models (LLMs) have been shown to be fragile, as attackers can exploit prompts to generate harmful responses. Low-cost jailbreak attacks, such as those utilizing low-resource languages and code-switching, demonstrate that LLM safety mechanisms are vulnerable to low-resource languages. This indicates that safety training is particularly ineffective in low-resource languages. Furthermore, research has shown that fine-tuning LLMs with a small number of adversarial samples can compromise their safety training, implying that safety mechanism objectives can be overridden with the latest fine-tuning objectives. Based on the aforementioned statements, we hypothesize that the safety training of LLMs is language-dependent, and LLMs can potentially be compromised by fine-tuning them with new languages, even when using only harmless data.In this work, we used the low-resource language Newari and created two fake languages to LoRA-finetune LLMs with non-harmful data. Our results show that simply fine-tuning LLMs with new languages, even without the presence of harmful data, will jailbreak LLMs. Furthermore, we demonstrate that as we introduce English-to-and-from new language translation pairs in the training dataset, the attack success rate increases with harmful responses becoming more coherent. Additionally, we show the transferability of the attack by jailbreaking GPT-4 through finetuning with only 4,000 data points, and demonstrate that higher-capability models such as Claude-3.5-Sonnet can be compelled to learn to write in new languages through few-shot examples from in-context learning and can be jailbroken with new languages without fine-tuning. We furthermore investigate the fine-tuned LLMs’ latents with logit lens and find that the new language fine-tuning weakens safety mechanisms by prioritizing new language fidelity over alignment, enabling jailbreaks via late-layer pivots to new language tokens that bypass English-centric safeguards. We have publicly released our trained model weights, dataset, and artifacts at this URL: https://github.com/UNHSAILLab/tongue-tied-breaking-llms-safety-through-new-language-learning

2024

pdf bib abs
Sandwich attack: Multi-language Mixture Adaptive Attack on LLMs
Bibek Upadhayay | Vahid Behzadan
Proceedings of the 4th Workshop on Trustworthy Natural Language Processing (TrustNLP 2024)

A significant challenge in reliable deployment of Large Language Models (LLMs) is malicious manipulation via adversarial prompting techniques such as jailbreaks. Employing mechanisms such as safety training have proven useful in addressing this challenge. However, in multilingual LLMs, adversaries can exploit the imbalanced representation of low-resource languages in datasets used for pretraining and safety training. In this paper, we introduce a new black-box attack vector called the Sandwich Attack: a multi-language mixture attack, which manipulates state-of-the-art LLMs into generating harmful and misaligned responses. Our experiments with five different models, namely Bard, Gemini Pro, LLaMA-2-70-B-Chat, GPT-3.5-Turbo, GPT-4, and Claude-3-OPUS, show that this attack vector can be used by adversaries to elicit harmful responses from these models. By detailing both the mechanism and impact of the Sandwich attack, this paper aims to guide future research and development towards more secure and resilient LLMs, ensuring they serve the public good while minimizing potential for misuse. Content Warning: This paper contains examples of harmful language.

Co-authors

Vahid Behzadan 2

Venues

Fix data