Bibek Upadhayay
2025
Tongue-Tied: Breaking LLMs Safety Through New Language Learning
Bibek Upadhayay
|
Vahid Behzadan
Proceedings of the 7th Workshop on Computational Approaches to Linguistic Code-Switching
The safety mechanisms of large language models (LLMs) have been shown to be fragile, as attackers can exploit prompts to generate harmful responses. Low-cost jailbreak attacks, such as those utilizing low-resource languages and code-switching, demonstrate that LLM safety mechanisms are vulnerable to low-resource languages. This indicates that safety training is particularly ineffective in low-resource languages. Furthermore, research has shown that fine-tuning LLMs with a small number of adversarial samples can compromise their safety training, implying that safety mechanism objectives can be overridden with the latest fine-tuning objectives. Based on the aforementioned statements, we hypothesize that the safety training of LLMs is language-dependent, and LLMs can potentially be compromised by fine-tuning them with new languages, even when using only harmless data.In this work, we used the low-resource language Newari and created two fake languages to LoRA-finetune LLMs with non-harmful data. Our results show that simply fine-tuning LLMs with new languages, even without the presence of harmful data, will jailbreak LLMs. Furthermore, we demonstrate that as we introduce English-to-and-from new language translation pairs in the training dataset, the attack success rate increases with harmful responses becoming more coherent. Additionally, we show the transferability of the attack by jailbreaking GPT-4 through finetuning with only 4,000 data points, and demonstrate that higher-capability models such as Claude-3.5-Sonnet can be compelled to learn to write in new languages through few-shot examples from in-context learning and can be jailbroken with new languages without fine-tuning. We furthermore investigate the fine-tuned LLMs’ latents with logit lens and find that the new language fine-tuning weakens safety mechanisms by prioritizing new language fidelity over alignment, enabling jailbreaks via late-layer pivots to new language tokens that bypass English-centric safeguards. We have publicly released our trained model weights, dataset, and artifacts at this URL: https://github.com/UNHSAILLab/tongue-tied-breaking-llms-safety-through-new-language-learning
X-Guard: Multilingual Guard Agent for Content Moderation
Bibek Upadhayay
|
Vahid Behzadan
Proceedings of the The First Workshop on LLM Security (LLMSEC)
Large Language Models (LLMs) have rapidly become integral to numerous applications in critical domains where reliability is paramount. Despite significant advances in safety frameworks and guardrails, current protective measures exhibit crucial vulnerabilities, particularly in multilingual contexts. Existing safety systems remain susceptible to adversarial attacks in low-resource languages and through code-switching techniques, primarily due to their English-centric design. Furthermore, the development of effective multilingual guardrails is constrained by the scarcity of diverse cross-lingual training data. Even recent solutions like Llama Guard-3, while offering multilingual support, lack transparency in their decision-making processes. We address these challenges by introducing X-Guard agent, a transparent multilingual safety agent designed to provide content moderation across diverse linguistic contexts. X-Guard effectively defends against both conventional low-resource language attacks and sophisticated code-switching attacks. Our approach includes: curating and enhancing multiple open-source safety datasets with explicit evaluation rationales; employing a jury of judges methodology to mitigate individual judge LLM provider biases; creating a comprehensive multilingual safety dataset spanning 132 languages with 5 million data points; and developing a two-stage architecture combining a custom-finetuned mBART-50 translation module with an evaluation X-Guard 3B model trained through supervised finetuning and GRPO training. Our empirical evaluations demonstrate X-Guard’s effectiveness in detecting unsafe content across multiple languages while maintaining transparency throughout the safety evaluation process. Our work represents a significant advancement in creating robust, transparent, and linguistically inclusive safety systems for LLMs and its integrated systems. We have publicly released our dataset and models at this {href{https://github.com/UNHSAILLab/X-Guard-Multilingual-Guard-Agent-for-Content-Moderation}{URL}}.
2024
Sandwich attack: Multi-language Mixture Adaptive Attack on LLMs
Bibek Upadhayay
|
Vahid Behzadan
Proceedings of the 4th Workshop on Trustworthy Natural Language Processing (TrustNLP 2024)
A significant challenge in reliable deployment of Large Language Models (LLMs) is malicious manipulation via adversarial prompting techniques such as jailbreaks. Employing mechanisms such as safety training have proven useful in addressing this challenge. However, in multilingual LLMs, adversaries can exploit the imbalanced representation of low-resource languages in datasets used for pretraining and safety training. In this paper, we introduce a new black-box attack vector called the Sandwich Attack: a multi-language mixture attack, which manipulates state-of-the-art LLMs into generating harmful and misaligned responses. Our experiments with five different models, namely Bard, Gemini Pro, LLaMA-2-70-B-Chat, GPT-3.5-Turbo, GPT-4, and Claude-3-OPUS, show that this attack vector can be used by adversaries to elicit harmful responses from these models. By detailing both the mechanism and impact of the Sandwich attack, this paper aims to guide future research and development towards more secure and resilient LLMs, ensuring they serve the public good while minimizing potential for misuse. Content Warning: This paper contains examples of harmful language.