pdf
bib
Proceedings of the The First Workshop on LLM Security (LLMSEC)
Jekaterina Novikova
pdf
bib
abs
UTF: Under-trained Tokens as Fingerprints —— a Novel Approach to LLM Identification
Jiacheng Cai
|
Jiahao Yu
|
Yangguang Shao
|
Yuhang Wu
|
Xinyu Xing
Fingerprinting large language models (LLMs) is essential for verifying model ownership, ensuring authenticity, and preventing misuse. Traditional fingerprinting methods often require significant computational overhead or white-box verification access. In this paper, we introduce UTF, a novel and efficient approach to fingerprinting LLMs by leveraging under-trained tokens. Under-trained tokens are tokens that the model has not fully learned during its training phase. By utilizing these tokens, we perform supervised fine-tuning to embed specific input-output pairs into the model. This process allows the LLM to produce predetermined outputs when presented with certain inputs, effectively embedding a unique fingerprint. Our method has minimal overhead and impact on model’s performance, and does not require white-box access to target model’s ownership identification. Compared to existing fingerprinting methods, UTF is also more effective and robust to fine-tuning and random guess.
pdf
bib
abs
RedHit: Adaptive Red-Teaming of Large Language Models via Search, Reasoning, and Preference Optimization
Mohsen Sorkhpour
|
Abbas Yazdinejad
|
Ali Dehghantanha
Red-teaming has become a critical component of Large Language Models (LLMs) security amid increasingly sophisticated adversarial techniques. However, existing methods often depend on hard-coded strategies that quickly become obsolete against novel attack patterns, requiring constant updates.Moreover, current automated red-teaming approaches typically lack effective reasoning ca- pabilities, leading to lower attack success rates and longer training times. In this paper, we propose RedHit, a multi-round, automated, and adaptive red-teaming framework that integrates Monte Carlo Tree Search (MCTS), Chain-of-Thought (CoT) reasoning, and Direct Preference Optimization (DPO) to enhance the adversarial capabilities of an Adversarial LLM (ALLM). RedHit formulates prompt injection as a tree search problem, where the goal is to discover adversarial prompts capable of bypassing target model defenses. Each search step is guided by an Evaluator module that dynamically scores model responses using multi-detector feedback, yielding fine-grained reward signals. MCTS is employed to explore the space of adversarial prompts, incrementally constructing a Prompt Search Tree (PST) in which each node stores an adversarial prompt, its response, a reward, and other control properties. Prompts are generated via a locally hosted IndirectPromptGenerator module, which uses CoT-enabled prompt transformation to create multi-perspective, semantically equivalent variants for deeper tree exploration. CoT reasoning improves MCTS exploration by injecting strategic insights derived from past interactions, enabling RedHit to adapt dynamically to the target LLM’s defenses. Furthermore, DPO fine-tunes ALLM using preference data collected from previous attack rounds, progressively enhancing its ability to generate more effective prompts. Red-Hit leverages the Garak framework to evaluate each adversarial prompt and compute rewards,demonstrating robust and adaptive adversarial behavior across multiple attack rounds.
pdf
bib
abs
Using Humor to Bypass Safety Guardrails in Large Language Models
Pedro Cisneros-Velarde
In this paper, we show it is possible to bypass the safety guardrails of large language models (LLMs) through a humorous prompt including the unsafe request. In particular, our method does not edit the unsafe request and follows a fixed template—it is simple to implement and does not need additional LLMs to craft prompts. Extensive experiments show the effectiveness of our method across different LLMs. We also show that both removing and adding more humor to our method can reduce its effectiveness—excessive humor possibly distracts the LLM from fulfilling its unsafe request. Thus, we argue that LLM jailbreaking occurs when there is a proper balance between focus on the unsafe request and presence of humor.
pdf
bib
abs
LongSafety: Enhance Safety for Long-Context LLMs
Mianqiu Huang
|
Xiaoran Liu
|
Shaojun Zhou
|
Mozhi Zhang
|
Qipeng Guo
|
Linyang Li
|
Pengyu Wang
|
Yang Gao
|
Chenkun Tan
|
Linlin Li
|
Qun Liu
|
Yaqian Zhou
|
Xipeng Qiu
|
Xuanjing Huang
Recent advancements in model architectures and length extrapolation techniques have significantly extended the context length of large language models (LLMs), paving the way for their application in increasingly complex tasks. However, despite the growing capabilities of long-context LLMs, the safety issues in long-context scenarios remain underexplored. While safety alignment in short context has been widely studied, the safety concerns of long-context LLMs have not been adequately addressed. In this work, we introduce ${textbf{LongSafety}}$, a comprehensive safety alignment dataset for long-context LLMs, containing 10 tasks and 17k samples, with an average length of 40.9k tokens. Our experiments demonstrate that training with LongSafety can enhance long-context safety performance while enhancing short-context safety and preserving general capabilities. Furthermore, we demonstrate that long-context safety does not equal long-context alignment with short-context safety data and LongSafety has generalizing capabilities in context length and long-context safety scenarios.
pdf
bib
abs
ArithmAttack: Evaluating Robustness of LLMs to Noisy Context in Math Problem Solving
Zain Ul Abedin
|
Shahzeb Qamar
|
Lucie Flek
|
Akbar Karimi
While Large Language Models (LLMs) have shown impressive capabilities in math problem-solving tasks, their robustness to noisy inputs is not well-studied. We propose ArithmAttack to examine how robust the LLMs are when they encounter noisy prompts that contain extra noise in the form of punctuation marks. While being easy to implement, ArithmAttack does not cause any information loss since words are not added or deleted from the context. We evaluate the robustness of eight LLMs, including LLama3, Mistral, Mathstral, and DeepSeek on noisy GSM8K and MultiArith datasets. Our experiments suggest that all the studied models show vulnerability to such noise, with more noise leading to poorer performances.
pdf
bib
abs
X-Guard: Multilingual Guard Agent for Content Moderation
Bibek Upadhayay
|
Vahid Behzadan
Large Language Models (LLMs) have rapidly become integral to numerous applications in critical domains where reliability is paramount. Despite significant advances in safety frameworks and guardrails, current protective measures exhibit crucial vulnerabilities, particularly in multilingual contexts. Existing safety systems remain susceptible to adversarial attacks in low-resource languages and through code-switching techniques, primarily due to their English-centric design. Furthermore, the development of effective multilingual guardrails is constrained by the scarcity of diverse cross-lingual training data. Even recent solutions like Llama Guard-3, while offering multilingual support, lack transparency in their decision-making processes. We address these challenges by introducing X-Guard agent, a transparent multilingual safety agent designed to provide content moderation across diverse linguistic contexts. X-Guard effectively defends against both conventional low-resource language attacks and sophisticated code-switching attacks. Our approach includes: curating and enhancing multiple open-source safety datasets with explicit evaluation rationales; employing a jury of judges methodology to mitigate individual judge LLM provider biases; creating a comprehensive multilingual safety dataset spanning 132 languages with 5 million data points; and developing a two-stage architecture combining a custom-finetuned mBART-50 translation module with an evaluation X-Guard 3B model trained through supervised finetuning and GRPO training. Our empirical evaluations demonstrate X-Guard’s effectiveness in detecting unsafe content across multiple languages while maintaining transparency throughout the safety evaluation process. Our work represents a significant advancement in creating robust, transparent, and linguistically inclusive safety systems for LLMs and its integrated systems. We have publicly released our dataset and models at this {href{https://github.com/UNHSAILLab/X-Guard-Multilingual-Guard-Agent-for-Content-Moderation}{URL}}.
pdf
bib
abs
RealHarm: A Collection of Real-World Language Model Application Failures
Pierre Le Jeune
|
Jiaen Liu
|
Luca Rossi
|
Matteo Dora
Language model deployments in consumer-facing applications introduce numerous risks. While existing research on harms and hazards of such applications follows top-down approaches derived from regulatory frameworks and theoretical analyses, empirical evidence of real-world failure modes remains underexplored. In this work, we introduce RealHarm, a dataset of annotated problematic interactions with AI agents built from a systematic review of publicly reported incidents. Analyzing harms, causes, and hazards specifically from the deployer’s perspective, we find that reputational damage constitutes the predominant organizational harm, while misinformation emerges as the most common hazard category. We empirically evaluate state-of-the-art guardrails and content moderation systems to probe whether such systems would have prevented the incidents, revealing a significant gap in the protection of AI applications.
pdf
bib
abs
Bypassing LLM Guardrails: An Empirical Analysis of Evasion Attacks against Prompt Injection and Jailbreak Detection Systems
William Hackett
|
Lewis Birch
|
Stefan Trawicki
|
Neeraj Suri
|
Peter Garraghan
Large Language Models (LLMs) guardrail systems are designed to protect against prompt injection and jailbreak attacks. However, they remain vulnerable to evasion techniques. We demonstrate two approaches for bypassing LLM prompt injection and jailbreak detection systems via traditional character injection methods and algorithmic Adversarial Machine Learning (AML) evasion techniques. Through testing against six prominent protection systems, including Microsoft’s Azure Prompt Shield and Meta’s Prompt Guard, we show that both methods can be used to evade detection while maintaining adversarial utility achieving in some instances up to 100% evasion success. Furthermore, we demonstrate that adversaries can enhance Attack Success Rates (ASR) against black-box targets by leveraging word importance ranking computed by offline white-box models. Our findings reveal vulnerabilities within current LLM protection mechanisms and highlight the need for more robust guardrail systems.
pdf
bib
abs
1-2-3 Check: Enhancing Contextual Privacy in LLM via Multi-Agent Reasoning
Wenkai Li
|
Liwen Sun
|
Zhenxiang Guan
|
Xuhui Zhou
|
Maarten Sap
Addressing contextual privacy concerns remains challenging in interactive settings where large language models (LLMs) process information from multiple sources. Building on the theory of contextual integrity, we introduce a multi-agent framework that decomposes privacy reasoning into specialized subtasks—extraction, classification—reducing the information load on any single agent while enabling iterative validation and more reliable adherence to contextual privacy norms. Experiments on the ConfAIde benchmark with two LLMs (GPT-4, Llama3) demonstrate that our multi-agent system substantially reduces private information leakage (36% reduction) while maintaining the fidelity of public content compared to a single-agent system, showing the promise of multi-agent frameworks towards contextual privacy with LLMs.
pdf
bib
abs
Fine-Tuning Lowers Safety and Disrupts Evaluation Consistency
Kathleen C. Fraser
|
Hillary Dawkins
|
Isar Nejadgholi
|
Svetlana Kiritchenko
Fine-tuning a general-purpose large language model (LLM) for a specific domain or task has become a routine procedure for ordinary users. However, fine-tuning is known to remove the safety alignment features of the model, even when the fine-tuning data does not contain any harmful content. We consider this to be a critical failure mode of LLMs due to the widespread uptake of fine-tuning, combined with the benign nature of the “attack”. Most well-intentioned developers are likely unaware that they are deploying an LLM with reduced safety. On the other hand, this known vulnerability can be easily exploited by malicious actors intending to bypass safety guardrails. To make any meaningful progress in mitigating this issue, we first need reliable and reproducible safety evaluations. In this work, we investigate how robust a safety benchmark is to trivial variations in the experimental procedure, and the stochastic nature of LLMs. Our initial experiments expose surprising variance in the results of the safety evaluation, even when seemingly inconsequential changes are made to the fine-tuning setup. Our observations have serious implications for how researchers in this field should report results to enable meaningful comparisons in the future.
pdf
bib
abs
SPADE: Structured Prompting Augmentation for Dialogue Enhancement in Machine-Generated Text Detection
Haoyi Li
|
Angela Yuan
|
Soyeon Han
|
Chirstopher Leckie
The increasing capability of large language models (LLMs) to generate synthetic content has heightened concerns about their misuse, driving the development of Machine-Generated Text (MGT) detection models. However, these detectors face significant challenges due to the lack of high-quality synthetic datasets for training. To address this issue, we propose SPADE, a structured framework for detecting synthetic dialogues using prompt-based adversarial samples. Our proposed methods yield 14 new dialogue datasets, which we benchmark against eight MGT detection models. The results demonstrate improved generalization performance when utilizing a mixed dataset produced by proposed augmentation frameworks, offering a practical approach to enhancing LLM application security. Considering that real-world agents lack knowledge of future opponent utterances, we simulate online dialogue detection and examine the relationship between chat history length and detection accuracy. Our open-source datasets can be downloaded.
pdf
bib
abs
Weakest Link in the Chain: Security Vulnerabilities in Advanced Reasoning Models
Arjun Krishna
|
Erick Galinkin
|
Aaditya Rastogi
The introduction of advanced reasoning capabilities have improved the problem-solving performance of large language models, particularly on math and coding benchmarks. However, it remains unclear whether these reasoning models are more or less vulnerable to adversarial prompt attacks than their non-reasoning counterparts. In this work, we present a systematic evaluation of weaknesses in advanced reasoning models compared to similar non-reasoning models across a diverse set of prompt-based attack categories. Using experimental data, we find that on average the reasoning-augmented models are slightly more robust than non-reasoning models (42.51% vs 45.53% attack success rate, lower is better). However, this overall trend masks significant category-specific differences: for certain attack types the reasoning models are substantially more vulnerable (e.g., up to 32 percentage points worse on a tree-of-attacks prompt), while for others they are markedly more robust (e.g., 29.8 points better on cross-site scripting injection). Our findings highlight the nuanced security implications of advanced reasoning in language models and emphasize the importance of stress-testing safety across diverse adversarial techniques.
pdf
bib
abs
CAPTURE: Context-Aware Prompt Injection Testing and Robustness Enhancement
Gauri Kholkar
|
Ratinder Ahuja
Prompt injection remains a major security risk for large language models. However, the efficacy of existing guardrail models in context-aware settings remains underexplored, as they often rely on static attack benchmarks. Additionally, they have over-defense tendencies. We introduce CAPTURE, a novel context-aware benchmark assessing both attack detection and over-defense tendencies with minimal in-domain examples. Our experiments reveal that current prompt injection guardrail models suffer from high false negatives in adversarial cases and excessive false positives in benign scenarios, highlighting critical limitations. To demonstrate our framework’s utility, we train CAPTUREGUARD on our generated data. This new model drastically reduces both false negative and false positive rates on our context-aware datasets while also generalizing effectively to external benchmarks, establishing a path toward more robust and practical prompt injection defenses.
pdf
bib
abs
Shortcut Learning in Safety: The Impact of Keyword Bias in Safeguards
Panuthep Tasawong
|
Napat Laosaengpha
|
Wuttikorn Ponwitayarat
|
Sitiporn Lim
|
Potsawee Manakul
|
Samuel Cahyawijaya
|
Can Udomcharoenchaikit
|
Peerat Limkonchotiwat
|
Ekapol Chuangsuwanich
|
Sarana Nutanong
This paper investigates the problem of shortcut learning in safety guardrails for large language models (LLMs). It reveals that current safeguard models often rely excessively on superficial cues, such as specific keywords that are spuriously correlated with training labels, rather than genuinely understanding the input’s semantics or intent. As a result, their performance degrades significantly when there is a shift in keyword distribution. The paper also examines the impact of reducing shortcut reliance, showing that merely minimizing shortcut influence is insufficient. To build robust safeguard models, it is equally crucial to promote the use of intended features.
pdf
bib
abs
Beyond Words: Multilingual and Multimodal Red Teaming of MLLMs
Erik Derner
|
Kristina Batistič
Multimodal large language models (MLLMs) are increasingly deployed in real-world applications, yet their safety remains underexplored, particularly in multilingual and visual contexts. In this work, we present a systematic red teaming framework to evaluate MLLM safeguards using adversarial prompts translated into seven languages and delivered via four input modalities: plain text, jailbreak prompt + text, text rendered as an image, and jailbreak prompt + text rendered as an image. We find that rendering prompts as images increases attack success rates and reduces refusal rates, with the effect most pronounced in lower-resource languages such as Slovenian, Czech, and Valencian. Our results suggest that vision-based multilingual attacks expose a persistent gap in current alignment strategies, highlighting the need for robust multilingual and multimodal MLLM safety evaluation and mitigation of these risks. We make our code and data available.