Tsung-Yi Ho

2026

Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets
Lei Hsiung | Tianyu Pang | Yung-Chen Tang | Linyue Song | Tsung-Yi Ho | Pin-Yu Chen | Yaoqing Yang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Recent advancements in large language models (LLMs) have underscored their vulnerability to safety alignment jailbreaks, particularly when subjected to downstream fine-tuning. However, existing mitigation strategies primarily focus on reactively addressing jailbreak incidents after safety guardrails have been compromised, removing harmful gradients during fine-tuning, or continuously reinforcing safety alignment throughout fine-tuning. As such, they tend to overlook a critical upstream factor: the role of the original safety-alignment data. This paper therefore investigates the degradation of safety guardrails through the lens of representation similarity between upstream alignment datasets and downstream fine-tuning tasks. Our experiments demonstrate that high similarity between these datasets significantly weakens safety guardrails, making models more susceptible to jailbreaks. Conversely, low similarity between these two types of datasets yields substantially more robust models and thus reduces harmfulness score by up to 10.33%. By highlighting the importance of upstream dataset design in the building of durable safety guardrails and reducing real-world vulnerability to jailbreak attacks, these findings offer actionable insights for fine-tuning service providers to prioritize upstream models with low jailbreak risk.

pdf bib abs

GRE Score: Generative Risk Evaluation for Large Language Models
Zaitang LI | Pin-Yu Chen | Tsung-Yi Ho
Findings of the Association for Computational Linguistics: ACL 2026

Large Language Models (LLMs) have revolutionized generative tasks, but concerns about their trustworthiness and vulnerability to adversarial attacks persist. This paper introduces the Generative Robustness Evaluation (GRE) Score, a novel metric designed to assess LLMs’ resilience against adversarial red teaming attempts that may compromise model compliance and elicit undesired responses. Our approach utilizes conditional generation for synthetic text creation, offering an attack-independent evaluation of LLM robustness. By calculating the margin in refusal scores, we quantify the robustness of LLMs in an attack-agnostic manner. We evaluate our method on five dimensions with specified datasets, encompassing ethical considerations, safety protocols, and potential misuse scenarios. We present four contributions: (1) The GRE Score framework, which establishes a textual robustness certificate for LLMs against adversarial red teaming attempts, providing a theoretical foundation for quantifying model resilience. (2) Comprehensive evaluations across five dimensions using eight prominent LLMs, validating GRE Scores with adversarial red teaming attacks. Our method demonstrates a consistent ranking of LLM robustness when compared to the attack-based model ranking on TrustLLM (CITATION) while achieving a significant 5-8x speedup compared to traditional evaluation techniques. (3) Insights into the non-linear relationship between model scaling and performance, revealing that larger models do not always perform better, and an analysis of how instruction-tuning impacts robustness across LLMs. (4) The discovery that all evaluated LLMs exhibit lower performance in robustness and privacy tasks compared to other areas, highlighting a critical gap in capabilities.

pdf bib abs

Large Language Models (LLMs) rely on massive training datasets, often including proprietary data, which raises concerns about unauthorized usage and copyright infringement. Existing dataset inference methods typically require access to log probabilities or other internal signals, but many modern LLMs restrict such access, motivating token-only inference approaches. We propose CatShift, a token-only dataset inference framework based on catastrophic forgetting, where models overwrite prior knowledge when trained on new data. Fine-tuning an LLM on a subset of its training data induces larger output shifts than fine-tuning on unseen data. CatShift compares these shifts against those from a known non-member validation set to infer whether a dataset was included in training. Experiments on both open-source and API-based LLMs show that CatShift remains effective without logit access, enabling practical protection of proprietary datasets.

2025

pdf bib abs

Defensive Prompt Patch: A Robust and Generalizable Defense of Large Language Models against Jailbreak Attacks
Chen Xiong | Xiangyu Qi | Pin-Yu Chen | Tsung-Yi Ho
Findings of the Association for Computational Linguistics: ACL 2025

Safety, security, and compliance are essential requirements when aligning large language models (LLMs). However, many seemingly aligned LLMs are soon shown to be susceptible to jailbreak attacks. These attacks aim to circumvent the models’ safety guardrails and security mechanisms by introducing jailbreak prompts into malicious queries. In response to these challenges, this paper introduces **Defensive Prompt Patch** (DPP), a novel prompt-based defense mechanism specifically designed to protect LLMs against such sophisticated jailbreak strategies. Unlike previous approaches, which have often compromised the utility of the model for the sake of safety, DPP is designed to achieve a minimal Attack Success Rate (ASR) while preserving the high utility of LLMs. Our method uses strategically designed suffix prompts that effectively thwart a wide range of standard and adaptive jailbreak techniques. Empirical results conducted on Llama-2-7B-Chat and Mistral-7B-Instruct-v0.2 demonstrate the robustness and adaptability of DPP, showing significant reductions in ASR with negligible impact on utility. Our approach not only outperforms existing defense strategies in balancing safety and functionality, but also provides a scalable and robust solution to various LLM platforms.

Co-authors

Rui Zhu 1

Venues

Findings3
ACL1

Fix author