Wenbo Jiang

2026

The deployment of large language models (LLMs) raises significant ethical and safety concerns. While LLM alignment techniques are adopted to improve model safety and trustworthiness, adversaries can exploit these techniques to undermine safety for malicious purposes, resulting in misalignment. Misaligned LLMs may be published on open platforms to magnify harm. To address this, additional safety alignment, referred to as realignment, is necessary before deploying untrusted third-party LLMs. This study explores the efficacy of fine-tuning methods in terms of misalignment, realignment, and the effects of their interplay. By evaluating four Supervised Fine-Tuning (SFT) and two Preference Fine-Tuning (PFT) methods across four popular safety-aligned LLMs, we reveal a mechanism asymmetry between attack and defense. While Odds Ratio Preference Optimization (ORPO) is most effective for misalignment, Direct Preference Optimization (DPO) excels in realignment, albeit at the expense of model utility. Additionally, we identify model-specific resistance, residual effects of multi-round adversarial dynamics, and other noteworthy findings. These findings highlight the need for robust safeguards and customized safety alignment strategies to mitigate potential risks in the deployment of LLMs.

2025

pdf bib abs

Large Audio-Language Models (LALMs) are increasingly deployed in real-world applications, yet their robustness against malicious audio injection remains underexplored. To address this gap, this study systematically evaluates five leading LALMs across four attack scenarios: Audio Interference Attack, Instruction Following Attack, Context Injection Attack, and Judgment Hijacking Attack. We quantitatively assess their vulnerabilities and resilience using metrics: the Defense Success Rate, Context Robustness Score, and Judgment Robustness Index. The experiments reveal significant performance disparities, with no single model demonstrating consistent robustness across all attack types. Attack effectiveness is significantly influenced by the position of the malicious content, particularly when injected at the beginning of a sequence. Furthermore, our analysis uncovers a negative correlation between a model’s instruction-following capability and its robustness: models that strictly adhere to instructions tend to be more susceptible, whereas safety-aligned models exhibit greater resistance. To facilitate future research, this work introduces a comprehensive benchmark framework. Our findings underscore the critical need for integrating robustness into training pipelines and developing multi-modal defenses, ultimately facilitating the secure deployment of LALMs. The dataset used in this work is available on Hugging Face.

Co-authors

Venues

EMNLP1
Findings1

Fix author