Xiaohao Luo

2026

Safety Guardrails of Large Language Models Are Vulnerable to Value-Driven Adversarial Prompting
Xiaohao Luo | Ying Wei | Zhijun Li
Findings of the Association for Computational Linguistics: ACL 2026

In the real world, the execution of a task often depends on the executor’s recognition of its value. Motivated by this observation, we propose the value-driven jailbreak attack (VDJA), a simple and effective black-box jailbreak method against large language models (LLMs). VDJA first exploits the phenomenon that LLMs tend to agree with humans to induce LLMs to affirm the moral value of harmful tasks. During autoregressive generation, these value-endorsement tokens function as an implicit value prior, making LLMs more likely to accept and generate harmful content. Extensive experiments on five state-of-the-art (SOTA) LLMs demonstrate the superiority of VDJA. Using only a single query and without concealing harmful instructions, VDJA achieves an average attack success rate (ASR) of 91.8% on JailbreakBench and 95.2% on the AdvBench subset, showcasing SOTA jailbreak success rates and attack efficiency. Most importantly, our work suggests a previously underexplored vulnerability in the safety guardrails of LLMs, which highlights the urgent need to enhance their robustness.

pdf bib abs

Detecting What Queries Seek: Steering LLM Safety with FFN Output Activation Monitoring
Xiaohao Luo | Ying Wei | Rui Zhao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Recently, activation steering has attracted considerable attention as a low-cost approach to improving the safety of large language models (LLMs). However, most existing methods apply interventions indiscriminately, often causing excessive refusal of benign queries. Although recent works have begun to explore selective intervention, their intervention decisions typically rely on residual stream activations where information is highly entangled, resulting in limited discriminative power and unreliable interventions. To address this issue, we propose FFN-Guided activation steering (FGAS). Motivated by the observation that feed-forward networks (FFNs) in LLMs serve as core modules for knowledge storage, we propose leveraging FFN output activations as more discriminative signals for intervention, since these activations more explicitly reflect the intent of a query. For a given query, FGAS projects the corresponding FFN output activation into a low-dimensional subspace that effectively separates harmful and benign queries, and then makes precise intervention decisions by assessing its similarity to pre-constructed prototype activations representing harmful and benign classes. Extensive experiments demonstrate that FGAS achieves state-of-the-art defense performance against various jailbreak attacks, while nearly preserving the model’s original performance on benign tasks.

Co-authors

Venues

ACL1
Findings1

Fix author