Safety Guardrails of Large Language Models Are Vulnerable to Value-Driven Adversarial Prompting

Xiaohao Luo; Ying Wei; Zhijun Li

Safety Guardrails of Large Language Models Are Vulnerable to Value-Driven Adversarial Prompting

Abstract

In the real world, the execution of a task often depends on the executor’s recognition of its value. Motivated by this observation, we propose the value-driven jailbreak attack (VDJA), a simple and effective black-box jailbreak method against large language models (LLMs). VDJA first exploits the phenomenon that LLMs tend to agree with humans to induce LLMs to affirm the moral value of harmful tasks. During autoregressive generation, these value-endorsement tokens function as an implicit value prior, making LLMs more likely to accept and generate harmful content. Extensive experiments on five state-of-the-art (SOTA) LLMs demonstrate the superiority of VDJA. Using only a single query and without concealing harmful instructions, VDJA achieves an average attack success rate (ASR) of 91.8% on JailbreakBench and 95.2% on the AdvBench subset, showcasing SOTA jailbreak success rates and attack efficiency. Most importantly, our work suggests a previously underexplored vulnerability in the safety guardrails of LLMs, which highlights the urgent need to enhance their robustness.

Anthology ID:: 2026.findings-acl.1357
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 27238–27255
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1357/
DOI:
Bibkey:
Cite (ACL):: Xiaohao Luo, Ying Wei, and Zhijun Li. 2026. Safety Guardrails of Large Language Models Are Vulnerable to Value-Driven Adversarial Prompting. In Findings of the Association for Computational Linguistics: ACL 2026, pages 27238–27255, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Safety Guardrails of Large Language Models Are Vulnerable to Value-Driven Adversarial Prompting (Luo et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1357.pdf
Checklist:: 2026.findings-acl.1357.checklist.pdf

PDF Cite Search Checklist Fix data