LLM-VA: Resolving the Jailbreak-Overrefusal Trade-off via Vector Alignment

Haonan Zhang, Dongxia Wang, Yi Liu, Kexin Chen, Wenhai Wang


Abstract
Safety-aligned LLMs suffer from two failure modes: jailbreak (responding to harmful inputs) and over-refusal (declining benign queries). Existing vector steering methods adjust the magnitude of answer vectors, but this creates a fundamental trade-off—reducing jailbreak increases over-refusal and vice versa. We identify the root cause: LLMs encode the decision to respond (answer vector va) and the judgment of input safety (benign vector vb) as nearly orthogonal directions, treating them as independent processes. We propose LLM-VA, which aligns va with vb through closed-form weight updates, making the model’s willingness to respond causally dependent on its safety assessment—without fine-tuning or architectural changes. Our method identifies vectors at each layer using SVMs, selects safety-relevant layers, and iteratively aligns vectors via minimum-norm weight modifications. Experiments on 12 LLMs demonstrate that LLM-VA achieves 11.45% higher F1 than the best baseline while preserving 95.92% utility, and automatically adapts to each model’s safety bias without manual tuning.Code and models are available at https://hotbento.github.io/LLM-VA-Web/.
Anthology ID:
2026.acl-long.260
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5760–5776
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.260/
DOI:
Bibkey:
Cite (ACL):
Haonan Zhang, Dongxia Wang, Yi Liu, Kexin Chen, and Wenhai Wang. 2026. LLM-VA: Resolving the Jailbreak-Overrefusal Trade-off via Vector Alignment. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5760–5776, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
LLM-VA: Resolving the Jailbreak-Overrefusal Trade-off via Vector Alignment (Zhang et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.260.pdf
Checklist:
 2026.acl-long.260.checklist.pdf