LLM-VA: Resolving the Jailbreak-Overrefusal Trade-off via Vector Alignment

Haonan Zhang; Dongxia Wang; Yi Liu; Kexin Chen; Wenhai Wang

LLM-VA: Resolving the Jailbreak-Overrefusal Trade-off via Vector Alignment

Haonan Zhang, Dongxia Wang, Yi Liu, Kexin Chen, Wenhai Wang

Abstract

Safety-aligned LLMs suffer from two failure modes: jailbreak (responding to harmful inputs) and over-refusal (declining benign queries). Existing vector steering methods adjust the magnitude of answer vectors, but this creates a fundamental trade-off—reducing jailbreak increases over-refusal and vice versa. We identify the root cause: LLMs encode the decision to respond (answer vector v_a) and the judgment of input safety (benign vector v_b) as nearly orthogonal directions, treating them as independent processes. We propose LLM-VA, which aligns v_a with v_b through closed-form weight updates, making the model’s willingness to respond causally dependent on its safety assessment—without fine-tuning or architectural changes. Our method identifies vectors at each layer using SVMs, selects safety-relevant layers, and iteratively aligns vectors via minimum-norm weight modifications. Experiments on 12 LLMs demonstrate that LLM-VA achieves 11.45% higher F1 than the best baseline while preserving 95.92% utility, and automatically adapts to each model’s safety bias without manual tuning.Code and models are available at https://hotbento.github.io/LLM-VA-Web/.

Anthology ID:: 2026.acl-long.260
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 5760–5776
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.260/
DOI:
Bibkey:
Cite (ACL):: Haonan Zhang, Dongxia Wang, Yi Liu, Kexin Chen, and Wenhai Wang. 2026. LLM-VA: Resolving the Jailbreak-Overrefusal Trade-off via Vector Alignment. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5760–5776, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: LLM-VA: Resolving the Jailbreak-Overrefusal Trade-off via Vector Alignment (Zhang et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.260.pdf
Checklist:: 2026.acl-long.260.checklist.pdf

PDF Cite Search Checklist Fix data