You Only Need One Single Token to Refine Safety Alignment

Wenqian Yu, Shuo Chen, Zhijiang Li, Zhipeng Wang, Jindong Gu


Abstract
Large language models (LLMs) face a critical alignment challenge: balancing safety with helpfulness. Excessive safety can lead to over-refusal, where models reject harmful-looking yet benign queries, severely limiting utility.Existing training-free interventions offer an efficient way to mitigate over-refusal without re-training, but suffer from high inference overhead and architecture dependency. Our work explores a complementary direction: rather than applying post-hoc corrections to model outputs, our goal is to intrinsically reshape the distributions of harmful and benign samples within the model’s decision space. In this paper, we argue that a lightweight training-based approach can more effectively distinguish between harmful and benign samples. We propose Single Token Alignment (STA), which optimizes only a single-token prefix (e.g., 4,096 parameters) while keeping the base model frozen. To address the inherent challenge of achieving robust refinement through such a minimal parameter interface, STA employs a mixed weighting mechanism integrated with its optimization objective. This mechanism incorporates hard weighting via stringent data filtering to provide clear, unbiased learning signals, and soft weighting through a focal mechanism to prioritize challenging cases.Extensive experiments across 9 models and 10 datasets demonstrate that STA achieves a superior safety-helpfulness balance for LLMs, MLLMs, and reasoning models, offering a highly efficient and generalizable solution for refining safety alignment.
Anthology ID:
2026.findings-acl.662
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
13529–13545
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.662/
DOI:
Bibkey:
Cite (ACL):
Wenqian Yu, Shuo Chen, Zhijiang Li, Zhipeng Wang, and Jindong Gu. 2026. You Only Need One Single Token to Refine Safety Alignment. In Findings of the Association for Computational Linguistics: ACL 2026, pages 13529–13545, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
You Only Need One Single Token to Refine Safety Alignment (Yu et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.662.pdf
Checklist:
 2026.findings-acl.662.checklist.pdf