You Only Need One Single Token to Refine Safety Alignment
Wenqian Yu, Shuo Chen, Zhijiang Li, Zhipeng Wang, Jindong Gu
Abstract
Large language models (LLMs) face a critical alignment challenge: balancing safety with helpfulness. Excessive safety can lead to over-refusal, where models reject harmful-looking yet benign queries, severely limiting utility.Existing training-free interventions offer an efficient way to mitigate over-refusal without re-training, but suffer from high inference overhead and architecture dependency. Our work explores a complementary direction: rather than applying post-hoc corrections to model outputs, our goal is to intrinsically reshape the distributions of harmful and benign samples within the model’s decision space. In this paper, we argue that a lightweight training-based approach can more effectively distinguish between harmful and benign samples. We propose Single Token Alignment (STA), which optimizes only a single-token prefix (e.g., 4,096 parameters) while keeping the base model frozen. To address the inherent challenge of achieving robust refinement through such a minimal parameter interface, STA employs a mixed weighting mechanism integrated with its optimization objective. This mechanism incorporates hard weighting via stringent data filtering to provide clear, unbiased learning signals, and soft weighting through a focal mechanism to prioritize challenging cases.Extensive experiments across 9 models and 10 datasets demonstrate that STA achieves a superior safety-helpfulness balance for LLMs, MLLMs, and reasoning models, offering a highly efficient and generalizable solution for refining safety alignment.- Anthology ID:
- 2026.findings-acl.662
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 13529–13545
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.662/
- DOI:
- Cite (ACL):
- Wenqian Yu, Shuo Chen, Zhijiang Li, Zhipeng Wang, and Jindong Gu. 2026. You Only Need One Single Token to Refine Safety Alignment. In Findings of the Association for Computational Linguistics: ACL 2026, pages 13529–13545, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- You Only Need One Single Token to Refine Safety Alignment (Yu et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.662.pdf