You Only Need One Single Token to Refine Safety Alignment

Wenqian Yu; Shuo Chen; Zhijiang Li; Zhipeng Wang; Jindong Gu

You Only Need One Single Token to Refine Safety Alignment

Wenqian Yu, Shuo Chen, Zhijiang Li, Zhipeng Wang, Jindong Gu

Abstract

Large language models (LLMs) face a critical alignment challenge: balancing safety with helpfulness. Excessive safety can lead to over-refusal, where models reject harmful-looking yet benign queries, severely limiting utility.Existing training-free interventions offer an efficient way to mitigate over-refusal without re-training, but suffer from high inference overhead and architecture dependency. Our work explores a complementary direction: rather than applying post-hoc corrections to model outputs, our goal is to intrinsically reshape the distributions of harmful and benign samples within the model’s decision space. In this paper, we argue that a lightweight training-based approach can more effectively distinguish between harmful and benign samples. We propose Single Token Alignment (STA), which optimizes only a single-token prefix (e.g., 4,096 parameters) while keeping the base model frozen. To address the inherent challenge of achieving robust refinement through such a minimal parameter interface, STA employs a mixed weighting mechanism integrated with its optimization objective. This mechanism incorporates hard weighting via stringent data filtering to provide clear, unbiased learning signals, and soft weighting through a focal mechanism to prioritize challenging cases.Extensive experiments across 9 models and 10 datasets demonstrate that STA achieves a superior safety-helpfulness balance for LLMs, MLLMs, and reasoning models, offering a highly efficient and generalizable solution for refining safety alignment.

Anthology ID:: 2026.findings-acl.662
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 13529–13545
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.662/
DOI:
Bibkey:
Cite (ACL):: Wenqian Yu, Shuo Chen, Zhijiang Li, Zhipeng Wang, and Jindong Gu. 2026. You Only Need One Single Token to Refine Safety Alignment. In Findings of the Association for Computational Linguistics: ACL 2026, pages 13529–13545, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: You Only Need One Single Token to Refine Safety Alignment (Yu et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.662.pdf
Checklist:: 2026.findings-acl.662.checklist.pdf

PDF Cite Search Checklist Fix data