Constructing Your Model’s Value Distinction: Towards LLM Alignment with Anchor Words Tuning
Zhen Yang, Ping Jian, Chengzhi Li, Chenxu Wang, Xinyue Zhang, Wenpeng Lu
Abstract
With the widespread applications of large language models (LLMs), aligning LLMs with human values has emerged as a critical challenge. For alignment, we always expect LLMs to be honest, positive, harmless, etc. And LLMs appear to be capable of generating the desired outputs after the alignment tuning process, such as the preference tuning via reinforcement learning from human feedback (RLHF). However, it also raises a question about **after alignment, do LLMs genuinely obtain a value distinction between positives and negatives, beyond the generation of positive outputs?** In this work, we start by investigating this question from the token distribution perspective. Our findings reveal that compared to the unaligned versions, LLMs after alignment exhibit a larger logits gap between positive and negative tokens at each generation step, which suggests that LLMs do obtain a value distinction of positives and negatives after alignment. Meanwhile, it also motivates us to achieve alignment by directly constructing such value distinction, thus alleviating the excessive reliance on computational resources required by training-time alignment. Specifically, we propose a representation editing method that intervenes the last hidden representation by amplifying the logits difference between positive and negative tokens (defined as anchor words). Experimental results demonstrate that the proposed method not only achieves effective alignment, but also requires fewer computational resources compared to training-time alignment methods- Anthology ID:
- 2025.findings-emnlp.317
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2025
- Month:
- November
- Year:
- 2025
- Address:
- Suzhou, China
- Editors:
- Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 5932–5948
- Language:
- URL:
- https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.317/
- DOI:
- 10.18653/v1/2025.findings-emnlp.317
- Cite (ACL):
- Zhen Yang, Ping Jian, Chengzhi Li, Chenxu Wang, Xinyue Zhang, and Wenpeng Lu. 2025. Constructing Your Model’s Value Distinction: Towards LLM Alignment with Anchor Words Tuning. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 5932–5948, Suzhou, China. Association for Computational Linguistics.
- Cite (Informal):
- Constructing Your Model’s Value Distinction: Towards LLM Alignment with Anchor Words Tuning (Yang et al., Findings 2025)
- PDF:
- https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.317.pdf