Constructing Your Model’s Value Distinction: Towards LLM Alignment with Anchor Words Tuning

Zhen Yang; Ping Jian (鉴萍); Chengzhi Li; Chenxu Wang; Xinyue Zhang; Wenpeng Lu

doi:10.18653/v1/2025.findings-emnlp.317

Constructing Your Model’s Value Distinction: Towards LLM Alignment with Anchor Words Tuning

Zhen Yang, Ping Jian, Chengzhi Li, Chenxu Wang, Xinyue Zhang, Wenpeng Lu

Abstract

With the widespread applications of large language models (LLMs), aligning LLMs with human values has emerged as a critical challenge. For alignment, we always expect LLMs to be honest, positive, harmless, etc. And LLMs appear to be capable of generating the desired outputs after the alignment tuning process, such as the preference tuning via reinforcement learning from human feedback (RLHF). However, it also raises a question about **after alignment, do LLMs genuinely obtain a value distinction between positives and negatives, beyond the generation of positive outputs?** In this work, we start by investigating this question from the token distribution perspective. Our findings reveal that compared to the unaligned versions, LLMs after alignment exhibit a larger logits gap between positive and negative tokens at each generation step, which suggests that LLMs do obtain a value distinction of positives and negatives after alignment. Meanwhile, it also motivates us to achieve alignment by directly constructing such value distinction, thus alleviating the excessive reliance on computational resources required by training-time alignment. Specifically, we propose a representation editing method that intervenes the last hidden representation by amplifying the logits difference between positive and negative tokens (defined as anchor words). Experimental results demonstrate that the proposed method not only achieves effective alignment, but also requires fewer computational resources compared to training-time alignment methods

Anthology ID:: 2025.findings-emnlp.317
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 5932–5948
Language:
URL:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.317/
DOI:: 10.18653/v1/2025.findings-emnlp.317
Bibkey:
Cite (ACL):: Zhen Yang, Ping Jian, Chengzhi Li, Chenxu Wang, Xinyue Zhang, and Wenpeng Lu. 2025. Constructing Your Model’s Value Distinction: Towards LLM Alignment with Anchor Words Tuning. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 5932–5948, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Constructing Your Model’s Value Distinction: Towards LLM Alignment with Anchor Words Tuning (Yang et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.317.pdf
Checklist:: 2025.findings-emnlp.317.checklist.pdf

PDF Cite Search Checklist Fix data