Self-Augmented Preference Alignment for Sycophancy Reduction in LLMs

Chien Hung Chen, Hen-Hsen Huang, Hsin-Hsi Chen


Abstract
Sycophancy causes models to produce answers that cater to user expectations rather than providing truthful responses. Sycophantic behavior in models can erode user trust by creating a perception of dishonesty or bias. This lack of authenticity may lead users to question the reliability and objectivity of the system’s responses. Although Reinforcement Learning from Human Feedback (RLHF) is effective in aligning models with human preferences, previous studies have observed that it can simultaneously amplify sycophantic behavior. However, these studies primarily focused on proprietary models and employed indirect analysis to demonstrate the influence of human feedback. Our study focuses on sycophancy in open-source models, which are more reproducible and transparent for research. We investigated the impact of human feedback on sycophancy by directly comparing models aligned with human feedback to those not aligned. To address sycophancy, we proposed assessing the user’s expected answer rather than ignoring it. Consequently, we developed the Sycophancy Answer Assessment (SAA) dataset and introduced Self-Augmented Preference Alignment, demonstrating that these methods effectively enhance the model’s assessment ability and significantly reduce sycophancy across tasks.
Anthology ID:
2025.emnlp-main.625
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
12390–12402
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.625/
DOI:
Bibkey:
Cite (ACL):
Chien Hung Chen, Hen-Hsen Huang, and Hsin-Hsi Chen. 2025. Self-Augmented Preference Alignment for Sycophancy Reduction in LLMs. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12390–12402, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Self-Augmented Preference Alignment for Sycophancy Reduction in LLMs (Chen et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.625.pdf
Checklist:
 2025.emnlp-main.625.checklist.pdf