Chien Hung Chen
2025
Self-Augmented Preference Alignment for Sycophancy Reduction in LLMs
Chien Hung Chen
|
Hen-Hsen Huang
|
Hsin-Hsi Chen
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Sycophancy causes models to produce answers that cater to user expectations rather than providing truthful responses. Sycophantic behavior in models can erode user trust by creating a perception of dishonesty or bias. This lack of authenticity may lead users to question the reliability and objectivity of the system’s responses. Although Reinforcement Learning from Human Feedback (RLHF) is effective in aligning models with human preferences, previous studies have observed that it can simultaneously amplify sycophantic behavior. However, these studies primarily focused on proprietary models and employed indirect analysis to demonstrate the influence of human feedback. Our study focuses on sycophancy in open-source models, which are more reproducible and transparent for research. We investigated the impact of human feedback on sycophancy by directly comparing models aligned with human feedback to those not aligned. To address sycophancy, we proposed assessing the user’s expected answer rather than ignoring it. Consequently, we developed the Sycophancy Answer Assessment (SAA) dataset and introduced Self-Augmented Preference Alignment, demonstrating that these methods effectively enhance the model’s assessment ability and significantly reduce sycophancy across tasks.