Jacek Duszenko
2026
Sycophantic Anchors: Localizing and Quantifying User Agreement in Reasoning Models
Jacek Duszenko | Przemyslaw Kazienko | Jan Kocon
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Jacek Duszenko | Przemyslaw Kazienko | Jan Kocon
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Reasoning models frequently agree with incorrect user suggestions - a behavior known as sycophancy. However, it is unclear where in the reasoning trace this agreement originates and how strong the commitment is. We introduce sycophantic anchors - sentences identified via counterfactual analysis that commit models to user agreement. Across four reasoning models spanning three architecture families (Llama, Qwen, Falcon-hybrid) and 1.5B - 8B parameters, we analyze over 200,000 counterfactual rollouts and show that linear probes reliably detect sycophantic anchors (74 - 85% balanced accuracy), outperforming text-only baselines at high commitment levels -confirming they capture internal states beyond surface vocabulary. Regressors further predict commitment strength from activations (R2 up to 0.74). We observe a consistent asymmetry: sycophancy leaves a stronger mechanistic footprint than correct reasoning. We also find that sycophancy builds gradually during generation rather than being determined by the prompt. These findings enable sentence-level detection and quantification of model misalignment mid-inference.