Sycophantic Anchors: Localizing and Quantifying User Agreement in Reasoning Models

Jacek Duszenko, Przemyslaw Kazienko, Jan Kocon


Abstract
Reasoning models frequently agree with incorrect user suggestions - a behavior known as sycophancy. However, it is unclear where in the reasoning trace this agreement originates and how strong the commitment is. We introduce sycophantic anchors - sentences identified via counterfactual analysis that commit models to user agreement. Across four reasoning models spanning three architecture families (Llama, Qwen, Falcon-hybrid) and 1.5B - 8B parameters, we analyze over 200,000 counterfactual rollouts and show that linear probes reliably detect sycophantic anchors (74 - 85% balanced accuracy), outperforming text-only baselines at high commitment levels -confirming they capture internal states beyond surface vocabulary. Regressors further predict commitment strength from activations (R2 up to 0.74). We observe a consistent asymmetry: sycophancy leaves a stronger mechanistic footprint than correct reasoning. We also find that sycophancy builds gradually during generation rather than being determined by the prompt. These findings enable sentence-level detection and quantification of model misalignment mid-inference.
Anthology ID:
2026.acl-srw.20
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Santosh T.Y.S.S., Juan Diego Rodriguez, Ona de Gibert
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
225–239
Language:
URL:
https://preview.aclanthology.org/ingestion-form-platform/2026.acl-srw.20/
DOI:
Bibkey:
Cite (ACL):
Jacek Duszenko, Przemyslaw Kazienko, and Jan Kocon. 2026. Sycophantic Anchors: Localizing and Quantifying User Agreement in Reasoning Models. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop), pages 225–239, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Sycophantic Anchors: Localizing and Quantifying User Agreement in Reasoning Models (Duszenko et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-form-platform/2026.acl-srw.20.pdf