Sycophantic Anchors: Localizing and Quantifying User Agreement in Reasoning Models

Jacek Duszenko, Przemyslaw Kazienko, Jan Kocon


Abstract
Reasoning models frequently agree with incorrect user suggestions - a behavior known as sycophancy. However, it is unclear where in the reasoning trace this agreement originates and how strong the commitment is. We introduce sycophantic anchors - sentences identified via counterfactual analysis that commit models to user agreement. Across four reasoning models spanning three architecture families (Llama, Qwen, Falcon-hybrid) and 1.5B - 8B parameters, we analyze over 200,000 counterfactual rollouts and show that linear probes reliably detect sycophantic anchors (74 - 85% balanced accuracy), outperforming text-only baselines at high commitment levels -confirming they capture internal states beyond surface vocabulary. Regressors further predict commitment strength from activations (R2 up to 0.74). We observe a consistent asymmetry: sycophancy leaves a stronger mechanistic footprint than correct reasoning. We also find that sycophancy builds gradually during generation rather than being determined by the prompt. These findings enable sentence-level detection and quantification of model misalignment mid-inference.
Anthology ID:
2026.acl-srw.20
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Santosh T.Y.S.S., Juan Diego Rodriguez, Ona de Gibert
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
225–239
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-srw.20/
DOI:
Bibkey:
Cite (ACL):
Jacek Duszenko, Przemyslaw Kazienko, and Jan Kocon. 2026. Sycophantic Anchors: Localizing and Quantifying User Agreement in Reasoning Models. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026), pages 225–239, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Sycophantic Anchors: Localizing and Quantifying User Agreement in Reasoning Models (Duszenko et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-srw.20.pdf