Sycophantic Anchors: Localizing and Quantifying User Agreement in Reasoning Models

Jacek Duszenko; Przemyslaw Kazienko; Jan Kocon

Sycophantic Anchors: Localizing and Quantifying User Agreement in Reasoning Models

Jacek Duszenko, Przemyslaw Kazienko, Jan Kocon

Abstract

Reasoning models frequently agree with incorrect user suggestions - a behavior known as sycophancy. However, it is unclear where in the reasoning trace this agreement originates and how strong the commitment is. We introduce sycophantic anchors - sentences identified via counterfactual analysis that commit models to user agreement. Across four reasoning models spanning three architecture families (Llama, Qwen, Falcon-hybrid) and 1.5B - 8B parameters, we analyze over 200,000 counterfactual rollouts and show that linear probes reliably detect sycophantic anchors (74 - 85% balanced accuracy), outperforming text-only baselines at high commitment levels -confirming they capture internal states beyond surface vocabulary. Regressors further predict commitment strength from activations (R² up to 0.74). We observe a consistent asymmetry: sycophancy leaves a stronger mechanistic footprint than correct reasoning. We also find that sycophancy builds gradually during generation rather than being determined by the prompt. These findings enable sentence-level detection and quantification of model misalignment mid-inference.

Anthology ID:: 2026.acl-srw.20
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Santosh T.Y.S.S., Juan Diego Rodriguez, Ona de Gibert
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 225–239
Language:
URL:: https://preview.aclanthology.org/ingestion-form-platform/2026.acl-srw.20/
DOI:
Bibkey:
Cite (ACL):: Jacek Duszenko, Przemyslaw Kazienko, and Jan Kocon. 2026. Sycophantic Anchors: Localizing and Quantifying User Agreement in Reasoning Models. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop), pages 225–239, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Sycophantic Anchors: Localizing and Quantifying User Agreement in Reasoning Models (Duszenko et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingestion-form-platform/2026.acl-srw.20.pdf

PDF Cite Search Fix data