Probing Bias Formation in Medical LLMs through Activation Steering

Bayram Ayadi, Annette Hautli-Janisz


Abstract
Large Language Models specialized for the medical domain achieve high performance on static benchmarks, but remain vulnerable to sycophantic confabulation, where the models generate medically spurious rationales to justify incorrect user hints. This robustness gap poses severe risks in clinical environments, as models may prioritize contextual faithfulness to a biased prompt over their internal parametric medical knowledge. This study introduces a mechanistic approach to identify and mitigate these failures in MedGemma-27B, isolating hint integration circuits using Sparse Autoencoders and geometric manifold analysis. Our findings reveal that sycophantic bias is a highly distributed and polymorphic concept, with biased reasoning routed through shifting dimensions across transformer layers. We identify the optimal layer for intervention and demonstrate that cluster-conditioned dynamic steering tailored to the geometric subspace of the prompt outperforms static global interventions, though it reveals a fundamental tension between bias resilience and the retention of internal parametric knowledge. This work proposes a principled framework toward clinical AI systems that are more robust and aligned with expert medical logic, demonstrating the potential of cluster-conditioned geometric interventions while characterizing the inherent trade-offs in clinical knowledge retention.
Anthology ID:
2026.acl-srw.54
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Santosh T.Y.S.S., Juan Diego Rodriguez, Ona de Gibert
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
609–620
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-srw.54/
DOI:
Bibkey:
Cite (ACL):
Bayram Ayadi and Annette Hautli-Janisz. 2026. Probing Bias Formation in Medical LLMs through Activation Steering. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026), pages 609–620, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Probing Bias Formation in Medical LLMs through Activation Steering (Ayadi & Hautli-Janisz, ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-srw.54.pdf