Understanding How CodeLLMs (Mis)Predict Types with Activation Steering

Francesca Lucchetti; Arjun Guha

Understanding How CodeLLMs (Mis)Predict Types with Activation Steering

Abstract

Large Language Models (LLMs) are widely used by software engineers for programming tasks. However, research shows that LLMs often lack a deep understanding of program semantics. Even minor changes to syntax, such as renaming variables, can significantly degrade performance across various tasks. In this work, we examine the task of *type prediction*: given a partially typed program, can a model predict a missing type annotations such that the resulting program is more typed? We construct a dataset of adversarial examples where models initially predict the correct types, but begin to fail after semantically irrelevant edits. This is problematic, as models should ideally generalize across different syntactic forms of semantically equivalent code. This lack of robustness suggests that models may have a shallow understanding of code semantics.Despite this, we provide evidence that LLMs do, in fact, learn robust mechanisms for type prediction—though these mechanisms often fail to activate in adversarial scenarios. By using *activation steering*, a method that manipulates a model’s internal activations to guide it toward using latent knowledge, we restore accurate predictions on adversarial inputs. We show that steering successfully activates a type prediction mechanism that is shared by both Python and TypeScript, and is more effective than prompting with in-context examples. Across five different models, our comprehensive evaluation demonstrates that LLMs can learn generalizable representations of code semantics that transfer across programming languages.

Anthology ID:: 2025.blackboxnlp-1.22
Volume:: Proceedings of the 8th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Yonatan Belinkov, Aaron Mueller, Najoung Kim, Hosein Mohebbi, Hanjie Chen, Dana Arad, Gabriele Sarti
Venues:: BlackboxNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 358–397
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.blackboxnlp-1.22/
DOI:
Bibkey:
Cite (ACL):: Francesca Lucchetti and Arjun Guha. 2025. Understanding How CodeLLMs (Mis)Predict Types with Activation Steering. In Proceedings of the 8th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 358–397, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Understanding How CodeLLMs (Mis)Predict Types with Activation Steering (Lucchetti & Guha, BlackboxNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.blackboxnlp-1.22.pdf

PDF Cite Search Fix data