Koyena Pal
2025
Internal states before wait modulate reasoning patterns
Dmitrii Troitskii
|
Koyena Pal
|
Chris Wendler
|
Callum Stuart McDougall
Findings of the Association for Computational Linguistics: EMNLP 2025
Prior work has shown that a significant driver of performance in reasoning models is their ability to reason and self-correct. A distinctive marker in these reasoning traces is the token wait, which often signals reasoning behavior such as backtracking. Despite being such a complex behavior, little is understood of exactly why models do or do not decide to reason in this particular manner, which limits our understanding of what makes a reasoning model so effective. In this work, we address the question whether model’s latents preceding wait tokens contain relevant information for modulating the subsequent reasoning process. We train crosscoders at multiple layers of DeepSeek-R1-Distill-Llama-8B and its base version, and introduce a latent attribution technique in the crosscoder setting. We locate a small set of features relevant for promoting/suppressing wait tokens’ probabilities. Finally, through a targeted series of experiments analyzing max-activating examples and causal interventions, we show that many of our identified features indeed are relevant for the reasoning process and give rise to different types of reasoning patterns such as restarting from the beginning, recalling prior knowledge, expressing uncertainty, and double-checking.
2023
Future Lens: Anticipating Subsequent Tokens from a Single Hidden State
Koyena Pal
|
Jiuding Sun
|
Andrew Yuan
|
Byron Wallace
|
David Bau
Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL)
We conjecture that hidden state vectors corresponding to individual input tokens encode information sufficient to accurately predict several tokens ahead. More concretely, in this paper we ask: Given a hidden (internal) representation of a single token at position t in an input, can we reliably anticipate the tokens that will appear at positions ≥ t + 2? To test this, we measure linear approximation and causal intervention methods in GPT-J-6B to evaluate the degree to which individual hidden states in the network contain signal rich enough to predict future hidden states and, ultimately, token outputs. We find that, at some layers, we can approximate a model’s output with more than 48% accuracy with respect to its prediction of subsequent tokens through a single hidden state. Finally we present a “Future Lens” visualization that uses these methods to create a new view of transformer states.
Search
Fix author
Co-authors
- David Bau 1
- Callum Stuart McDougall 1
- Jiuding Sun 1
- Dmitrii Troitskii 1
- Byron C. Wallace 1
- show all...