Patches of Nonlinearity: Instruction Vectors in Large Language Models

Irina Bigoulaeva; Jonas Rohweder; Subhabrata Dutta; Iryna Gurevych

Patches of Nonlinearity: Instruction Vectors in Large Language Models

Irina Bigoulaeva, Jonas Rohweder, Subhabrata Dutta, Iryna Gurevych

Abstract

Despite the recent success of instruction-tuned language models and their ubiquitous usage, very little is known of how models process instructions internally. In this work, we address this gap from a mechanistic point of view by investigating how instruction-specific representations are constructed and utilized in different stages of post-training: Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). Via causal mediation, we identify that instruction representation is fairly localized in models. These representations, which we call Instruction Vectors (IVs), demonstrate a curious juxtaposition of linear separability along with non-linear causal interaction, broadly questioning the scope of the linear representation hypothesis commonplace in mechanistic interpretability. To disentangle the non-linear causal interaction, we propose a novel method to localize information processing in language models that is free from the implicit linear assumptions of patching-based techniques. We find that, conditioned on the task representations formed in the early layers, different information pathways are selected in the later layers to solve that task, i.e., IVs act as circuit selectors.

Anthology ID:: 2026.acl-long.559
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 12209–12262
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.559/
DOI:
Bibkey:
Cite (ACL):: Irina Bigoulaeva, Jonas Rohweder, Subhabrata Dutta, and Iryna Gurevych. 2026. Patches of Nonlinearity: Instruction Vectors in Large Language Models. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12209–12262, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Patches of Nonlinearity: Instruction Vectors in Large Language Models (Bigoulaeva et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.559.pdf
Checklist:: 2026.acl-long.559.checklist.pdf

PDF Cite Search Checklist Fix data