\documentclass[11pt]{article}

% Change "review" to "final" to generate the final (sometimes called camera-ready) version.
% Change to "preprint" to generate a non-anonymous version with page numbers.
\usepackage[final]{acl}
\usepackage{pgfplots}
\pgfplotsset{compat=1.18}
% Standard package includes
\usepackage{times}
\usepackage{latexsym}

% For proper rendering and hyphenation of words containing Latin characters (including in bib files)
\usepackage[T1]{fontenc}
% For Vietnamese characters
% \usepackage[T5]{fontenc}
% See https://www.latex-project.org/help/documentation/encguide.pdf for other character sets

% This assumes your files are encoded as UTF8
\usepackage[utf8]{inputenc}

% This is not strictly necessary, and may be commented out,
% but it will improve the layout of the manuscript,
% and will typically save some space.
\usepackage{microtype}

% This is also not strictly necessary, and may be commented out.
% However, it will improve the aesthetics of text in
% the typewriter font.
\usepackage{inconsolata}

%Including images in your LaTeX document requires adding
%additional package(s)
\usepackage{graphicx}
\usepackage{amsmath} 
% If the title and author information does not fit in the area allocated, uncomment the following
%
%\setlength\titlebox{<dim>}
%
% and set <dim> to something 5cm or larger.

\title{SteerForce at SemEval-2026 Task 11: Reducing Content Effects Using Layered Activation Steering}

% Author information can be set in various styles:
% For several authors from the same institution:
% \author{Author 1 \and ... \and Author n \\
%         Address line \\ ... \\ Address line}
% if the names do not fit well on one line use
%         Author 1 \\ {\bf Author 2} \\ ... \\ {\bf Author n} \\
% For authors from different institutions:
% \author{Author 1 \\ Address line \\  ... \\ Address line
%         \And  ... \And
%         Author n \\ Address line \\ ... \\ Address line}
% To start a separate ``row'' of authors use \AND, as in
% \author{Author 1 \\ Address line \\  ... \\ Address line
%         \AND
%         Author 2 \\ Address line \\ ... \\ Address line \And
%         Author 3 \\ Address line \\ ... \\ Address line}

\author{
  \textbf{Noah Tratzsch}$^1$, 
  \textbf{Asmaa Al-Raian}$^1$, 
  \textbf{Mounika Marreddy}$^1$, 
  \textbf{Alexander Mehler}$^1$ \\
  $^1$Goethe University, Frankfurt am Main, Germany \\
  \texttt{\small ntratzsch@stud.uni-frankfurt.de, s6380199@rz.uni-frankfurt.de} \\
  \texttt{\small mmarredd@em.uni-frankfurt.de, mehler@em.uni-frankfurt.de}
}

%\author{
%  \textbf{First Author\textsuperscript{1}},
%  \textbf{Second Author\textsuperscript{1,2}},
%  \textbf{Third T. Author\textsuperscript{1}},
%  \textbf{Fourth Author\textsuperscript{1}},
%\\
%  \textbf{Fifth Author\textsuperscript{1,2}},
%  \textbf{Sixth Author\textsuperscript{1}},
%  \textbf{Seventh Author\textsuperscript{1}},
%  \textbf{Eighth Author \textsuperscript{1,2,3,4}},
%\\
%  \textbf{Ninth Author\textsuperscript{1}},
%  \textbf{Tenth Author\textsuperscript{1}},
%  \textbf{Eleventh E. Author\textsuperscript{1,2,3,4,5}},
%  \textbf{Twelfth Author\textsuperscript{1}},
%\\
%  \textbf{Thirteenth Author\textsuperscript{3}},
%  \textbf{Fourteenth F. Author\textsuperscript{2,4}},
%  \textbf{Fifteenth Author\textsuperscript{1}},
%  \textbf{Sixteenth Author\textsuperscript{1}},
%\\
%  \textbf{Seventeenth S. Author\textsuperscript{4,5}},
%  \textbf{Eighteenth Author\textsuperscript{3,4}},
%  \textbf{Nineteenth N. Author\textsuperscript{2,5}},
%  \textbf{Twentieth Author\textsuperscript{1}}
%\\
%\\
%  \textsuperscript{1}Affiliation 1,
%  \textsuperscript{2}Affiliation 2,
%  \textsuperscript{3}Affiliation 3,
%  \textsuperscript{4}Affiliation 4,
%  \textsuperscript{5}Affiliation 5
%\\
%  \small{
%    \textbf{Correspondence:} \href{mailto:email@domain}{email@domain}
%  }
%}

\begin{document}
\maketitle

%We study the mitigation of content-driven bias in neural reasoning models using inference-time activation steering. Focusing on SemEval-2026 Task~11, we model bias as semantic interference between plausibility heuristics and formal logical validity. We propose a sequential steering pipeline that combines a stabilizing activation transport step (K-ACT) with a targeted, input-adaptive correction (K-CAST) applied to mid-to-late layers identified via layer sensitivity analysis. On Bert-base-uncased, this approach improves validity accuracy by 5.21\% while reducing directional content bias by 75\%, without retraining the model. Our results reveal a clear architectural difference: encoder models benefit from distributed, multi-layer steering, while Qwen1.5B model LLMs respond best to localized interventions in late layers. This suggests that bias mitigation strategies must be adapted to the representational dynamics of each model family.


\begin{abstract}
Large language models exhibit content effects, where surface plausibility interferes with formal logical reasoning. In SemEval-2026 Task 11, this appears as a performance gap between plausibility-aligned and plausibility-conflicting syllogisms, reflecting directional content bias. We address this issue using inference-time activation steering, modeling bias as a geometric deviation between plausibility-driven and validity-driven representations. We introduce a layered steering framework that combines Activation Transport (ACT) with input-adaptive contrastive steering (K-CAST), applied to layers identified through sensitivity analysis. This architecture-aware strategy enables targeted interventions without retraining.

On BERT, sequential multi-layer steering improves validity accuracy from 77.1\% to 82.3\% while reducing bias by 75\%. In contrast, for the decoder-only Qwen2.5-1.5B-Instruct, a single mid-to-late layer intervention reduces bias from 0.26 to 0.04 with modest accuracy gains, whereas multi-layer steering offers no additional benefit. These results reveal a fundamental architectural distinction: encoder-based models benefit from distributed low-intensity corrections, while decoder-only instruction-tuned models concentrate reasoning signals within a narrow late-layer band. Our findings demonstrate that effective bias mitigation requires architecture-aware activation steering.
\end{abstract}


\iffalse
\section{Introduction}

Large Language Models (LLMs), including bidirectional architectures like BERT \citep{devlin-etal-2019-bert}, often fall prey to "content effects," where pre-existing biases or heuristics interfere with formal logical reasoning. This results in a performance gap between logical tasks that align with training-set heuristics and those that conflict with them \citep{valentino2025mitigatingcontenteffectsreasoning}.\\
While standard approaches to classification involve extensive fine-tuning \citep{sun2020finetuneberttextclassification}, recent research suggests that model behavior can be altered at inference time by manipulating internal latent-subspaces \citep{sharma2025steeringconceptualbiastransformer}. Building on the concept of contrastive activation addition \citep{panickssery2024steeringllama2contrastive}, we propose a steering pipeline that can be applied either sequentially across layers or locally at a single sensitive layer, depending on the model architecture.
Our method differentiates itself by employing a multi-layer strategy that combines activation transport \citep{rodriguez2024controllinglanguagediffusionmodels} with semantics-adaptive interventions \citep{wang2025semanticsadaptiveactivationinterventionllms} to neutralize bias without sacrificing classification accuracy.
\fi

%%%%%%%% INTRODUCTION %%%%%%%%%%5
\section{Introduction}

Large Language Models (LLMs), including bidirectional architectures such as BERT~\cite{devlin-etal-2019-bert}, achieve strong performance across language understanding and reasoning tasks. However, they often exhibit \emph{content effects}, where surface plausibility or learned heuristics interfere with formal logical reasoning~\cite{valentino2025mitigatingcontenteffectsreasoning}. This results in systematic performance gaps between belief-consistent and belief-conflicting problems.

This limitation is central to SemEval-2026 Task 11~\cite{valentino-etal-2026-semeval}, which requires predicting the logical validity of natural-language syllogisms independently of their real-world plausibility. The task exposes how models may rely on semantic shortcuts rather than abstract logical structure.

While fine-tuning with additional supervision can partially mitigate such effects~\cite{sun2020finetuneberttextclassification}, it is computationally costly and does not directly control internal reasoning representations. In contrast, recent work shows that model behavior can be modified at inference time through activation steering, which manipulates latent representations within transformer layers~\cite{rimsky2024steering}.

We frame content bias as a geometric deviation in hidden representation space and propose a layered steering framework that combines Activation Transport (ACT)~\cite{rodriguez2024controllinglanguagediffusionmodels} with semantics-adaptive contrastive steering (K-CAST)~\cite{wang2025semanticsadaptiveactivationinterventionllms}. This sequential approach stabilizes global representations while applying input-specific corrections.

Crucially, we show that steering effectiveness depends on model architecture. Encoder-based models benefit from low-intensity sequential multi-layer interventions, reflecting distributed reasoning signals. In contrast, decoder-only instruction-tuned models concentrate bias-sensitive representations within a narrow late-layer band, where a single well-placed intervention suffices. These results highlight the need for architecture-aware activation steering to improve logical robustness without additional fine-tuning.

\iffalse
\section{Related Work}
Recent work has shown that large language models often rely on surface plausibility rather than formal logical structure when solving reasoning tasks. This phenomenon, known as content effects, leads to systematic errors when logical validity conflicts with common beliefs~\cite{valentino2025mitigating}. Instead of addressing this problem only through additional fine-tuning \cite{sun2019fine}, several studies suggest that reasoning behavior can be analyzed and modified directly in the model’s hidden representations \cite{sharma2025steering}. These findings motivate approaches that treat bias as a property of the model’s internal geometry rather than solely as a training limitation.

A growing line of research explores inference-time activation steering as a way to control model behavior without retraining. Contrastive Activation Addition (CAA) demonstrated that behavioral traits can be shifted by adding directions computed from contrasting activation clusters~\cite{rimsky2024steering}. Later work introduced more adaptive interventions that adjust steering based on input semantics~\cite{wang2024semantics}. In parallel, activation transport methods proposed smoother geometric transformations that align hidden states with reference distributions using controlled scaling and shifting operations~\cite{lopez2024controlling}. These approaches show that model outputs can be influenced by structured manipulation of latent space representations.

Our work builds on these ideas but differs in two important ways. First, we explicitly frame content bias in logical reasoning as a geometric deviation between plausibility-driven and validity-driven representations, and we combine activation transport with adaptive contrastive steering into a unified layered framework. Second, we show that the optimal steering strategy depends strongly on model architecture: encoder-based models benefit from distributed multi-layer interventions, whereas decoder-only instruction-tuned models respond best to a single well-placed intervention. This architecture-aware analysis extends prior activation steering research and provides practical guidance for mitigating reasoning bias in different transformer families.

%\textcolor{red}{more information about them is here: https://www.notion.so/Paper-Review-2d1b6bfca0fc8008979fd064b7cfc387}
\fi

\iffalse
\section{Methodology}
Our methodology follows a progressive analysis-and-intervention pipeline. We first establish encoder- and decoder-based baselines, extract internal representations, identify layer-specific sensitivity to content bias, and then apply increasingly structured activation steering interventions. Our interventions build on prior work on contrastive activation steering and activation transport~\cite{valentino2025mitigatingcontenteffectsreasoning,rodriguez2024controllinglanguagediffusionmodels,panickssery2024steeringllama2contrastive}.

\subsection{BERT-based Method}
For encoder-based experiments, we fine-tune \texttt{bert-base-uncased} with a dual-head classification setup for validity and plausibility prediction~\cite{devlin-etal-2019-bert,sun2020finetuneberttextclassification}. The shared encoder is intentionally exposed to both logical and heuristic signals, creating a controlled setting in which content bias can be observed. Plausibility predictions are used only during training and analysis, not at inference time.

\subsection{LLM-based Method}
For decoder-only models, we frame validity prediction as an instruction-following task and evaluate instruction-tuned models (TinyLlama and Qwen)~\cite{zhang2024tinyllamaopensourcesmalllanguage,qwen2025qwen25technicalreport}. No task-specific fine-tuning is performed; instead, all interventions operate directly on internal activations, allowing us to study and modify reasoning behavior without retraining.

\subsection{Hidden State Extraction}
To enable layer-wise analysis and intervention, we extract hidden states from all transformer layers. For each input, we record the hidden representation at a fixed token position (the [CLS] token for BERT and the final token for decoder models), which serves as the basis for both sensitivity analysis and activation steering.


\subsection{Layer Sensitivity Analysis}
We conduct a layer sensitivity analysis by applying a lightweight test intervention independently to each layer and measuring the resulting changes in accuracy and directional bias. This analysis consistently identifies a narrow set of upper-middle layers where steering is effective, while earlier layers show negligible effects and later layers become increasingly brittle to intervention, in line with prior findings~\cite{sharma2025steeringconceptualbiastransformer,valentino2025mitigatingcontenteffectsreasoning}.

\subsection{Global Contrastive Steering}
As an initial baseline, we apply global contrastive steering. We compute a contrastive direction in activation space as the difference between mean activations of agreement and conflict samples:
\begin{equation}
\Delta\phi = \mu^{+} - \mu^{-},
\label{eq:contrastive}
\end{equation}
where $\mu^{+}$ and $\mu^{-}$ denote the average hidden representations of plausibility-aligned and plausibility-conflicting examples, respectively. This direction is added uniformly to all samples with a fixed steering strength $\lambda$. While effective at reducing bias, this method lacks input adaptivity and often requires relatively large $\lambda$, which can degrade accuracy~\cite{panickssery2024steeringllama2contrastive}.

\subsection{K-CAST and ACT-based Steering}
\textbf{K-CAST} performs contrastive activation steering only when a test sample is classified via a k-nearest-neighbor (kNN) lookup in activation space, thereby adapting the steering direction on a per-input basis~\cite{valentino2025mitigatingcontenteffectsreasoning}. The kNN lookup is performed over a memory bank constructed from hidden-state activations of the \emph{training set}, which serves as a fixed key-value store during inference.\\
In our setup, $K$ denotes the number of nearest neighbors used for regime classification; low $K$ values (e.g., $K=3$--$5$) yield highly local, input-specific corrections, while larger $K$ values (e.g., $K\geq10$) produce smoother but less discriminative decisions and may dilute the steering signal.
\\
\textbf{ACT} (Activation Transport) modifies hidden states by gradually moving them toward a reference activation distribution using a linear scaling and shift. For a hidden activation $a$, the transported activation is computed as
\begin{equation}
T(a;\lambda) = (1-\lambda)a + \lambda(\omega a + \beta),
\label{eq:act}
\end{equation}
where $\lambda \in [0,1]$ controls the strength of the intervention. The parameters $\omega$ (scale) and $\beta$ (shift) are fixed, layer-specific values computed from training-set activation statistics to align bias-prone representations with agreement representations. This interpolation avoids abrupt changes to the hidden state and allows controlled, low-intensity activation updates without retraining the model~\cite{rodriguez2024controllinglanguagediffusionmodels}.


\subsection{K-ACT: Combined Steering}
\textbf{K-ACT} combines ACT and K-CAST in a two-stage update. First, ACT defines a stabilized transport target for the hidden state. The resulting transport delta $(T(a;\lambda)-a)$ is then applied only to samples selected by the K-CAST gating rule. By combining smooth transport with kNN-based adaptivity, K-ACT achieves effective bias mitigation at lower steering intensities than either method alone ~\cite{valentino2025mitigatingcontenteffectsreasoning,rodriguez2024controllinglanguagediffusionmodels}.


\subsection{Single-Layer Interventions}
All steering methods are first evaluated in isolation by applying them individually to each layer identified by the sensitivity analysis. This allows us to compare methods under identical conditions and to characterize how different interventions behave when constrained to a single representational depth.

\subsection{Sequential Steering}
Finally, we evaluate sequential steering strategies in which interventions are applied across multiple layers. We test different orderings, including repeated application of the same method and heterogeneous sequences such as ACT followed by K-CAST or their combination (K-ACT). A steering configuration is considered \emph{aggressive} when it relies on large steering intensities or late-layer interventions; empirically, such settings tend to reduce accuracy. In encoder-based models, distributing low-intensity interventions across multiple sensitive layers consistently yields better trade-offs between accuracy and bias reduction. 
However, for decoder-only instruction-tuned models, we observe that single-layer interventions already capture most of the achievable bias reduction, and sequential steering often provides no additional benefit.


\fi

\section{Methodology}

Our approach follows a structured analysis-and-intervention pipeline. 
We first establish encoder-based and decoder-based baselines, extract internal representations, identify layers that are sensitive to content bias, and then apply targeted activation steering interventions. 
Our methods build on prior work on contrastive activation steering and activation transport~\cite{valentino2025mitigatingcontenteffectsreasoning,rodriguez2024controllinglanguagediffusionmodels,panickssery2024steeringllama2contrastive}.

\subsection{Encoder-Based Model}

For encoder-based experiments, we fine-tune \texttt{bert-base-uncased} using a dual-head classification architecture for validity and plausibility prediction~\cite{devlin-etal-2019-bert,sun2020finetuneberttextclassification}. 
Both heads operate on the shared encoder representation of the \texttt{[CLS]} token. 

This setup intentionally exposes the encoder to both formal logical signals and heuristic plausibility cues. 
As a result, the latent space captures both reasoning-relevant and plausibility-driven information, providing a controlled setting in which content bias can be analyzed. 
At inference time, only the validity head is used.

\subsection{Decoder-Based Models}

For decoder-only models, we frame validity prediction as an instruction-following task. 
We evaluate instruction-tuned models, namely TinyLlama and Qwen~\cite{zhang2024tinyllamaopensourcesmalllanguage,qwen2025qwen25technicalreport}. 

No task-specific fine-tuning is performed. 
Instead, all modifications are applied at inference time by manipulating internal activations. 
This allows us to study and influence reasoning behavior without updating model parameters.

For TinyLlama, ACT is implemented as a projection-based geometric correction along the steering direction rather than full activation transport with learned scale and shift parameters. This formulation corresponds to removing the bias-aligned component of the hidden representation scaled by $\lambda$. The overall steering framework remains unchanged. In TinyLlama, the steering direction and kNN gating are computed using validity-labeled training activations.

\subsection{Hidden State Extraction}

For layer-wise analysis and intervention, we extract hidden representations from all transformer layers. 
For each input, we record a single layer representation: the \texttt{[CLS]} token for BERT and the final token for decoder-only models. 
These representations form the basis for both sensitivity analysis and activation steering.

\subsection{Layer Sensitivity Analysis}

To identify layers that contribute most to directional content bias, we perform a layer sensitivity analysis. 
We apply a lightweight test intervention independently at each layer and measure the resulting change in accuracy and bias. 

Consistent with prior activation-steering findings~\cite{valentino2025mitigatingcontenteffectsreasoning}, we observe that steering effects are negligible in early layers, peak in upper-middle layers, and become unstable in the final layers. 
Subsequent interventions are therefore restricted to the most sensitive layers.

\subsection{Global Contrastive Steering}

As a baseline, we implement global contrastive steering following contrastive activation addition~\cite{panickssery2024steeringllama2contrastive}. 
We compute a steering direction in activation space as:

\begin{equation}
\Delta\phi = \mu^{+} - \mu^{-},
\label{eq:contrastive}
\end{equation}

where $\mu^{+}$ and $\mu^{-}$ denote the mean hidden representations of plausibility-aligned and plausibility-conflicting samples, respectively. 
The direction $\Delta\phi$ is added to all test representations with a fixed steering strength $\lambda$.

Although this approach can reduce bias, it applies the same correction to every input and often requires larger $\lambda$, which may negatively affect accuracy.

\subsection{K-CAST and ACT}

\textbf{K-CAST.}  
K-CAST introduces input adaptivity by applying contrastive steering only when a test sample is assigned to a specific regime using a $k$-nearest-neighbor (kNN) lookup in activation space~\cite{valentino2025mitigatingcontenteffectsreasoning}. 
The memory bank consists of hidden representations from the training set. 
The parameter $K$ controls the locality of the intervention: smaller values yield highly input-specific corrections, while larger values produce smoother but less discriminative adjustments.

\textbf{ACT.}  
Activation Transport (ACT) performs a smooth geometric transformation of hidden states toward a reference activation distribution~\cite{rodriguez2024controllinglanguagediffusionmodels}. 
For a hidden activation $a$, the transported representation is defined as:

\begin{equation}
T(a;\lambda) = (1-\lambda)a + \lambda(\omega a + \beta),
\label{eq:act}
\end{equation}

where $\lambda \in [0,1]$ controls intervention strength, and $\omega$ (scale) and $\beta$ (shift) are layer-specific parameters computed from training activation statistics. 
This interpolation enables controlled, low-intensity updates without abrupt changes to the representation.\\
Steering parameters $\lambda$ and K are selected empirically based on validation performance. We perform layer-wise tuning guided by sensitivity analysis. We observe that larger values of K tend to dilute the corrective signal, while excessively large $\lambda$ values may suppress useful reasoning features.

\subsection{K-ACT: Combined Steering}

We combine ACT and K-CAST into a two-stage procedure termed \textbf{K-ACT}. 
First, ACT defines a transported target representation. 
Second, the resulting transport delta $(T(a;\lambda) - a)$ is applied only to samples selected by the K-CAST gating mechanism. 

This combination integrates global stabilization with input-specific correction and allows effective bias mitigation at lower steering strengths~\cite{valentino2025mitigatingcontenteffectsreasoning,rodriguez2024controllinglanguagediffusionmodels}.

\subsection{Single-Layer and Sequential Interventions}

We first evaluate all steering methods independently at each sensitive layer to assess their isolated effects. 
We then explore sequential steering across multiple layers. 

In encoder-based models, distributing low-intensity interventions across adjacent sensitive layers provides the best trade-off between accuracy and bias reduction. 
In contrast, for decoder-only instruction-tuned models, a single well-placed intervention typically captures most of the achievable bias reduction, and additional layers offer limited benefit.


\section{Experimental Setup}

We evaluate activation steering on both encoder- and decoder-based architectures. For the encoder setting, we fine-tune \texttt{bert-base-uncased}~\cite{devlin-etal-2019-bert} using a dual-head classification architecture. For decoder-only instruction-tuned models (TinyLlama and Qwen), no task-specific fine-tuning is performed. Instead, validity prediction is formulated as an instruction-following task, and all steering interventions are applied exclusively at inference time.

\subsection{Dataset}

We use the official dataset released for SemEval-2026 Task~11, Subtask~1~\cite{semeval2026_task11_dataset}. The dataset consists of English natural-language syllogisms annotated with two binary labels: \textit{validity} (formal logical correctness) and \textit{plausibility} (real-world believability). The objective is to predict logical validity independently of plausibility effects. 

The training set contains approximately 800 instances. Table~\ref{tab:dataset_example} shows a representative example illustrating the potential conflict between logical validity and surface plausibility.

\begin{table}[h]
\centering
\small
\begin{tabular}{p{7cm}}
\hline
\textbf{Syllogism} \\
\hline
Not all canines are aquatic creatures known as fish. It is certain that no fish belong to the class of mammals. Therefore, every canine falls under the category of mammals. \\
\hline
\textbf{Validity:} false \\
\textbf{Plausibility:} true \\
\hline
\end{tabular}
\caption{Example illustrating the distinction between logical validity and plausibility.}
\label{tab:dataset_example}
\end{table}

\subsection{Evaluation Metrics}

Systems are evaluated using the official SemEval metric, which measures accuracy over binary validity labels. Since classes are balanced, accuracy serves as the primary evaluation metric. Plausibility labels are not used at test time and function only as auxiliary supervision for the encoder-based model.

To quantify susceptibility to plausibility heuristics, we additionally report \emph{Directional Content Bias}, defined as the absolute difference in validity accuracy between plausibility-aligned and plausibility-conflicting subsets:

\begin{equation}
\text{Bias} = \left| \mathrm{Acc}_{\text{plaus}} - \mathrm{Acc}_{\text{implaus}} \right|.
\label{eq:content_bias}
\end{equation}

Lower values indicate greater robustness to content effects.

\subsection{Experimental Environment}

Experiments are conducted using \texttt{bert-base-uncased}~\cite{devlin-etal-2019-bert}, \texttt{TinyLlama-1.1B-Chat-v1.0}~\cite{zhang2024tinyllamaopensourcesmalllanguage}, and \texttt{Qwen2.5-1.5B-Instruct}~\cite{qwen2025qwen25technicalreport}. All models are implemented in PyTorch with HuggingFace Transformers. We fix random seeds for reproducibility.

\subsection{Multi-Head Training (Encoder Model)}

Following~\citet{sun2020finetuneberttextclassification}, we attach two independent classification heads to the shared [CLS] representation of BERT: \textbf{Validity head}: predicts formal logical correctness and \textbf{Plausibility head}: predicts empirical believability.


This dual-head design encourages the encoder to encode both logical and heuristic signals in its latent space, intentionally inducing representational overlap. This controlled semantic interference provides the setting in which activation steering aims to disentangle plausibility-driven and validity-driven representations~\cite{valentino2025mitigatingcontenteffectsreasoning}.


\section{Results}

\subsection{Layer Sensitivity Analysis}
Across architectures, steering effects are negligible in early layers, increase toward intermediate-to-late layers, and either peak sharply (decoder models) or gradually accumulate (encoder models).

For BERT, sensitivity increases steadily from lower to higher layers, reaching its maximum in the final layers (8--11). This smooth upward trend suggests that logical and plausibility-related signals remain distributed across multiple upper layers. No single layer dominates the bias effect; instead, the influence of plausibility appears to accumulate progressively throughout the encoder stack.

In contrast, Qwen exhibits a highly non-monotonic profile. Sensitivity remains low in early layers, then rises sharply around layers 15--17, forming a narrow peak, before fluctuating in later layers. This concentrated spike indicates that bias-relevant reasoning signals are localized within a restricted band of upper-middle layers. The sharper peak compared to BERT suggests a more centralized representation of decision-critical features in decoder-only architectures.

This architectural contrast already anticipates the downstream steering behavior: distributed sensitivity in BERT should favor multi-layer interventions, whereas localized sensitivity in Qwen should benefit from single-layer correction.


\begin{table*}[!t]
\centering
\begin{tabular}{llccc}
\hline
\textbf{Model} & \textbf{Configuration} & \textbf{Layers} & \textbf{Acc} & \textbf{Bias} \\ \hline
BERT & Baseline & None & 0.7708 & 0.0833 \\
BERT & Single-Layer & 11 & 0.7604 & \textbf{0.0104} \\
BERT & \textbf{Sequential} & \textbf{8 + 9} & \textbf{0.8229} & 0.0208 \\
BERT & High-K Value & 8 + 9 & 0.7812 & 0.0625 \\ \hline
TinyLlama & Baseline & None & 0.6250 & 0.1469 \\
TinyLlama & Single-Layer (K-ACT) & 17 & \textbf{0.6667} & \textbf{0.0559} \\
TinyLlama & Sequential (K-CAST$\rightarrow$ACT) & 21 + 14 & 0.5104 & 0.1066 \\
TinyLlama & High-K Value & 21 & 0.5417 & 0.0769 \\ \hline
Qwen & Baseline & None & 0.6771 & 0.2611 \\
Qwen & Single-Layer (K-CAST) & 21 & \textbf{0.6875} & \textbf{0.0409} \\
Qwen & Sequential (Hybrid) & 19 + 21 & 0.6833 & 0.2599 \\
Qwen & High-K / ACT & 23 & 0.6792 & 0.2674 \\
\hline
\end{tabular}
\caption{Comparison of steering outcomes across models. Sequential steering provides the best trade-off for the encoder-based BERT model, 
while single-layer steering is more effective for the decoder-only Qwen model.
Sequential steering provides the best balance between accuracy and neutrality, while overly large neighborhoods lead to signal dilution.}
\label{tab:results}
\end{table*}


\subsection{Content Bias and Accuracy}

\paragraph{BERT (Encoder-Based).}
The baseline BERT model achieves 0.7708 accuracy with a directional bias of 0.0833.  
Single-layer steering at layer 11 drastically reduces bias (0.0104) but slightly decreases accuracy (0.7604), suggesting that aggressive correction at a single late layer may partially suppress useful reasoning signals.

Sequential steering across layers 8 and 9 yields the strongest overall performance: accuracy improves substantially to 0.8229 while bias remains low (0.0208). This result indicates that distributing low-intensity corrections across adjacent sensitive layers preserves logical representations while attenuating plausibility-driven deviations. In contrast, increasing the neighborhood size (High-K) weakens performance and increases bias, suggesting that overly broad contrastive neighborhoods dilute the corrective signal.

Overall, BERT benefits from coordinated multi-layer steering, consistent with its gradually distributed sensitivity profile.

\paragraph{Qwen (Decoder-Only).}
The baseline Qwen model exhibits a substantially higher bias (0.2611), indicating strong susceptibility to plausibility heuristics. However, a single K-CAST intervention at layer 21 reduces bias dramatically to 0.0409, while slightly improving accuracy (0.6875 vs.\ 0.6771).

Importantly, sequential steering (layers 19 + 21) fails to produce additional gains and instead restores bias to near-baseline levels (0.2599). Similarly, ACT-based smoothing at layer 23 provides no meaningful improvement. These findings suggest that once the bias-relevant representation in the dominant layer is corrected, further interventions interfere with stabilized decision signals. In decoder-only models, late-layer activations appear to play a decisive role in classification, and redundant corrections may destabilize the final representation.

\paragraph{TinyLlama.}
TinyLlama follows a pattern similar to Qwen but with lower baseline performance. The model starts at 0.6250 accuracy and 0.1469 bias. A single-layer K-ACT intervention at layer 17 improves both metrics substantially (accuracy 0.6667, bias 0.0559), confirming that bias-sensitive signals are concentrated in a narrow upper-layer band.

Sequential steering across layers 21 and 14, however, degrades accuracy sharply (0.5104) and only partially reduces bias (0.1066). The partial reversal of bias direction suggests overcorrection and interference between interventions applied at different depths. This behavior indicates that later-layer corrections dominate earlier adjustments, and that distributed interventions can distort the learned decision boundary in decoder-only architectures.

\paragraph{Cross-Architectural Insights.}
Taken together, the results reveal a clear architectural distinction:

\begin{itemize}
    \item \textbf{Encoder-based models (BERT)} encode reasoning and plausibility signals in a distributed manner across upper layers. Consequently, sequential low-intensity steering across adjacent layers provides the best trade-off between accuracy and bias reduction.
    \item \textbf{Decoder-only instruction-tuned models (Qwen, TinyLlama)} concentrate bias-sensitive reasoning signals within a narrow band of late layers. In these models, a single well-placed intervention is sufficient, while multi-layer steering can introduce instability.
\end{itemize}


These findings demonstrate that activation steering is not architecture-agnostic. Its effectiveness depends critically on the representational geometry and the layer-wise concentration of reasoning signals within the underlying transformer. Aligning the steering strategy with these architectural dynamics enables substantial bias reduction without sacrificing logical accuracy and, in some cases, even leads to performance improvements.

Table~\ref{tab:results} confirms these architecture-dependent dynamics.

Steering can become unstable in the late layers of decoder-only models, and sequential multi-layer interventions may introduce interference. Larger values of $K$ further reduce effectiveness by diluting the corrective signal.

\section{Conclusion}

We show that content bias in logical reasoning can be mitigated at inference time without sacrificing performance. By modeling bias as a geometric deviation in representation space, activation steering improves logical robustness through targeted latent interventions.

Crucially, the optimal strategy depends on model architecture. Encoder-based models such as BERT benefit from low-intensity sequential steering across multiple upper layers, where reasoning signals are distributed. In contrast, decoder-only instruction-tuned models such as TinyLlama and Qwen concentrate bias-sensitive representations within a narrow late-layer band, where a single well-placed intervention is sufficient. These findings demonstrate that effective bias mitigation requires architecture-aware steering aligned with the model’s internal representational structure.

\section{Limitations}

Our study is limited to relatively small models and a modest dataset, which may restrict generalization to larger architectures or broader reasoning tasks. Steering layers and hyperparameters are selected empirically and may require adaptation across settings. Moreover, we evaluate only a small set of encoder and decoder models, leaving the generality of the observed architectural differences open.

\section{Potential Improvements} Future work could extend this approach to larger models and broader datasets to evaluate scalability. In particular, studying a wider range of decoder-only and encoder–decoder architectures would help determine whether the observed architectural differences generalize. Automating the selection of steering layers, steering strengths, and neighborhood sizes could further reduce the need for manual tuning. Additionally, exploring adaptive or learned combinations of ACT and K-CAST may yield more stable improvements in decoder-based models.


\bibliography{custom}

\appendix


\end{document}