Amr Alanwar

2026

Guiding Giants: Lightweight Controllers for Weighted Activation Steering in LLMs
Amr Hegazy | Mostafa Elhoushi | Amr Alanwar
Proceedings of the 6th Workshop on Trustworthy NLP (TrustNLP 2026)

Controlling undesirable LLM behaviors typically requires costly fine-tuning, while existing inference-time steering methods lack fine-grained adaptivity. We introduce a lightweight, trainable controller network for adaptive inference-time control. The controller observes intermediate LLM activations to predict a global scaling factor and layer-specific weights, which dynamically modulate a pre-computed “refusal direction” vector. Trained on harmful and benign prompts, the controller learns to apply nuanced, layer-aware steering selectively. Experiments on Llama and Mistral models show our method significantly increases refusal rates on safety benchmarks like ToxicChat, outperforming existing approaches without altering the original model parameters.

Co-authors

Mostafa Elhoushi 1
Amr Hegazy 1

Venues

TrustNLP1
WS1

Fix author