Amr Alanwar
2026
Guiding Giants: Lightweight Controllers for Weighted Activation Steering in LLMs
Amr Hegazy | Mostafa Elhoushi | Amr Alanwar
Proceedings of the 6th Workshop on Trustworthy NLP (TrustNLP 2026)
Amr Hegazy | Mostafa Elhoushi | Amr Alanwar
Proceedings of the 6th Workshop on Trustworthy NLP (TrustNLP 2026)
Controlling undesirable LLM behaviors typically requires costly fine-tuning, while existing inference-time steering methods lack fine-grained adaptivity. We introduce a lightweight, trainable controller network for adaptive inference-time control. The controller observes intermediate LLM activations to predict a global scaling factor and layer-specific weights, which dynamically modulate a pre-computed “refusal direction” vector. Trained on harmful and benign prompts, the controller learns to apply nuanced, layer-aware steering selectively. Experiments on Llama and Mistral models show our method significantly increases refusal rates on safety benchmarks like ToxicChat, outperforming existing approaches without altering the original model parameters.