Guiding Giants: Lightweight Controllers for Weighted Activation Steering in LLMs

Amr Hegazy, Mostafa Elhoushi, Amr Alanwar


Abstract
Controlling undesirable LLM behaviors typically requires costly fine-tuning, while existing inference-time steering methods lack fine-grained adaptivity. We introduce a lightweight, trainable controller network for adaptive inference-time control. The controller observes intermediate LLM activations to predict a global scaling factor and layer-specific weights, which dynamically modulate a pre-computed “refusal direction” vector. Trained on harmful and benign prompts, the controller learns to apply nuanced, layer-aware steering selectively. Experiments on Llama and Mistral models show our method significantly increases refusal rates on safety benchmarks like ToxicChat, outperforming existing approaches without altering the original model parameters.
Anthology ID:
2026.trustnlp-main.46
Volume:
Proceedings of the 6th Workshop on Trustworthy NLP (TrustNLP 2026)
Month:
July
Year:
2026
Address:
San Diego, California
Editors:
Kai-Wei Chang, Ninareh Mehrabi, Satyapriya Krishna, Anubrata Das, Jwala Dhamala, Yang Trista Cao, Tharindu Kumarage, Anil Ramakrishna, Christos Christodoulopoulos, Yixin Wan, Aram Galystan, Anoop Kumar, Rahul Gupta
Venues:
TrustNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
584–599
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.trustnlp-main.46/
DOI:
Bibkey:
Cite (ACL):
Amr Hegazy, Mostafa Elhoushi, and Amr Alanwar. 2026. Guiding Giants: Lightweight Controllers for Weighted Activation Steering in LLMs. In Proceedings of the 6th Workshop on Trustworthy NLP (TrustNLP 2026), pages 584–599, San Diego, California. Association for Computational Linguistics.
Cite (Informal):
Guiding Giants: Lightweight Controllers for Weighted Activation Steering in LLMs (Hegazy et al., TrustNLP 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.trustnlp-main.46.pdf