Detecting What Queries Seek: Steering LLM Safety with FFN Output Activation Monitoring

Xiaohao Luo, Ying Wei, Rui Zhao


Abstract
Recently, activation steering has attracted considerable attention as a low-cost approach to improving the safety of large language models (LLMs). However, most existing methods apply interventions indiscriminately, often causing excessive refusal of benign queries. Although recent works have begun to explore selective intervention, their intervention decisions typically rely on residual stream activations where information is highly entangled, resulting in limited discriminative power and unreliable interventions. To address this issue, we propose FFN-Guided activation steering (FGAS). Motivated by the observation that feed-forward networks (FFNs) in LLMs serve as core modules for knowledge storage, we propose leveraging FFN output activations as more discriminative signals for intervention, since these activations more explicitly reflect the intent of a query. For a given query, FGAS projects the corresponding FFN output activation into a low-dimensional subspace that effectively separates harmful and benign queries, and then makes precise intervention decisions by assessing its similarity to pre-constructed prototype activations representing harmful and benign classes. Extensive experiments demonstrate that FGAS achieves state-of-the-art defense performance against various jailbreak attacks, while nearly preserving the model’s original performance on benign tasks.
Anthology ID:
2026.acl-long.1360
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
29500–29514
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1360/
DOI:
Bibkey:
Cite (ACL):
Xiaohao Luo, Ying Wei, and Rui Zhao. 2026. Detecting What Queries Seek: Steering LLM Safety with FFN Output Activation Monitoring. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 29500–29514, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Detecting What Queries Seek: Steering LLM Safety with FFN Output Activation Monitoring (Luo et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1360.pdf
Checklist:
 2026.acl-long.1360.checklist.pdf