Small Changes, Big Impact: How Manipulating a Few Neurons Can Drastically Alter LLM Aggression

Jaewook Lee; Junseo Jang; Oh-Woog Kwon; Harksoo Kim

Small Changes, Big Impact: How Manipulating a Few Neurons Can Drastically Alter LLM Aggression

Jaewook Lee, Junseo Jang, Oh-Woog Kwon, Harksoo Kim

Abstract

Recent remarkable advances in Large Language Models (LLMs) have led to innovations in various domains such as education, healthcare, and finance, while also raising serious concerns that they can be easily misused for malicious purposes. Most previous research has focused primarily on observing how jailbreak attack techniques bypass safety mechanisms like Reinforcement Learning through Human Feedback (RLHF). However, whether there are neurons within LLMs that directly govern aggression has not been sufficiently investigated. To fill this gap, this study identifies specific neurons (“aggression neurons”) closely related to the expression of aggression and systematically analyzes how manipulating them affects the model’s overall aggression. Specifically, using a large-scale synthetic text corpus (aggressive and non-aggressive), we measure the activation frequency of each neuron, then apply masking and activation techniques to quantitatively evaluate changes in aggression by layer and by manipulation ratio. Experimental results show that, in all models, manipulating only a small number of neurons can increase aggression by up to 33%, and the effect is even more extreme when aggression neurons are concentrated in certain layers. Moreover, even models of the same scale exhibit nonlinear changes in aggression patterns, suggesting that simple external safety measures alone may not be sufficient for complete defense.

Anthology ID:: 2025.acl-long.1144
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 23478–23505
Language:
URL:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.1144/
DOI:
Bibkey:
Cite (ACL):: Jaewook Lee, Junseo Jang, Oh-Woog Kwon, and Harksoo Kim. 2025. Small Changes, Big Impact: How Manipulating a Few Neurons Can Drastically Alter LLM Aggression. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23478–23505, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Small Changes, Big Impact: How Manipulating a Few Neurons Can Drastically Alter LLM Aggression (Lee et al., ACL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.1144.pdf

PDF Cite Search Fix data