An Adversarial Example for Direct Logit Attribution: Memory Management in GELU-4L

Jett Janiak; Can Rager; James Dao; Yeu-Tong Lau

doi:10.18653/v1/2024.blackboxnlp-1.15

An Adversarial Example for Direct Logit Attribution: Memory Management in GELU-4L

Jett Janiak, Can Rager, James Dao, Yeu-Tong Lau

Abstract

Prior work suggests that language models manage the limited bandwidth of the residual stream through a “memory management” mechanism, where certain attention heads and MLP layers clear residual stream directions set by earlier layers. Our study provides concrete evidence for this erasure phenomenon in a 4-layer transformer, identifying heads that consistently remove the output of earlier heads. We further demonstrate that direct logit attribution (DLA), a common technique for interpreting the output of intermediate transformer layers, can show misleading results by not accounting for erasure.

Anthology ID:: 2024.blackboxnlp-1.15
Volume:: Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP
Month:: November
Year:: 2024
Address:: Miami, Florida, US
Editors:: Yonatan Belinkov, Najoung Kim, Jaap Jumelet, Hosein Mohebbi, Aaron Mueller, Hanjie Chen
Venue:: BlackboxNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 232–237
Language:
URL:: https://aclanthology.org/2024.blackboxnlp-1.15
DOI:: 10.18653/v1/2024.blackboxnlp-1.15
Bibkey:
Cite (ACL):: Jett Janiak, Can Rager, James Dao, and Yeu-Tong Lau. 2024. An Adversarial Example for Direct Logit Attribution: Memory Management in GELU-4L. In Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 232–237, Miami, Florida, US. Association for Computational Linguistics.
Cite (Informal):: An Adversarial Example for Direct Logit Attribution: Memory Management in GELU-4L (Janiak et al., BlackboxNLP 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/dois-2013-emnlp/2024.blackboxnlp-1.15.pdf

PDF Search