Abstract
Prior work suggests that language models manage the limited bandwidth of the residual stream through a “memory management” mechanism, where certain attention heads and MLP layers clear residual stream directions set by earlier layers. Our study provides concrete evidence for this erasure phenomenon in a 4-layer transformer, identifying heads that consistently remove the output of earlier heads. We further demonstrate that direct logit attribution (DLA), a common technique for interpreting the output of intermediate transformer layers, can show misleading results by not accounting for erasure.- Anthology ID:
- 2024.blackboxnlp-1.15
- Volume:
- Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP
- Month:
- November
- Year:
- 2024
- Address:
- Miami, Florida, US
- Editors:
- Yonatan Belinkov, Najoung Kim, Jaap Jumelet, Hosein Mohebbi, Aaron Mueller, Hanjie Chen
- Venue:
- BlackboxNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 232–237
- Language:
- URL:
- https://aclanthology.org/2024.blackboxnlp-1.15
- DOI:
- 10.18653/v1/2024.blackboxnlp-1.15
- Cite (ACL):
- Jett Janiak, Can Rager, James Dao, and Yeu-Tong Lau. 2024. An Adversarial Example for Direct Logit Attribution: Memory Management in GELU-4L. In Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 232–237, Miami, Florida, US. Association for Computational Linguistics.
- Cite (Informal):
- An Adversarial Example for Direct Logit Attribution: Memory Management in GELU-4L (Janiak et al., BlackboxNLP 2024)
- PDF:
- https://preview.aclanthology.org/dois-2013-emnlp/2024.blackboxnlp-1.15.pdf