Transformer-specific Interpretability
Hosein Mohebbi, Jaap Jumelet, Michael Hanna, Afra Alishahi, Willem Zuidema
Abstract
Transformers have emerged as dominant play- ers in various scientific fields, especially NLP. However, their inner workings, like many other neural networks, remain opaque. In spite of the widespread use of model-agnostic interpretability techniques, including gradient-based and occlusion-based, their shortcomings are becoming increasingly apparent for Transformer interpretation, making the field of interpretability more demanding today. In this tutorial, we will present Transformer-specific interpretability methods, a new trending approach, that make use of specific features of the Transformer architecture and are deemed more promising for understanding Transformer-based models. We start by discussing the potential pitfalls and misleading results model-agnostic approaches may produce when interpreting Transformers. Next, we discuss Transformer-specific methods, including those designed to quantify context- mixing interactions among all input pairs (as the fundamental property of the Transformer architecture) and those that combine causal methods with low-level Transformer analysis to identify particular subnetworks within a model that are responsible for specific tasks. By the end of the tutorial, we hope participants will understand the advantages (as well as current limitations) of Transformer-specific interpretability methods, along with how these can be applied to their own research.- Anthology ID:
- 2024.eacl-tutorials.4
- Volume:
- Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Tutorial Abstracts
- Month:
- March
- Year:
- 2024
- Address:
- St. Julian’s, Malta
- Editors:
- Mohsen Mesgar, Sharid Loáiciga
- Venue:
- EACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 21–26
- Language:
- URL:
- https://aclanthology.org/2024.eacl-tutorials.4
- DOI:
- Cite (ACL):
- Hosein Mohebbi, Jaap Jumelet, Michael Hanna, Afra Alishahi, and Willem Zuidema. 2024. Transformer-specific Interpretability. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Tutorial Abstracts, pages 21–26, St. Julian’s, Malta. Association for Computational Linguistics.
- Cite (Informal):
- Transformer-specific Interpretability (Mohebbi et al., EACL 2024)
- PDF:
- https://preview.aclanthology.org/dois-2013-emnlp/2024.eacl-tutorials.4.pdf