Transformer-specific Interpretability

Hosein Mohebbi; Jaap Jumelet; Michael Hanna; Afra Alishahi; Willem Zuidema

Transformer-specific Interpretability

Hosein Mohebbi, Jaap Jumelet, Michael Hanna, Afra Alishahi, Willem Zuidema

Abstract

Transformers have emerged as dominant play- ers in various scientific fields, especially NLP. However, their inner workings, like many other neural networks, remain opaque. In spite of the widespread use of model-agnostic interpretability techniques, including gradient-based and occlusion-based, their shortcomings are becoming increasingly apparent for Transformer interpretation, making the field of interpretability more demanding today. In this tutorial, we will present Transformer-specific interpretability methods, a new trending approach, that make use of specific features of the Transformer architecture and are deemed more promising for understanding Transformer-based models. We start by discussing the potential pitfalls and misleading results model-agnostic approaches may produce when interpreting Transformers. Next, we discuss Transformer-specific methods, including those designed to quantify context- mixing interactions among all input pairs (as the fundamental property of the Transformer architecture) and those that combine causal methods with low-level Transformer analysis to identify particular subnetworks within a model that are responsible for specific tasks. By the end of the tutorial, we hope participants will understand the advantages (as well as current limitations) of Transformer-specific interpretability methods, along with how these can be applied to their own research.

Anthology ID:: 2024.eacl-tutorials.4
Volume:: Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Tutorial Abstracts
Month:: March
Year:: 2024
Address:: St. Julian’s, Malta
Editors:: Mohsen Mesgar, Sharid Loáiciga
Venue:: EACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 21–26
Language:
URL:: https://aclanthology.org/2024.eacl-tutorials.4
DOI:
Bibkey:
Cite (ACL):: Hosein Mohebbi, Jaap Jumelet, Michael Hanna, Afra Alishahi, and Willem Zuidema. 2024. Transformer-specific Interpretability. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Tutorial Abstracts, pages 21–26, St. Julian’s, Malta. Association for Computational Linguistics.
Cite (Informal):: Transformer-specific Interpretability (Mohebbi et al., EACL 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/dois-2013-emnlp/2024.eacl-tutorials.4.pdf

PDF Search