Locally Distributed Activation Vectors for Guided Feature Attribution

Housam K. B. Bashier, Mi-Young Kim, Randy Goebel


Abstract
Explaining the predictions of a deep neural network (DNN) is a challenging problem. Many attempts at interpreting those predictions have focused on attribution-based methods, which assess the contributions of individual features to each model prediction. However, attribution-based explanations do not always provide faithful explanations to the target model, e.g., noisy gradients can result in unfaithful feature attribution for back-propagation methods. We present a method to learn explanations-specific representations while constructing deep network models for text classification. These representations can be used to faithfully interpret black-box predictions, i.e., highlighting the most important input features and their role in any particular prediction. We show that learning specific representations improves model interpretability across various tasks, for both qualitative and quantitative evaluations, while preserving predictive performance.
Anthology ID:
2022.coling-1.83
Volume:
Proceedings of the 29th International Conference on Computational Linguistics
Month:
October
Year:
2022
Address:
Gyeongju, Republic of Korea
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
994–1005
Language:
URL:
https://aclanthology.org/2022.coling-1.83
DOI:
Bibkey:
Cite (ACL):
Housam K. B. Bashier, Mi-Young Kim, and Randy Goebel. 2022. Locally Distributed Activation Vectors for Guided Feature Attribution. In Proceedings of the 29th International Conference on Computational Linguistics, pages 994–1005, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
Cite (Informal):
Locally Distributed Activation Vectors for Guided Feature Attribution (Bashier et al., COLING 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/auto-file-uploads/2022.coling-1.83.pdf
Data
AG NewsIMDb Movie ReviewsSNLI