Abstract
Explaining the predictions of a deep neural network (DNN) is a challenging problem. Many attempts at interpreting those predictions have focused on attribution-based methods, which assess the contributions of individual features to each model prediction. However, attribution-based explanations do not always provide faithful explanations to the target model, e.g., noisy gradients can result in unfaithful feature attribution for back-propagation methods. We present a method to learn explanations-specific representations while constructing deep network models for text classification. These representations can be used to faithfully interpret black-box predictions, i.e., highlighting the most important input features and their role in any particular prediction. We show that learning specific representations improves model interpretability across various tasks, for both qualitative and quantitative evaluations, while preserving predictive performance.- Anthology ID:
- 2022.coling-1.83
- Volume:
- Proceedings of the 29th International Conference on Computational Linguistics
- Month:
- October
- Year:
- 2022
- Address:
- Gyeongju, Republic of Korea
- Venue:
- COLING
- SIG:
- Publisher:
- International Committee on Computational Linguistics
- Note:
- Pages:
- 994–1005
- Language:
- URL:
- https://aclanthology.org/2022.coling-1.83
- DOI:
- Cite (ACL):
- Housam K. B. Bashier, Mi-Young Kim, and Randy Goebel. 2022. Locally Distributed Activation Vectors for Guided Feature Attribution. In Proceedings of the 29th International Conference on Computational Linguistics, pages 994–1005, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
- Cite (Informal):
- Locally Distributed Activation Vectors for Guided Feature Attribution (Bashier et al., COLING 2022)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/2022.coling-1.83.pdf
- Data
- AG News, IMDb Movie Reviews, SNLI