Evaluating Attribution Methods using White-Box LSTMs
2020pp. 300–313
Citations Over TimeTop 19% of 2020 papers
Abstract
Interpretability methods for neural networks are difficult to evaluate because we do not understand the black-box models typically used to test them. This paper proposes a framework in which interpretability methods are evaluated using manually constructed networks, which we call white-box networks, whose behavior is understood a priori. We evaluate five methods for producing attribution heatmaps by applying them to white-box LSTM classifiers for tasks based on formal languages. Although our white-box classifiers solve their tasks perfectly and transparently, we find that all five attribution methods fail to produce the expected model explanations.
Related Papers
- → Semi-mechanistic modeling of chemical processes with neural networks(1998)55 cited
- → Demystifying Black Box Models with Neural Networks for Accuracy and Interpretability of Supervised Learning(2018)1 cited
- → Not Just a Black Box: Learning Important Features Through Propagating Activation Differences(2016)549 cited
- → A Theory of Diagnostic Interpretation in Supervised Classification(2018)1 cited