Skip to main content

Exploring Interpretable LSTM Neural Networks over MultiVariable Data

This post is the reading notes about this paper. This paper proposes a new LSTM structure that can be interpreted with the help of mixture attention, which includes both variable importance and temporal importance.

Background

RNNs trained over multi-variable data capture nonlinear correlation of historical values of target and exogenous variables to the future target values. However, current RNNs fall short of interpretability for multi-variable data due to their opaque hidden states. Existing works aiming to enhance the interpretability of recurrent neural networks rarely touch the internal structure of RNNs to overcome the opacity of hidden states on multivariable data. This paper tries to achieve a unified framework of accurate forecasting and importance interpretation.


Proposed Model

This model basically does two things
  1. first explores the internal structure of LSTM to enable hidden states to encode individual variables,
  2. then, mixture attention is designed to summarize these variable-wise hidden states for predicting.
 Lets first define some mathematical symbols

The IMV-LSTM Structure

The idea of IMV-LSTM is to make use of hidden state matrix and to develop associated update scheme such that each element (e.g. row) of the hidden matrix encapsulates information exclusively from a certain variable of the input.



The hidden state update function is constructed in a similar manner as the regular LSTM
The authors proposed two sets of approaches

The difference between this two approaches is that the equation set 1 first transfers the matrices to vectors, then restore back to matrices, however, for equation set 2, the authors extend the regular LSTM with tensor operations, and operate on matrices directly. The goal of both approaches are the same -- keep the variables independent during propagation.


Mixture Attention

Mixture attention is used to enable interpretability of the IMV-LSTM model. the mixture attention is formulated as

The notations are defined by

The loss function is defined by
The above Lemma 3.3 ensures that during the EM algorithm, the above loss function upper-bounds the negative log-likelihood.

 Therefore, minimizing Eq. (9) enables to simultaneously learn the network parameters and importance vectors without the need of post processing on trained networks.

Interpretation

After training, a simple closed-form solution of the variable importance vector $I$ can be derived

where
And the temporal importance vector can also be derived

where



Prediction

in the predicting phase, the prediction of $y_{T+1} is obtained by the weighted sum of means as:

where $\mu_n$ is the element of $I$

[1]Guo, Tian, Tao Lin, and Nino Antulov-Fantulin. "Exploring Interpretable LSTM Neural Networks over MultiVariable Data." International Conference on Machine Learning (ICML), 2019.

Comments

Popular posts from this blog

Reading notes: On the Connection Between Adversarial Robustness and Saliency Map Interpretability

Etmann et al. Connection between robustness and interpretability On the Connection Between Adversarial Robustness and Saliency Map Interpretability Advantage and Disadvantages of adversarial training? While this method – like all known approaches of defense – decreases the accuracy of the classifier, it is also successful in increasing the robustness to adversarial attacks Connections between the interpretability of saliency maps and robustness? saliency maps of robustified classifiers tend to be far more interpretable, in that structures in the input image also emerge in the corresponding saliency map How to obtain saliency maps for a non-robustified networks? In order to obtain a semantically meaningful visualization of the network’s classification decision in non-robustified networks, the saliency map has to be aggregated over many different points in the vicinity of the input image. This can be achieved either via averaging saliency maps of noisy versions of the image (Smilkov

Reading Notes: Probabilistic Model-Agnostic Meta-Learning

Probabilistic Model-Agnostic Meta-Learning Reading Notes: Probabilistic Model-Agnostic Meta-Learning This post is a reading note for the paper "Probabilistic Model-Agnostic Meta-Learning" by Finn et al. It is a successive work to the famous MAML paper , and can be viewed as the Bayesian version of the MAML model. Introduction When dealing with different tasks of the same family, for example, the image classification family, the neural language processing family, etc.. It is usually preferred to be able to acquire solutions to complex tasks from only a few samples given the past knowledge of other tasks as a prior (few shot learning). The idea of learning-to-learn, i.e., meta-learning, is such a framework. What is meta-learning? The model-agnostic meta-learning (MAML) [1] is a few shot meta-learning algorithm that uses gradient descent to adapt the model at meta-test time to a new few-shot task, and trains the model parameters at meta-training time to enable rapid adap

Evaluation methods for recommender systems

There are plenty of recommender systems available, the question is, for a specific recommendation problem, which recommender system model to use? The prediction accuracy (ratio of correct predicted items) is a straightforward approach, however, this is in most cases doesn't give a good indication on how 'good' the model is? Because usually, the ratings are somehow ordinal, which means the ratings are ordered instead of categorical, a prediction of 4 star is better than prediction of 5 star for a ground truth of 3 star, while when evaluate with accuracy, 4-star prediction and 5-star are treated equal -- incorrect prediction. There are plenty of better evaluation methods available,  in this post, I will introduce some of them. But first of all, lets review some basic concepts in model evaluation. To simplify our settings, lets say that we have a binary classification model, it made predictions on a test dataset, the prediction result is shown in Figure 1. Then the pr