Exploring Interpretable LSTM Neural Networks over MultiVariable Data

This post is the reading notes about this paper. This paper proposes a new LSTM structure that can be interpreted with the help of mixture attention, which includes both variable importance and temporal importance.

Background

RNNs trained over multi-variable data capture nonlinear correlation of historical values of target and exogenous variables to the future target values. However, current RNNs fall short of interpretability for multi-variable data due to their opaque hidden states. Existing works aiming to enhance the interpretability of recurrent neural networks rarely touch the internal structure of RNNs to overcome the opacity of hidden states on multivariable data. This paper tries to achieve a uniﬁed framework of accurate forecasting and importance interpretation.

Proposed Model

This model basically does two things

first explores the internal structure of LSTM to enable hidden states to encode individual variables,
then, mixture attention is designed to summarize these variable-wise hidden states for predicting.

Lets first define some mathematical symbols

The IMV-LSTM Structure

The idea of IMV-LSTM is to make use of hidden state matrix and to develop associated update scheme such that each element (e.g. row) of the hidden matrix encapsulates information exclusively from a certain variable of the input.

The hidden state update function is constructed in a similar manner as the regular LSTM

The authors proposed two sets of approaches

The difference between this two approaches is that the equation set 1 first transfers the matrices to vectors, then restore back to matrices, however, for equation set 2, the authors extend the regular LSTM with tensor operations, and operate on matrices directly. The goal of both approaches are the same -- keep the variables independent during propagation.

Mixture Attention

Mixture attention is used to enable interpretability of the IMV-LSTM model. the mixture attention is formulated as

The notations are defined by

The loss function is defined by

The above Lemma 3.3 ensures that during the EM algorithm, the above loss function upper-bounds the negative log-likelihood.

Therefore, minimizing Eq. (9) enables to simultaneously learn the network parameters and importance vectors without the need of post processing on trained networks.

Interpretation

After training, a simple closed-form solution of the variable importance vector $I$ can be derived

where

And the temporal importance vector can also be derived

where

Prediction

in the predicting phase, the prediction of $y_{T+1} is obtained by the weighted sum of means as:

where $\mu_n$ is the element of $I$

[1]Guo, Tian, Tao Lin, and Nino Antulov-Fantulin. "Exploring Interpretable LSTM Neural Networks over MultiVariable Data." International Conference on Machine Learning (ICML), 2019.

PandaCid's Blog

Search This Blog