Skip to main content

Study notes: Multi-Pointer Co-Attention Networks for Recommendation

Traditional collaborative filtering methods usually only incorporate user-item rating pairs for recommendation, the vast available metadata is just ignored in such scenario. With the recent rapid developments of deep learning techniques, neural based recommendation methods is emerging. Most of them benefit from the metadata that improves personalized recommendation significantly. This paper is an example that is based on neural architechture for recommendation with user reviews.

In this post, I just explain the model itself, for detail experiments and backgrounds, please refer to the original paper.

Problem Formulation

Inputs: User ID $a$, Item ID $b$,  user $a$'s reviews set $\boldsymbol{d}_a= \{d_{a1},..., d_{al_a}\}$ and item $b$'s reviews set $\boldsymbol{d}_b= \{d_{b1},..., d_{bl_b}\}$.
Note that  $\boldsymbol{d}_a$ contains all reviews given by user $a$, similarly, $\boldsymbol{d}_b$ contains all reviews received by item $b$.

Outputs: The predictred rating $\hat{r}_{ab}$ corresponding to user $a$ and item $b$.

The Model

The model is called multi-pointer co-attention network (MPCN). The full network is shown in Figure 1.

Figure 1: Multi-Pointer Co-attention Network, credit

It's a little bit complex, let's decompose and study it from the inputs to outputs, step by step.

Input Encoding

Figure 2: Input encoding

As shown in Figure 2, the input encoding is a step to encode input list of reviews into an array of vectors. The maximum number of reviews for a user $a$ or item $b$ is set to be $l_r$, and each review consists of $l_w$ words which are represented by one-hot encoding, i.e. the $i$th word of the review is represented by one-hot vector $w_i$, hence a review can then be represented by $(w_1, ..., w_{l_w})$.

To encode reviews to vectors, as each word is a one-hot vector, a review's vector can be calculaterd by the sum of its constituent word embeddings. i.e.
\[x = \sum f_i w_i\]
where $f_i$ represents frequency of the corresponding word in the review.

The authors also state that not all reviews for a product are important. They apply Review Gating Mechanism to leverage the importance among different reviews by one user / for one product. i.e.
\[\bar{x}_i = \sigma(\boldsymbol{W}_gx_i) + \boldsymbol{b}_g \odot tanh(\boldsymbol{W}_u x_i + b_u)\]

This concludes the encoding.

Review-level Co-Attention

Figure 3: Review-level co-attention

From the last section, assume the dimension of each review encoding is $d$, and because the maximum number of review for a user/ an item is $l_r$, then we have the review embeddings for user ($a\in R^{l_r \times d}$) and item ($b\in R^{l_r \times d}$). The co-attention matrix can then be calculated by 
\[ s_{ij} = F(a_i)^T \boldsymbol{M} F(b_j)\]

where $\boldsymbol{M}$ is a $d\time d$ parameter matrix, $s_{ij}$ is the $(i,j)$ entry of co-attention matrix, and $F(\cdot)$  is a feed-forward neural network.

Pooling and Review Pointers

Figure 4: Pooling and Review Pointers
The idea here is that, by pooling operation, from the co-attention matrix, we want to get a one-hot vector as a review pointer, which point to the most important review for the user/item.

One may try with softmax and argmax (to produce one-hot vector). However, this approach is not good for this purpose as argmax will disable backpropagations. In this case, the authors propose to use Gumbel-softax function to replace softmax.

Gumbel-softmax enables descrete random variables (e.g. one-hot vectors) to be utilized within an end-to-end neural network architecture. In Gumbel-Softmax, the argmax function is replaced by the differentiable softmax function:
where τ, the temperature parameter, controls the extend of how much the output approaches a one hot vector. And $g$ are Gumbel noises.

In the forward pass, the one-hot vector is obtained via argmax:

However, the backward pass maintains the flow of continuous gradient using gumbel-softmax.
Hence the retrieved review vectors from the review pointers are calculated by
where $a'$ $b'$ are selected review vectors, G is the Gumbel-softmax, and $s$ is the co-attention matrix.

Word-level Co-Attention

The review-level co-attention smooths over word information as it compresses each review into a single embedding. However, the design of the model allows the most informative reviews to be ex- tracted by the use of pointers. These reviews can then be compared and modeled at word-level. 
Figure 5: Word-level co-attention

Let $\bar{a}$ and $\bar{b}$ be the selected reviews using the pointer learning scheme. Similar to the review-level co-attention, the co-attention matrix is computed word-by-word (not review-by-reviw):
\[w_{ij} = F(\bar{a}_i)^T \boldsymbol{M}_w F(\bar{b}_j)\]

Next, take the mean pooling (instead of max pooling in review co-attention) to get co-attentional representation of reviews $\bar{a}$ and $\bar{b}$:

Multi-Pointer Learning

All of the above discussions only include single pointer. The multi-pointer learning, indicated by its name, generate multiple pointers instead of one. with different Gumbel noise initializations, we can generate $n_p$ pointers of each review.  In the word-level co-attention case, they are $\{\hat{a}_1',...,\hat{a}_{n_p}'\}$ and $\{\hat{b}_1',...,\hat{b}_{n_p}'\}$.

There are different aggregation methods can be used here to aggregate multiple pointers, the authors of this paper choose summation.

Figure 6: Aggregation multi-pointors by summation

Prediction layer

The prediction layer is a factorization machine. Assume $a_f$ and $b_f$ are outputs from the previous layers, the FM function is defined as 

Reference: Tay, Yi, Anh Tuan Luu, and Siu Cheung Hui. "Multi-pointer co-attention networks for recommendation." Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 2018.

Comments

  1. Six Line | This bet allows gamers to make two road bets . Doubling the width of the road, the payout for the six-line is just 5 to 1. Split | The cut up bet allows gamers to put a bet on two numbers that exist in neighboring pockets. Below we go even further into the different betting strategies out there in roulette and type of|what sort of} payouts may be 바카라사이트 expected from bets that hit during every spin.

    ReplyDelete

Post a Comment

Popular posts from this blog

Reading notes: On the Connection Between Adversarial Robustness and Saliency Map Interpretability

Etmann et al. Connection between robustness and interpretability On the Connection Between Adversarial Robustness and Saliency Map Interpretability Advantage and Disadvantages of adversarial training? While this method – like all known approaches of defense – decreases the accuracy of the classifier, it is also successful in increasing the robustness to adversarial attacks Connections between the interpretability of saliency maps and robustness? saliency maps of robustified classifiers tend to be far more interpretable, in that structures in the input image also emerge in the corresponding saliency map How to obtain saliency maps for a non-robustified networks? In order to obtain a semantically meaningful visualization of the network’s classification decision in non-robustified networks, the saliency map has to be aggregated over many different points in the vicinity of the input image. This can be achieved either via averaging saliency maps of noisy versions of the image (Smilkov...

Reading Notes: Probabilistic Model-Agnostic Meta-Learning

Probabilistic Model-Agnostic Meta-Learning Reading Notes: Probabilistic Model-Agnostic Meta-Learning This post is a reading note for the paper "Probabilistic Model-Agnostic Meta-Learning" by Finn et al. It is a successive work to the famous MAML paper , and can be viewed as the Bayesian version of the MAML model. Introduction When dealing with different tasks of the same family, for example, the image classification family, the neural language processing family, etc.. It is usually preferred to be able to acquire solutions to complex tasks from only a few samples given the past knowledge of other tasks as a prior (few shot learning). The idea of learning-to-learn, i.e., meta-learning, is such a framework. What is meta-learning? The model-agnostic meta-learning (MAML) [1] is a few shot meta-learning algorithm that uses gradient descent to adapt the model at meta-test time to a new few-shot task, and trains the model parameters at meta-training time to enable rapid adap...

Evaluation methods for recommender systems

There are plenty of recommender systems available, the question is, for a specific recommendation problem, which recommender system model to use? The prediction accuracy (ratio of correct predicted items) is a straightforward approach, however, this is in most cases doesn't give a good indication on how 'good' the model is? Because usually, the ratings are somehow ordinal, which means the ratings are ordered instead of categorical, a prediction of 4 star is better than prediction of 5 star for a ground truth of 3 star, while when evaluate with accuracy, 4-star prediction and 5-star are treated equal -- incorrect prediction. There are plenty of better evaluation methods available,  in this post, I will introduce some of them. But first of all, lets review some basic concepts in model evaluation. To simplify our settings, lets say that we have a binary classification model, it made predictions on a test dataset, the prediction result is shown in Figure 1. Then the pr...