This paper is a direct extension of this paper . It adds an explanation task objective and jointly training both rating prediction and explanation tasks. Please refer my other post for multi-pointer co-attention learning. Figure 1: Co-Attentive Multi-Task Learning for Explainable Recommendation From Figure 1, one major difference from this post is Task 2 . Task 2 itself is a GRU network used to generate text explanations. Let's denote its output $\boldsymbol{o}_t$ as the distribution of corresponding words. $Y=(y_1, ..., y_T)$ the generated texts. There are two additional losses from Task 2. 1. Concept relevance loss $\mathcal{L}_c$. During training, $\mathcal{L}_c$ is used to increase the probability that the selected concepts appear in $Y$. It is computed by 2. Negative log-likelihood loss $\mathcal{L}_n$. To ensure that the generated words are similar to the ground truth ones. Plus, the original rating prediction loss: The model jointly t...