Skip to main content

Study notes: Representation Learning on Graphs with Jumping Knowledge Networks

Source: Xu, Keyulu, et al. "Representation learning on graphs with jumping knowledge networks." arXiv preprint arXiv:1806.03536 (2018).

Problem to be addressed

There are two major problems to be addressed, deep network degradation and the bias from graph structure.

Degradation as network goes deeper

Although recent developments of graphic neural network have achieved state-of-the-art results on several studies with graph structures. It is concerned that most of then faces the same challenge as the CNN had, i.e. the performance degrades as network grows deeper. 

Many of the recent approaches broadly follow a neighborhood aggregation (or "message passing" scheme), which have been shown to generalize the Weisfeiler-Lehman graph isomorphism test.

Yet, such aggregation schemes sometimes lead to surprises. For example, it has been observed that the best performance with one of the state-of-the-art models, Graph Convolutional Networks (GCN), is achieved with a 2-layer model. Deeper versions of the model that, in principle, have access to more information, perform worse [1].

Such problem in CNN is resolved by residual connections, but it is not the case when adding residual connections to GCNs. 

Graph structure imbalance

In the real applications such as citation networks, social networks, etc. the majority of the nodes have few connections, whereas some nodes (hubs) are connected to many other nodes.  Social and web networks usually consist of an expander-like core part and an almost-tree (bounded treewidth) part, which represent well-connected entities and the small communities respectively.

The subgraph structure has great impact on the result of neighborhood aggregation.  For example, a 4-step random walk (the equivalence to neighborhood aggregation will be shown later) will distribute completely different choosing different starting points (core or tree). See Figure 1.

Aggregation schemes

Given the following notations:

  • $G=(V, E)$, a simple graph
  • $X_v = \mathbb{R}^{d_i}$, node features, where $v \in V$
  • $\tilde{G}$, the graph obtained by adding a self-loop to every 𝑣∈𝑉
  • $h_v^{(l)}\in\mathbb{R}^{d_h}$, The hidden feature of node 𝑣 learned by the 𝑙-th layer of the model.
  • $N(v)=\{u \in V |(v, u) \in E\}$, neighborhood of 𝑣
  • $\widetilde{N}(v)=\{v\} \cup\{u \in V |(v, u) \in E\}$, The analogous neighborhood
An aggregation scheme can be represented by
\[h_{v}^{(l)}=\sigma\left(W_{l} \cdot \operatorname{AGGREGATE}\left(\left\{h_{u}^{(l-1)}, \forall u \in \widetilde{N}(v)\right\}\right)\right)\]

Graph Convolutional Networks (GCN)

The aggregation scheme for GCN can be written by
\[h_{v}^{(l)}=\operatorname{ReLU}\left(W_{l} \cdot \sum_{u \in \tilde{N}(v)}(\operatorname{deg}(v) \operatorname{deg}(u))^{-1 / 2} h_{u}^{(l-1)}\right)\]
With different normalization it can also be written by
\[h_{v}^{(l)}=\operatorname{ReLU}\left(W_{l} \cdot \frac{1}{\operatorname{deg}(v)} \sum_{u \in \tilde{N}(v)} h_{u}^{(l-1)}\right)\]

Connection between aggregation schemes and random walk

By the following definition of influence score and distribution, it can be proved that the k-layer GCN is equivalent to k-step random walk in terms of influence distribution.

Figure 2 demonstrates such equivalence, where different colored nodes represents different probability of being visited.

Jumping Knowledge Networks

The Jumping knowledge network is shown in Figure 4, as the name shows, the last aggregation layer gets all information from previous aggregation layers, it leverages over all these (1-hop neighbor, 2-hop neighbor,...) information.
Quote from paper: "as in common neighborhood aggregation networks, each layer increases the size of the influence distribution by aggregating neighborhoods from the previous layer. At the last layer, for each node, we carefully select from all of those intermediate representations (which "jump" to the last layer), potentially combining a few. If this is done independently for each node, then the model can adapt the effective neighborhood size for each node as needed, resulting in exactly the desired adaptivity."

Degradation addressed

As the last layer integrates all information from previous layers, the degradation is less likely to happen as the model can always choose to incorporate only the appropriate layers. (The idea of skip connection is similar to resnet.)

Graph structure issue addressed

For the graphs with different structures, the Figure 5 shows that the network adapts to the graph correspondingly.

[1]Kipf, Thomas N., and Max Welling. "Semi-supervised classification with graph convolutional networks." arXiv preprint arXiv:1609.02907 (2016).

Comments

Popular posts from this blog

Reading notes: On the Connection Between Adversarial Robustness and Saliency Map Interpretability

Etmann et al. Connection between robustness and interpretability On the Connection Between Adversarial Robustness and Saliency Map Interpretability Advantage and Disadvantages of adversarial training? While this method – like all known approaches of defense – decreases the accuracy of the classifier, it is also successful in increasing the robustness to adversarial attacks Connections between the interpretability of saliency maps and robustness? saliency maps of robustified classifiers tend to be far more interpretable, in that structures in the input image also emerge in the corresponding saliency map How to obtain saliency maps for a non-robustified networks? In order to obtain a semantically meaningful visualization of the network’s classification decision in non-robustified networks, the saliency map has to be aggregated over many different points in the vicinity of the input image. This can be achieved either via averaging saliency maps of noisy versions of the image (Smilkov...

Reading Notes: Probabilistic Model-Agnostic Meta-Learning

Probabilistic Model-Agnostic Meta-Learning Reading Notes: Probabilistic Model-Agnostic Meta-Learning This post is a reading note for the paper "Probabilistic Model-Agnostic Meta-Learning" by Finn et al. It is a successive work to the famous MAML paper , and can be viewed as the Bayesian version of the MAML model. Introduction When dealing with different tasks of the same family, for example, the image classification family, the neural language processing family, etc.. It is usually preferred to be able to acquire solutions to complex tasks from only a few samples given the past knowledge of other tasks as a prior (few shot learning). The idea of learning-to-learn, i.e., meta-learning, is such a framework. What is meta-learning? The model-agnostic meta-learning (MAML) [1] is a few shot meta-learning algorithm that uses gradient descent to adapt the model at meta-test time to a new few-shot task, and trains the model parameters at meta-training time to enable rapid adap...

Evaluation methods for recommender systems

There are plenty of recommender systems available, the question is, for a specific recommendation problem, which recommender system model to use? The prediction accuracy (ratio of correct predicted items) is a straightforward approach, however, this is in most cases doesn't give a good indication on how 'good' the model is? Because usually, the ratings are somehow ordinal, which means the ratings are ordered instead of categorical, a prediction of 4 star is better than prediction of 5 star for a ground truth of 3 star, while when evaluate with accuracy, 4-star prediction and 5-star are treated equal -- incorrect prediction. There are plenty of better evaluation methods available,  in this post, I will introduce some of them. But first of all, lets review some basic concepts in model evaluation. To simplify our settings, lets say that we have a binary classification model, it made predictions on a test dataset, the prediction result is shown in Figure 1. Then the pr...