Study notes: Representation Learning on Graphs with Jumping Knowledge Networks

Source: Xu, Keyulu, et al. "Representation learning on graphs with jumping knowledge networks." arXiv preprint arXiv:1806.03536 (2018).

Problem to be addressed

There are two major problems to be addressed, deep network degradation and the bias from graph structure.

Degradation as network goes deeper

Although recent developments of graphic neural network have achieved state-of-the-art results on several studies with graph structures. It is concerned that most of then faces the same challenge as the CNN had, i.e. the performance degrades as network grows deeper.

Many of the recent approaches broadly follow a neighborhood aggregation (or "message passing" scheme), which have been shown to generalize the Weisfeiler-Lehman graph isomorphism test.

Yet, such aggregation schemes sometimes lead to surprises. For example, it has been observed that the best performance with one of the state-of-the-art models, Graph Convolutional Networks (GCN), is achieved with a 2-layer model. Deeper versions of the model that, in principle, have access to more information, perform worse [1].

Such problem in CNN is resolved by residual connections, but it is not the case when adding residual connections to GCNs.

Graph structure imbalance

In the real applications such as citation networks, social networks, etc. the majority of the nodes have few connections, whereas some nodes (hubs) are connected to many other nodes. Social and web networks usually consist of an expander-like core part and an almost-tree (bounded treewidth) part, which represent well-connected entities and the small communities respectively.

The subgraph structure has great impact on the result of neighborhood aggregation. For example, a 4-step random walk (the equivalence to neighborhood aggregation will be shown later) will distribute completely different choosing different starting points (core or tree). See Figure 1.

Aggregation schemes

Given the following notations:

$G=(V, E)$, a simple graph
$X_v = \mathbb{R}^{d_i}$, node features, where $v \in V$
$\tilde{G}$, the graph obtained by adding a self-loop to every 𝑣∈𝑉
$h_v^{(l)}\in\mathbb{R}^{d_h}$, The hidden feature of node 𝑣 learned by the 𝑙-th layer of the model.
$N(v)=\{u \in V |(v, u) \in E\}$, neighborhood of 𝑣
$\widetilde{N}(v)=\{v\} \cup\{u \in V |(v, u) \in E\}$, The analogous neighborhood

An aggregation scheme can be represented by

\[h_{v}^{(l)}=\sigma\left(W_{l} \cdot \operatorname{AGGREGATE}\left(\left\{h_{u}^{(l-1)}, \forall u \in \widetilde{N}(v)\right\}\right)\right)\]

Graph Convolutional Networks (GCN)

The aggregation scheme for GCN can be written by

\[h_{v}^{(l)}=\operatorname{ReLU}\left(W_{l} \cdot \sum_{u \in \tilde{N}(v)}(\operatorname{deg}(v) \operatorname{deg}(u))^{-1 / 2} h_{u}^{(l-1)}\right)\]

With different normalization it can also be written by

\[h_{v}^{(l)}=\operatorname{ReLU}\left(W_{l} \cdot \frac{1}{\operatorname{deg}(v)} \sum_{u \in \tilde{N}(v)} h_{u}^{(l-1)}\right)\]

Connection between aggregation schemes and random walk

By the following definition of influence score and distribution, it can be proved that the k-layer GCN is equivalent to k-step random walk in terms of influence distribution.

Figure 2 demonstrates such equivalence, where different colored nodes represents different probability of being visited.

Jumping Knowledge Networks

The Jumping knowledge network is shown in Figure 4, as the name shows, the last aggregation layer gets all information from previous aggregation layers, it leverages over all these (1-hop neighbor, 2-hop neighbor,...) information.

Quote from paper: "as in common neighborhood aggregation networks, each layer increases the size of the influence distribution by aggregating neighborhoods from the previous layer. At the last layer, for each node, we carefully select from all of those intermediate representations (which "jump" to the last layer), potentially combining a few. If this is done independently for each node, then the model can adapt the effective neighborhood size for each node as needed, resulting in exactly the desired adaptivity."

Degradation addressed

As the last layer integrates all information from previous layers, the degradation is less likely to happen as the model can always choose to incorporate only the appropriate layers. (The idea of skip connection is similar to resnet.)

Graph structure issue addressed

For the graphs with different structures, the Figure 5 shows that the network adapts to the graph correspondingly.

[1]Kipf, Thomas N., and Max Welling. "Semi-supervised classification with graph convolutional networks." arXiv preprint arXiv:1609.02907 (2016).

PandaCid's Blog

Search This Blog