NN-lecture-12-GNN-2023-final.pdf

Neural Networks - Lecture 12 Graph Neural Networks Based on - slides from Stanford CS224W Fall 2021 - EEML 2020 talk on GNN by Petar Veličković - slides from AIMS 2022 Lectures 5 and 6 by Petar Veličković Outline ● ● ● Graph Neural Networks – Intro and Basics – General Framework for building and training GNNs Foundational GNN models – Graph Convolutional Networks – Graph Attentional Networks – Message Passing Neural Network (MPNN) Applications of GNNs 2/56 Intro 3/55 Intro ● Structured data is ever present Knowledge graph Protein graphs Recommender Systems User preference/consumption Social Network Graphs 4/55 Intro ● Recent and hot topic in machine learning research – One of the fastest growing area at ICLR (International Conference on Learning Representations) in recent years – Broke into the real world ● e.g. Drug discovery (including for COVID-19) ● Trip / wait time prediction on Google Maps ● Fake news detection on Social Media (e.g. Tweets) 5/55 Intro: Challenge ● Structured data is ever present – How can we apply deep learning techniques to graph-based information representations? – How can we obtain embeddings of information contained in graphs to further use it in deep learning pipelines? 6/55 Intro: General Challenge ● Deep Learning for graph data – intuition Image source: Stanford CS224w ● ● Map nodes to d-dimensional embeddings - similar nodes in the graph are embedded close together Challenge: how to learn the mapping function f? 7/55 Intro: Analogies ● ● ● Graphs can be seen as a generalization of images – Any image can be seen as a “grid graph” – Each node corresponds to a pixel, adjacent to its four neighbors CNNs leverage the convolution operator to extract spatial regularity information from images Research in GNNs seeks to generalize such an operation to work on arbitrary graphs 8/55 Intro: GNN – CNN similarity – conv Layer 9/55 Intro: GNN – CNN similarity – graph conv layer 10/55 Intro: GNN – CNN similarity – challenges with graph convolutions ● Desirable properties for a graph convolutional layer – Computational and storage efficiency (~O(V+E)) – Fixed number of parameters (independent of input size) – Localisation (act on a local neighborhood of a node) – Specifying different importances to different neighbors – Applicability to inductive problems (i.e. generalize to graphs with similar, but not identical structure) 11/55 Intro: Notation and math setup ● Graph: G=(V , E) ● Node features: X={x⃗1, x⃗2, …, x⃗N } ⃗ x i ∈ℝ ● Adjacency matrix: A∈ℝ N× N ● Neighborhoods: N i ={j∣i= j∨ Aij ≠0} ● Edge features: e⃗ij ∈ℝ F ' ; (i , j)∈E Fi F H ={h⃗1, h⃗2, …, h⃗N } h⃗i ∈ℝ h 12/55 General Framework for building and training GNNs (switch to CS224W slides) 13/55 General Framework for building and training GNNs Node, Subgraph, Graph Encoders Tasks to solve with GNNs 14/55 " similarity 𝑢, 𝑣 ≈ 𝐳 Goal: ! 𝐳# Need to define! Input network 10/7/21 d-dimensional embedding space Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 4 ¡ Encoder: Maps each node to a low-dimensional vector d-dimensional ENC 𝑣 = 𝐳! embedding node in the input graph ¡ Similarity function: Specifies how the relationships in vector space map to the relationships in the original network similarity 𝑢, 𝑣 ≈ 𝐳!" 𝐳# Decoder Similarity of 𝑢 and 𝑣 in the original network 10/7/21 dot product between node embeddings Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 5 ¡ Today: We will now discuss deep learnig methods based on graph neural networks (GNNs): ENC 𝑣 = ¡ 10/7/21 multiple layers of non-linear transformations based on graph structure Note: All these deep encoders can be combined with node similarity functions defined in the Lecture 3. Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 8 … Output: Node embeddings. Also, we can embed subgraphs, and graphs 10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 9 Tasks we will be able to solve: ¡ Node classification § Predict a type of a given node ¡ Link prediction § Predict whether two nodes are linked ¡ Community detection § Identify densely linked clusters of nodes ¡ Network similarity § How similar are two (sub)networks 10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 10 Images Text/Speech Modern deep learning toolbox is designed for simple sequences & grids 10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 11 But networks are far more complex! § Arbitrary size and complex topological structure (i.e., no spatial locality like grids) vs. Text Networks Images § No fixed node ordering or reference point § Often dynamic and have multimodal features 10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 12 General Framework for building and training GNNs Node Embedding properties: Permutation Invariance Permutation Equivariance 15/55 Where do we begin? We will first look at graphs without connections (sets) • Much simpler to analyse • Most conclusions will naturally carry over to graphs • Still very relevant! (point clouds / LiDAR) Learning on sets: Setup For now, assume our graph has no edges (i.e. Ω = 𝒱, the set of nodes) Let 𝐱 # ∈ ℝ$ be the features of node 𝑖. (i.e. our feature space is 𝒞 = ℝ$ ). We can stack these features into a node feature matrix of shape 𝒱 × 𝑘: 𝐗 = 𝐱! , … , 𝐱 " # That is, the 𝑖th row of 𝐗 corresponds to 𝐱 # Note that, by doing so, we have specified a node ordering! We would like the result of any neural networks to not depend on this. What do we want? What do we want? What do we want? Symmetry group 𝔊: 𝑛-element Permutation group Σ" Group element 𝔤 ∈ 𝔊: Permutations Permutations and permutation matrices It will be useful to think about operators that change the node order Such operations are known as permutations (there are 𝑛! of them) e.g. a permutation (2, 4, 1, 3) maps 𝐱% ← 𝐱 &, 𝐱 & ← 𝐱 ', 𝐱 ( ← 𝐱%, 𝐱 ' ← 𝐱 ( Symmetry group 𝔊: 𝑛-element Permutation group Σ" Group element 𝔤 ∈ 𝔊: Permutations Permutations and permutation matrices Within linear algebra, each permutation defines a |𝒱| × |𝒱| matrix • Such matrices are called permutation matrices (group action 𝜌(𝔤)!) • They have exactly one 1 in every row and column, zeroes elsewhere • Their effect when left-multiplied is to permute rows of 𝐗, like so: 0 0 𝐏(%,',!,() 𝐗 = 1 0 1 0 0 0 0 0 0 1 0 1 0 0 − 𝐱! − 𝐱% − 𝐱( − 𝐱' − 𝐱% − − − 𝐱' − = − 𝐱! − 𝐱( − − − − − Permutation invariance Want: functions 𝑓(𝐗) over sets that will not depend on the order Equivalently: applying a permutation matrix shouldn’t modify result! We arrive at a very useful notion of permutation invariance. 𝑓(𝐗) is permutation invariant if, for all permutation matrices 𝐏: 𝑓 𝐏𝐗 = 𝑓 𝐗 Permutation invariance Want: functions 𝑓(𝐗) over sets that will not depend on the order Equivalently: applying a permutation matrix shouldn’t modify result! We arrive at a very useful notion of permutation invariance. 𝑓(𝐗) is permutation invariant if, for all permutation matrices 𝐏: 𝑓 𝐏𝐗 = 𝑓 𝐗 Compare with GDL blueprint: 𝔊-invariant layer (global pooling) 𝐴 ∶ 𝒳 Ω, 𝒞 → 𝒴, satisfying 𝐴 𝔤. 𝑥 = 𝐴 𝑥 for all 𝔤 ∈ 𝔊 and 𝑥 ∈ 𝒳(Ω, 𝒞). Deep Sets A very generic form is the Deep Sets model (Zaheer et al., NeurIPS’18): 𝑓 𝐗 = 𝜙 , 𝜓 𝐱* *∈𝒱 where 𝜓 and 𝜙 are (learnable) functions, e.g. MLPs. The sum aggregation is critical! (other choices possible, e.g. max or avg) Deep Sets A very generic form is the Deep Sets model (Zaheer et al., NeurIPS’18): ⨁ 𝑓 𝐗 = 𝜙 , 𝜓 𝐱* *∈𝒱 where 𝜓 and 𝜙 are (learnable) functions, e.g. MLPs. The sum aggregation is critical! (other choices possible, e.g. max or avg) We will use ⨁ to denote any permutation-invariant operator. Permutation equivariance Permutation invariant models are good for set-level outputs What if we would like answers at the node level? We want to still be able to identify node outputs, which a permutation-invariant aggregator would destroy! We may instead seek functions that don’t change the node order i.e. if we permute nodes, it doesn’t matter if we do it before or after! Accordingly, we say that 𝐅(𝐗) is permutation equivariant if, for all permutation matrices 𝐏: 𝐅 𝐏𝐗 = 𝐏𝐅 𝐗 Permutation equivariance Accordingly, we say that 𝐅(𝐗) is permutation equivariant if, for all permutation matrices 𝐏: 𝐅 𝐏𝐗 = 𝐏𝐅 𝐗 Compare with GDL blueprint: Linear 𝔊-equivariant layer 𝐵 ∶ 𝒳 Ω, 𝒞 → 𝒳(Ω! , 𝒞 ! ), satisfying 𝐵 𝔤. 𝑥 = 𝔤. 𝐵(𝑥) for all 𝔤 ∈ 𝔊 and 𝑥 ∈ 𝒳(Ω, 𝒞). Important constraint: Locality Recall: Want signal to be stable under slight deformations of the domain Important constraint: Locality Recall: Want signal to be stable under slight deformations of the domain It is highly beneficial to compose local operations to model larger-scale ones, as local ops won’t globally propagate errors e.g. CNNs with 3 x 3 kernels, but very deep Accordingly, we would like to support locality in our layers! cf. Fourier Transform vs. Wavelets What does this mean for sets? General blueprint for learning on sets One way we can enforce locality in equivariant set functions is through a shared function 𝜓 applied to every node in isolation: 𝐡* = 𝜓(𝐱 * ) Stacking 𝐡# into a matrix yields 𝐇 = 𝐅 𝐗 (remark: this is typically as far as we can get with sets, without assuming or inferring additional structure) General blueprint for learning on sets One way we can enforce locality in equivariant set functions is through a shared function 𝜓 applied to every node in isolation: 𝐡* = 𝜓(𝐱 * ) Deep Sets as a blueprint: (stacking) equivariant function(s), potentially with an invariant tail---yields (m)any useful set neural nets! ⨁ 𝑓 𝐗 = 𝜙 , 𝜓 𝐱* *∈𝒱 (remark: this is typically as far as we can get with sets, without assuming or inferring additional structure) Learning on graphs Now we can augment the set of nodes with edges between them That is, we consider graphs 𝒢 = (𝒱, ℰ) where ℰ ⊆ 𝒱 × 𝒱 We can represent these edges in an adjacency matrix, 𝐀, such that: 1, 𝑎#) = Y 0, 𝑖, 𝑗 ∈ ℰ 𝑖, 𝑗 ∉ ℰ Note that the edges are now part of the domain! Further additions, e.g. edge features, are possible but ignored for now Our main desiderata (permutation {in,equi}variance) still hold! What’s changed? What’s changed? Permutation invariance and equivariance on graphs Main difference: permutations now also accordingly act on the edges We need to appropriately permute both rows and columns of 𝐀 When applying a permutation matrix 𝐏, this amounts to 𝐏𝐀𝐏 * We arrive at updated definitions of suitable functions over graphs: Invariance: Equivariance: 1 = 𝑓 𝐗, 𝐀 1 = 𝐏𝐅(𝐗, 𝐀) 𝑓 𝐏𝐗, 𝐏𝐀𝐏 𝐅 𝐏𝐗, 𝐏𝐀𝐏 Locality on graphs: Neighbourhoods On sets, we enforced locality by transforming every node in isolation Graphs give us a broader context: a node’s neighbourhood For a node 𝑖, its (1-hop) neighbourhood, 𝒩# , is commonly defined as: 𝒩# = { 𝑗 ∶ 𝑖, 𝑗 ∈ ℰ ∨ 𝑗, 𝑖 ∈ ℰ)} Accordingly, we can extract neighbourhood features, 𝐗 𝒩! , like so: 𝐗 𝒩! = {{ 𝐱) ∶ 𝑗 ∈ 𝒩# }} and define a local function, 𝜙(𝐱 # , 𝐗 𝒩! ), operating over them. Recipe for graph neural networks Now we can construct permutation equivariant functions, 𝐅(𝐗, 𝐀), by appropriately applying the local function, 𝜙, over all neighbourhoods: 𝐅 𝐗, 𝐀 = − − − 𝜙(𝐱! , 𝐗 𝒩! ) 𝜙(𝐱 % , 𝐗 𝒩" ) ⋮ 𝜙(𝐱 " , 𝐗 𝒩# ) − − − To ensure equivariance, it is sufficient if 𝜙 does not depend on the order of the nodes in 𝐗 𝒩! (i.e. if it is permutation invariant). Exercise: Prove this! Recipe for graph neural networks, visualised General blueprint for learning on graphs General blueprint for learning on graphs General blueprint for learning on graphs General blueprint for learning on graphs General blueprint for learning on graphs What’s in a GNN layer? We build permutation equivariant functions 𝐅(𝐗, 𝐀) on graphs by shared application of a local permutation-invariant 𝜙(𝐱 # , 𝐗 𝒩! ) Common lingo: 𝐅 is a “GNN layer” 𝜙 is “diffusion” / “propagation” / “message passing” But how do we implement 𝜙? Very intense area of research! Fortunately, almost all of them can be classified across three “flavours” The three “flavours” of GNN layers ⨁ 𝐡! = 𝜙 𝐱 ! , & 𝑐!" 𝜓 𝐱" "∈𝒩! ⨁ 𝐡! = 𝜙 𝐱 ! , & 𝑎 𝐱 ! , 𝐱" 𝜓 𝐱" "∈𝒩! ⨁ 𝐡! = 𝜙 𝐱 ! , & 𝜓 𝐱 ! , 𝐱 " "∈𝒩! Convolutional GNN Features of neighbours aggregated with fixed weights, 𝑐#) ⨁ 𝐡# = 𝜙 𝐱 # , c 𝑐#) 𝜓 𝐱) )∈𝒩! Usually, weights depend directly on 𝐀 • ChebyNet (Defferrard et al., NeurIPS’16) • GCN (Kipf & Welling, ICLR’17) • SGC (Wu et al., ICML’19) Useful for homophilous graphs (when edges encode label similarity) Highly scalable Most industrial GNN applications currently live here Attentional GNN Features of neighbours aggregated with implicit weights (attention) ⨁ 𝐡# = 𝜙 𝐱 # , c 𝑎 𝐱 # , 𝐱) 𝜓 𝐱) )∈𝒩! Attention weights computed as 𝛼#) = 𝑎(𝐱 # , 𝐱) ) • MoNet (Monti et al., CVPR’17) • GAT (Veličković et al., ICLR’18) • GATv2 (Brody et al., ICLR’22) Useful as ”middle ground” w.r.t. capacity, scale, interpretability Edges need not encode homophily But still computing only a scalar per edge Message-passing GNN Compute arbitrary vectors (messages) to be sent across edges ⨁ 𝐡# = 𝜙 𝐱 # , c 𝜓 𝐱 # , 𝐱) )∈𝒩! Messages computed as 𝐦#) = 𝜓(𝐱 # , 𝐱) ) • Interaction Nets (Battaglia et al., NeurIPS’16) • MPNN (Gilmer et al., ICML’17) • GraphNets (Battaglia et al., 2018) Most generic GNN layer Edges give “recipe” for passing data May have scalability or learnability issues Ideal for computational chemistry, reasoning and simulation tasks General Framework for building and training GNNs Model Design Overview of Principles 18/55 (1) Define a neighborhood aggregation function 𝒛= (2) Define a loss function on the embeddings 10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 71 (3) Train on a set of nodes, i.e., a batch of compute graphs 10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 72 (4) Generate embeddings for nodes as needed Even for nodes we never trained on! 10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 73 ¡ The same aggregation parameters are shared for all nodes: § The number of model parameters is sublinear in |𝑉| and we can generalize to unseen nodes! shared parameters B A 𝑊> C F D 𝐵> shared parameters E INPUT GRAPH 10/7/21 Compute graph for node A Compute graph for node B Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 74 zD Train on one graph Inductive node embedding Generalize to new graph Generalize to entirely unseen graphs E.g., train on protein interaction graph from model organism A and generate embeddings on newly collected data about organism B 10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 75 zD Train with snapshot ¡ New node arrives Generate embedding for new node Many application settings constantly encounter previously unseen nodes: § E.g., Reddit, YouTube, Google Scholar ¡ Need to generate new embeddings “on the fly” 10/7/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 76 Foundational GNN models GCN, GAT, MPNN 19/56 ¡ Putting things together: § (1) Message: each node computes a message (7) 𝐦6 = MSG 7 𝐡6789 , 𝑢 ∈ {𝑁 𝑣 ∪ 𝑣} § (2) Aggregation: aggregate messages from neighbors (#) 𝐡! = AGG # 𝐦%# , 𝑢 ∈ 𝑁 𝑣 , 𝐦!# § Nonlinearity (activation): Adds expressiveness § Often written as 𝜎(⋅): ReLU(⋅), Sigmoid(⋅) , … § Can be added to message or aggregation (2) Aggregation (1) Message 10/12/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 35 Foundational GNN models Graph Convolutional Networks (GCN) [Kipf & Welling, ICLR 2017] 21/55 T. Kipf, M. Welling. Semi-Supervised Classification with Graph Convolutional Networks, ICLR 2017 ¡ (1) Graph Convolutional Networks (GCN) (<) 𝐡? = 𝜎 𝐖 < 𝐡;<=> G 𝑁 𝑣 ;∈@ ? ¡ How to write this as Message + Aggregation? Message (<) 𝐡? =𝜎 G 𝐖 < ;∈@ ? 𝐡;<=> 𝑁 𝑣 (2) Aggregation (1) Message Aggregation 10/12/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 36 ¡ (1) Graph Convolutional Networks (GCN) <=> (<) 𝐡? = 𝜎 G 𝐖 < ;∈@ ? ¡ Message: § Each ¡ 𝐡; 𝑁 𝑣 (<) Neighbor: 𝐦; = Aggregation: > @ ? 𝐖 < 𝐡;<=> (2) Aggregation (1) Message Normalized by node degree (In the GCN paper they use a slightly different normalization) § Sum over messages from neighbors, then apply activation < § 𝐡? = 𝜎 Sum 10/12/21 < 𝐦; , 𝑢 ∈ 𝑁 𝑣 In GCN graph is assumed to have self-edges that are included in the summation. Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 37 Foundational GNN models Graph Attentional Networks (GAT) 28/55 ¡ (3) Graph Attention Networks (7) 𝐡: = (789) (7) 𝜎(∑6∈I : 𝛼:6 𝐖 𝐡6 ) Attention weights ¡ In GCN / GraphSAGE § 𝛼:6 = 9 I : is the weighting factor (importance) of node 𝑢’s message to node 𝑣 § ⟹ 𝛼:6 is defined explicitly based on the structural properties of the graph (node degree) § ⟹ All neighbors 𝑢 ∈ 𝑁(𝑣) are equally important to node 𝑣 10/12/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 41 ¡ (3) Graph Attention Networks (7) 𝐡: = (789) (7) 𝜎(∑6∈I : 𝛼:6 𝐖 𝐡6 ) Attention weights Not all node’s neighbors are equally important § Attention is inspired by cognitive attention. § The attention 𝜶𝒗𝒖 focuses on the important parts of the input data and fades out the rest. § Idea: the NN should devote more computing power on that small but important part of the data. § Which part of the data is more important depends on the context and is learned through training. 10/12/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 42 [Velickovic et al., ICLR 2018; Vaswani et al., NIPS 2017] Can we do better than simple neighborhood aggregation? Can we let weighting factors 𝜶𝒗𝒖 to be learned? Goal: Specify arbitrary importance to different neighbors of each node in the graph (%) ¡ Idea: Compute embedding 𝒉# of each node in the graph following an attention strategy: ¡ § Nodes attend over their neighborhoods’ message § Implicitly specifying different weights to different nodes in a neighborhood 10/12/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 43 ¡ Let 𝛼"# be computed as a byproduct of an attention mechanism 𝒂: § (1) Let 𝑎 compute attention coefficients 𝒆𝒗𝒖 across pairs of nodes 𝑢, 𝑣 based on their messages: (7) (789) (7) (789) 𝑒:6 = 𝑎(𝐖 𝐡6 , 𝐖 𝒉: ) § 𝒆𝒗𝒖 indicates the importance of 𝒖D 𝐬 message to node 𝒗 (#.%) 𝑒01 𝐡1 (#.%) 𝐡0 (<=>) 10/12/21 𝑒FG = 𝑎(𝐖 (<) 𝐡F (<=>) , 𝐖 (<) 𝐡G ) Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 44 § Normalize 𝑒:6 into the final attention weight 𝜶𝒗𝒖 § Use the softmax function, so that ∑;∈@ 𝛼!% exp(𝑒!% ) = ∑)∈+ ! exp(𝑒!) ) ? 𝛼?; = 1: § Weighted sum based on the final attention weight 𝜶𝒗𝒖 (#) 𝐡! = 𝜎(∑%∈+ ! (<) (#,-) 𝛼!% 𝐖 𝐡% (#) (#<%) (#<%) 𝛼>A 𝐖 (#) 𝐡A 10/12/21 ) (#<%) +𝛼>@ 𝐖 (#) 𝐡@ (#.%) 𝐡1 𝛼01 Weighted sum using 𝛼FG , 𝛼FI , 𝛼FJ : 𝐡> = 𝜎(𝛼>? 𝐖 (#) 𝐡? ) 𝛼02 + 𝛼03 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu (#.%) 𝐡2 45 ¡ What is the form of attention mechanism 𝒂? § The approach is agnostic to the choice of 𝑎 § E.g., use a simple single-layer neural network § 𝑎 have trainable parameters (weights in the Linear layer) Concatenate (#.%) 𝐡0 (#.%) 𝐡1 Linear 𝑒>? (#<%) 𝑒>? = 𝑎 𝐖 (#) 𝐡> (#<%) , 𝐖 (#) 𝐡? (#<%) = Linear Concat 𝐖 (#) 𝐡> (#<%) , 𝐖 (#) 𝐡? § Parameters of 𝑎 are trained jointly: § Learn the parameters together with weight matrices (i.e., other parameter of the neural net 𝐖 (<) ) in an end-to-end fashion 10/12/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 46 ¡ Multi-head attention: Stabilizes the learning process of attention mechanism § Create multiple attention scores (each replica with a different set of parameters): (#) 𝐡! [1] (#) 𝐡! [2] (#) 𝐡! [3] (#,-) (#) ) ! 𝛼!% 𝐖 𝐡% (#,-) ' (#) 𝜎(∑%∈+ ! 𝛼!% 𝐖 𝐡% ) . 𝐖 (#) 𝐡(#,-) ) 𝜎(∑%∈+ ! 𝛼!% % = 𝜎(∑%∈+ = = § Outputs are aggregated: § By concatenation or summation (<) (<) (<) (<) § 𝐡? = AGG(𝐡? 1 , 𝐡? 2 , 𝐡? 3 ) 10/12/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 47 Key benefit: Allows for (implicitly) specifying different importance values (𝜶𝒗𝒖 ) to different neighbors ¡ Computationally efficient: ¡ § Computation of attentional coefficients can be parallelized across all edges of the graph § Aggregation may be parallelized across all nodes ¡ ¡ ¡ Storage efficient: § Sparse matrix operations do not require more than 𝑂(𝑉 + 𝐸) entries to be stored § Fixed number of parameters, irrespective of graph size Localized: § Only attends over local network neighborhoods Inductive capability: § It is a shared edge-wise mechanism § It does not depend on the global graph structure 10/12/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 48 Foundational GNN models Message Passing Neural Networks (MPNN) [Gilmer et al, ICML 2017] 23/55 Message Passing Neural Networks (MPNN) ● ● GCN model supports edge features only indirectly (by averaging features from its neighbors) Attempt to correct this by focusing on edge-wise mechanisms – In the general case, nodes send messages (vectors) along graph edges – => these messages can be conditioned by the edge features – A node will aggregate all messages sent to it using a permutationinvariant function 24/55 Message Passing Neural Networks (MPNN) ● m⃗ij is the message sent along edge i → j – Message computed using a message function fe m⃗ij=f e ( h⃗i , h⃗j , e⃗ij ) ● Aggregate all messages entering the node and determine updated node features using a readout function fv h⃗' i =f v ( h⃗i , ∑ m⃗ji ) j ∈N i ● The above is the general form of MPNNl fe and fv are usually (small) MLPs 25/55 Message Passing Neural Networks (MPNN) Extract from GNN Lecture, EEML 2020, Petar Veličković 26/55 Message Passing Neural Networks (MPNN) ● ● MPNN advantage: most potent (in terms of representational power) GNN layer Issues of MPNN – Requires more memory to store edge messages – Requires more learnable parameters – In practice can be applied only to small graphs (up to the order of hundreds of nodes) 27/55 Foundational GNN models GNN Layers in Practice 29/55 J. You, R. Ying, J. Leskovec. Design Space of Graph Neural Networks, NeurIPS 2020 ¡ In practice, these classic GNN layers are a great starting point A suggested GNN Layer § We can often get better performance by considering a general GNN layer design § Concretely, we can include modern deep learning modules that proved to be useful in many domains 10/12/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 50 J. You, R. Ying, J. Leskovec. Design Space of Graph Neural Networks, NeurIPS 2020 ¡ Many modern deep learning modules can be incorporated into a GNN layer § Batch Normalization: A suggested GNN Layer § Stabilize neural network training § Dropout: § Prevent overfitting § Attention/Gating: § Control the importance of a message § More: § Any other useful deep learning modules 10/12/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 51 J. You, R. Ying, J. Leskovec. Design Space of Graph Neural Networks, NeurIPS 2020 How to connect GNN layers into a GNN? C B • Stack layers sequentially A B • Ways of addingC skip connections A B TARGET NODE A C A E F D F E D A INPUT GRAPH GNN Layer 1 (3) Layer connectivity GNN Layer 2 10/12/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 58 ¡ How to construct a Graph Neural Network? § The standard way: Stack GNN layers sequentially § Input: Initial raw node feature 𝐱 : § Output: Node embeddings (X) 𝐡: after 𝐿 GNN layers (B) 𝐡! = 𝐱 ! (%) 𝐡! (C) 𝐡! (D) 𝐡! 10/12/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 59 ¡ The Issue of stacking many GNN layers § GNN suffers from the over-smoothing problem ¡ The over-smoothing problem: all the node embeddings converge to the same value § This is bad because we want to use node embeddings to differentiate nodes ¡ Why does the over-smoothing problem happen? 10/12/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 60 ¡ Receptive field: the set of nodes that determine the embedding of a node of interest § In a 𝑲-layer GNN, each node has a receptive field of 𝑲-hop neighborhood Receptive field for 1-layer GNN 10/12/21 Receptive field for 2-layer GNN Receptive field for 3-layer GNN Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 61 ¡ Receptive field overlap for two nodes § The shared neighbors quickly grows when we increase the number of hops (num of GNN layers) 1-hop neighbor overlap Only 1 node 10/12/21 2-hop neighbor overlap About 20 nodes 3-hop neighbor overlap Almost all the nodes! Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 62 ¡ We can explain over-smoothing via the notion of receptive field § We knew the embedding of a node is determined by its receptive field § If two nodes have highly-overlapped receptive fields, then their embeddings are highly similar ¡ § Stack many GNN layers à nodes will have highlyoverlapped receptive fields à node embeddings will be highly similar à suffer from the oversmoothing problem Next: how do we overcome over-smoothing problem? 10/12/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 63 ¡ ¡ How to make a shallow GNN more expressive? Solution 1: Increase the expressive power within each GNN layer § In our previous examples, each transformation or aggregation function only include one linear layer § We can make aggregation / transformation become a deep neural network! If needed, each box could include a 3-layer MLP 10/12/21 (2) Aggregation (1) Transformation Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 65 ¡ ¡ How to make a shallow GNN more expressive? Solution 2: Add layers that do not pass messages § A GNN does not necessarily only contain GNN layers § E.g., we can add MLP layers (applied to each node) before and after GNN layers, as pre-process layers and post-process layers Pre-processing layers: Important when encoding node features is necessary. E.g., when nodes represent images/text Post-processing layers: Important when reasoning / transformation over node embeddings are needed E.g., graph classification, knowledge graphs In practice, adding these layers works great! 10/12/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 66 He et al. Deep Residual Learning for Image Recognition, CVPR 2015 ¡ ¡ What if my problem still requires many GNN layers? Lesson 2: Add skip connections in GNNs § Observation from over-smoothing: Node embeddings in earlier GNN layers can sometimes better differentiate nodes § Solution: We can increase the impact of earlier layers on the final node embeddings, by adding shortcuts in GNN Duplicate into two branches Sum two branches 10/12/21 Idea of skip connections: Before adding shortcuts: 𝑭 𝐱 After adding shortcuts: 𝑭 𝐱 +𝐱 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 67 Xu et al. Representation learning on graphs with jumping knowledge networks, ICML 2018 (') Input: 𝐡! ¡ Other options: Directly skip to the last layer § The final layer directly aggregates from the all the node embeddings in the previous layers (#) 𝐡! (%) 𝐡! (&) 𝐡! (()*+,) Output: 𝐡! 10/12/21 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 70 Applications of GNNs (extracts from EEML 2020 GNN lecture by Petar Veličković) 30/55 GNN Layers in Practice: Computational biochemistry and medicine Extract from GNN Lecture, EEML 2020, Petar Veličković 31/55 GNN Layers in Practice: Computational biochemistry and medicine Extract from GNN Lecture, EEML 2020, Petar Veličković 32/55 GNN Layers in Practice: Computational biochemistry and medicine Extract from GNN Lecture, EEML 2020, Petar Veličković 33/55 GNN Layers in Practice: Computational biochemistry and medicine Extract from GNN Lecture, EEML 2020, Petar Veličković 34/55 GNN Layers in Practice: Computational biochemistry and medicine Extract from GNN Lecture, EEML 2020, Petar Veličković 35/55 GNN Layers in Practice: Computational biochemistry and medicine Extract from GNN Lecture, EEML 2020, Petar Veličković 36/55 GNN Layers in Practice: Computational biochemistry and medicine Extract from GNN Lecture, EEML 2020, Petar Veličković 37/55 GNN Layers in Practice: Computational biochemistry and medicine Extract from GNN Lecture, EEML 2020, Petar Veličković 38/55 GNN Layers in Practice: Traffic simulations and vehicle perception Extract from GNN Lecture, EEML 2020, Petar Veličković 39/55 GNN Layers in Practice: Traffic simulations and vehicle perception Extract from GNN Lecture, EEML 2020, Petar Veličković 40/55 GNN Layers in Practice: Traffic simulations and vehicle perception Extract from GNN Lecture, EEML 2020, Petar Veličković 41/55 GNN Layers in Practice: Traffic simulations and vehicle perception Extract from GNN Lecture, EEML 2020, Petar Veličković 42/55 GNN Layers in Practice: Recommendation Systems and Social Networks Extract from GNN Lecture, EEML 2020, Petar Veličković 43/55 GNN Layers in Practice: Recommendation Systems and Social Networks Extract from GNN Lecture, EEML 2020, Petar Veličković 44/55 GNN Layers in Practice: Recommendation Systems and Social Networks Extract from GNN Lecture, EEML 2020, Petar Veličković 45/55 GNN Layers in Practice: Recommendation Systems and Social Networks Extract from GNN Lecture, EEML 2020, Petar Veličković 46/55 GNN Layers in Practice: Recommendation Systems and Social Networks Extract from GNN Lecture, EEML 2020, Petar Veličković 47/55 GNN Layers in Practice: Recommendation Systems and Social Networks Extract from GNN Lecture, EEML 2020, Petar Veličković 48/55 GNN Layers in Practice: Recommendation Systems and Social Networks Extract from GNN Lecture, EEML 2020, Petar Veličković 49/55 GNN Layers in Practice: Recommendation Systems and Social Networks Extract from GNN Lecture, EEML 2020, Petar Veličković 50/55 GNN Layers in Practice: Recommendation Systems and Social Networks Extract from GNN Lecture, EEML 2020, Petar Veličković 51/55 GNN Layers in Practice: Recommendation Systems and Social Networks Extract from GNN Lecture, EEML 2020, Petar Veličković 52/55 GNN Layers in Practice: Recommendation Systems and Social Networks Extract from GNN Lecture, EEML 2020, Petar Veličković 53/55 GNN Layers in Practice: Recommendation Systems and Social Networks 54/56 References [Kipf & Welling, ICLR 2017] Kipf, Thomas N., and Max Welling. "Semi-supervised classification with graph convolutional networks." arXiv preprint arXiv:1609.02907 (2016). [Hamilton et al., NeurIPS 2017] Hamilton, William L., Rex Ying, and Jure Leskovec. "Inductive representation learning on large graphs." In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 1025-1035. 2017. [Gilmer et al., ICML 2017] Gilmer, Justin, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. "Neural message passing for quantum chemistry." In International conference on machine learning, pp. 1263-1272. PMLR, 2017. [Veličković et al., 2018] Veličković, Petar, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. "Graph attention networks." arXiv preprint arXiv:1710.10903 (2017). [Stokes et al., Cell 2020] Stokes, Jonathan M., Kevin Yang, Kyle Swanson, Wengong Jin, Andres Cubillos-Ruiz, Nina M. Donghia, Craig R. MacNair et al. "A deep learning approach to antibiotic discovery." Cell 180, no. 4 (2020): 688-702. 54/55

NN-lecture-12-GNN-2023-final.pdf

Document Details

Tags

Related

Full Transcript

Upgrade to continue