Stanford CS224W: Machine Learning with Graphs PDF
Document Details
Uploaded by Deleted User
Stanford University
Jure Leskovec
Tags
Related
- Graph Theory Part 1 PDF
- CP312 Algorithm Design and Analysis I Lecture Notes PDF
- CP312 Algorithm Design and Analysis I - Lecture 13 Graph Algorithms - Minimum Spanning Trees.PDF
- CP312 Algorithm Design and Analysis I Lecture 14: Graph Algorithms - Shortest Path PDF
- Graph Data PDF
- Graph Data - CSMODEL PDF
Summary
These notes discuss graph representation learning, a way to learn node and graph embeddings for downstream tasks, without feature engineering. Topics include encoder-decoder frameworks, node similarity measures (using random walks like DeepWalk and Node2Vec), and different graph embedding approaches for tasks like node classification. The lecture notes are from Stanford University's CS224W course.
Full Transcript
CS224W: Machine Learning with Graphs Jure Leskovec, Stanford University http://cs224w.stanford.edu Given an input graph, extract node, link and graph-level features, learn a model (SVM, neural network, etc.) that maps features to labels. Input Structu...
CS224W: Machine Learning with Graphs Jure Leskovec, Stanford University http://cs224w.stanford.edu Given an input graph, extract node, link and graph-level features, learn a model (SVM, neural network, etc.) that maps features to labels. Input Structured Learning Prediction Graph Features Algorithm Feature engineering Downstream (node-level, edge-level, graph- prediction task level features) 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 2 Graph Representation Learning alleviates the need to do feature engineering every single time. Input Structured Learning Prediction Graph Features Algorithm Feature Representation Learning -- Downstream Engineering Automatically prediction task learn the features 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 3 Goal: Efficient task-independent feature learning for machine learning with graphs! node vector 𝑢 𝑓: 𝑢 → ℝ𝑑 ℝ𝑑 Feature representation, embedding 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 4 Task: Map nodes into an embedding space ▪ Similarity of embeddings between nodes indicates their similarity in the network. For example: ▪ Both nodes are close to each other (connected by an edge) ▪ Encode network information ▪ Potentially used for many downstream predictions Vec Tasks Node classification Link prediction Graph classification Anomalous node detection embeddings ℝ 𝑑 Clustering …. 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 5 Example 2D embedding of nodes of the Zachary’s Karate Karate Zachary’s Club network: Network: Image from: Perozzi et al. DeepWalk: Online Learning of Social Representations. KDD 2014. 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 6 CS224W: Machine Learning with Graphs Jure Leskovec, Stanford University http://cs224w.stanford.edu Assume we have a graph G: ▪ V is the vertex set. ▪ A is the adjacency matrix (assume binary). ▪ For simplicity: No node features or extra information is used 4 3 2 0 1 0 1 1 1 0 0 1 A= V: {1, 2, 3, 4} 0 0 0 1 1 0 1 1 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 8 Goal is to encode nodes so that similarity in the embedding space (e.g., dot product) approximates similarity in the graph 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 9 Goal: similarity 𝑢, 𝑣 ≈ 𝐳𝑣Τ 𝐳𝑢 in the original network Similarity of the embedding Need to define! 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 10 1. Encoder maps from nodes to embeddings 2. Define a node similarity function (i.e., a measure of similarity in the original network) 3. Decoder 𝐃𝐄𝐂 maps from embeddings to the similarity score 4. Optimize the parameters of the encoder so that: 𝐃𝐄𝐂(𝐳Τ 𝐳 ) 𝑣 𝑢 similarity 𝑢, 𝑣 ≈ 𝐳𝑣Τ 𝐳𝑢 in the original network Similarity of the embedding 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 11 Encoder: maps each node to a low-dimensional vector d-dimensional ENC 𝑣 = 𝐳𝑣 embedding node in the input graph Similarity function: specifies how the relationships in vector space map to the relationships in the original network similarity 𝑢, 𝑣 ≈ 𝐳𝑣Τ 𝐳𝑢 Decoder Similarity of 𝑢 and 𝑣 in dot product between node the original network embeddings 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 12 Simplest encoding approach: Encoder is just an embedding-lookup ENC 𝑣 = 𝐳𝒗 = 𝐙 ⋅ 𝑣 matrix, each column is a node embedding [what we learn / optimize] indicator vector, all zeroes except a one in column indicating node v 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 13 Simplest encoding approach: encoder is just an embedding-lookup embedding vector for a embedding specific node matrix Dimension/size of embeddings one column per node 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 14 Simplest encoding approach: Encoder is just an embedding-lookup Each node is assigned a unique embedding vector (i.e., we directly optimize the embedding of each node) Many methods: DeepWalk, node2vec 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 15 Encoder + Decoder Framework ▪ Shallow encoder: embedding lookup ▪ Parameters to optimize: 𝐙 which contains node embeddings 𝐳𝑢 for all nodes 𝑢 ∈ 𝑉 ▪ We will cover deep encoders (GNNs) in Lecture 6 ▪ Decoder: based on node similarity. ▪ Objective: maximize 𝐳𝑣Τ 𝐳𝑢 for node pairs (𝑢, 𝑣) that are similar 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 16 Key choice of methods is how they define node similarity. Should two nodes have a similar embedding if they… ▪ are linked? ▪ share neighbors? ▪ have similar “structural roles”? We will now learn node similarity definition that uses random walks, and how to optimize embeddings for such a similarity measure. 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 17 This is unsupervised/self-supervised way of learning node embeddings. ▪ We are not utilizing node labels ▪ We are not utilizing node features ▪ The goal is to directly estimate a set of coordinates (i.e., the embedding) of a node so that some aspect of the network structure (captured by DEC) is preserved. These embeddings are task independent ▪ They are not trained for a specific task but can be used for any task. 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 18 CS224W: Machine Learning with Graphs Jure Leskovec, Stanford University http://cs224w.stanford.edu Vector 𝐳𝑢 : ▪ The embedding of node 𝑢 (what we aim to find). Probability 𝑃 𝑣 𝐳𝑢 ) : Our model prediction based on 𝐳𝑢 ▪ The (predicted) probability of visiting node 𝑣 on random walks starting from node 𝑢. Non-linear functions used to produce predicted probabilities Softmax function: ▪ Turns vector of 𝐾 real values (model predictions) 𝒛[𝑖] into 𝑒 𝐾 probabilities that sum to 1: 𝜎(𝒛)[𝑖] = 𝐾 𝒛[𝑗] σ𝑗=1 𝑒 Sigmoid function: ▪ S-shaped function that turns real values into the range of (0, 1). 1 Written as 𝑆 𝑥 = −𝑥. 1+𝑒 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 20 10 9 12 2 Step 3 Step 4 8 Step 5 1 11 3 Given a graph and a starting 4 Step 1 Step 2 point, we select a neighbor of it at random, and move to this neighbor; then we select a 6 5 neighbor of this point at random, and move to it, etc. The (random) sequence of 7 points visited this way is a random walk on the graph. 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 21 probability that u and v co-occur on a random walk over the graph 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 22 1. Estimate probability of visiting node 𝒗 on a random walk starting from node 𝒖 using some random walk strategy 𝑹 2. Optimize embeddings to encode these random walk statistics: Similarity in embedding space (Here: dot product=cos(𝜃)) encodes random walk “similarity” 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 23 1. Expressivity: Flexible stochastic definition of node similarity that incorporates both local and higher-order neighborhood information Idea: if random walk starting from node 𝑢 visits 𝑣 with high probability, 𝑢 and 𝑣 are similar (high-order multi-hop information) 2. Efficiency: Do not need to consider all node pairs when training; only need to consider pairs that co-occur on random walks 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 24 Intuition: Find embedding of nodes in 𝑑-dimensional space that preserves similarity Idea: Learn node embedding such that nearby nodes are close together in the network Given a node 𝑢, how do we define nearby nodes? ▪ 𝑁𝑅 𝑢 … neighbourhood of 𝑢 obtained by some random walk strategy 𝑅 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 25 Given 𝐺 = (𝑉, 𝐸), Our goal is to learn a mapping 𝑓: 𝑢 → ℝ𝑑 : 𝑓 𝑢 = 𝐳𝑢 Log-likelihood objective: ▪ 𝑁𝑅 (𝑢) is the neighborhood of node 𝑢 by strategy 𝑅 Given node 𝑢, we want to learn feature representations that are predictive of the nodes in its random walk neighborhood 𝑁𝑅 (𝑢). 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 26 1. Run short fixed-length random walks starting from each node 𝑢 in the graph using some random walk strategy R. 2. For each node 𝑢 collect 𝑁𝑅 (𝑢), the multiset* of nodes visited on random walks starting from 𝑢. 3. Optimize embeddings according to: Given node 𝑢, predict its neighbors 𝑁R (𝑢). Maximum likelihood objective *𝑁𝑅 (𝑢) can have repeat elements since nodes can be visited multiple times on random walks 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 27 Equivalently, Intuition: Optimize embeddings 𝒛𝑢 to maximize the likelihood of random walk co-occurrences. Parameterize 𝑃(𝑣|𝐳𝑢) using softmax: Why softmax? We want node 𝑣 to be most similar to node 𝑢 (out of all nodes 𝑛). Intuition: σ𝑖 exp 𝑥𝑖 ≈ max exp(𝑥𝑖 ) 𝑖 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 28 Putting it all together: sum over all sum over nodes 𝑣 predicted probability of 𝑢 nodes 𝑢 seen on random and 𝑣 co-occuring on walks starting from 𝑢 random walk Optimizing random walk embeddings = Finding embeddings 𝐳𝒖 that minimize L 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 29 But doing this naively is too expensive! Nested sum over nodes gives O(|V|2) complexity! 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 30 But doing this naively is too expensive! The normalization term from the softmax is the culprit… can we approximate it? 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 31 Why is the approximation valid? Technically, this is a different objective. But Solution: Negative sampling Negative Sampling is a form of Noise Contrastive Estimation (NCE) which approx. maximizes the log probability of softmax. New formulation corresponds to using a logistic regression (sigmoid func.) to distinguish the target node 𝑣 from nodes 𝑛𝑖 sampled from background distribution 𝑃𝑣. More at https://arxiv.org/pdf/1402.3722.pdf ≈ log 𝜎 𝐳𝑢T 𝐳𝑣 − σ𝑘𝑖=1 log 𝜎 𝐳𝑢T 𝐳𝑛𝑖 , 𝑛𝑖 ~𝑃𝑉 sigmoid function random distribution (makes each term a “probability” between 0 and 1) over nodes Instead of normalizing w.r.t. all nodes, just normalize against 𝑘 random “negative samples” 𝑛𝑖 Negative sampling allows for quick likelihood calculation. 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 32 random distribution over nodes ▪ Sample 𝑘 negative nodes each with prob. proportional to its degree ▪ Two considerations for 𝑘 (# negative samples): 1. Higher 𝑘 gives more robust estimates 2. Higher 𝑘 corresponds to higher bias on negative events In practice 𝑘 =5-20. Can negative sample be any node or only the nodes not on the walk? People often use any nodes (for efficiency). However, the most “correct” way is to use nodes not on the walk. 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 33 ▪ After we obtained the objective function, how do we optimize (minimize) it? ℒ= −log(𝑃(𝑣|𝐳𝑢 )) 𝑢∈𝑉 𝑣∈𝑁𝑅 (𝑢) ▪ Gradient Descent: a simple way to minimize ℒ : ▪ Initialize 𝑧𝑢 at some randomized value for all nodes 𝑢. ▪ Iterate until convergence: 𝜕ℒ 𝜂: learning rate ▪ For all 𝑢, compute the derivative. 𝜕𝑧𝑢 𝜕ℒ ▪ For all 𝑢, make a step in reverse direction of derivative: 𝑧𝑢 ← 𝑧𝑢 − 𝜂. 𝜕𝑧𝑢 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 34 ▪ Stochastic Gradient Descent: Instead of evaluating gradients over all examples, evaluate it for each individual training example. ▪ Initialize 𝑧𝑢 at some randomized value for all nodes 𝑢. ▪ Iterate until convergence: ℒ (𝑢) = −log(𝑃(𝑣|𝐳𝑢 )) 𝑣∈𝑁𝑅 (𝑢) 𝜕ℒ (𝑢) ▪ Sample a node 𝑢, for all 𝑣 calculate the derivative. 𝜕𝑧𝑣 𝜕ℒ (𝑢) ▪ For all 𝑣, update:𝑧𝑣 ← 𝑧𝑣 − 𝜂. 𝜕𝑧𝑣 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 35 1. Run short fixed-length random walks starting from each node on the graph 2. For each node 𝑢 collect 𝑁𝑅 (𝑢), the multiset of nodes visited on random walks starting from 𝑢. 3. Optimize embeddings using Stochastic Gradient Descent: We can efficiently approximate this using negative sampling! 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 36 So far we have described how to optimize embeddings given a random walk strategy R What strategies should we use to run these random walks? ▪ Simplest idea: Just run fixed-length, unbiased random walks starting from each node (i.e., DeepWalk from Perozzi et al., 2013) ▪ The issue is that such notion of similarity is too constrained How can we generalize this? Reference: Perozzi et al. 2014. DeepWalk: Online Learning of Social Representations. KDD. 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 37 Goal: Embed nodes with similar network neighborhoods close in the feature space. We frame this goal as a maximum likelihood optimization problem, independent to the downstream prediction task. Key observation: Flexible notion of network neighborhood 𝑁𝑅 (𝑢) of node 𝑢 leads to rich node embeddings Develop biased 2nd order random walk 𝑅 to generate network neighborhood 𝑁𝑅 (𝑢) of node 𝑢 Reference: Grover et al. 2016. node2vec: Scalable Feature Learning for Networks. KDD. 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 38 Feature Learning for Networks Idea: use flexible, biased Jure random walks that can Leskovec trade off betweenStanford localUniversity and global views of the [email protected] network (Grover and Leskovec, 2016). s1 s2 s8 careful s7 cent re- BFS d to sig- u s6 eatures DFS sitive to s4 s9 s3 s5 r learn- Figur e 1: BFS and DFS sear ch str ategies from node u (k = 3). 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 39 Jure Leskovec Stanford University [email protected] Two classic strategies to define a neighborhood 𝑵𝑹 𝒖 of a given node 𝒖: s1 s2 s8 careful s7 cent re- BFS d to sig- u s6 eatures DFS sitive to s4 s9 s3 s5 r learn- Figur e 1: BFS and DFS sear ch str ategies from node u (k = 3). vec, we Walk of length 3 (𝑁 𝑢 of size 3): 𝑅 eatures 𝑁 een net- 𝐵𝐹𝑆 𝑢 = { 𝑠 , 𝑠 , 𝑠 } Local microscopic view and edges. A typical solution involves hand-engineering domain- 1 2 3 specific features based on expert knowledge. Even if one discounts node’s the tedious work of feature engineering, such features are usually proce- 𝑁 𝐷𝐹𝑆 𝑢 = { 𝑠 , 𝑠 , 𝑠 } Global macroscopic view designed for specific 4 tasks 5 and 6 do not generalize across different eads to9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 40 u BFS: DFS: Micro-view of Macro-view of neighbourhood neighbourhood 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 41 Biased fixed-length random walk 𝑹 that given a node 𝒖 generates neighborhood 𝑵𝑹 𝒖 Two parameters: ▪ Return parameter 𝒑: ▪ Return back to the previous node ▪ In-out parameter 𝒒: ▪ Moving outwards (DFS) vs. inwards (BFS) ▪ Intuitively, 𝑞 is the “ratio” of BFS vs. DFS 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 42 Biased 2nd-order random walks explore network neighborhoods: ▪ Rnd. walk just traversed edge (𝑠1 , 𝑤) and is now at 𝑤 ▪ Insight: Neighbors of 𝑤 can only be: Same distance to 𝒔𝟏 s2 s3 w Farther from 𝒔𝟏 u s1 Back to 𝒔𝟏 Idea: Remember where the walk came from 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 43 Walker came over edge (𝐬𝟏 , 𝐰) and is at 𝐰. Where to go next? s2 1 1/𝑞 s3 1/𝑝, 1/𝑞, 1 are w unnormalized 1/𝑞 s1 probabilities u 1/𝑝 s4 𝑝, 𝑞 model transition probabilities ▪ 𝑝 … return parameter ▪ 𝑞 … ”walk away” parameter 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 44 Walker came over edge (𝐬𝟏 , 𝐰) and is at 𝐰. Where to go next? Target 𝒕 Prob. Dist. (𝒔𝟏, 𝒕) s2 1 s3 1/𝑞 s1 1/𝑝 0 w w → s2 1 1 1/𝑞 u s1 1/𝑝 s4 s3 1/𝑞 2 s4 1/𝑞 2 Unnormalized ▪ BFS-like walk: Low value of 𝑝 transition prob. segmented based ▪ DFS-like walk: Low value of 𝑞 on distance from 𝑠! 𝑁𝑅 (𝑢) are the nodes visited by the biased walk 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 45 1) Compute random walk probabilities 2) Simulate 𝑟 random walks of length 𝑙 starting from each node 𝑢 3) Optimize the node2vec objective using Stochastic Gradient Descent Linear-time complexity All 3 steps are individually parallelizable 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 46 Different kinds of biased random walks: ▪ Based on node attributes (Dong et al., 2017). ▪ Based on learned weights (Abu-El-Haija et al., 2017) Alternative optimization schemes: ▪ Directly optimize based on 1-hop and 2-hop random walk probabilities (as in LINE from Tang et al. 2015). Network preprocessing techniques: ▪ Run random walks on modified versions of the original network (e.g., Ribeiro et al. 2017’s struct2vec, Chen et al. 2016’s HARP). 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 47 Core idea: Embed nodes so that distances in embedding space reflect node similarities in the original network. Different notions of node similarity: ▪ Naïve: similar if two nodes are connected ▪ Neighborhood overlap (covered in Lecture 2) ▪ Random walk approaches (covered today) 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 48 So what method should I use..? No one method wins in all cases…. ▪ E.g., node2vec performs better on node classification while alternative methods perform better on link prediction (Goyal and Ferrara, 2017 survey). Random walk approaches are generally more efficient. In general: Must choose definition of node similarity that matches your application. 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 49 CS224W: Machine Learning with Graphs Jure Leskovec, Stanford University http://cs224w.stanford.edu Goal: Want to embed a subgraph or an entire graph 𝐺. Graph embedding: 𝐳𝑮. 𝒛𝐺 Tasks: ▪ Classifying toxic vs. non-toxic molecules ▪ Identifying anomalous graphs 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 51 Simple (but effective) approach 1: Run a standard graph embedding technique on the (sub)graph 𝐺. Then just sum (or average) the node embeddings in the (sub)graph 𝐺. Used by Duvenaud et al., 2016 to classify molecules based on their graph structure 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 52 Approach 2: Introduce a “virtual node” to represent the (sub)graph and run a standard graph embedding technique Proposed by Li et al., 2016 as a general technique for subgraph embedding 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 53 States in anonymous walks correspond to the index of the first time we visited the node in a random walk Anonymous Walk Embeddings, ICML 2018 https://arxiv.org/pdf/1805.11921.pdf 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 54 Agnostic to the identity of the nodes visited (hence anonymous) Example: Random walk w1: Step 1: node A node 1 Step 2: node B node 2 (different from node 1) Step 3: node C node 3 (different from node 1, 2) Step 4: node B node 2 (same as the node in step 2) Step 5: node C node 3 (same as the node in step 3) Note: Random walk w2 gives the same anonymous walk: 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 55 Number of anonymous walks grows exponentially: ▪ There are 5 anon. walks 𝑤𝑖 of length 3: 𝑤1=111, 𝑤2=112, 𝑤3= 121, 𝑤4= 122, 𝑤5= 123 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 56 Simulate anonymous walks 𝑤𝑖 of 𝑙 steps and record their counts. Represent the graph as a probability distribution over these walks. For example: ▪ Set 𝑙 = 3 ▪ Then we can represent the graph as a 5-dim vector ▪ Since there are 5 anonymous walks 𝑤𝑖 of length 3: 111, 112, 121, 122, 123 ▪ 𝒛𝑮 [𝑖] = probability of anonymous walk 𝑤𝑖 in graph 𝐺. 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 57 Sampling anonymous walks: Generate independently a set of 𝑚 random walks. Represent the graph as a probability distribution over these walks. How many random walks 𝑚 do we need? ▪ We want the distribution to have error of more than 𝜀 with prob. less than 𝛿: For example: There are 𝜂 = 877 anonymous walks of length 𝑙 = 7. If we set 𝜀 = 0.1 and 𝛿 = 0.01 then we need to generate 𝑚=122,500 random walks where: 𝜂 is the total number of anon. walks of length 𝑙. 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 58 Rather than simply representing each walk by the fraction of times it occurs, we learn embedding 𝒛𝒊 of anonymous walk 𝒘𝒊. Learn a graph embedding 𝒛𝑮 together with all the anonymous walk embeddings 𝒛𝒊 𝒁 = {𝒛𝒊 : 𝑖 = 1 … 𝜂}, where 𝜂 is the number of sampled anonymous walks. How to embed walks? Idea: Embed walks s.t. the next walk can be predicted. 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 59 Output: A vector 𝒛𝑮 for input graph 𝑮 ▪ The embedding of entire graph to be learned Starting from node 1: Sample anonymous 𝑤1 𝑤2 𝑤3 𝑤4 random walks, e.g. Learn to predict walks that co-occur in 𝚫-size window (e.g., predict 𝑤3 given 𝑤1, 𝑤2 if Δ = 2) Objective: ▪ Where 𝑇 is the total number of walks 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 60 Run 𝑻 different random walks from 𝒖 each of length 𝒍: 𝑁𝑅 𝑢 = 𝑤1𝑢 , 𝑤2𝑢 … 𝑤𝑇𝑢 Learn to predict walks that co-occur in 𝚫-size window Estimate embedding 𝒛𝑖 of anonymous walk 𝑤𝑖. Let 𝜂 be number of all possible walk embeddings. Objective: exp(𝑦 𝑤𝑡 ) ▪ 𝑃 𝑤𝑡 {𝑤𝑡−Δ, … , 𝑤𝑡−1, 𝒛𝑮 ) = σ𝜂 𝑖=1 exp(𝑦(𝑤𝑖 )) 1 ▪ 𝑦 𝑤𝑡 = 𝑏 + 𝑈 ⋅ 𝑐𝑎𝑡( σΔ𝑖=1 𝒛𝑖 , 𝒛𝑮) All possible anonymous walks Δ (requires negative sampling) 1 ▪ 𝑐𝑎𝑡( σΔ𝑖=1 𝒛𝑖 , 𝒛𝑮) means an average of anonymous walk embeddings 𝒛𝑖 in Δ the window, concatenated with the graph embedding 𝒛𝑮. ▪ 𝑏 ∈ ℝ, 𝑈 ∈ ℝ𝐷 are learnable parameters. This represents a linear layer. Anonymous Walk Embeddings, ICML 2018 https://arxiv.org/pdf/1805.11921.pdf 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 61 We obtain the graph embedding 𝒛𝑮 (learnable parameter) after Overall Architecture the optimization. ▪ Is 𝒛𝑮 simply the sum over walk embeddings 𝒛𝒊 ? Or is 𝒛𝑮 the residual embedding next to 𝒛𝒊 ? ▪ According to the paper, 𝒛𝑮 is a separately optimized vector parameter, just like other 𝒛𝒊 ’s. Use 𝒛𝑮 to make predictions (e.g., graph classification): ▪ Option1: Inner product Kernel 𝒛𝑇𝑮𝟏 𝒛𝑮𝟐 (Lecture 2) ▪ Option2: Use a neural network that takes 𝒛𝑮 9/28/2021 as input to classify 𝑮. Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 62 We discussed 3 ideas to graph embeddings: Approach 1: Embed nodes and sum/avg them Approach 2: Create super-node that spans the (sub) graph and then embed that node. Approach 3: Anonymous Walk Embeddings ▪ Idea 1: Sample the anon. walks and represent the graph as fraction of times each anon walk occurs. ▪ Idea 2: Learn graph embedding together with anonymous walk embeddings. 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 63 We will discuss more advanced ways to obtain graph embeddings in Lecture 8. We can hierarchically cluster nodes in graphs, and sum/avg the node embeddings according to these clusters. 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 64 How to use embeddings 𝒛𝒊 of nodes: ▪ Clustering/community detection: Cluster points 𝒛𝒊 ▪ Node classification: Predict label of node 𝑖 based on 𝒛𝒊 ▪ Link prediction: Predict edge (𝑖, 𝑗) based on (𝒛𝒊, 𝒛𝒋) ▪ Where we can: concatenate, avg, product, or take a difference between the embeddings: ▪ Concatenate: 𝑓(𝒛𝑖 , 𝒛𝑗 )= 𝑔([𝒛𝑖 , 𝒛𝑗 ]) ▪ Hadamard: 𝑓(𝒛𝑖 , 𝒛𝑗 )= 𝑔(𝒛𝑖 ∗ 𝒛𝑗 ) (per coordinate product) ▪ Sum/Avg: 𝑓(𝒛𝑖 , 𝒛𝑗 )= 𝑔(𝒛𝑖 + 𝒛𝑗 ) ▪ Distance: 𝑓(𝒛𝑖 , 𝒛𝑗 )= 𝑔(||𝒛𝑖 − 𝒛𝒋 ||2) ▪ Graph classification: Graph embedding 𝒛𝑮 via aggregating node embeddings or anonymous random walks. Predict label based on graph embedding 𝒛𝐺. 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 65 We discussed graph representation learning, a way to learn node and graph embeddings for downstream tasks, without feature engineering. Encoder-decoder framework: ▪ Encoder: embedding lookup ▪ Decoder: predict score based on embedding to match node similarity Node similarity measure: (biased) random walk ▪ Examples: DeepWalk, Node2Vec Extension to Graph embedding: Node embedding aggregation and Anonymous Walk Embeddings 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 66