CS224W Machine Learning with Graphs PDF

CS224W: Machine Learning with Graphs Jure Leskovec, Stanford University http://cs224w.stanford.edu ¡ Today we will talk about how does the Web graph look like: § 1) We will take a real system: the Web § 2) We will represent it as a directed graph v § 3) We will use the language of graph theory § Strongly Connected Components § 4) We will design a computational Out(v) experiment: § Find In- and Out-components of a given node v § 5) We will learn something about the structure of the Web: BOWTIE! 10/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 3 Q: What does the Web “look like” at a global level? ¡ Web as a graph: § Nodes = web pages § Edges = hyperlinks § Side issue: What is a node? § Dynamic pages created on the fly § “dark matter” – inaccessible database generated pages 10/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 4 I teach a class on Networks. CS224W: Classes are in the Gates building Computer Science Department at Stanford Stanford University 10/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 5 I teach a class on Networks. CS224W: Classes are in the Gates building Computer Science Department at Stanford Stanford University ¡ In early days of the Web links were navigational ¡ Today many links are transactional (used not to navigate from page to page, but to post, comment, like, buy, …) 10/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 6 10/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 7 Citations References in an Encyclopedia 10/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 8 ¡ How is the Web linked? ¡ What is the “map” of the Web? Web as a directed graph [Broder et al. 2000]: § Given node v, what nodes can v reach? § What other nodes can reach v? E F B A D C G For example: In(v) = {w | w can reach v} In(A) = {A,B,C,E,G} Out(v) = {w | v can reach w} Out(A)={A,B,C,D,F} 10/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 9 ¡ Two types of directed graphs: E § Strongly connected: B A § Any node can reach any node via a directed path D C In(A)=Out(A)={A,B,C,D,E} § Directed Acyclic Graph (DAG): E B § Has no cycles: if u can reach v, A then v cannot reach u D C ¡ Any directed graph (the Web) can be expressed in terms of these two types! § Is the Web a big strongly connected graph or a DAG? 10/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 10 ¡ A Strongly Connected Component (SCC) is a set of nodes S so that: § Every pair of nodes in S can reach each other § There is no larger set containing S with this property E F B A Strongly connected components of the graph: D C G {A,B,C,G}, {D}, {E}, {F} 10/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 11 ¡ Fact: Every directed graph is a DAG on its SCCs § (1) SCCs partitions the nodes of G § That is, each node is in exactly one SCC § (2) If we build a graph G’ whose nodes are SCCs, and with an edge between nodes of G’ if there is an edge between corresponding SCCs in G, then G’ is a DAG (1) Strongly connected components of E F graph G: {A,B,C,G}, {D}, {E}, {F} B (2) G’ is a DAG: A {E} {F} D C G {A,B,C,G} G {D} G’ 10/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 12 ¡ Broder et al.: Altavista web crawl (Oct ’99) § Web crawl is based on a large set of starting points accumulated over time from various sources, including voluntary submissions. § 203 million URLS and 1.5 billion links Goal: Take a large snapshot of the Web and try to understand how its SCCs “fit together” as a DAG Tomkins, Broder, and 10/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu Kumar 15 ¡ Computational issue: v § Want to find a SCC containing node v? ¡ Observation: Out(v) § Out(v) … nodes that can be reached from v § SCC containing v is: Out(v) ∩ In(v) = Out(v,G) ∩ Out(v,G’), where G’ is G with all edge directions flipped In(v) v 10/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 16 ¡ Example: F H E B G A Out(A) In(A) D C § Out(A) = {A, B, D, E, F, G, H} § In(A) = {A, B, C, D, E} § So, SCC(A) = Out(A) ∩ In(A) = {A, B, D, E} 10/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 17 starting nodes, both the forward and the backward BFS runs would ‘explode’, each covering about 100 million nodes (though not the same 100 million in the two runs). As we show below, these are the starting points that lie in the SCC. The cumulative distributions of the nodes covered in these BFS runs are summarized in Fig. 7. They re- Directed version of the Web graph: veal that the true structure of the Web graph must be ¡ somewhat subtler than a ‘small world’ phenomenon in which a browser can pass from any Web page § Altavista crawl from October 1999 to any other with a few clicks. We explicate this structure in Section 3. § 203 million URLs, 1.5 billion links 2.2.5. Zipf distributions vs power law distributions The Zipf distribution is an inverse polynomial Computation: function of ranks rather than magnitudes; for exam- ple, if only in-degrees 1, 4, and 5 occurred then a § Compute IN(v) and OUT(v) power law would be inversely polynomial in those values, whereas a Zipf distribution would be in- by starting at random nodes. versely polynomial in the ranks of those values: i.e., inversely polynomial in 1, 2, and 3. The in-degree distribution in our data shows a striking fit with a § Observation: The BFS either Zipf (more so than the power law) distribution; Fig. 8 shows the in-degrees of pages from the May 1999 visits many nodes or crawl plotted against both ranks and magnitudes (corresponding to the Zipf and power law cases). very few The plot against ranks is virtually a straight line in the log–log plot, without the flare-out noticeable in the plot against magnitudes. x-axis: rank 3. Interpretation and further work y-axis: number of reached nodes 10/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 19 veal that the true structure of the Web graph must be somewhat subtler than a ‘small world’ phenomenon in which a browser can pass from any Web page to any other with a few clicks. We explicate this structure in Section 3. 2.2.5. Zipf distributions vs power law distributions The Zipf distribution is an inverse polynomial function of ranks rather than magnitudes; for exam- ple, if only in-degrees 1, 4, and 5 occurred then a Result: Based on IN and OUT power law would be inversely polynomial in those values, whereas a Zipf distribution would be in- versely polynomial in the ranks of those values: i.e., of a random node v: inversely polynomial in 1, 2, and 3. The in-degree distribution in our data shows a striking fit with a Zipf (more so than the power law) distribution; Fig. 8 shows the in-degrees of pages from the May 1999 § Out(v) ≈ 100 million (50% nodes) crawl plotted against both ranks and magnitudes (corresponding to the Zipf and power law cases). The plot against ranks is virtually a straight line in § In(v) ≈ 100 million (50% nodes) the log–log plot, without the flare-out noticeable in the plot against magnitudes. § Largest SCC: 56 million (28% nodes) 3. Interpretation and further work x-axis: rank y-axis: number of Let us now put together the results of the connected component experiments with the results of the ran- reached nodes dom-start BFS experiments. Given that the set SCC ¡ What does this tell us about the Fig. 7. Cumulative distribution on the number of nodes reached when BFS is started from a random node: (a) follows in-links, (b) conceptual picture of the Web graph? follows out-links, and (c) follows both in- and out-links. Notice that there are two distinct regions of growth, one at the beginning and an ‘explosion’ in 50% of the start nodes in the case of in- and out-links, and for 90% of the nodes in the undirected case. These experiments form the basis of our structural analysis. 10/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 20 318 A. Broder et al. / Computer Networks 33 (2000) 309–320 Fig. 9. Connectivity of the Web: one can pass from any node of IN through SCC to any node of OUT. Hanging off IN and OUT are 203 million pages, 1.5 billion links [Broder et al. 2000] TENDRILS containing nodes that are reachable from portions of IN, or that can reach portions of OUT, without passage through SCC. It is possible for a TENDRIL hanging off from IN to be hooked into a TENDRIL leading into OUT, forming a TUBE: i.e., a passage from a10/29/19 portion of IN to a portion of OUT Jure without touching Leskovec, StanfordSCC. CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 21 ¡ All web pages are not equally “important” www.joe-schmoe.com vs. www.stanford.edu ¡ There is large diversity in the web-graph node connectivity. ¡ So, let’s rank the pages using the web graph link structure! 10/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 23 ¡ We will cover the following Link Analysis approaches to compute the importance of nodes in a graph: § PageRank § Personalized PageRank § Random Walk with Restarts 10/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 24 ¡ Idea: Links as votes § Page is more important if it has more links § In-coming links? Out-going links? ¡ Think of in-links as votes: § www.stanford.edu has 23,400 in-links § www.joe-schmoe.com has 1 in-link ¡ Are all in-links equal? § Links from important pages count more § Recursive question! 10/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 25 ¡ A “vote” from an important page is worth more: i k ri/3 r /4 § Each link’s vote is proportional k to the importance of its source j rj/3 page rj/3 rj/3 § If page i with importance ri has di out-links, each link gets ri / di rj = ri/3 + rk/4 votes § Page j’s own importance rj is the sum of the votes on its in- links 10/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 26 ¡ A page is important if it is The web in 1839 pointed to by other important ry/2 pages y ¡ Define a “rank” rj for node j ra/2 ri ry/2 rj = å a rm m i® j di ra/2 𝒅𝒊 … out-degree of node 𝒊 “Flow” equations: ry = ry /2 + ra /2 ra = ry /2 + rm You might wonder: Let’s just use Gaussian elimination rm = ra /2 to solve this system of linear equations. Bad idea! 10/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 27 j ¡ Stochastic adjacency matrix 𝑴 § Let page 𝒋 have 𝒅𝒋 out-links i 𝟏 § If 𝒋 → 𝒊, then 𝑴𝒊𝒋 = 𝒅𝒋 § 𝑴 is a column stochastic matrix § Columns sum to 1 1/3 ¡ Rank vector 𝒓: An entry per page M § 𝒓𝒊 is the importance score of page 𝒊 § ∑𝒊 𝒓𝒊 = 𝟏 ¡ The flow equations can be written ri 𝒓 = 𝑴⋅ 𝒓 rj = å i® j di 10/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 28 ry ra rm y ry ½ ½ 0 ra ½ 0 1 a m rm 0 ½ 0 r = M·r ry = ry /2 + ra /2 ry ½ ½ 0 ry ra = ry /2 + rm ra = ½ 0 1 ra rm = ra /2 rm 0 ½ 0 rm 10/29/19 Jure Leskovec, Stanford C246: Mining Massive Datasets 29 ¡ The flow equations can be written 𝒓 = 𝑴 + 𝒓 ¡ So the rank vector r is an eigenvector of the stochastic web matrix M § Starting from any vector 𝒖, the limit 𝑴(𝑴(… 𝑴(𝑴 𝒖))) is the long-term distribution of the surfers. NOTE: x is an eigenvector with § The math: Limiting distribution = principal the corresponding eigenvector of 𝑀 = PageRank. eigenvalue λ if: 𝑨𝒙 = 𝝀𝒙 § Note: If 𝒓 is the limit of 𝑴𝑴 … 𝑴𝒖, then 𝒓 satisfies the equation 𝒓 = 𝑴𝒓, so r is an eigenvector of 𝑴 with eigenvalue 1 ¡ We can now efficiently solve for r! The method is called Power iteration 10/29/19 Jure Leskovec, Stanford C246: Mining Massive Datasets 30 ¡ Given a web graph with N nodes, where the nodes are pages and edges are hyperlinks ¡ Power iteration: a simple iterative scheme § Initialize: r(0) = [1/N,….,1/N]T (t ) ri § Iterate: r(t+1) = M · r(t) =å ( t +1) rj i® j di § Stop when |r(t+1) – r(t)|1 < e di …. out-degree of node i |x|1 = å1≤i≤N|xi| is the L1 norm Can use any other vector norm, e.g., Euclidean About 50 iterations is sufficient to estimate the limiting solution. 10/29/19 Jure Leskovec, Stanford C246: Mining Massive Datasets 31 i1 i2 i3 ¡ Imagine a random web surfer: § At any time 𝒕, surfer is on some page 𝑖 § At time 𝒕 + 𝟏, the surfer follows an j ri out-link from 𝒊 uniformly at random rj = å § Ends up on some page 𝒋 linked from 𝒊 i® j d out (i) § Process repeats indefinitely ¡ Let: ¡ 𝒑(𝒕) … vector whose 𝑖 th coordinate is the prob. that the surfer is at page 𝑖 at time 𝑡 § So, 𝒑(𝒕) is a probability distribution over pages 10/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 32 i1 i2 i3 ¡ Where is the surfer at time t+1? § Follow a link uniformly at random j 𝒑 𝒕 + 𝟏 = 𝑴 ⋅ 𝒑(𝒕) p(t + 1) = M × p(t ) ¡ Suppose the random walk reaches a state 𝒑 𝒕 + 𝟏 = 𝑴 ⋅ 𝒑(𝒕) = 𝒑(𝒕) then 𝒑(𝑡) is stationary distribution of a random walk ¡ Our original rank vector 𝒓 satisfies 𝒓 = 𝑴 ⋅ 𝒓 § So, 𝒓 is a stationary distribution for the random walk 10/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 33 Given a web graph with n nodes, where the nodes are pages and edges are hyperlinks ¡ Assign each node an initial page rank ¡ Repeat until convergence (Si |ri(t+1) – ri(t)| < e) § Calculate the page rank of each node (t ) ri =å ( t +1) rj i® j di 𝒅𝒊 …. out-degree of node 𝒊 10/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 35 y a m ¡ Power Iteration: y y ½ ½ 0 § Set 𝑟: ← 1/N a ½ 0 1 ?@ a m m 0 ½ 0 § 1: 𝑟′: ← ∑>→: A@ ry = ry /2 + ra /2 § 2: If |𝑟 − 𝑟’| > 𝜀: ra = ry /2 + rm § 𝑟 ← 𝑟′ rm = ra /2 § 3: goto 1 ¡ Example: ry 1/3 1/3 5/12 9/24 6/15 ra = 1/3 3/6 1/3 11/24 … 6/15 rm 1/3 1/6 3/12 1/6 3/15 Iteration 0, 1, 2, … 10/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 36 y a m ¡ Power Iteration: y y ½ ½ 0 § Set 𝑟: ← 1/N a ½ 0 1 ?@ a m m 0 ½ 0 § 1: 𝑟′: ← ∑>→: A@ ry = ry /2 + ra /2 § 2: 𝑟 ← 𝑟′ ra = ry /2 + rm § If |𝑟 − 𝑟’| > 𝜀: goto 1 rm = ra /2 ¡ Example: ry 1/3 1/3 5/12 9/24 6/15 ra = 1/3 3/6 1/3 11/24 … 6/15 rm 1/3 1/6 3/12 1/6 3/15 Iteration 0, 1, 2, … 10/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 37 (t ) ri =å ( t +1) rj i® j di or equivalently r = Mr ¡ Does this converge? ¡ Does it converge to what we want? ¡ Are results reasonable? 10/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 38 Two problems: ¡ (1) Some pages are dead ends (have no out-links) § Such pages cause importance to “leak out” ¡ (2) Spider traps (all out-links are within the group) § Eventually spider traps absorb all importance 10/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 39 ¡ The “Spider trap” problem: (t ) ri =å ( t +1) a b rj i® j di ¡ Example: Iteration: 0, 1, 2, 3… ra 1 0 0 0 = rb 0 1 1 1 10/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 40 ¡ The “Dead end” problem: (t ) ri =å ( t +1) a b rj i® j di ¡ Example: Iteration: 0, 1, 2, 3… ra 1 0 0 0 rb = 0 1 0 0 10/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 41 ¡ The Google solution for spider traps: At each time step, the random surfer has two options § With prob. b, follow a link at random § With prob. 1-b, jump to a random page § Common values for b are in the range 0.8 to 0.9 ¡ Surfer will teleport out of spider trap within a few time steps y y a m a m 10/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 42 ¡ Teleports: Follow random teleport links with total probability 1.0 from dead-ends § Adjust matrix accordingly y y a m a m y a m y a m y ½ ½ 0 y ½ ½ ⅓ a ½ 0 0 a ½ 0 ⅓ m 0 ½ 0 m 0 ½ ⅓ 10/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 43 Why are dead-ends and spider traps a problem and why do teleports solve the problem? ¡ Spider-traps are not a problem, but with traps PageRank scores are not what we want § Solution: Never get stuck in a spider trap by teleporting out of it in a finite number of steps ¡ Dead-ends are a problem § The matrix is not column stochastic so our initial assumptions are not met § Solution: Make matrix column stochastic by always teleporting when there is nowhere else to go 10/29/19 Jure Leskovec, Stanford C246: Mining Massive Datasets 44 ¡ Google’s solution that does it all: At each step, random surfer has two options: § With probability b, follow a link at random § With probability 1-b, jump to some random page ¡ PageRank equation [Brin-Page, 98] 𝑟> 1 di … out-degree of node i 𝑟: = G 𝛽 + (1 − 𝛽) 𝑑> 𝑁 >→: This formulation assumes that 𝑴 has no dead ends. We can either preprocess matrix 𝑴 to remove all dead ends or explicitly follow random teleport links with probability 1.0 from dead-ends. 10/29/19 Jure Leskovec, Stanford C246: Mining Massive Datasets 45 ¡ PageRank equation [Brin-Page, ‘98] 𝑟> 1 𝑟: = G 𝛽 + (1 − 𝛽) 𝑑> 𝑁 >→: ¡ The Google Matrix A: [1/N]NxN…N by N matrix where all entries are 1/N 1 𝐴=𝛽𝑀+ 1−𝛽 𝑁 L×L ¡ We have a recursive problem: 𝒓 = 𝑨 ⋅ 𝒓 And the Power method still works! ¡ What is b ? § In practice b =0.8,0.9 (make 5 steps on avg., jump) 10/29/19 Jure Leskovec, Stanford C246: Mining Massive Datasets 46 M [1/N]NxN 7/15 y 1/2 1/2 0 1/3 1/3 1/3 0.8 1/2 0 0 + 0.2 1/3 1/3 1/3 0 1/2 1 1/3 1/3 1/3 1/ 5 7/1 15 5 7/1 1/ y 7/15 7/15 1/15 15 13/15 a 7/15 1/15 1/15 a 7/15 m 1/15 7/15 13/15 1/15 m 1/ 15 A y 1/3 0.33 0.24 0.26 7/33 a = 1/3 0.20 0.20 0.18... 5/33 m 1/3 0.46 0.52 0.56 21/33 10/29/19 Jure Leskovec, Stanford C246: Mining Massive Datasets 47 ¡ Key step is matrix-vector multiplication § rnew = A · rold ¡ Easy if we have enough main memory to hold A, rold, rnew ¡ Say N = 1 billion pages § We need 4 bytes for A = b·M + (1-b) [1/N]NxN each entry (say) ½ ½ 0 1/3 1/3 1/3 § 2 billion entries for A = 0.8 ½ 0 0 +0.2 1/3 1/3 1/3 0 ½ 1 1/3 1/3 1/3 vectors, approx 8GB § Matrix A has N2 entries 7/15 7/15 1/15 § 1018 is a large number! = 7/15 1/15 1/15 1/15 7/15 13/15 10/29/19 Jure Leskovec, Stanford C246: Mining Massive Datasets 49 ¡ We can rearrange the PageRank equation 𝟏−𝜷 𝒓 = 𝜷𝑴 ⋅ 𝒓 + 𝑵 𝑵 § where [(1-b)/N]N is a vector with all N entries (1-b)/N ¡ M is a sparse matrix! (with no dead-ends) § 10 links per node, approx 10𝑁 entries ¡ So in each iteration, we need to: § Compute rnew = b M · rold § Add a constant value (1-b)/N to each entry in rnew § Note if M contains dead-ends then ∑𝒋 𝒓𝒏𝒆𝒘 𝒋 < 𝟏 and we also have to renormalize rnew so that it sums to 1 10/29/19 Jure Leskovec, Stanford C246: Mining Massive Datasets 51 ¡ Input: Graph 𝑮 and parameter 𝜷 § Directed graph 𝑮 (can have spider traps and dead ends) § Parameter 𝜷 ¡ Output: PageRank vector 𝒓𝒏𝒆𝒘 W § Set: 𝑟:UVA = L § repeat until convergence: ∑: 𝑟:XYZ − 𝑟:UVA < 𝜀 𝒓𝒐𝒍𝒅 § ∀𝑗: 𝒓′𝒏𝒆𝒘 𝒋 = ∑𝒊→𝒋 𝜷 𝒊 𝒅𝒊 𝒓′𝒏𝒆𝒘 𝒋= 𝟎 if in-degree of 𝒋 is 0 § Now re-insert the leaked PageRank: 𝒏𝒆𝒘 𝟏b𝑺 ∀𝒋: 𝒓𝒏𝒆𝒘 𝒋 = 𝒓a 𝒋 + where: 𝑆 = ∑: 𝑟′:XYZ 𝑵 § 𝒓𝒐𝒍𝒅 = 𝒓𝒏𝒆𝒘 If the graph has no dead-ends then the amount of leaked PageRank is 1-β. But since we have dead-ends the amount of leaked PageRank may be larger. We have to explicitly account for it by computing S. 10/29/19 Jure Leskovec, Stanford C246: Mining Massive Datasets 52 10/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 53 Given: … ¡ … Conferences-to-authors IJCAI graph Philip S. Yu ¡ Goal: KDD Ning Zhong Proximity on graphs ICDM § What is most related SDM R. Ramakrishnan conference to ICDM? AAAI M. Jordan § What conferences should … NIPS we recommend to M. Jordan to attend? … Conference Author (item) (user) 10/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 55 ¡ Bipartite Pin and Board graph ¡ Which is more related A,A’ or B,B’? ¡ Which is more related A,A’, B,B’ or C,C’? Shortest path ¡ Which is more related A,A’, B,B’ or C,C’? Shortest path Common Neighbors ¡ Which is more related A,A’, B,B’ or C,C’? D D’ Personalized Page Rank/Random Walk with Restarts Graphs and web search: … ¡ … § Ranks nodes by “importance” IJCAI ¡ Personalized PageRank: Philip S. Yu KDD § Ranks proximity of nodes Ning Zhong to the teleport nodes 𝑺 ICDM ¡ Proximity on graphs: SDM R. Ramakrishnan § Q: What is most related AAAI M. Jordan conference to ICDM? … NIPS § Random Walks with Restarts … § Teleport back to the starting node: Conference Author S = { single node } (item) (user) 10/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 62 ¡ Idea § Every node has some importance § Importance gets evenly split among all edges and pushed to the neighbors Pixie is based on a graph of relationships between objects: ¡ Given a set of QUERY_NODES, we simulate a random walk: § Make a step to a random neighbor and record the visit (visit count) § With probability ALPHA, restart the walk at one of the QUERY_NODES § The nodes with the highest visit count have highest proximity to the QUERY_NODES 10/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 63 ¡ Idea: § Every node has some importance Bipartite Pin and Board graph § Importance gets evenly split among all edges and pushed to the neighbors ¡ Given a set of QUERY NODES Q, simulate a random walk: Q 10/29/19 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 64 ¡ Proximity to query node(s) Q: Q Bipartite Pin and Board graph 10/29/19 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 65 Pixie Random ¡ Proximity Walk to query node(s) Q: 5 5 5 5 5 5 14 9 Q 16 7 8 8 8 8 1 1 1 Yummm Strawberries Smoothies Smoothie Madness! ! 10/29/19 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 66 ¡ Why is this great? ¡ It considers: § Multiple connections § Multiple paths § Direct and indirect connections § Degree of the node 10/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 67 Q: Which conferences are closest to KDD & K ICDM? I A: Personalized PageRank with Graph of CS conferences teleport set S={KDD, ICDM} 10/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 68 PKDD SDM PAKDD 0.008 0.007 0.009 KDD 0.005 ICML 0.011 ICDM 0.005 0.004 CIKM ICDE 0.005 0.004 0.004 ECML SIGMOD DMKD 10/29/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 69 ¡ “Normal” PageRank: § Teleports uniformly at random to any node § All nodes have the same probability of surfer landing there: S = [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1] ¡ Topic-Specific PageRank also known as Personalized PageRank: § Teleports to a topic specific set of pages § Nodes can have different probabilities of surfer landing there: S = [0.1, 0, 0, 0.2, 0, 0, 0.5, 0, 0, 0.2] ¡ Random Walk with Restarts: § Topic-Specific PageRank where teleport is always to the same node. S=[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0] 10/29/19 Jure Leskovec, Stanford C246: Mining Massive Datasets 70 CS224W: Machine Learning with Graphs Jure Leskovec, Stanford University http://cs224w.stanford.edu ¡ Node-level prediction ¡ Link-level prediction ¡ Graph-level prediction ? Node-level Graph-level Link-level ? B F ? A D E G C H 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 7 ¡ Design features for nodes/links/graphs ¡ Obtain features for all training data ‫ א‬Թ஽ ‫ א‬Թ஽ ‫ א‬Թ஽ Node features Link features B Graph features F A D E G C H 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 8 ¡ Train an ML model: ¡ Apply the model: Random forest Given a new SVM node/link/graph, obtain its features and make a Neural network, etc. prediction ࢞૚ ‫ݕ‬ଵ ࢞ ‫ݕ‬ ࢞ࡺ ‫ݕ‬ே 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 9 ¡ Using effective features over graphs is the key to achieving good model performance. ¡ Traditional ML pipeline uses hand-designed features. ¡ In this lecture, we overview the traditional features for: Node-level prediction Link-level prediction Graph-level prediction ¡ For simplicity, we focus on undirected graphs. 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 10 Goal: Make predictions for a set of objects Design choices: ¡ Features: d-dimensional vectors ¡ Objects: Nodes, edges, sets of nodes, entire graphs ¡ Objective function: What task are we aiming to solve? 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 11 Example: Node-level prediction ¡ Given: ¡ Learn a function: How do we learn the function? 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 12 CS224W: Machine Learning with Graphs Jure Leskovec, Stanford University http://cs224w.stanford.edu ? ? ? ? Machine Learning ? Node classification ML needs features. 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 14 Goal: Characterize the structure and position of a node in the network: Node degree Node centrality Clustering coefficient Node feature Graphlets B F A D E G C H 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 15 ¡ The degree ݇௩ of node ‫ ݒ‬is the number of edges (neighboring nodes) the node has. ¡ Treats all neighboring nodes equally. ݇஻ ൌ ʹ ݇஺ ൌ ͳ B ݇஽ ൌ Ͷ F A D E G C ݇஼ ൌ ͵ H 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 16 ¡ Node degree counts the neighboring nodes without capturing their importance. ¡ Node centrality ܿ௩ takes the node importance in a graph into account ¡ Different ways to model importance: Eigenvector centrality Engienvector centrality Betweenness centrality Closeness centrality and many others… 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 17 ¡ Eigenvector centrality: A node ‫ ݒ‬is important if surrounded by important neighboring nodes ‫ܰ א ݑ‬ሺ‫ݒ‬ሻ. We model the centrality of node ‫ ݒ‬as the sum of the centrality of neighboring nodes: ߣ is normalization constant (it will turn out to be the largest eigenvalue of A) Notice that the above equation models centrality in a recursive manner. How do we solve it? 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 18 ¡ Eigenvector centrality: Rewrite the recursive equation in the matrixB form. ߣࢉ ൌ ࡭ࢉ C ࡭: Adjacency matrix J D ൌ ͳ if ‫ܰ א ݑ‬ሺ‫ݒ‬ሻ ࡭௨௩ ߣ is normalization const I (largest eigenvalue of A) ࢉ: Centrality vector ߣ: Eigenvalue We see that centrality ܿ is the eigenvector of ࡭!K The largest eigenvalue ߣ௠௔௫ is always positive and unique (by Perron-Frobenius Theorem). The eigenvector ࢉ௠௔௫ corresponding to ߣ௠௔௫ is used for centrality. 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 19 ¡ Betweenness centrality: A node is important if it lies on many shortest paths between other nodes. Example: ܿ஺ ൌ ܿ஻ ൌ ܿா ൌ Ͳ B ܿ஼ ൌ ͵ A (A-C-B, A-C-D, A-C-D-E) D E ܿ஽ ൌ ͵ C (A-C-D-E, B-D-E, C-D-E) 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 20 ¡ Closeness centrality: A node is important if it has small shortest path lengths to all other nodes. Example: B ܿ஺ ൌ ͳȀሺʹ ൅ ͳ ൅ ʹ ൅ ͵ሻ ൌ ͳȀͺ A (A-C-B, A-C, A-C-D, A-C-D-E) D E ܿ஽ ൌ ͳȀሺʹ ൅ ͳ ൅ ͳ ൅ ͳሻ ൌ ͳȀ5 C (D-C-A, D-B, D-C, D-E) 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 21 ¡ Measures how connected ‫ ݒ‬ᇱ ‫ ݏ‬neighboring nodes are: #(node pairs among ݇௩ neighboring nodes) ¡ Examples: In our examples below the denominator is 6 (4 choose 2). ‫ݒ‬ ‫ݒ‬ ‫ݒ‬ ݁௩ ൌ ͳ ݁௩ ൌ ͲǤͷ ݁௩ ൌ Ͳ 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 22 ¡ Observation: Clustering coefficient counts the #(triangles) in the ego-network ‫ݒ‬ ‫ݒ‬ ‫ݒ‬ ‫ݒ‬ ݁௩ ൌ ͲǤͷ 3 triangles (out of 6 node triplets) ¡ We can generalize the above by counting #(pre-specified subgraphs, i.e., graphlets). 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 23 ¡ Goal: Describe network structure around node ‫ݑ‬ Graphlets are small subgraphs that describe the structure of node ‫’ݑ‬s network neighborhood B A ‫ݑ‬ E Analogy: C ¡ Degree counts #(edges) that a node touches ¡ Clustering coefficient counts #(triangles) that a node touches. ¡ Graphlet Degree Vector (GDV): Graphlet-base features for nodes GDV counts #(graphlets) that a node touches 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 24 ¡ Considering graphlets of size 2-5 nodes we get: Vector of 73 coordinates is a signature of a node that describes the topology of node's neighborhood ¡ Graphlet degree vector provides a measure of a node’s local network topology: Comparing vectors of two nodes provides a more detailed measure of local topological similarity than node degrees or clustering coefficient. B A ‫ݑ‬ E 9/27/2021 C Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 25 ¡ Def: Induced subgraph is another graph, formed from a subset of vertices and all of the edges connecting the vertices in that subset. B B A B Induced Not induced ‫ݑ‬ E ‫ݑ‬ ‫ݑ‬ subgraph: C subgraph: C C ¡ Def: Graph Isomorphism Two graphs which contain the same number of nodes connected in the same way are said to be isomorphic. Isomorphic Non-Isomorphic Node mapping: (e2,c2), (e1, c5), The right graph has cycles of length 3 but he left (e3,c4), 9/27/2021 (e5,c3), (e4,c1) Jure Leskovec, Stanford Source: Mathoverflow graph does not, so the graphs cannot be isomorphic. CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 26 Przulj et al., Bioinformatics 2004 B Graphlets: Rooted connected A ‫ݑ‬ E induced non-isomorphic subgraphs: Take some nodes C ‫ ݑ‬has and all the edges graphlets: between them. 0, 1, 2, 3, 5, 10, 11, … Graphlet id (Root/ “position” of node ‫)ݑ‬ There are 73 different graphlets on up to 5 nodes 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 27 ¡ Graphlet Degree Vector (GDV): A count vector of graphlets rooted at a given node. ¡ Example: Possible graphlets up to size 3 ‫ݑ‬ Graphlet instances of node u: ܽ ܾ ܿ ݀ GDV of node ‫ݑ‬: ܽǡ ܾǡ ܿǡ ݀ [2,1,0,2] 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 28 ¡ We have introduced different ways to obtain node features. ¡ They can be categorized as: Importance-based features: Node degree Different node centrality measures Structure-based features: Node degree Clustering coefficient Graphlet count vector 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 29 ¡ Importance-based features: capture the importance of a node in a graph Node degree: Simply counts the number of neighboring nodes Node centrality: Models importance of neighboring nodes in a graph Different modeling choices: eigenvector centrality, betweenness centrality, closeness centrality ¡ Useful for predicting influential nodes in a graph Example: predicting celebrity users in a social network 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 30 ¡ Structure-based features: Capture topological properties of local neighborhood around a node. Node degree: Counts the number of neighboring nodes Clustering coefficient: Measures how connected neighboring nodes are Graphlet degree vector: Counts the occurrences of different graphlets ¡ Useful for predicting a particular role a node plays in a graph: Example: Predicting protein functionality in a protein-protein interaction network. 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 31 Different ways to label nodes of the network: Node features defined so However, the features far would allow to defines so far would not distinguish nodes in the allow for distinguishing the above example above node labelling 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 32 CS224W: Machine Learning with Graphs Jure Leskovec, Stanford University http://cs224w.stanford.edu ¡ The task is to predict new links based on the existing links. ¡ At test time, node pairs (with no existing links) are ranked, and top ‫ ܭ‬node pairs are predicted. ¡ The key is to design features for a pair of nodes. ? B F A D E G C ? H 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 34 Two formulations of the link prediction task: ¡ 1) Links missing at random: Remove a random set of links and then aim to predict them ¡ 2) Links over time: Given ‫ܩ‬ሾ‫ݐ‬଴ ǡ ‫ݐ‬଴ᇱ ሿ a graph defined by edges up to time ‫ݐ‬଴ᇱ , output a ranked list L of edges (not in ‫ܩ‬ሾ‫ݐ‬଴ ǡ ‫ݐ‬଴ᇱ ሿ) that are ‫ܩ‬ሾ‫ݐ‬଴ǡ ‫ݐ‬଴ᇱ ሿ predicted to appear in time ‫ܩ‬ሾ‫ݐ‬ଵ ǡ ‫ݐ‬ଵᇱ ሿ ‫ܩ‬ሾ‫ݐ‬ଵ ǡ ‫ݐ‬ଵᇱ ሿ Evaluation: n = |Enew|: # new edges that appear during the test period ሾ‫ݐ‬ଵǡ ‫ݐ‬ଵᇱሿ 9/27/2021 Take top n elements of L and count correct edges Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 35 ¡ Methodology: For each pair of nodes (x,y) compute score c(x,y) For example, c(x,y) could be the # of common neighbors of x and y Sort pairs (x,y) by the decreasing score c(x,y) Predict top n pairs as new links See which of these links actually X appear in ‫ܩ‬ሾ‫ݐ‬ଵ ǡ ‫ݐ‬ଵᇱ ሿ 9/27/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 36 ¡ Distance-based feature ¡ Local neighborhood overlap ¡ Global neighborhood overlap Link feature B F A D E G C H 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 37 Shortest-path distance between two nodes ¡ Example: B A F ܵ஻ு ൌ ܵ஻ா ൌ ܵ஺஻ ൌ ʹ E D ܵ஻ீ ൌ ܵ஻ி ൌ ͵ G C H ¡ However, this does not capture the degree of neighborhood overlap: Node pair (B, H) has 2 shared neighboring nodes, while pairs (B, E) and (A, B) only have 1 such node. 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 38 Captures # neighboring nodes shared between two nodes ࢜૚ and ࢜૛ : ¡ Common neighbors: ȁܰ ‫ݒ‬ଵ ‫ݒ ܰ ת‬ଶ ȁ Example: ܰ ‫ܤ ܰ ת ܣ‬ ൌ ‫ ܥ‬ൌͳ ȁே ௩ ‫ת‬ே ௩ ȁ ¡ Jaccard’s coefficient: ȁே ௩భ ‫׫‬ே ௩మ ȁ భ మ B ே ஺ ‫ת‬ே ஻ ሼ஼ሽ ଵ Example: ൌ ൌ ܰ஻ ே ஺ ‫׫‬ே ஻ ሼ஼ǡ஽ሽ ଶ A ¡ Adamic-Adar index: D E ଵ σ௨‫א‬ே ௩భ ‫ת‬ே ௩మ ୪୭୥ሺ௞ೠ ሻ C ଵ ଵ Example: ୪୭୥ሺ௞ ൌ ୪୭୥ ସ ܰ஺ ಴ ሻ F 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 39 ¡ Limitation of local neighborhood features: Metric is always zero if the two nodes do not have any neighbors in common. B ܰா A D E ܰ஺ ‫ܰ ת‬ா ൌ ߶ C ȁܰ஺ ‫ܰ ת‬ா ȁ ൌ Ͳ ܰ஺ F However, the two nodes may still potentially be connected in the future. ¡ Global neighborhood overlap metrics resolve the limitation by considering the entire graph. 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 40 ¡ Katz index: count the number of walks of all lengths between a given pair of nodes. ¡ Q: How to compute #walks between two nodes? ¡ Use powers of the graph adjacency matrix! 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 41 ¡ Computing #walks between two nodes Recall: ࡭௨௩ ൌ ͳ if ‫ܰ א ݑ‬ሺ‫ݒ‬ሻ ሺࡷሻ Let ࡼ࢛࢜ ൌ #walks of length ࡷ between ࢛ and ࢜ We will show ࡼሺࡷሻ ൌ ࡭࢑ ሺ૚ሻ ࡼ࢛࢜ ൌ #walks of length 1 (direct neighborhood) between ‫ ݑ‬and ‫ ݒ‬ൌ ࡭࢛࢜ ሺ૚ሻ ࡼ ൌ࡭ ૚૛ ૚૛ 4 3 2 1 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 42 ¡ How to compute ? Step 1: Compute #walks of length 1 between each of ࢛’s neighbor and ࢜ Step 2: Sum up these #walks across u’s neighbors ሺ૛ሻ ሺ૚ሻ ࡼ࢛࢜ ൌ σ࢏ ࡭࢛࢏ ‫ ࢜࢏ࡼ כ‬ൌ σ࢏ ࡭࢛࢏ ‫ ࢜࢏࡭ כ‬ൌ ࡭૛࢛࢜ #walks of length 1 between ሺ૛ሻଶ Node 1’s neighbors Node 1’s neighbors and Node 2 ࡼ૚૛ ൌ ࡭ଵଶ Power of adjacency 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 43 ¡ Katz index: count the number of walks of all lengths between a pair of nodes. ¡ How to compute #walks between two nodes? ¡ Use adjacency matrix powers! ࡭௨௩ specifies #walks of length 1 (direct neighborhood) between ‫ ݑ‬and ‫ݒ‬. ࡭૛௨௩ specifies #walks of length 2 (neighbor of neighbor) between ‫ ݑ‬and ‫ݒ‬. And, ࡭࢒௨௩ specifies #walks of length ࢒. 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 44 ¡ Katz index between ‫ݒ‬ଵ and ‫ݒ‬ଶ is calculated as Sum over all walk lengths #walks of length ݈ between ‫ݒ‬ଵ and ‫ݒ‬ଶ Ͳ ൏ ߚ ൏ ͳ: discount factor ¡ Katz index matrix is computed in closed-form: ൌ σஶ௜ୀ଴ ߚ ௜ ௜ ࡭ by geometric series of matrices 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 45 ¡ Distance-based features: Uses the shortest path length between two nodes but does not capture how neighborhood overlaps. ¡ Local neighborhood overlap: Captures how many neighboring nodes are shared by two nodes. Becomes zero when no neighbor nodes are shared. ¡ Global neighborhood overlap: Uses global graph structure to score two nodes. Katz index counts #walks of all lengths between two nodes. 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 46 CS224W: Machine Learning with Graphs Jure Leskovec, Stanford University http://cs224w.stanford.edu ¡ Goal: We want features that characterize the structure of an entire graph. ¡ For example: B F A D E G C H 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 48 ¡ Kernel methods are widely-used for traditional ML for graph-level prediction. ¡ Idea: Design kernels instead of feature vectors. ¡ A quick introduction to Kernels: Kernel ‫ܩ ܭ‬ǡ ‫ ܩ‬ᇱ ‫ א‬Թ measures similarity b/w data Kernel matrix ࡷ ൌ ‫ܩ ܭ‬ǡ ‫ ܩ‬ᇱ ீǡீᇲ must always be positive semidefinite (i.e., has positive eigenvalues) There exists a feature representation ߶(ȉሻ ‫ܩ ܭ‬ǡ ‫ ܩ‬ᇱ ൌ ߶ ୘ ߶ ‫ ܩ‬ᇱ Once the kernel is defined, off-the-shelf ML model, such as kernel SVM, can be used to make predictions. 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 49 ¡ Graph Kernels: Measure similarity between two graphs: Graphlet Kernel Weisfeiler-Lehman Kernel Other kernels are also proposed in the literature (beyond the scope of this lecture) Random-walk kernel Shortest-path graph kernel And many more… Shervashidze, Nino, et al. "Efficient graphlet kernels for large graph comparison." Artificial Intelligence and Statistics. 2009. Shervashidze, Nino, et al. "Weisfeiler-lehman graph kernels." Journal of Machine Learning Research 12.9 (2011). 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 50 ¡ Goal: Design graph feature vector ߶ ‫ܩ‬ ¡ Key idea: Bag-of-Words (BoW) for a graph Recall: BoW simply uses the word counts as features for documents (no ordering considered). Naïve extension to a graph: Regard nodes as words. Since both graphs have 4 red nodes, we get the same feature vector for two different graphs… ߶ሺ ሻ=߶ 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 51 What if we use Bag of node degrees? Deg1: Deg2: Deg3: ߶ሺ ሻ = count( ) = [1, 2, 1] Obtains different features for different graphs! ߶ሺ ሻ = count( ) = [0, 2, 2] ¡ Both Graphlet Kernel and Weisfeiler-Lehman (WL) Kernel use Bag-of-* representation of graph, where * is more sophisticated than node degrees! 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 52 ¡ Key idea: Count the number of different graphlets in a graph. Note: Definition of graphlets here is slightly different from node-level features. The two differences are: Nodes in graphlets here do not need to be connected (allows for isolated nodes) The graphlets here are not rooted. Examples in the next slide illustrate this. 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 53 Let ऑ࢑ ൌ ሺࢍ૚ǡ ࢍ૛ ǡ ǥ ǡ ࢍ࢔࢑ ሻ be a list of graphlets of size ࢑. For ݇ ൌ ͵ , there are 4 graphlets. ݃ଵ ݃ଶ ݃ଷ ݃ସ For ݇ ൌ Ͷ , there are 11 graphlets. Shervashidze et al., AISTATS 2011 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 54 ¡ Given graph ‫ܩ‬, and a graphlet list ࣡௞ ൌ ሺ݃ଵǡ ݃ଶ ǡ ǥ ǡ ݃௡ೖ ሻ, define the graphlet count vector ࢌீ ‫ א‬Թ௡ೖ as ሺࢌீ ሻ௜ ൌ ͓ሺ݃௜ ‫ܩ ك‬ሻ for ݅ ൌ ͳǡʹǡ ǥ ǡ ݊௞. 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 55 ¡ Example for ݇ = 3. ݃ଵ ݃ଶ ݃ଷ ݃ସ ‫ܩ‬ ࢌீ ൌ ሺͳǡ ͵ǡ ͸ǡ Ͳሻ୘ 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 56 ¡ Given two graphs, ‫ ܩ‬and ‫ ܩ‬ᇱ, graphlet kernel is computed as ᇱ ୘ ‫ܩ ܭ‬ǡ ‫ ܩ‬ൌ ࢌீ ࢌீᇲ ¡ Problem: if ‫ ܩ‬and ‫ ܩ‬ᇱ have different sizes, that will greatly skew the value. ¡ Solution: normalize each feature vector ᇱ ୘ ‫ܩ ܭ‬ǡ ‫ܩ‬ ൌ ࢎீ ࢎீ ᇲ 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 57 Limitations: Counting graphlets is expensive! ¡ Counting size-݇ graphlets for a graph with size ݊ by enumeration takes ݊ ௞ Ǥ ¡ This is unavoidable in the worst-case since subgraph isomorphism test (judging whether a graph is a subgraph of another graph) is NP-hard. ¡ If a graph’s node degree is bounded by ݀, an ܱሺ݊݀௞ିଵ ሻ algorithm exists to count all the graphlets of size ݇. Can we design a more efficient graph kernel? 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 58 ¡ Goal: Design an efficient graph feature descriptor ߶ ‫ܩ‬ ¡ Idea: Use neighborhood structure to iteratively enrich node vocabulary. Generalized version of Bag of node degrees since node degrees are one-hop neighborhood information. ¡ Algorithm to achieve this: Color refinement 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 59 ¡ Given: A graph ‫ ܩ‬with a set of nodes ܸǤ Assign an initial color ܿ ଴ ‫ ݒ‬to each node ‫ݒ‬. Iteratively refine node colors by ௞ାଵ ௞ ௞ ܿ ‫ ݒ‬ൌ ܿ ‫ ݒ‬ǡ ܿ ‫ݑ‬ ǡ ௨‫א‬ே ௩ where HASH maps different inputs to different colors. After ‫ ܭ‬steps of color refinement, ܿ ௄ ‫ݒ‬ summarizes the structure of ‫ܭ‬-hop neighborhood 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 60 Example of color refinement given two graphs Assign initial colors 1 1 ‫ܩ‬ଶ 1 1 ‫ܩ‬ଵ 1 1 1 1 1 1 1 1 Aggregate neighboring colors 1,11 1,111 1,111 1,11 1,1111 1,111 1,1111 1,111 1,1 1,1 1,1 1,1 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 61 Example of color refinement given two graphs Aggregated colors 1,11 1,111 1,111 1,11 1,1111 1,111 1,1111 1,111 1,1 1,1 1,1 1,1 Hash aggregated colors 3 4 Hash table 4 3 1,1 --> 2 5 4 5 4 1,11 --> 3 1,111 --> 4 1,1111 --> 5 2 2 2 2 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 62 Example of color refinement given two graphs Aggregated colors 3 4 4 3 5 4 5 4 2 2 2 2 Hash aggregated colors 3,45 4,345 4,345 3,44 5,2244 5,2344 4,245 4,345 2,5 2,5 2,5 2,4 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 63 Example of color refinement given two graphs Aggregated colors 3,45 4,345 4,345 3,44 5,2244 5,2344 4,245 4,345 2,5 2,5 2,5 2,4 Hash aggregated colors Hash table 11 8 9 11 2,4 --> 6 2,5 --> 7 3,44 --> 8 12 11 13 10 3,45 --> 9 4,245 --> 10 7 6 4,345 --> 11 7 7 5,2244 --> 12 5,2344 --> 13 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 64 After color refinement, WL kernel counts number of nodes with a given color. Colors 1,2,3,4,5,6,7,8,9,10,11,12,13 = [6,2,1,2,1,0,2,1,0, 0, 0, 2, 1] Counts 1,2,3,4,5,6,7,8,9,10,11,12,13 = [6,2,1,2,1,1,1,0,1, 1, 1, 0, 1] 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 65 The WL kernel value is computed by the inner product of the color count vectors: K( , ) = = 49 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 66 ¡ WL kernel is computationally efficient The time complexity for color refinement at each step is linear in #(edges), since it involves aggregating neighboring colors. ¡ When computing a kernel value, only colors appeared in the two graphs need to be tracked. Thus, #(colors) is at most the total number of nodes. ¡ Counting colors takes linear-time w.r.t. #(nodes). ¡ In total, time complexity is linear in #(edges). 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 67 ¡ Graphlet Kernel Graph is represented as Bag-of-graphlets Computationally expensive ¡ Weisfeiler-Lehman Kernel Apply ‫ܭ‬-step color refinement algorithm to enrich node colors Different colors capture different ‫ܭ‬-hop neighborhood structures Graph is represented as Bag-of-colors Computationally efficient Closely related to Graph Neural Networks (as we will see!) 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 68 ¡ Traditional ML Pipeline Hand-crafted feature + ML model ¡ Hand-crafted features for graph data Node-level: Node degree, centrality, clustering coefficient, graphlets Link-level: Distance-based feature local/global neighborhood overlap Graph-level: Graphlet kernel, WL kernel 9/27/2021 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 69 CS224W: Machine Learning with Graphs Jure Leskovec, Stanford University http://cs224w.stanford.edu Given an input graph, extract node, link and graph-level features, learn a model (SVM, neural network, etc.) that maps features to labels. Input Structured Learning Prediction Graph Features Algorithm Feature engineering Downstream (node-level, edge-level, graph- prediction task level features) 9/28/2021 Jure Les kovec, Stanford CS224W: Ma chine Learning with Graphs, http://cs224w.stanford.edu 2 Graph Representation Learning alleviates the need to do feature engineering every single time. Input Structured Learning Prediction Graph Features Algorithm Feature Representation Learning -- Downstream Engineering Automatically prediction task le

CS224W Machine Learning with Graphs PDF

Document Details

Tags

Related

Summary

Full Transcript