Graph Data - CSMODEL PDF

Graph Data CSMODEL Graphs Graphs are a general language for describing systems of interacting entities. A graph is a collection of objects where some pairs of objects are connected by links 2 Graphs Networks (Natural Graphs): Communication systems link electronic devices Interaction between genes/proteins regulate life Thoughts are connected through neurons in our brain. Information Graphs: Information/knowledge are organized and linked Scene graphs: how objects in a scene relate Similarity networks: take data, connect similar points. 3 Graphs Components of a Graph Objects: nodes, vertices Interactions: links, edges System: network, graph 4 Graphs A graph is connected if every two vertex has a path between them. G1 is a connected graph, while G2 is not. 5 Networks or Graphs Network often refers to real systems Web, social network, metabolic network Jargon: Network, node, link Graph is a mathematical representation of a network Web graph, social graph, knowledge graph Jargon: Graph, vertex, edge 6 Undirected Graph 7 Undirected Graph Undirected graphs are composed of edges which do not have any specific direction. These edges, instead, represents a two-way relationship between the vertices. 𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺 𝐺𝐺1 8 Undirected Graph The undirected graph 𝐺𝐺 can be represented as 𝐺𝐺 = (𝑉𝑉, 𝐸𝐸), where 𝑉𝑉 is the set of vertices and 𝐸𝐸 is the set of edges 𝐺𝐺1 = (𝑉𝑉1 , 𝐸𝐸1 ) 𝑉𝑉1 = 𝐴𝐴, 𝐵𝐵, 𝐶𝐶, 𝐷𝐷, 𝐸𝐸 𝐸𝐸1 = {(𝐴𝐴, 𝐶𝐶), (𝐴𝐴, 𝐷𝐷), (𝐵𝐵, 𝐷𝐷), (𝐵𝐵, 𝐸𝐸), (𝐶𝐶, 𝐸𝐸)} 𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺 𝐺𝐺1 9 Undirected Graph Degree refers to the number of edges at a vertex. In 𝐺𝐺1 , all vertices have a degree 2. In 𝐺𝐺2 , all vertices have a degree 3. 𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺 𝐺𝐺1 𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺 𝐺𝐺2 10 Undirected Graph The maximum number of edges in any n-vertex 𝑛𝑛 𝑛𝑛−1 undirected graph is. 2 An n-vertex undirected graph 𝑛𝑛 𝑛𝑛−1 with exactly edges is a 2 complete undirected graph. Graph 𝐺𝐺1 is a complete undirected graph 11 Directed Graph 12 Directed Graph Directed graphs are composed of edges which have specified direction. Thus, at most 2 edges of different directions might be used to connect 2 different vertices. 𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺 𝐺𝐺1 13 Directed Graph The directed graph 𝐺𝐺 can be represented as 𝐺𝐺 = (𝑉𝑉, 𝐸𝐸), where 𝑉𝑉 is the set of vertices and 𝐸𝐸 is the set of edges 𝐺𝐺1 = (𝑉𝑉1 , 𝐸𝐸1 ) 𝑉𝑉1 = 𝐴𝐴, 𝐵𝐵, 𝐶𝐶, 𝐷𝐷, 𝐸𝐸, 𝐹𝐹, 𝐺𝐺, 𝐻𝐻 𝐸𝐸1 = {< 𝐴𝐴, 𝐶𝐶 >, < 𝐵𝐵, 𝐸𝐸 >, < 𝐶𝐶, 𝐴𝐴 >, < 𝐶𝐶, 𝐹𝐹 >, < 𝐶𝐶, 𝐺𝐺 >, < 𝐷𝐷, 𝐹𝐹 >, < 𝐷𝐷, 𝐻𝐻 >, < 𝐸𝐸, 𝐵𝐵 >, < 𝐸𝐸, 𝐺𝐺 >, < 𝐸𝐸, 𝐻𝐻 >, < 𝐹𝐹, 𝐷𝐷 >, < 𝐺𝐺, 𝐹𝐹 >, < 𝐺𝐺, 𝐻𝐻 >, < 𝐻𝐻, 𝐷𝐷 >} 𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺 𝐺𝐺1 14 Directed Graph Out-degree – number of arrows originating from a vertex Out-degree of vertex 𝐴𝐴 is 1. Out-degree of vertex 𝐷𝐷 is 2. Out-degree of vertex 𝐶𝐶 is 3. 𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺 𝐺𝐺1 15 Directed Graph In-degree – number of arrows pointing to a vertex In-degree of vertex 𝐵𝐵 is 1. In-degree of vertex 𝐺𝐺 is 2. In-degree of vertex 𝐹𝐹 is 3. 𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺 𝐺𝐺1 16 Directed Graph The maximum number of edges in any n-vertex directed graph is 𝑛𝑛(𝑛𝑛 − 1). An n-vertex directed graph with exactly 𝑛𝑛(𝑛𝑛 − 1) edges is a complete directed graph. Graph 𝐺𝐺1 is a complete directed graph. 𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺 𝐺𝐺1 17 Weighted Graphs 18 Weighted Graphs Graphs for which each edge may have an associated weight, typically given by a weighted function 𝑤𝑤: 𝐸𝐸 → 𝑅𝑅 19 Graph Representations 20 Graph Representations Collection of Adjacency Lists Adjacency list representation is usually preferred since it provides a compact way to represent sparse graphs (i.e. 𝐸𝐸 < 𝑉𝑉 2 ). Adjacency Matrix Adjacency matrix representation is preferred if the graph is dense (i.e. 𝐸𝐸 is close to 𝑉𝑉 2 ). 21 Graph Representations Adjacency List The adjacency list representation of graph 𝐺𝐺 = (𝑉𝑉, 𝐸𝐸) consists of an array 𝐴𝐴 with 𝑉𝑉 number of lists, one for each vertex in 𝑉𝑉. For each vertex 𝑢𝑢 ∈ 𝑉𝑉, the adjacency list 𝐴𝐴 𝑢𝑢 contains all vertices 𝑣𝑣 such that there is an edge 𝑢𝑢, 𝑣𝑣 ∈ 𝐸𝐸. The adjacency list representation’s memory requirement is 𝑂𝑂 𝑉𝑉 + 𝐸𝐸. 22 Graph Representations Adjacency List - Undirected In an undirected graph, the sum of the lengths of all adjacency lists is 2 𝐸𝐸. Since if 𝑢𝑢, 𝑣𝑣 is an undirected edge, then 𝑢𝑢 appears in 𝑣𝑣’s adjacency list and vice-versa. 23 Graph Representations Adjacency List - Directed In a directed graph, the sum of the lengths of all the adjacency lists is 𝐸𝐸. Since an edge of the form < 𝑢𝑢, 𝑣𝑣 > is represented by having 𝑣𝑣 appear in 𝐴𝐴 𝑢𝑢. 24 Graph Representations Adjacency List – Weighted Undirected The weight 𝑤𝑤 𝑢𝑢, 𝑣𝑣 of the edge 𝑢𝑢, 𝑣𝑣 ∈ 𝐸𝐸 is stored with vertex 𝑣𝑣 in 𝑢𝑢’s adjacency list and vice versa 25 Graph Representations Adjacency List – Weighted Directed The weight 𝑤𝑤 𝑢𝑢, 𝑣𝑣 of the edge 𝑢𝑢, 𝑣𝑣 ∈ 𝐸𝐸 is stored with vertex 𝑣𝑣 in 𝑢𝑢’s adjacency list. 26 Graph Representations Adjacency Matrix The adjacency matrix representation of graph 𝐺𝐺 = 𝑉𝑉, 𝐸𝐸 consists of a 𝑉𝑉 × 𝑉𝑉 matrix 𝐴𝐴 = 𝑎𝑎𝑖𝑖𝑖𝑖 such that: 1 𝑖𝑖𝑖𝑖 𝑖𝑖, 𝑗𝑗 𝜖𝜖 𝐸𝐸 𝑎𝑎𝑖𝑖𝑖𝑖 = 0 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 The adjacency matrix representation of a graph requires 𝑂𝑂 𝑉𝑉 2 memory, independent of the number of edges in the graph. 27 Adjacency Matrix – Undirected A B C D E F A 0 0 1 1 0 0 B 0 0 0 1 1 0 C 1 0 0 1 0 1 D 1 1 1 0 1 0 E 0 1 0 1 0 1 F 0 0 1 0 1 0 28 Adjacency Matrix – Undirected Weighted A B C D E F A 0 0 2 2 0 0 B 0 0 0 2 2 0 C 2 0 0 3 0 5 D 2 2 3 0 3 0 E 0 2 0 3 0 5 F 0 0 5 0 5 0 29 Adjacency Matrix - Directed A B C D E F G A 0 0 1 0 0 0 0 B 0 0 0 1 0 0 0 C 1 0 0 0 1 1 0 D 0 1 0 0 0 1 1 E 0 0 0 0 0 0 0 F 1 1 0 0 1 0 1 G 0 0 0 0 0 0 0 30 Adjacency Matrix – Directed Weighted A B C D E F G A 0 0 1 0 0 0 0 B 0 0 0 1 0 0 0 C 1 0 0 0 3 5 0 D 0 1 0 0 0 5 3 E 0 0 0 0 0 0 0 F 6 6 0 0 3 0 3 G 0 0 0 0 0 0 0 31 Graph Embeddings 32 Graph Embedding A graph embedding transforms graphs to a lower dimensional representation of the graph while preserving its topology. Its goal is to turn graphs into a format that machine learning algorithms can understand and process. Machine learning algorithms are tuned for continuous data; thus we need to convert graphs, which are discrete by nature, in a continuous vector space. 33 Graph Embeddings Recent algorithms used to produce graph embeddings: DeepWalk (Perozzi et al., 2014) Node2Vec (Grover & Leskovec, 2016) 34 Graph Embeddings DeepWalk (Perozzi et al., 2014) Deepwalk belongs to the family of graph embedding techniques that uses walks. 35 Graph Embeddings DeepWalk (Perozzi et al., 2014) Graphs are like texts. the dog is cute cat also red but blue not “cat” 0 0 0 0 1 0 0 0 0 0 A B C D E F G H I J A 0 0 0 1 0 1 0 0 0 0 36 Graph Embeddings DeepWalk (Perozzi et al., 2014) Language modeling estimates the likelihood of a specific sequence of words appearing in a corpus. Suppose we have a sequence of words: 𝑊𝑊1𝑛𝑛 = (𝑤𝑤0 , 𝑤𝑤1 , … , 𝑤𝑤𝑛𝑛 ) We want to estimate the likelihood of: P(𝑤𝑤𝑛𝑛 |𝑤𝑤0 , 𝑤𝑤1 , … , 𝑤𝑤𝑛𝑛−1 ) 37 Graph Embeddings DeepWalk (Perozzi et al., 2014) DeepWalk generalizes language modeling to explore the graph through a stream of short random walks Suppose we have a sequence of visited vertices: 𝑉𝑉1𝑛𝑛 = (𝑣𝑣0 , 𝑣𝑣1 , … , 𝑣𝑣𝑛𝑛 ) We want to estimate the likelihood of: P(𝑣𝑣𝑛𝑛 |𝑣𝑣0 , 𝑣𝑣1 , … , 𝑣𝑣𝑛𝑛−1 ) 38 Graph Embeddings DeepWalk (Perozzi et al., 2014) The goal is to learn a latent representation, not only a probability distribution of node co-occurrences. Thus, they introduced the mapping function: Φ: 𝑣𝑣 ∈ 𝑉𝑉 → ℝ|𝑉𝑉|×𝑑𝑑 which represents the latent social representation associated with each vertex 𝑣𝑣 in the graph. 39 Graph Embeddings DeepWalk (Perozzi et al., 2014) Thus, DeepWalk estimates the likelihood: P (Φ(𝑣𝑣𝑛𝑛 )|Φ(𝑣𝑣0 ), Φ(𝑣𝑣1 ), … , Φ(𝑣𝑣𝑛𝑛−1 )) The goal is to estimate the likelihood of observing node 𝑣𝑣𝑛𝑛 given all the previous nodes visited so far in the random walk. 40 Graph Embeddings DeepWalk (Perozzi et al., 2014) 41 Graph Embeddings Node2Vec (Grover & Leskovec, 2016) Node2vec is one of the first Deep Learning attempts to learn embedding from graph data. Node2Vec, like DeepWalk, utilizes walks to learn graph embeddings. Compared to DeepWalk, Node2vec incorporates a search bias variable 𝛼𝛼, which is parameterized by 𝑝𝑝 and 𝑞𝑞, which allows it to interpolate between BFS and DFS. 42 Graph Embeddings Node2Vec (Grover & Leskovec, 2016) Formally, given a source node 𝑢𝑢, simulate a random walk of fixed length 𝑙𝑙. Let 𝑐𝑐𝑖𝑖 denote the 𝑖𝑖th node in the walk, starting with 𝑐𝑐0 = 𝑢𝑢. Nodes 𝑐𝑐𝑖𝑖 are generated by the following distribution: 𝜋𝜋𝑣𝑣𝑣𝑣 , 𝑖𝑖𝑖𝑖 𝑣𝑣, 𝑥𝑥 ∈ 𝐸𝐸 𝑃𝑃 𝑐𝑐𝑖𝑖 = 𝑥𝑥 𝑐𝑐𝑖𝑖−1 = 𝑣𝑣) = 𝑍𝑍 0, 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 where 𝑃𝑃 𝑥𝑥, 𝑣𝑣 is the transition probability between 𝑣𝑣 and 𝑥𝑥 43 Graph Embeddings Node2Vec (Grover & Leskovec, 2016) Suppose the walk has just traversed the edge (𝑡𝑡, 𝑣𝑣) and now resides at node 𝑣𝑣. The walk needs to decide on the next step to evaluate the transition probability 𝜋𝜋𝑣𝑣𝑣𝑣 on edges (𝑣𝑣, 𝑥𝑥) leading to 𝑣𝑣. 𝜋𝜋𝑣𝑣𝑣𝑣 = 𝛼𝛼𝑝𝑝𝑝𝑝 (𝑡𝑡, 𝑥𝑥) × 𝑤𝑤𝑣𝑣𝑣𝑣 where 𝑤𝑤𝑣𝑣𝑣𝑣 is the weight of the edge going from 𝑣𝑣 to 𝑥𝑥 44 Graph Embeddings Node2Vec (Grover & Leskovec, 2016) Search Bias 1 𝑖𝑖𝑖𝑖 𝑑𝑑 = 0 𝑝𝑝 𝑡𝑡𝑡𝑡 𝛼𝛼𝑝𝑝𝑝𝑝 𝑡𝑡, 𝑥𝑥 = 1 𝑖𝑖𝑖𝑖 𝑑𝑑𝑡𝑡𝑡𝑡 = 1 1 𝑖𝑖𝑖𝑖 𝑑𝑑 = 2 𝑞𝑞 𝑡𝑡𝑡𝑡 where 𝑑𝑑𝑡𝑡𝑡𝑡 denotes the shortest path distance between nodes 𝑡𝑡 and 𝑥𝑥 45 Graph Embeddings Node2Vec (Grover & Leskovec, 2016) BFS is ideal for learning local neighbors. The neighborhood 𝑁𝑁𝑠𝑠 is restricted to nodes which are immediate neighbors of the source. 46 Graph Embeddings Node2Vec (Grover & Leskovec, 2016) DFS is better for learning global variables. The neighborhood consists of nodes sequentially sampled at increasing distances from the source node. 47 Applications 48 Applications Network Compression 49 Applications Clustering 50 Applications Link Prediction 51 Applications Node Classification 52 Graph Data CSMODEL

Graph Data - CSMODEL PDF

Document Details

Tags

Related

Summary

Full Transcript