Lecture #6.1 - Data Processing - Apache Spark Graph API.pdf
Document Details
Uploaded by PerfectPanda
IE University
Tags
Full Transcript
MODERN DATA ARCHITECTURES FOR BIG DATA II APACHE SPARK GRAPH API AGENDA Graph Concepts GraphX and GraphFrames GraphFrames API Additional References TIME TO TURN OSBDET ON! We'll use the course environment by the end of the lesson: 1. GRAPH CONCEPTS DISCLAIMER This session is an intro to Graph Proces...
MODERN DATA ARCHITECTURES FOR BIG DATA II APACHE SPARK GRAPH API AGENDA Graph Concepts GraphX and GraphFrames GraphFrames API Additional References TIME TO TURN OSBDET ON! We'll use the course environment by the end of the lesson: 1. GRAPH CONCEPTS DISCLAIMER This session is an intro to Graph Processing with Spark. It could be the foundations to explore the topic further. I'll share some interesting resources if you want to do so. BUILDING UP ON THE FOUNDATIONS Spark GraphX/GraphFrames built on studied core APIs: GRAPHS ARE DATA STRUCTURES Graphs are advanced data structures made up of: Nodes/Vertices - main represented entities (ex. bike stations, airports,...) Edges - existing relationships between entities (ex. trips, routes,...) Nodes & Edges can have attributes to describe them better. Graphs can be classified in two based on edges navegability: Undirected graphs - how edges are traversed is not relevant Directed graphs - directional edges, A to B different to B to A MULTI-TYPE ENTITY GRAPHS Graphs with different types of entities and/or edges. Knowledge graph - example of a multi-type entity graph. GRAPH ANALYTICS Data analysis of relationships in a graph or network. Typical graph analytics use cases*: Social media & social network graphs Recommendation engines Fraud detection IT infrastructure monitoring * Graph Database Use Cases: https://neo4j.com/use-cases/ SOCIAL NETWORK ANALYSIS Key elements of this type of graphs: Nodes - people Edges - friendship Communities - set of people related/connected to each other Key members/influencers - individual connecting communities WEB SITES ANALYTICS Key elements of this type of graphs: Nodes - sites Edges - links between sites Site Relevance - insights produced by the PageRank algorithm* * More information about this algorithm created by Google in PageRank algorithm, fully explained. 2. GRAPHX AND GRAPHFRAMES GRAPHX* Component in Spark for graphs & graph-parallel computation. Built on top of RDDs, low-level API, which we don't like. Graph abstraction - directed multigraph with: Properties for vertices & edges Set of fundamental operators (ex. subgraph, joinVertices,...) Collection of graphs algorithms for graph analytics * More details in GraphX Programming Guide. GRAPHFRAMES* Apache Spark package** providing DataFrame-based Graphs. Built on top of DataFrames, high-level API, which we love. Provide & extend GraphX functionality: Motif finding for pattern search within graphs Additional graph processing algorithms to those in GraphX * More details in GraphFrames User Guide. ** A PySpark version of the API can be found in Welcome to the GraphFrames Python API docs!. 3. GRAPHFRAMES API EXPLORE THE API IN JUPYTER NOTEBOOK Jump to OSBDET and explore the GraphFrames API: 4. ADDITIONAL REFERENCES GRAPH ALGORITHMS Graph Analytics Book Reveal Hidden Patterns in Data and Enhance Machine Learning Predictions using Apache Spark & Neo4j GRAPH POWERED MACHINE LEARNING Graph Powered Machine Learning Book Free book wich is a practical guide to using graphs in machine learning applications; Drive you in all the stages necessary for building complete solutions where graphs play a key role. CONGRATS, WE'RE DONE!