Spark Lecture 6 PDF
Document Details
Uploaded by LucidHeliotrope6628
EURECOM
Raja Appuswamy
Tags
Related
- Lecture #2.2 - Spark Structured Streaming API.pdf
- Lecture #6.1 - Data Processing - Apache Spark Graph API.pdf
- (Spark) Chapter 15 How Spark Runs on a Cluster.pdf
- (Spark) Chapter 15 How Spark Runs on a Cluster.pdf
- Spark Chapter 15: How Spark Runs on a Cluster PDF
- MSc Data Analytics - Big Data Analytics Assignment (PDF)
Summary
These lecture notes cover Spark, a flexible, in-memory data processing framework written in Scala. The document explores the core concepts of Spark, including its architecture, how to perform computations, and its benefits and limits.
Full Transcript
Spark Lecture 6 Raja Appuswamy (Eurecom) Cloud Computing and Distributed Systems Recap MapReduce introduced by Google Simple programming model for building distributed applications that process vast amounts of data R...
Spark Lecture 6 Raja Appuswamy (Eurecom) Cloud Computing and Distributed Systems Recap MapReduce introduced by Google Simple programming model for building distributed applications that process vast amounts of data Runtime for executing jobs on large clusters in a reliable, fault- tolerant manner Hadoop makes MapReduce broadly available HDFS becomes central data repository Becomes Defacto standard for batch processing Stonebraker & Dewitt: Mapreduce a major step backwards Raja Appuswamy (Eurecom) Cloud Computing and Distributed Systems 2 New applications, new workloads Iterative computations Ex: More and more people aiming to get insights from data Apache Mahout becomes popular framework for ML over Hadoop How would we implement K-Means with MapReduce? Traditional k-means Raja Appuswamy (Eurecom) Cloud Computing and Distributed Systems K-means MapReduce algorithm Configure: A single file containing cluster centers Mapper Input: Input data points Compute: Distance of a point from each centroid to identify a cluster Output: (cluster id, data id) Reducer Input: (cluster id, data id) Compute: New cluster centroid based on data points assigned Output: (cluster id, cluster centroid) Driver Each iteration produces new cluster centroids Run multiple iteration jobs using mapper + reducer until convergence: High overhead leading to poor performance Raja Appuswamy (Eurecom) Cloud Computing and Distributed Systems MapReduce & Iterative Computations MapReduce is built for batch processing Entirely disk based: Input and output sit on HDFS Let us look at k-means algorithm, 1 iteration HDFS Read Map(Assign sample to closest centroid) NETWORK Shuffle Reduce(Compute new centroids) HDFS Write Each iteration reads and writes data from disk-based HDFS To understand why this is bad, let us look at the memory hierarchy Raja Appuswamy (Eurecom) Cloud Computing and Distributed Systems Understanding Memory Hierarchy: How Far Away is the Data? Andromeda 10 9 Tape 2,000 Years 10 6 Disk Pluto Jim Gray 2 Years St. Tropez 100 Memory 1.5 hr 10 On Board Cache This Building 10 min 1 Registers This Room 1 min Raja Appuswamy (Eurecom) Cloud Computing and Distributed Systems 1980s: Database If a data is accessed Administrators Dilemma more than once within 5 minutes, How do I improve the cache it in memory performance of my DB server? Should I cache this data in memory? Should I store data on disk? Raja Appuswamy (Eurecom) Cloud Computing and Distributed Systems Tandem Computers: Price/performance Tandem disk: $1k/access cost: $15k for 180MB performance: 15 accesses / second Tandem CPU + supporting hardware: $1k/access Cost: $15,000 Cost of accessing data from disk: $2k/access Memory cost: $5k for 1MB => $5/KB Raja Appuswamy (Eurecom) Cloud Computing and Distributed Systems Five-minute rule Cost of accessing data from disk: $2k/access, Memory cost: $5/KB If we keep 1KB in memory, assuming we have 1 access/sec We save $2k of disk i/o by paying $5 for memory If we have 1 access every 10 secs => 0.1 access/sec We save $200 of disk i/o by paying $5 for memory... Break even point : 1 access every 400 secs 400 seconds ~ 5 minutes Raja Appuswamy (Eurecom) Cloud Computing and Distributed Systems Five-minute rule: then and now Page size (4KB) 1987 Now RAM-HDD 5 mins 5 hours RAM-HDD break-even 60x higher due to drop in DRAM price Take away: Never ever go to disk! See “Five minute rule” CACM paper for more details https://cacm.acm.org/magazines/2019/11/240388-the-five-minute-rule- 30-years-later-and-its-impact-on-the-storage-hierarchy/fulltext Raja Appuswamy (Eurecom) Cloud Computing and Distributed Systems MapReduce/Hadoop and memory hierarchy Hadoop misaligned with five-minute rule All data is stored in disk Does not cache data in memory even if workload can fit Hadoop unfit for new classes of workloads Interactive and iterative applications are bottlenecked by disk MapReduce was also too simple a computational model Algorithm design with just map and reduce functions is non trivial Raja Appuswamy (Eurecom) Cloud Computing and Distributed Systems Hadoop: Fractured ecosystem Specialized systems emerged with no unified vision Diverse APIs, sparse modules, high operational costs MapReduce runtime replaced with more optimized ones Raja Appuswamy (Eurecom) Cloud Computing and Distributed Systems Lighting a Spark Flexible, in-memory data processing framework written in Scala Central Ideas Exploit memory by caching data to enable fast data sharing Generalize the two-stage computational model of mapreduce to a Directed Acyclic Graph-based one that can support a richer API Raja Appuswamy (Eurecom) Cloud Computing and Distributed Systems Spark Distributed Architecture Spark Application User program built on Spark Contains the Spark driver Spark Driver Transforms all the Spark operations into DAG computations Communicates with the cluster manager & requests resources (CPU, memory, etc.) for Spark executors Schedules computations Instantiates SparkSession Raja Appuswamy (Eurecom) Cloud Computing and Distributed Systems Spark Distributed Architecture SparkSession A unified conduit to all Spark operations and data Cluster Manager Responsible for managing and allocating resources 4 supported: Standalone, YARN, Mesos, Kubernetes Executor Responsible for executing tasks Usually one per node, but depends on deployment mode Raja Appuswamy (Eurecom) Cloud Computing and Distributed Systems RDD: Need for a new abstraction Need an efficient way to share data stored in memory Traditional way: Distributed shared memory abstraction General purpose, extends single-node shared memory to a cluster Applications can make fine-grained updates to any data in memory Can be used to build very efficient applications Problem: Fault tolerance Need to replicate data across nodes or log updates which is 10-100x slower than memory write Too expensive for data-intensive apps Goal: In-memory abstraction that provides fault-tolerance and efficiency Raja Appuswamy (Eurecom) Cloud Computing and Distributed Systems 16 Resilient Distributed Dataset RDD (Resilient Distributed Dataset): Restricted form of DSM An immutable, partitioned collection of objects Can only be built through coarse-grained deterministic transformations RDD are data structures that: Either point to a direct data source (e.g. HDFS) Apply some transformations to its parent RDD(s) to generate new data elements Raja Appuswamy (Eurecom) Cloud Computing and Distributed Systems 17 RDD An abstraction that encapsulates 3 things Dependencies, Partitions, Compute function Dependencies Instruct Spark how an RDD is constructed To reproduce results, Spark can recreate an RDD from these dependencies and replicate operations on it => resiliency Partitions Split the work to parallelize computation on partitions across executors Exploit data locality Compute function: Partition -> Iterator[t] A function that produces an iterator for data stored in RDD Raja Appuswamy (Eurecom) Cloud Computing and Distributed Systems RDD: Example Query: Find average age for each name aggregate all the ages for each name group by name average the ages Raja Appuswamy (Eurecom) Cloud Computing and Distributed Systems RDD Transformations Set of operations that define how to transform an RDD Examples: map(), filter(), select(), join(), orderby(), … As in relational algebra, the application of a transformation to an RDD yields a new RDD RDD are immutable Transformations are lazily evaluated Computation that performs the transformation is not performed immediately Raja Appuswamy (Eurecom) Cloud Computing and Distributed Systems 20 RDD Actions Actions trigger computation of the chain of transformations Some actions only store data to an external data source (e.g. HDFS) Ex: show(), take(), count(), collect(),.. Others fetch data from the RDD (and its transformation chain) upon which the action is applied, and convey it to the driver count() – return the number of elements take(n) – return an array of the first n elements collect()– return an array of all elements … Raja Appuswamy (Eurecom) Cloud Computing and Distributed Systems 21 RDD & Lazy Execution Demo https://mediaserver.eurecom.fr/permalink/v12641880f54fqht30d c/iframe/#start=4615 Raja Appuswamy (Eurecom) Cloud Computing and Distributed Systems Spark Execution Spark Driver Converts your Spark application into one or more Spark jobs Transforms each job into a DAG – execution plan Job broke into stages Stages are created based on what operations can be performed in parallel Dictate data transfer among Spark executors. Stages composed of tasks Task - a unit of execution Maps to a single core Maps to 1 partition of data Raja Appuswamy (Eurecom) Cloud Computing and Distributed Systems Spark DAG execution: An Example Goal: Find the number of distinct names per first letter Raja Appuswamy (Eurecom) Cloud Computing and Distributed Systems Spark Execution (1) 1. Create a DAG of RDDs to represent computation Raja Appuswamy (Eurecom) Cloud Computing and Distributed Systems Spark Execution (2) 1. Create a DAG of RDDs to represent computation 2. Create logical execution plan for the DAG Split DAG into “stages” based on dependencies Pipeline as much as possible Raja Appuswamy (Eurecom) Cloud Computing and Distributed Systems RDD: Data Set vs Partition Views Much like in Hadoop MapReduce, each RDD is stored physically in multiple nodes as input partitions Raja Appuswamy (Eurecom) Cloud Computing and Distributed Systems 27 A word about dependencies (1) Dependencies determine the need to shuffle data Two types: Narrow and wide Narrow dependencies Each partition of the parent RDD is used by at most one partition of the child RDD Task can be executed locally and we don’t have to shuffle. (E.g. map, flatMap, filter, sample) Raja Appuswamy (Eurecom) Cloud Computing and Distributed Systems A word about dependencies (2) Wide dependencies Multiple child partitions may depend on one partition of the parent RDD We have to shuffle data (E.g. sortByKey, reduceByKey, groupByKey, cogroupByKey, join, cartesian) Raja Appuswamy (Eurecom) Cloud Computing and Distributed Systems How does Spark execute this job? 1. Create a DAG of RDDs to represent computation 2. Create logical execution plan for the DAG Pipeline as much as possible Split DAG into “stages” based on need to shuffle data Raja Appuswamy (Eurecom) Cloud Computing and Distributed Systems Spark Execution (3) 1. Create a DAG of RDDs to represent computation 2. Create logical execution plan for the DAG 3. Split each stage into tasks and execute tasks stage by stage Task = Data + Computation In this example, all tasks from stage 1 would be executed together first Raja Appuswamy (Eurecom) Cloud Computing and Distributed Systems Spark Execution (3) 1. Create a DAG of RDDs to represent computation 2. Create logical execution plan for the DAG 3. Split each stage into tasks and execute tasks stage by stage In this example, all tasks from stage 1 would be executed together first After stage 1, pull-based shuffle occurs (intermediates written to files and pulled) Raja Appuswamy (Eurecom) Cloud Computing and Distributed Systems Spark Execution (3) 1. Create a DAG of RDDs to represent computation 2. Create logical execution plan for the DAG 3. Split each stage into tasks and execute tasks stage by stage In this example, all tasks from stage 1 would be executed together first After stage 1, pull-based shuffle occurs (intermediates written to files and pulled) Now, tasks from stage 2 are executed (operators pipelined in each task) Raja Appuswamy (Eurecom) Cloud Computing and Distributed Systems Putting it all together Raja Appuswamy (Eurecom) Cloud Computing and Distributed Systems 34 RDD to Structured API Spark can support interactive workloads But working with RDD is procedural SQL as a high-level programming language Offers expressiveness, succinctness Enables compatibility with existing tools, e.g. Business Intelligence using JDBC Large pool of engineers proficient in SQL SELECT name, avg(age) FROM people GROUP BY name Raja Appuswamy (Eurecom) Cloud Computing and Distributed Systems DataFrame: Schema General idea borrowed from Python Pandas Tabular data with an API Schema to the rescue A distributed collection of rows organized into named, typed columns Basic types and Structured/Complex types supported Schema defines column names and associated types 3 ways to get schema definition (demo here: https://mediaserver.eurecom.fr/permalink/v12641880f54fqht30dc/ifram e/#start=6211) (i) Define with struct type, (ii) define with DDL, (iii) auto infer Columns and Rows are objects with APIs Raja Appuswamy (Eurecom) Cloud Computing and Distributed Systems APIs DataSource API Enables you to read/write data from/to a DataFrame from myriad data sources in formats such as JSON, CSV, Parquet, Text, Avro, ORC, etc. Tranformations and Actions Relational Projections: done with select() method Relational Selection: done with filter() or where() method Aggregations: groupBy, orderBy, count, … Descriptive stats: min, max, sum, avg Demo: https://mediaserver.eurecom.fr/permalink/v12641880f54fqht30dc/ifra me/#start=6495 time taken to infer schema vs predeclare Transformations, actions Raja Appuswamy (Eurecom) Cloud Computing and Distributed Systems RDD vs DF Use RDD when Want to precisely instruct Spark how to do a query Can forgo the code optimization, efficient space utilization, and performance benefits available with DataFrames and Datasets! You can get rdd from df: df.rdd Basically, save yourself some time and use DF Raja Appuswamy (Eurecom) Cloud Computing and Distributed Systems SparkSQL engine The substrate on which structured APIs are built Core components Catalyst and Tungsten Raja Appuswamy (Eurecom) Cloud Computing and Distributed Systems Catalyst Optimizer Reminiscent of traditional database systems Analysis: SQL to Plan Abstract Syntax Tree Logical & physical optimization: Use cost-based optimization to pick optimal plan Raja Appuswamy (Eurecom) Cloud Computing and Distributed Systems Catalyst Example Raja Appuswamy (Eurecom) Cloud Computing and Distributed Systems Tungesten & Code Generation Take optimized physical plan and do “Full stage code generation” collapses the whole query into a single function getting rid of virtual function calls employing CPU registers for intermediate data Demo of Catalyst and Tungsten: https://mediaserver.eurecom.fr/permalink/v12641880f54fqht30d c/iframe/#start=7410 Raja Appuswamy (Eurecom) Cloud Computing and Distributed Systems Spark & Caching DataFrame.cache() store as many of the partitions read in memory across Spark executors as memory Dataframe.persist(StorageLevel) Control how your data is cached Unpersist Remove any cached data Cache/persist are hints DataFrame is not fully cached until you invoke an action Demo of caching perf: https://mediaserver.eurecom.fr/perm alink/v12641880f54fqht30dc/iframe/ #start=8042 Raja Appuswamy (Eurecom) Cloud Computing and Distributed Systems Spark & RDBMS: Summary Spark: unified analytics engine Quickly adopted RDBMS concepts to optimize SQL analytics Other libraries developed for machine learning (Mlib), graph analytics(GraphX),… RDD: an underlying abstraction that supports several libraries DBMSs have also evolved Disk-based to in-memory to NVM One-size-fits-all “OldSQL” DBMS to customized “NewSQL” engines column stores for Business Intelligence highly parallel transaction engines for OLTP Array databases for scientific applications … NewSQL still king for structured data management Raja Appuswamy (Eurecom) Cloud Computing and Distributed Systems