Chapter 3 & 4 Hadoop (1) PDF

Chapter 3 : Apache Hadoop 20140 Introducti on Chapter 3 Course Objectives ·Why is Hadoop needed? ·Learn concepts of the Hadoop Distributed File System and MapReduce ·Hadoop-able Problems ·Core Hadoop technologies and the Hadoop Ecosystem ·Developing MapReduce applica tions ·Common MapReduce algorithms ·Using Hive and Pig for rapid applica tion development © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1-4 What is Apache Hadoop? ! A software framework for storing, processing, and analyzing “big data” – Distributed – Scalable – Fault/tolerant – Open source © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"6 Facts About Apache Hadoop ·Open source ·Around 60 committers from ~10 companies – Cloudera, Yahoo!, Facebook, Apple, and more ·Hundreds of contributors writing features, fixing bugs ·Many related projects, applications, tools, etc. © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"7 A Large (and Growing) Ecosystem Pig Zookeep er Impal a © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"8 Who uses Hadoop? © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"9 The Motivation For Hadoop ·What problems exist with traditional large-scale computing systems? ·How does Hadoop address those challenges? ·What kinds of data processing and analysis does Hadoop do best? © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1-13 Traditional Large/Scale Computa tion ·Traditionally, computation has been processor-bound – Relatively small amounts of data – Lots of complex processing ·The early solution: bigger computers – Faster processor, more memory – But even this couldn’t keep up © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1-15 Distributed Systems ! The better solution: more computers.Distributed systems evolved.Use multple machines for a single job.MPI, (Message Passing Interface), for example “In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log, we didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers.” – Grace Hopper © Copyright 2010/2014 Database prior wri>en consent. 1"16 Distributed Systems: Problems ·Programming for traditional distributed systems is complex – Data exchange requires synchronization – Finite bandwidth is available – Temporal dependencies are complicated – It is difficult to deal with par tial failures of the system ·Ken Arnold, CORBA designer: – “Failure is the defining difference between distributed and local programming, so you have to design distributed systems with the expectation of failure” – Developers spend more time designing for failure than they do actually working on the problem itself ‫في مشكله‬ CORBA: Common Object Request Broker Architecture © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"17 Distributed Systems: Challenges ! Challenges with distributed systems – Programming complexity – Keeping data and processes in sync – Finite bandwidth – Partial failures © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"18 Distributed Systems: The Data Bo ttleneck ·Moore’s Law has held firm for over 40 years – Processing power doubles every two years – Processing speed is no longer the problem ·Getting the data to the processors becomes the bottleneck ·Quick calculation – Typical disk data transfer rate: 75MB/sec – Time taken to transfer 100GB of data to the processor: approx 22 minutes! – Assuming sustained reads – Actual time will be worse, since most servers have less than 100GB of RAM available ·A new approach is needed ‫مهم‬ © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"19 Distributed Systems: The Data Bo ttleneck (cont’d) ·Traditionally, data is stored in a central location ·Data is copied to processors at runtime ·Fine for limited amounts of data © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"20 Distributed Systems: The Data Bo ttleneck (cont’d) ·Modern systems have much more data – terabytes+ per day – petabytes+ total ·We need a new approach... © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"21 Chapter Topics The Motivation for Hadoop ·Problems with Traditional Large/Scale Systems ·Requirements for a New Approach ·Hadoop! ·Hadoop/able Problems © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"22 Partial Failure Support ! The system must support par tial failure – Failure of a component should result in a graceful degradation of application performance – Not complete failure of the en tire system © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"23 Data Recoverability ! If a component of the system fails, its workload should be assumed by still- functioning units in the system – Failure should not result in the loss of any data © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1-24 Component Recovery !If a component of the system fails and then recovers, it should be able to rejoin the system – Without requiring a full restart of the en tire system © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"25 Consistency ! Component failures during execu tion of a job should not affect the outcome of the job © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"26 Scalability ·Adding load to the system should result in a graceful decline in performance of individual jobs – Not failure of the system ·Increasing resources should support a proportional increase in load capacity © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"27 Chapter Topics The Motivation for Hadoop ·Problems with Traditional Large/Scale Systems ·Requirements for a New Approach ·Hadoop! ·Hadoop/able Problems © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"28 Hadoop’s History ·Hadoop is based on work done by Google in the late 1990s/early 2000s – Specifically, on papers describing the Google File System (GFS) published in 2003, and MapReduce published in 2004 ·This work takes a radical new approach to the problem of distributed computing – Meets all the requirements we have for reliability and scalability ·Core concept: distribute the data as it is ini tially stored in the system – Individual nodes can work on data local to those nodes – No data transfer over the network is required for ini tial processing © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"29 Core Hadoop Concepts ·Applications are written in high-level code – Developers need not worry about network programming, temporal dependencies or low/level infrastructure ·Nodes talk to each other as little as possible – Developers should not write code which communicates between nodes – ‘Shared nothing’ architecture ·Data is spread among machines in advance – Computation happens where the data is stored, wherever possible – Data is replicated multiple times on the system for increased availability and reliability © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1-30 Hadoop: Very High/Level Overview ·When data is loaded into the system, it is split into ‘blocks’ – Typically 64MB or 128MB ·Map tasks (the first part of the MapReduce system) work on relatively small portions of data – Typically a single block ·A master program allocates work to nodes such that a Map task will work on a block of data stored locally on that node whenever possible – Many nodes work in parallel, each on their own part of the overall dataset © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"31 Fault Tolerance ·If a node fails, the master will detect that failure and re - assign the work to a different node on the system ·Restarting a task does not require communica tion with nodes working on other portions of the data ·If a failed node restarts, it is automa tically added back to the system and assigned new tasks ·If a node appears to be running slowly, the master can redundantly execute another instance of the same task – Results from the first to finish will be used – Known as ‘speculative execution’ © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 1-32 Hadoop ! A radical new approach to distributed compu ting – Distribute data when the data is being stored – Run computation where the data is stored © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"33 Hadoop: Very High/Level Overview Data is split into “blocks” when loaded Map tasks typically work on a single block A master program manages tasks Slave Nodes Master Lorem ipsum dolor s Lorem ipsum dolor sit amet, consectetur se amet, consectetur sed adipisicing elit, ado l eiusmod tempor etma adipisicing elit, ado lei incididunt ut libore tua eiusmod tempor etma dolore magna alli quio incididunt ut libore tua dolore magna alli quio ut enim ad minim veni ut enim ad minim veni veniam, quis nostruda veniam, quis nostruda exercitation ul laco es sed exercitation ul laco es laboris nisi ut eres aliquip sed laboris nisi ut eres ex eaco modai consequat. aliquip ex eaco modai Duis hona consequat. Duis hona irure dolor in repre sie honerit in ame mina lo irure dolor in repre sie voluptate elit esse oda honerit in ame mina lo cillum le dolore eu fugi voluptate elit esse oda cillum le dolore eu fug gia gia nulla aria tur. Ente nulla aria tur. Ente culpa culpa qui officia ledea qui officia ledea un mollit anim id est o laborum ame elita tu a magna omnibus et. un mollit anim id est o laborum ame elita tu a magna omnibus et. © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"34 Core Hadoop Concepts ·Applications are written in high-level code ·Nodes talk to each other as little as possible ·Data is distributed in advance – Bring the computation to the data ·Data is replicated for increased availability and reliability ·Hadoop is scalable and fault-tolerant © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1-35 Scalability ! Adding nodes adds capacity proportionally !Increasing load results in a graceful decline in performance – Not failure of the system Capacity Number of Nodes © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"36 Fault Tolerance ! Node failure is inevitable ! What happens? – System continues to function – Master re/assigns tasks to a different node – Data replication = no loss of data – Nodes which recover rejoin the cluster automa tically “Failure is the defining difference between distributed and local programming, so you have to design distributed systems with the expectation of failure.” – Ken Arnold (CORBA designer) © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"37 Chapter Topics The Motivation for Hadoop ·Problems with Traditional Large/Scale Systems ·Requirements for a New Approach ·Hadoop! ·Hadoop-able Problems © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1-38 What Can You Do With Hadoop? Business Intelligence Advanced Analytics Applications Innova ti on and Advantage Opera tional Efficiency Data Storage Data Processing © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"39 Where Does Data Come From? ·Science – Medical imaging, sensor data, genome sequencing, weather data, satellite feeds, etc. ·Industry – Financial, pharmaceutical, manufacturing, insurance, online, energy, retail data ·Legacy –Sales data, customer behavior, product databases, accounting data, etc. ·System Data –Log files, health and status feeds, activity streams, network messages, web analytics, intrusion detection, spam filters © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"40 Common Types of Analysis with Hadoop ·Text mining Collaborative filtering ·Index building Prediction models ·Graph creation and analysis Sentiment analysis ·Pattern recognition Risk assessment © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"41 What is Common Across Hadoop -able Problems? ·Nature of the data – Volume – Velocity – Variety ·Nature of the analysis – Batch processing – Parallel execution – Distributed data © Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"42 Use Case: Opower ·Opower – SaaS for utility companies – Provides insights to customers about their energy usage ·The data – Huge amounts of data – Many different sources ·The analysis, e.g. – Similar homes comparison – Heating and cooling usage – Bill forecasting SaaS: Sotware as a Service © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"43 Benefits of Analyzing with Hadoop ·Previously impossible or impractical analysis ·Lower cost ·Less time ·Greater flexibility ·Near-linear scalability ·Ask Bigger Questions © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1-44 Hadoop: Basic Concepts ·What is Hadoop? ·What features does the Hadoop Distributed File System (HDFS) provide? ·What are the concepts behind MapReduce? ·How does a Hadoop cluster operate? © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"48 The Hadoop Project ·Hadoop is an open-source project overseen by the Apache Software Foundation ·Originally based on papers published by Google in 2003 and 2004 ·Hadoop committers work at several different organizations – Including Cloudera, Yahoo!, Facebook, LinkedIn © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1-50 Hadoop Components ·Hadoop consists of two core components – The Hadoop Distributed File System (HDFS) – MapReduce ·There are many other projects based around core Hadoop – Often referred to as the ‘Hadoop Ecosystem’ – Pig, Hive, HBase, Flume, Oozie, Sqoop, etc – Many are discussed later in the course ·A set of machines running HDFS and MapReduce is known as a Hadoop Cluster – Individual machines are known as nodes – A cluster can have as few as one node, as many as several thousand – More nodes = better performance! © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"51 Hadoop Components (cont’d) Sqoop Hive Pig Mahout Hadoop Ecosyste HBas Flume... m e Oozie CD H MapRedu Hadoop ce Core Componen Hadoop Distributed File ts System Note: CDH is Cloudera’s open source Apache Hadoop distribu tion. © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"52 Core Components: HDFS and MapReduce ·HDFS (Hadoop Distributed File System) – Stores data on the cluster ·MapReduce – Processes data on the cluster MapRedu Hadoop ce Core Componen Hadoop Distributed File ts System © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"53 Hadoop Components: HDFS ·HDFS, the Hadoop Distributed File System, is responsible for storing data on the cluster ·Data is split into blocks and distributed across multiple nodes in the cluster – Each block is typically 64MB or 128MB in size ·Each block is replicated multiple times – Default is to replicate each block three times – Replicas are stored on different nodes – This ensures both reliability and availability © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"54 Hadoop Components: MapReduce ·MapReduce is the system used to process data in the Hadoop cluster ·Consists of two phases: Map, and then Reduce – Between the two is a stage known as the shuffle and sort ·Each Map task operates on a discrete portion of the overall dataset – Typically one HDFS block of data ·After all Maps are complete, the MapReduce system distributes the intermediate data to nodes which perform the Reduce phase – Much more on this later! © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"55 Chapter Topics Hadoop Basic Concepts ·What is Hadoop? ·The Hadoop Distributed File System (HDFS) ·How MapReduce Works © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"56 HDFS Basic Concepts ·HDFS is a filesystem written in Java – Based on Google’s GFS ·Sits on top of a native filesystem – Such as ext3, ext4 or xfs ·Provides redundant storage for massive amounts of data – Using readily/available, industry/standard computers HDF S Native OS filesystem Disk Storage © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"57 How Files Are Stored in HDFS ·Files are split into blocks – Each block is usually 64MB or 128MB ·Data is distributed across many machines at load time – Different blocks from the same file will be stored on different machines – This provides for efficient MapReduce processing (see later) ·Blocks are replicated across multiple machines, known as DataNodes – Default replication is three/fold – Meaning that each block exists on three different machines ·A master node called the NameNode keeps track of which blocks make up a file, and where those blocks are located – Known as the metadata © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"58 How Files are Stored (1) ! Data files are split into blocks and distributed to data nodes Block 1 Very Large Block Data 2 File Block 3 © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"59 How Files are Stored (2) ! Data files are split into blocks and distributed to data nodes Block 1 Block 1 Block 1 Very Large Block Data 2 File Block 3 Block 1 © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"60 How Files are Stored (3) ! Data files are split into blocks and distributed to data nodes Each block is ! replicated on multiple nodes (default 3x) Block 1 Block 3 Block 1 Block 1 Very Block 2 Large Block Block 2 Data 2 Block 3 File Block Block 2 3 Block 1 Block 3 © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"61 How Files are Stored (4) ·Data files are split into blocks and distributed to data nodes ·Each block is replicated on multiple nodes (default 3x) ·NameNode stores metadata Block Nam 1 Block e 3 Nod Block 1 Block e 1 Metadata Very Block 2 : Large Block Block 2 informati Data 2 Block 3 on File about Block Block files 2 3 and Block blocks 1 Block 3 © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"62 HDFS: Points To Note ·Although files are split into 64MB or 128MB blocks, if a file is smaller than this the full 64MB/128MB will not be used ·Blocks are stored as standard files on the DataNodes, in a set of directories specified in Hadoop’s configuration files ·Without the metadata on the NameNode, there is no way to access the files in the HDFS cluster ·When a client application wants to read a file: – It communicates with the NameNode to determine which blocks make up the file, and which DataNodes those blocks reside on – It then communicates directly with the DataNodes to read the data – The NameNode will not be a bo ttleneck © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"63 Options for Accessing HDFS put ·FsShell Command line: HDF hadoop fs Clien S Java · t get Cluste API r ·Ecosystem Projects – Flume – Collects data from network sources (e.g., system logs) – Sqoop – Transfers data between HDFS and RDBMS – Hue – Web/based interactive UI. Can browse, upload, download, and view files © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"64 Example: Storing and Retrieving Files (1) Loc al Node Node /logs/ 031512.log A D Node Node B E /logs/ 041213.log Node C HDFS Clust er © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"65 Example: Storing and Retrieving Files (2) Metadat 1.A,B NameNo a ,D de 2.B,D /logs/031512.log: B1,B2,B3 ,E /logs/041213.log: B4,B5 3.A,B ,C 4.A,B ,E 1 Node Node /logs/ 5.C,E 031512.log 2 A 1 3 D1 5 ,D 3 4 2 Node Node B 1 2 E2 5 3 4 4 4 /logs/ 041213.log 5 Node C 3 5 © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"66 Example: Storing and Retrieving Files (3) Metadat 1.A,B NameNo ,D de a 2.B,D /logs/031512.log: B1,B2,B3 ,E /logs/041213.log: B4,B5 3.A,B ,C 4.A,B 1 ,E Node Node 5.C,E /logs/ /logs/ 031512.log 2 A D1 5 ,D 041213.log? 3 1 3 2 4 Node Node B4,B B 1 2 E2 5 5 3 4 4 4 /logs/ 041213.log 5 Node Clien C t 3 5 © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"67 Example: Storing and Retrieving Files (4) Metadat 1.A,B NameNo ,D de a 2.B,D /logs/031512.log: B1,B2,B3 ,E /logs/041213.log: B4,B5 3.A,B ,C 4.A,B 1 ,E Node Node 5.C,E /logs/ /logs/ 2 031512.log 1 3 A D 1 5 ,D 041213.log? 3 4 2 Node Node B4,B B 1 2 E 5 3 4 42 5 4 /logs/ 041213.log 5 Node Clien C t 3 5 © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"68 HDFS NameNode Availability ·The NameNode daemon must be running at all times – If the NameNode stops, the cluster becomes inaccessible ·High Availability mode (in CDH4 and Activ Standb later) e y – Two NameNodes: Ac tive and Nam Name Standby e Node Nod e ·Classic mode – One NameNode Second – One “helper” node called Nam ary e Name SecondaryNameNode Nod Node – Bookkeeping, not backup e © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"69 Important Notes About HDFS ·HDFS performs best with a modest number of large files – Millions, rather than billions, of files – Each file typically 100MB or more ·How HDFS works – Files are divided into blocks, which are replicated across nodes ·Command line access to HDFS – FsShell: hadoop fs – Sub/commands: -get, -put, -ls, -cat, etc ·HDFS is optimized for large, streaming reads of files – Rather than random reads © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"73 Chapter Topics Hadoop Basic Concepts ·What is Hadoop? ·The Hadoop Distributed File System (HDFS) ·How MapReduce Works © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"74 What Is MapReduce? ·MapReduce is a method for distributing a task across multiple nodes ·Each node processes data stored on that node – Where possible ·Consists of two phases: – Map – Reduce © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"75 MapReduce: Terminology ·A job is a ‘full program’ – A complete execution of Mappers and Reducers over a dataset – In MapReduce 2, the term application is oten used in place of ‘job’ ·A task is the execution of a single Mapper or Reducer over a slice of data ·A task attempt is a particular instance of an attempt to execute a task – There will be at least as many task a ttempts as there are tasks – If a task attempt fails, another will be started by the JobTracker – Speculative execution (see later) can also result in more task attempts than completed tasks © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"76 Hadoop Components: MapReduce ·The Mapper – Each Map task (typically) operates on a single HDFS Map block – Map tasks (usually) run on the node where the block is stored ·Shuffle and Sort – Sorts and consolidates intermediate data from all Shuffle mappers and Sort – Happens as Map tasks complete and before Reduce tasks start ·The Reducer – Operates on shuffled/sorted intermediate data Reduce (Map task output) – Produces final output © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"77 Example: Word Count Resul t aardvar 1 Input k Data cat 1 the cat sat on the mat mat 1 the aardvark sat on the sofa Map Reduce on 2 sat 2 sofa 1 the 4 © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"78 Example: The WordCount Mapper WordCountMapper output the 1 cat 1 Map sat 1 Input on 1 Data the 1 the cat sat on the mat mat 1 the aardvark sat on the sofa the 1 aardvark 1 sat 1 Map on 1 the 1 sofa 1 © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"79 Example: Shuffle & Sort Mapper Output the 1 cat 1 sat 1 Intermediate Data on 1 aardvark 1 the 1 cat 1 mat 1 mat 1 the 1 on 1,1 aardvark 1 sat 1,1 sat 1 sofa 1 on 1 the 1,1,1,1 the 1 sofa 1 © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"80 Example: SumReducer Reducer Output reduc aardvark e 1 Intermediate reduc cat 1 Final Data e Result aardvark 1 aardvark 1 reduc mat 1 cat 1 cat 1 e mat 1 mat 1 reduc on 2 e on 2 on 1,1 sat 2 sat 1,1 reduc sat 2 e sofa 1 sofa 1 reduc sofa 1 the 4 the 1,1,1,1 e reduc the 4 e © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"81 Mappers Run in Parallel !Hadoop runs Map tasks on the slave node where the possible) block is stored (when – Many Mappers can run in parallel a 1,1,1,... – Minimizes network traffic ac 1,1,1,1,1.... accumsan 1,1 Block 1 Input Data HDFS Blocks Map ad 1,1,1,1.... adipiscing Block Lorem ipsum dolor sit amet, Lorem ipsum dolor 3 Block 1,1,1,1 consectetur adipiscing elit. sit amet, consec... aliquet Integer nec odio. Praesent tetur adipisc ing elit. Integer... 1 1,1,1,1.... libero. Sed cursus ante dapibus diam. Sed nisi. Nulla quis sem at nibh elementum a 1,1,1,1.... imperdiet. Duis sagittis ipsum. Praesent mauris. Fusce ac 1,1,1,1,1... nec tellus sed augue semper Sed pretium blandit Block ad 1,1,1,1,1... porta. Mauris massa. Vestibulum lacinia arcu eget orci. Ut eu diam at pede susci pit 2 Block Map aliquam 1,1,1 nulla. Class aptent taciti sodales... 3 Block aliquet 1 auctor 1,1,1,1... sociosqu ad litora torquent per conubia nostra, per 3... inceptos himenaeos. Curabitur sodales ligula in libero. Sed dignissim lacinia nunc. Aenean quam. In Curabitur tortor. scelerisque sem at Pellentesque Aenean dolor. Maecenas a 1,1,1,... quam. In scelerisque sem at mattis. Sed con... Block ad 1,1,1,1.... convallis tristique sem... 1 Block Map aliquam 1.1,1 2 Block amet 1,1,1,1 blandit 1,1,1 2... © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"82 MapReduce: The Mapper ·Hadoop attempts to ensure that Mappers run on nodes which hold their portion of the data locally, to avoid network traffi c – Multi ple Mappers run in parallel, each processing a por ti on of the input data ·The Mapper reads data in the form of key/value pairs – The Mapper may use or completely ignore the input key – For example, a standard pa tt ern is to read one line of a file at a ti me – The key is the byte offset into the file at which the line starts – The value is the contents of the line itself – Typically the key is considered irrelevant ·If the Mapper writes anything out, the output must be in the form of key/value pairs © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"83 MapReduce: The Reducer ·After the Map phase is over, all intermediate values for a given intermediate key are combined together into a list ·This list is given to a Reducer – There may be a single Reducer, or mul ti ple Reducers – All values associated with a par ti cular intermediate key are guaranteed to go to the same Reducer – The intermediate keys, and their value lists, are passed to the Reducer in sorted key order – This step is known as the ‘shuffle and sort’ ·The Reducer outputs zero or more final key/value pairs – These are written to HDFS – In practi ce, the Reducer usually emits a single key/value pair for each input key © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"84 Why Do We Care About Coun ting Words? ·Word count is challenging over massive amounts of data – Using a single compute node would be too time- consuming – Using distributed nodes requires moving data – Number of unique words can easily exceed available memory – Would need to store to disk ·Sta ti s ti cs are simple aggregate func ti ons – Distribu tive in nature – e.g., max, min, sum, count ·MapReduce breaks complex tasks down into smaller elements which can be executed in parallel ·Many common tasks are very similar to word count – e.g., log file analysis © Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 1"85 Another Example: Analyzing Log Data... Input Data 2013-03-15 12:39 - 74.125.226.230 /common/logo.gif 1231ms - 2326 2013-03-15 12:39 – 157.166.255.18 /catalog/cat1.html 891ms - 1211 2013-03-15 12:40 – 65.50.196.141 /common/logo.gif 1992ms - 1198 2013-03-15 12:41 – 64.69.4.150 /common/promoex.jpg 3992ms - 2326... AverageRedu FileTypeMapp Intermediate Data cer er ater Shuffle and Sort output 891,788,344,2990 htmloutput 888.6 123 html gif... 1886. 1 1231,1992,3997,8 gif gif 4 ht 72... 891 ml jpg 3992,7881,2999... jpg 888.6 Map 199 Reduc gif 919,890,3441,444 e 2 png 1201. 399... png jpg 344,325,444,421.. 0 2 txt ht. 788 txt 399.1 ml...... 399 gif 7 © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"86 Features of MapReduce ·Automatic parallelization and distribution ·Built-in fault-tolerance ·A clean abstraction for programmers – MapReduce hides all of the “housekeeping” away from the developer – Developers concentrate simply on writing the Map and Reduce functions ·Status and monitoring tools © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1-87 Hadoop Environments ·Where to develop Hadoop solu tions? – Cloudera’s Quickstart VM – Hadoop and Hadoop ecosystem tools are already installed and configured - works right out of the box, free of charge – Very useful for tes ting code before it is deployed to the real cluster – Alternately, configure a machine to run in Hadoop pseudo@distributed mode – Must install Hadoop ecosystem tools one by one ·Where to run tested Hadoop solu tions? – Once testing is completed on a small data sample, the Hadoop solution can be run on a Hadoop cluster over all data – A Hadoop cluster is usually managed by the system administrator – It is useful to understand the components of a cluster, this will be covered in a future lecture © Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 1"88 The Hadoop Ecosystem Chapter 4 The Hadoop Ecosystem ·What other projects exist around core Hadoop? ·When to use HBase? ·How does Spark compare to MapReduce? ·What is the differences between Hive, Pig, and Impala? ·How is Flume typically deployed? © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 2"3 Chapter Topics The Hadoop Ecosystem ·Introduction ·Data Storage: HBase ·Data Integration: Flume and Sqoop ·Data Processing: Spark ·Data Analysis: Hive, Pig, and Impala ·Workflow Engine: Oozie ·Machine Learning: Mahout © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 2"4 The Hadoop Ecosystem (1) Sqoo Impal Hive Pig p a Hadoop Ecosyste HBas e Flum e Oozi e m CD H MapRedu Hadoop ce Core Componen Hadoop Distributed File ts System © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 2"5 The Hadoop Ecosystem (2) Sqoop Hive Pig Impala Hadoop Ecosystem Hbase Flume Oozie !Next, a discussion of the key Hadoop ecosystem components © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 2"6 The Hadoop Ecosystem (3) ·Ecosystem projects may be – Built on HDFS and MapReduce – Built on just HDFS – Designed to integrate with or support Hadoop ·Most are Apache projects or Apache Incubator projects – Some others are not managed by the Apache Software Foundation – These are often hosted on GitHub or a similar repository ·Following is an introduction to some of the most significant projects © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 2"7 Chapter Topics The Hadoop Ecosystem ·Introduction ·Data Storage: HBase ·Data Integration: Flume and Sqoop ·Data Processing: Spark ·Data Analysis: Hive, Pig, and Impala ·Workflow Engine: Oozie ·Machine Learning: Mahout © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 2"8 HBase ·HBase is the Hadoop database ·A ‘NoSQL’ datastore ·Can store massive amounts of data – Petabytes+ ·High write throughput – Scales to hundreds of thousands of inserts per second ·Handles sparse data well – No wasted spaces for empty columns in a row ·Limited access model – Optimized for lookup of a row by key rather than full queries – No transactions: single row operations only – Only one column (the ‘row key’) is indexed © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 2"9 HBase vs Traditional RDBMSs RDBMS HBase Data layout Row/oriented Column/oriented Transactions Yes Single row only Query language SQL get/put/scan (or use Hive or Impala) Security Authentication/ Kerberos Authorization Indexes Any column Row/key only Max data size TBs PB+ Read/write Thousands Millions throughput (queries per second) © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 2"10 When To Use HBase ·Use plain HDFS if... – You only append to your dataset (no random write) – You usually read the whole dataset (no random read) ·Use HBase if... – You need random write and/or read – You do thousands of opera tions per second on TB+ of data ·Use an RDBMS if... – Your data fits on one big node – You need full transac ti on support – You need real/time query capabilities © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 2"11 Chapter Topics The Hadoop Ecosystem ·Introduction ·Data Storage: HBase ·Data Integration: Flume and Sqoop ·Data Processing: Spark ·Data Analysis: Hive, Pig, and Impala ·Workflow Engine: Oozie ·Machine Learning: Mahout © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 2"12 Flume: Real/Mme Data Import ·What is Flume? – A service to move large amounts of data in real Mme – Example: storing log files in HDFS ·Flume imports data into HDFS as it is generated – Instead of batch/processing it later – For example, log files from a Web server ·Flume is – Distributed – Reliable and available – Horizontally scalable – Extensible © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 2"13 Flume: High/Level Overview ·Collect data as it is produced ·Files, syslogs, Agent Agent Agent Agen Agent stdout or custom t source ·Process in place encrypt compress ·e.g., encrypt, compress ·Pre/process data before Agen Agent storing t ·e.g., transform, scrub, enrich Agent(s) ·Write in parallel ·Scalable throughput HDFS Store in any format ·Text, compressed, binary, or custom sink © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 2"14 Sqoop: Exchanging Data With RDBMSs ! Sqoop transfers data between RDBMSs and HDFS – Does this very efficiently via a Map/only MapReduc j – Supports JDBC, ODBC, and several specific databases – “Sqoop” = “SQL to Hadoop” Sqoo RDBM p S HDF S © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 2"15 Sqoop Custom Connectors ·Custom connectors for – MySQL – Postgres – Netezza – Teradata – Oracle (partnered with Quest Software) ·Not open source, but free to use © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 2"16 Chapter Topics The Hadoop Ecosystem ·Introduction ·Data Storage: HBase ·Data Integration: Flume and Sqoop ·Data Processing: Spark ·Data Analysis: Hive, Pig, and Impala ·Workflow Engine: Oozie ·Machine Learning: Mahout © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 2"17 Apache Spark ·Apache Spark is a fast, general engine for large-scale data processing on a cluster ·Originally developed UC Berkeley’s AMPLab ·Provides several benefits over MapReduce ·Open source Apache project – Faster – Better suited for iterative algorithms – Can hold intermediate data in RAM, resul ting in much better performance – Easier API – Supports Python, Scala, Java – Supports real/time streaming data processing © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 2-18 Spark vs Hadoop MapReduce ·MapReduce – Widely used, huge investment already made – Supports and supported by many complementary tools – Mature, well/tested ·Spark – Flexible – Elegant – Fast – Supports real/Mme streaming data processing ·Over time, Spark is expected to supplant MapReduce as the general processing framework used by most organizations © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 2"19 Chapter Topics The Hadoop Ecosystem ·Introduction ·Data Storage: HBase ·Data Integration: Flume, Sqoop ·Data Processing: Spark ·Data Analysis: Hive, Pig, and Impala ·Workflow Engine: Oozie ·Machine Learning: Mahout © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 2"20 Hive and Pig: High Level Data Languages ·The motivation: MapReduce is powerful but hard to master ·The solution: Hive and Pig – Languages for querying and manipulating data – Leverage existing skillsets – Data analysts who use SQL – Programmers who use scripting languages – Open source Apache projects – Hive initiruns ·Interpreter on a client ally developed at machine Facebook – Turns Initially into – Pigqueries developed at Yahoo! MapReduce jobs – Submits jobs to the cluster © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 2"21 Hive ! What is Hive? – HiveQL: An SQL/like interface to Hadoop SELECT * FROM purchases WHERE price > 10000 ORDER BY storeid © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 2"22 Pig ! What is Pig? – Pig Latin: A dataflow language for transforming large data sets purchases = LOAD "/user/dave/purchases" AS (itemID, price, storeID, purchaserID); bigticket = FILTER purchases BY price > 10000;... © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 2"23 Hive vs. Pig Hive Pig Language HiveQL (SQL/like) Pig Latin (dataflow language) Schema Table definitions stored Schema optionally in defined a metastore at runtime Programmatic JDBC, ODBC PigServer (Java API) access JDBC: Java Database Connectivity ODBC: Open Database Connectivity © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 2"24 Impala: High Performance Queries ·High-performance SQL engine for vast amounts of data – Similar query language to HiveQL – 10 to 50+ Mmes faster than Hive, Pig, or MapReduce ·Impala runs on Hadoop clusters – Data stored in HDFS – Does not use MapReduce ·Developed by Cloudera – 100% open source, released under the Apache software license © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 2-25 Which to Choose? ·Use Impala when... – You need near real-time responses to ad hoc queries – You have structured data with a defined schema ·Use Hive or Pig when... – You need support for custom file types, complex data types, or external functions ·Use Pig when... – You have developers experienced with writing scripts – Your data is unstructured/multi-structured ·Use Hive When... – You have analysts familiar with SQL – You are integrating with BI or reporting tools via ODBC/JDBC © Copyright 2010-2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 2"26 Chapter Topics The Hadoop Ecosystem ·Introduction ·Data Storage: HBase ·Data Integration: Flume, Sqoop ·Data Processing: Spark ·Data Analysis: Hive, Pig, and Impala ·Workflow Engine: Oozie ·Machine Learning: Mahout © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 2"27 Oozie ·Oozie – Workflow engine for MapReduce jobs – Defines dependencies between jobs ·The Oozie server submits the jobs to the server in the correct sequence ·We will investigate Oozie later in the course © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 2"28 Chapter Topics The Hadoop Ecosystem ·Introduction ·Data Storage: HBase ·Data Integration: Flume, Sqoop ·Data Processing: Spark ·Data Analysis: Hive, Pig, and Impala ·Workflow Engine: Oozie ·Machine Learning: Mahout © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 2"29 Mahout ·Mahout is a Machine Learning library written in Java ·Used for – Collaborative filtering (recommenda tions) – Clustering (finding naturally occurring “groupings” in data) – Classification (determining whether new data fits a category) ·Why use Hadoop for Machine Learning? – “It’s not who has the best algorithms that wins. It’s who has the most data.” © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 2"30 Key Points ! Hadoop Ecosystem – Many projects built on, and suppor ting, Hadoop – Several will be covered in detail later in the course © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri tten consent. 2"31

Chapter 3 & 4 Hadoop (1) PDF

Document Details

Tags

Related

Summary

Full Transcript