SABSE3 Big Data Engineer 2021 Ecosystem Course Guide PDF
Document Details
Uploaded by EnergySavingPrologue
2021
Ahmed Abdel-Baky, Maria Farid, Heba Aboulmagd, Abdelrahman Hassan, Mohamed El-Khouly, Norhan Khaled, Adel El-Metwally, Nouran El-Sheikh, Dina Sayed
Tags
Summary
This document is a course guide for a 2021 Big Data Engineer course. It includes details like course code, authors, and IBM training. The document also contains various notices, disclaimers, and trademark information related to the course and the provider. There are no questions.
Full Transcript
V11.3 cover Front cover Course Guide Big Data Engineer 2021 2021 Ecosysfem Big Data Ecosystem Course code SABSE ERC 3.0 3.0 Ahmed Abdel-Baky Maria Farid Heba...
V11.3 cover Front cover Course Guide Big Data Engineer 2021 2021 Ecosysfem Big Data Ecosystem Course code SABSE ERC 3.0 3.0 Ahmed Abdel-Baky Maria Farid Heba Aboulmagd Abdelrahman Hassan El-KhouIy Mohamed El-Khouly Norhan Khaled Adel El-Metwally El-MetwaIIy Ramy Said Nouran El-Sheikh Dina Sayed IBM Training February 2021 2021 edition Notices This information was was developed forproducts for products and and services offered in the the US. US. IBM IBM may may not not offer the the products, services, services, or features discussed in this document in other countries. countries. Consult your local IBM IBM representative for information on for on the products andand services currently available in your area. Any Any reference to an an IBM IBM product, program, or service is not not intended to state or imply that only that IBM IBM product, program, or service maymay be be used. Any Any functionally equivalent product, product, program, or service that does not infringe any notinfringe any IBM IBM intellectual intellectual property right may may be be used instead. However, it is the the user's responsibility to evaluate and the operation of any and verify the any non-IBM product, program, or service. service. IBM IBM may may have patents or pending patent applications covering subject matter described in this document. The The furnishing of this document does not grant you notgrant you any any license to these patents. You You can can send license inquiries, in writing, to: IBM Director of Licensing IBM IBM Corporation IBM Drive, MD-NC119 North Castle Drive, Armonk, NY NY 10504-1785 10504-1785 of America United States of INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" IS"WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR OR IMPLIED, INCLUDING, BUT BUT NOT NOT LIMITED TO, THETHE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR OR FITNESS FOR FORAA PARTICULAR PURPOSE. Some jurisdictions do do not not allow disclaimer of express or or implied warranties in certain transactions, transactions, therefore, this statement may may not not apply to to you. This information could include technical inaccuracies or typographical errors. Changes areperiodically are periodically made totheinformation to the information herein; these changes will be be incorporated incorporated in new new editions of the the publication. publication. IBM IBM may may make improvements and/or changes in the the product(s) and/or the program(s) described in this publication at any any time without notice. Any references in this information Any information to non-IBM websites are provided forconvenience for convenience only and and do do not in any any manner serve as an an endorsement ofthose of those websites. The The materials at those websites are not part of the the materials for for this IBM IBM product and and use use of of those websites is at at your own own risk. IBM IBM may may use or distribute any use or any of the the information you you provide in any any way way it believes appropriate without incurring incurring any any obligation to you. Information concerning concerning non-IBM products waswas obtained from the suppliers of those products, their published announcements orother or other publicly available sources. sources. IBM IBM has has not tested those products and and cannot confirm the accuracy of of performance, performance, compatibility or any any other claims related to non-IBM products. Questions onon the capabilities of non-IBM products should be be addressed tothe to the suppliers of those products. This information contains examples of data and ofdata and reports used in daily business operations. operations. To To illustrate them as completely asas possible, possible, the examples include the names of the individuals, companies, brands, and ofindividuals, and products. All of these names are fictitious and arefictitious and any any similarity to actual people or business enterprises is entirely coincidental. coincidental. Trademarks IBM, the IBM IBM logo, and and ibm.com are are trademarks or registered registered trademarks of of International Business Machines Corp., registered in many worldwide. Other product and jurisdictions worldwide. and service names might be of IBM be trademarks of IBM or companies. A current list of or other companies.A of IBM IBM trademarks is available on the web on the web at “Copyright and and trademark information” at www.ibm.com/legal/copytrade.shtml. c© Copyright International Business Machines Corporation 2016, 2021. This document may may not be reproduced in notbe whole or inwhole or in part without the prior written permission of inpart of IBM. US US Government Users Restricted Rights - Use, duplication duplication or disclosure restricted by by GSA ADP Schedule Contract with IBM GSAADP IBM Corp. V11.3 Contents TOC Contents Trademarks.......................................................................................................................................... XI xiii II Course description.......................................................................................................................XiV....... xiv Agenda.................................................................................................................................................. XVÏ xvi Unit 1. 1. Introduction to big data.......................................................................................................... 1-1 1-1 Unit objectives...................................................................................................................................... 1-2 1-2 1.1. Big data overview.................................................................. Bigdata................................................................ 1-3 1-3 Big data overview................................................................. Bigdata............................................................... 1-4 1-4 Topics.................................................................................................................................................. 1-5 1-5 Introduction to big data........................................................................................................................ 1-6 1-6 Big data: A tsunami that is hitting us Bigdata:A us.................................................................................................... 1-7 1-7 Some examples of big data.......................................................... ofbigdata........................................................ 1-8 1-8 Types of big data.................................................................. ofbigdata................................................................ 1-9 1-9 The four classic dimensions of of big data (the four Vs) bigdata Vs)........................................................................ 1-10 An insight into big An big data analytic techniques...................................................................................... 1-12 1.2. Big data useuse cases.............................................................................................................................. 1-13 Big data use use cases............................................................................................................................ 1-14 Topics................................................................................................................................................ 1-15 Big data analytics use use case examples............................................................................................. 1-16 Common use cases that are usecases are applied to big data................................................................................ 1-17 Examples of business sectors that use ofbusiness use big data................................................................................ 1-18 Use Use cases forbigdata: for big data: Healthcare.................................................................................................... 1-19 The Precision Medicine Initiative and The and bigbig data.................................................................................. 1-21 1-21 Use Use cases forbigdata: for big data: Financial services........................................................................................ 1-23 Financial marketplace example: Visa................................................................................................ 1-24 Financial............................................................................................................................................ 1-25 ““Data Data is the the newnew ooil”.............................................................11-26 26 '!”. : : - 1.3. Evolution from traditional traditional data processing to big big data processing...................................................... 1-27 Evolution from traditional traditional data processing processing to big big data processing.................................................... 1-28 Topics............................................................................................................................................... 1-29 Traditional versus big Traditional big data approaches to to using data........................................................................ 1-30 System ofunits/ of units / Binary system of of units............................................................................................ 1-31 1-31 Hardware improvements over the years............................................................................................ 1-33 Parallel data processing.................................................................................................................... 1-34 Online transactional processing system............................................................................................ 1-35 Online analytical analytical processing processing system.................................................................................................. 1-36 Meaning of“real of “real time” when applied to big data................................................................................ 1-37 More comments on “real time”.......................................................................................................... 1-38 1.4. Introduction to Apache Hadoop and theHadoop the Hadoop infrastructure.......................................................... 1-39 Introduction Introduction to Apache Hadoop and theHadoop the Hadoop infrastructure........................................................ 1-40 Topics................................................................................................................................................ 1-41 1-41 AA new new approach is needed toprocess to process big big data: Requirements........................................................ 1-42 Introduction Introduction to Apache Hadoop and theHadoop the Hadoop infrastructure........................................................ 1-44 Core Hadoop characteristics.............................................................................................................. 1-45 What is Apache Hadoop?.................................................................................................................. 1-46 Why and Why and where Hadoop is used and not used.......................................... notused........................................ 1-48 Apache Hadoop core components.................................................................................................... 1-49 The two The two key components of Hadoop.................................................. ofHadoop................................................ 1-50 @© Copyright IBM IBM Corp. 2016, 2021 2021 iii Course materials may may not be be reproduced in whole or or in part without the prior written permission of IBM. ”’ V11.3 Contents TOC Differences between RDBMS and Hadoop HDFS....................................... andHadoop..................................... 1-52 Hadoop infrastructure: Large and and constantly growing........................................................................ 1-53 Think differently.................................................................................................................................. 1-56 Unit summary.................................................................................................................................... 1-57 Review questions.............................................................................................................................. 1-58 Review questions (cont.).................................................................................................................... 1-59 Review answers................................................................................................................................ 1-60 Review answers (cont.)...................................................................................................................... 1-61 1-61 Unit 2. 2. Introduction to to Hortonworks Data Platform (HDP).............................................................. 2-1 2-1 Unit objectives...................................................................................................................................... 2-2 2-2 2.1. Hortonworks Data Platform overview.................................................................................................... 2-3 2-3 Hortonworks Data Platform overview.................................................................................................. 2-4 2-4 Topics.................................................................................................................................................. 2-5 2-5 Hortonworks Data Platform.................................................................................................................. 2-6 2-6 Hortonworks Data Platform.................................................................................................................. 2-7 2-7 2.2. Data flow........................................................................ flow........................................................................ 2-82-8 Data flow.............................................................................................................................................. 2-9 2-9 Topics................................................................................................................................................ 2-10 Data Flow.......................................................................................................................................... 2-11 2-11 Kafka.................................................................................................................................................. 2-12 Sqoop................................................................................................................................................ 2-13 2.3. Data access........................................................................................................................................ 2-14 Data access...................................................................................................................................... 2-15 Topics................................................................................................................................................ 2-16 Data access...................................................................................................................................... 2-17 Hive.................................................................................................................................................... 2-18 Pig...................................................................................................................................................... 2-19 HBase................................................................................................................................................ 2-20 Accumulo.......................................................................................................................................... 2-21 2-21 Phoenix.............................................................................................................................................. 2-22 Storm.................................................................................................................................................. 2-23 Solr.................................................................................................................................................... 2-24 Spark.................................................................................................................................................. 2-25 Druid.................................................................................................................................................. 2-26 2.4. Data lifecycle and and governance............................................................................................................ 2-27 Data lifecycle and and governance.......................................................................................................... 2-28 Topics................................................................................................................................................ 2-29 Data Lifecycle and and Governance........................................................................................................ 2-30 Falcon................................................................................................................................................ 2-31 2-31 Atlas.................................................................................................................................................. 2-32 2.5. Security........................................................................ 2-33 Security........................................................................ Security.............................................................................................................................................. 2-34 Topics................................................................................................................................................ 2-35 Security.............................................................................................................................................. 2-36 Ranger.............................................................................................................................................. 2-37 Knox.................................................................................................................................................. 2-38 2.6. Operations.......................................................................................................................................... 2-39 Operations.......................................................................................................................................... 2-40 Topics................................................................................................................................................ 2-41 2-41 Operations.......................................................................................................................................... 2-42 Ambari................................................................................................................................................ 2-43 Cloudbreak........................................................................................................................................ 2-44 ZooKeeper........................................................................................................................................ 2-45 Oozie.................................................................................................................................................. 2-46 2.7. Tools.................................................................................................................................................... 2-47 @© Copyright IBM IBM Corp. 2016, 2021 2021 iv ÎV Course materials may may not be be reproduced in whole or or in part without the prior written permission of of IBM. V11.3 Contents TOC Tools.................................................................................................................................................. 2-48 Topics................................................................................................................................................ 2-49 Tools................................................................................................................................................. 2-50 Zeppelin............................................................................................................................................ 2-51 2-51 Ambari Views.................................................................................................................................... 2-52 2.8. IBM IBM added value components............................................................................................................ 2-53 IBM IBMs added value components............................................................................................................ 2-54 Topics................................................................................................................................................ 2-55 IBM IBM added value components............................................................................................................ 2-56 Db2 Db2 Big SQL SQL is SQL SQL on on Hadoop........................................................................................................ 2-57 Big Replicate................................................................... 2-58 Information BfoRe ti'o !Se Server er and and Hadoop: Hadoo ”BiQualt BigQuality and BigIntegrate Bi Inte rate”.....”.....”.”.....”.”..”.....”.....22-59 -59 Information Server - BigIntegrate:Ingest, Biglntegrate: Ingest, transform, transform, process and and deliveraany ny data into into&& within a do withinHHadoop 2::i0.............................................................................. 2-60 Information Server Sewer - BigQuality:Analyze, BigQuaIity:Ar/aly e, cleanse c/ear/seaand nd monitor mo nitor yo yourur big big data............::.................... 2-62 IBM IBM InfoSphere Big Match forHadoop for Hadoop.............................................................................................. 2-63 Unit summary.................................................................................................................................... 2-65 Review questions.............................................................................................................................. 2-66 Review questions.............................................................................................................................. 2-67 Review answers................................................................................................................................ 2-68 Review answers............................................................................................................................... 2-69 Exercise Exercise1: 1: Exploring the lab environment........................................................................................ 2-70 Exercise objectives............................................................................................................................ 2-71 2-71 Unit Unit 3. 3. Introduction to Apache Introduction to Apache Ambari............................................... Ambari............................................... 3-1 3-1 Unit objectives...................................................................................................................................... 3-2 3-2 3.1. Apache Ambari overview...................................................................................................................... 3-3 3-3 Apache Ambari overview.................................................................................................................... 3-4 3-4 Topics.................................................................................................................................................. 3-5 3-5 Operations............................................................................................................................................ 3-6 3-6 Apache Ambari.................................................................................................................................... 3-7 3-7 Functions of of Apache Ambari................................................................................................................ 3-8 3-8 Apache Ambari Metrics System.......................................................................................................... 3-9 3-9 Apache Ambari architecture.............................................................................................................. 3-10 3.2. Apache Ambari Web Web UI...................................................................................................................... 3-11 3-11 Apache Ambari Web Web UI...................................................................................................................... 3-12 Topics C........................................................................ 3-13 SSign Ogn iin to Apache Ambari Web S to WebUUI I...................................................33-14 -!4 Navigating Apache Ambari Web Web UI.................................................................................................... 3-15 The Apache Ambari dashboard..................................................... The................................................... 3-16 Metric details on the Apache Ambari dashboard........................................ on the...................................... 3-17 Metric details for time-based cluster components.............................................................................. 3-18 Service Actions and and Alert and and Health Checks.................................................................................... 3-19 Service Check from the Service Actions menu.................................................................................. 3-20 Host metrics: metrics: Example ofay of a host........................................................................................................ 3-21 3-21 Non-functioning/failed services: Example ofHBase of HBase.......................................................................... 3-22 Managing hosts in ina a cluster............................................................................................................. 3-23 3.3. Apache Ambari command-line interface (CLI).................................................................................... 3-24 Apache Ambari command-line interface (CLI).................................................................................. 3-25 Topics................................................................................................................................................ 3-26 Running Apache Ambari from the command line.............................................................................. 3-27 3.4. Apache Ambari basic terms................................................................................................................ 3-30 Apache Ambari basic terms.............................................................................................................. 3-31 3-31 Topics................................................................................................................................................ 3-32 Apache Ambari terminology.............................................................................................................. 3-33 Unit summary.................................................................................................................................... 3-35 @© Copyright IBM IBM Corp. 2016, 2021 2021 Vv Course materials may may not be be reproduced in whole or or in part without the prior written permission permission of IBM. V11.3 Contents TOC Review questions.............................................................................................................................. 3-36 Review questions (cont.).................................................................................................................... 3-37 Review answers................................................................................................................................ 3-38 Review answers (cont.)...................................................................................................................... 3-39 Exercise: Exercise: Managing Hadoop clusters with Apache Ambari................................................................ 3-40 Exercise objectives............................................................................................................................ 3-41 3-41 Unit 4. 4. Apache Apache Hadoop and HDFS.................................................................................................... 4-1 4-1 Unit objectives...................................................................................................................................... 4-2 4-2 4.1. Apache Hadoop: Summary and and recap.................................................................................................. 4-3 4-3 Apache Hadoop: Summary and and recap................................................................................................ 4-4 4-4 Topics.................................................................................................................................................. 4-5 4-5 What is Apache Hadoop...................................................................................................................... 4-6 4-6 Hadoop infrastructure: Large and and constantly growing.......................................................................... 4-8 4-8 The importance of Hadoop........................................................ 4-11 Advantages deantpge and disadvantages nddiHd antages of of Hadoop........................................... 4*! 4-13 3 4.2. Introduction to Hadoop Distributed File Fi e System.......................................... !........................................ 4-14 Introduction to Hadoop Distributed File System................................................................................ 4-15 Topics................................................................................................................................................ 4-16 Introduction to HDFS.......................................................................................................................... 4-17 HDFS goals........................................................................................................................................ 4-18 Brief introduction introduction to HDFS and MapReduce............................................ andMapReduce.......................................... 4-19 HDFS architecture.............................................................................................................................. 4-20 HDFS blocks...................................................................................................................................... 4-21 4-21 HDFS replication of blocks................................................................................................................ 4-22 Setting the the rack network topology (rack awareness)........................................................................ 4-24 Compression offilesof files.......................................................................................................................... 4-28 Which compression format should you you use use...................................................................................... 4-30 4.3. Managing Managingaa Hadoop Distributed File System cluster.......................................................................... 4-31 4-31 Managing Managingaa Hadoop Distributed File System cluster........................................................................ 4-32 Topics................................................................................................................................................ 4-33 NameNode startup............................................................................................................................ 4-34 NameNode NameeNode files (as (as stored in HDFS)................................................................................................ 4-35 Adding a file to Addinga to HDFS: replication pipelining...................................................................................... 4-36 Managing thecluster the cluster.......................................................................................................................... 4-37 HDFS NameNode high availability.................................................................................................... 4-38 Standby NameNode.......................................................................................................................... 4-40 Federated NameNode (HDFS).......................................................................................................... 4-41 4-41 dfs: File system shell (1 of 4) (1 of 4)............................................................................................................ 4-43 dfs: File system shell (2 of 4) 4)............................................................................................................ 4-44 dfs: File system shell (3 of 4) 4)............................................................................................................ 4-45 dfs: File system shell (4 of 4) 4)............................................................................................................ 4-46 Unit summary.................................................................................................................................... 4-47 Review questions.............................................................................................................................. 4-48 Review answers................................................................................................................................ 4-49 Exercise: Exercise: File access and and basic commands with HDFS.................................................................... 4-50 Exercise objectives............................................................................................................................ 4-51 4-51 Unit Unit 5. 5. MapReduce MapReduce and YARN...................................................... 5-1 and YARN...................................................... 5-1 Unit objectives...................................................................................................................................... 5-2 5-2 5.1. Introduction to MapReduce.......................................................... 5-3 MapReduce.......................................................... 5-3 Introduction O duction to MapReduce.................................................................................................................. 5-4 5- Topics......................................................................... 5-5 MapReduce: MapReduce:”The The Distributed File System (DFS)”......................................... File”System....................................... 5-6 5-6 MapReduce explained........................................................................................................................ 5-7 5-7 The MapReduce programming model.................................................................................................. 5-8 5-8 @ IBM Corp. 2016, 2021 © Copyright IBM 2021 vi VÎ Course materials may may not be be reproduced in whole or or in part without the prior written permission permission of IBM. V11.3 Contents TOC The MapReduce execution environments............................................... The............................................. 5-9 5-9 MapReduce overview........................................................................................................................ 5-10 MapReduce: Map Map phase.................................................................................................................... 5-11 5-11 MapReduce: Shuffle phase................................................................................................................ 5-12 MapReduce: Reduce phase.............................................................................................................. 5-13 MapReduce: Combiner (Optional).................................................................................................... 5-14 WordCount example.......................................................................................................................... 5-15 Map Map task............................................................................................................................................ 5-16 Shuffle................................................................................................................................................ 5-18 Reduce.............................................................................................................................................. 5-19 Combiner (Optional).......................................................................................................................... 5-20 Source code forWordCount.java for WordCount.java (1 of 3) (1 of 3).......................................................................................... 5-21 5-21 Source code forWordCount.java for WordCount.java (2 of 3).............................................. (2of3)............................................ 5-23 Source code forWordCount.java for WordCount.java (3 of 3).............................................. (3of3)............................................ 5-24 Classes.............................................................................................................................................. 5-25 Splits.................................................................................................................................................. 5-26 RRecordReader u FdReader............................................................................................................................. 5-27 5-2 InputFormat.................................................................... 5-28 5.2. Hadoop v1 and MapReduce v”1 v1 architecture and and limitations.............................................................. 5-29 Hadoop v1 and MapReduce v1 v1 architecture and and limitations.............................................................. 5-30 Topics................................................................................................................................................ 5-31 5-31 MapReduce v1 v1 engine...................................................................................................................... 5-32 How How Hadoop runs MapReduce v1jobs v1 jobs.............................................................................................. 5-33 Fault tolerance................................................................................................................................ 5-35 Issues with the original MapReduce paradigm.................................................................................. 5-37 Limitations of classic MapReduce (MRv1)........................................................................................ 5-38 Scalability in MRv1: Busy JobTracker................................................................................................ 5-39 5.3. YARN architecture.............................................................................................................................. 5-40 YARN architecture............................................................................................................................ 5-41 5-41 Topics................................................................................................................................................ 5-42 YARN................................................................................................................................................ 5-43 YARN high-level architecture............................................................................................................ 5-44 Running an application in YARN (1 (1 of of 7) 7).......................................................................................... 5-45 Running an application in YARN (2 (2ofof7)7).......................................................................................... 5-46 Running an application in YARN (3 (3ofof7)7).......................................................................................... 5-47 Running an application in YARN (4 (4ofof7)7).......................................................................................... 5-48 Running an application in YARN (5 (5ofof7)7).......................................................................................... 5-49 Running an application in YARN (6 (6ofof7)7).......................................................................................... 5-50 Running an application in YARN (7 (7ofof7)7).......................................................................................... 5-51 5-51 How How YARN runs an an application application.......................................................................................................... 5-52 YARN features.................................................................................................................................. 5-53 features: Scalability......................................................... YARN features:....................................................... 5-54 features: Multi-tenancy....................................................... YARN features:..................................................... 5-55 features: Compatibility....................................................... YARN features:..................................................... 5-57 features: Higher cluster utilization.............................................. YARN features:............................................ 5-58 features: Reliability and YARN features: and availability............................................................................... 5-59 YARN major features summarized.................................................................................................... 5-60 Apache Spark with Hadoop 2+.......................................................................................................... 5-61 5-61 5.4. Hadoop and MapReduce v1 v1 compared tov2 to v2...................................................................................... 5-62 Hadoop and MapReduce v1 v1 compared tov2 to v2.................................................................................... 5-63 Topics................................................................................................................................................ 5-64 Hadoop v1 toHadoop to Hadoop v2.................................................................................................................. 5-65 YARN modifies MRv1 MRv1........................................................................................................................ 5-66 Architecture of MRv1 MRv1.......................................................................................................................... 5-68 YARN architecture............................................................................................................................ 5-69 Terminology changes from MRv1 to YARN............................................ MRv1 toYARN.......................................... 5-71 5-71 @© Copyright IBM IBM Corp. 2016, 2021 2021 vii Course materials may may not be be reproduced in whole or or in part without the prior written permission permission of IBM. V11.3 Contents TOC Unit summary.................................................................................................................................... 5-72 Review questions.............................................................................................................................. 5-73 Review questions (cont.).................................................................................................................... 5-74 Review answers................................................................................................................................ 5-75 Review answers (cont.)..................................................................................................................... 5-76 Exercise: Exercise: Running MapReduce and and YARN jobs................................................................................ 5-77 Exercise objectives............................................................................................................................ 5-78 Exercise: Creating and and coding codingaa simple MapReduce job job.................................................................. 5-79 Exercise objectives............................................................................................................................ 5-80 Unit Unit 6. 6. Introduction to Apache Introduction to Apache Spark Spark................................................................................................ 6-1 6-1 Unit objectives...................................................................................................................................... 6-2 6-2 6.1. Apache Spark overview........................................................................................................................ 6-3 6-3 Apache Spark overview...................................................................................................................... 6-4 6-4 Topics.................................................................................................................................................. 6-5 6-5 Big data and Bigdata and Apache Spark................................................................................................................ 6-6 6-6 Ease ofuse of use.......................................................................................................................................... 6-8 6-8 Who uses Apache Spark and why Who why...................................................................................................... 6-9 6-9 Apache Spark unified stack................................................................................................................ 6-11 6-11 Apache Spark jobs and and shell.............................................................................................................. 6-13 Apache Spark Scala and and Python shells............................................................................................ 6-14 6.2. Scala overview.................................................................................................................................... 6-15 Scala overview.................................................................................................................................. 6-16 Topics................................................................................................................................................ 6-17 Brief overview of of Scala...................................................................................................................... 6-18 Scala: Anonymous functions (Lambda functions).............................................................................. 6-20 Computing WordCount by by using Lambda functions.......................................................................... 6-21 6-21 6.3. Resilient Res ent Distributed Dataset.............................................................................................................. 6-23 '! Resilient Distributed Distributed Dataset.............................................................................................................. 6-24 Topics................................................................................................................................................ 6-25 Resilient Distributed Distributed Dataset.............................................................................................................. 6-26 Creating an an RDD RDD................................................................................................................................ 6-28 a RDD basic operations............................................................ 6-29 What happens when an action is run (1 of 8).......................................... 6-30 What happens when an action is run (2 of 8).......................................... 6-31 What happens when an action is run (3 of 8).......................................... 6-32 DI pen t'en ntt:ion:tunttttt) ””””””””””””””””””””””””””””””””””””””””””I-tt What happens when an action is run (4 of 8).......................................... 6-33 What happens when an action is run (5 of 8).......................................... 6-34 What happens when an action is run (6 of 8).......................................... 6-35 What happens when an action is run (7 of 8).......................................... 6-36 What happens then that when in an action Witiin.is run fun ( (8 of of 8) ).................................................................................... 6-37 RDD RDD operations: operations: Transformations...................................................................................................... 6-38 RDD RDD operations: operations: Actions.................................................................................................................... 6-40 RDD RDD persistence................................................................................................................................ 6-41 6-41 Best practices for for which storage level to choose.............................................................................. 6-43 Shared variables and and key-value pairs................................................................................................ 6-45 Programming with key-value pairs.................................................................................................... 6-47 6.4. Programming with Apache Spark........................................................................................................ 6-48 Programming with Apache Spark...................................................................................................... 6-49 Topics................................................................................................................................................ 6-50 Programming with Apache Spark...................................................................................................... 6-51 6-51 SparkContext.................................................................................................................................... 6-52 Linking with Apache Spark: Scala...................................................................................................... 6-53 Initializing Initializing Apache Spark: Scala........................................................................................................ 6-54 Linking with Apache Spark: Python.................................................................................................... 6-55 Initializing Initializing Apache Spark: Python...................................................................................................... 6-56 @© Copyright IBM IBM Corp. 2016, 2021 2021 viii Course materials may may not be be reproduced in whole or or in part without the prior written permission permission of IBM. V11.3 Contents TOC Linking Linkings with Apache Spark: Java...................................................................................................... 6-57 Initializing Initializing Apache Spark: Java.......................................................................................................... 6-58 Passing functions to Apache Spark.................................................................................................. 6-59 Programming the the business logic........................................................................................................ 6-61 6-61 Running Apache Spark examples...................................................................................................... 6-62 Creating Apache Spark stands-alone stand-alone applications: applications: Scala.................................................................. 6-64 Running stand-Malone stand-alone applications...................................................................................................... 6-65 6.5. Apache Spark libraries........................................................................................................................ 6-66 Apache Spark libraries...................................................................................................................... 6-67 Topics................................................................................................................................................ 6-68 Apache Spark libraries...................................................................................................................... 6-69 Apache Spark SQL SQL............................................................................................................................ 6-70 Apache Spark Streaming.................................................................................................................. 6-71 6-71 Apache Spark Streaming: Streaming: Internals.................................................................................................. 6-72 GraphX.............................................................................................................................................. 6-74 6.6. Apache Spark cluster and and monitoring................................................. 6-75 monitoring................................................. Apache Spark cluster and and monitoring................................................................................................ 6-76 Topics................................................................................................................................................ 6-77 Apache Spark cluster overview.......................................................................................................... 6-78 Apache Spark monitoring.................................................................................................................. 6-80 Unit summary.................................................................................................................................... 6-82 Review questions.............................................................................................................................. 6-83 Review answers............................................................................................................................... 6-84 Exercise: Running Apache Spark applications in Python.................................................................. 6-85 Exercise objectives............................................................................................................................ 6-86 Unit Unit 7. 7. Storing Storing and and querying querying data................................................... 7-1 data...................................................7-1 Unit objectives...................................................................................................................................... 7-2 7-2 7.1. Introduction to data and and file formats...................................................................................................... 7-3 7-3 Introduction to data and and file formats.................................................................................................... 7-4 7-4 Topics.................................................................................................................................................. 7-5 7-5 Introduction to data.............................................................................................................................. 7-6 7-6 Gathering andand cleaning, munging, or or wrangling data.......................................................................... 7-7 7-7 Flat files and and text files.......................................................................................................................... 7-9