1. Introduction to Big Data.pdf
Document Details
Uploaded by CozyOctopus
Tags
Full Transcript
Big Data Analytics Michał Bryś © Copyright. All rights reserved. Not to be reproduced without prior written consent. 1 About me ■ ■ ■ ■ ■ Michał Bryś ML Architect at GetInData 13+ years experience in data industry Interests: ML & MLOps e-mail: [email protected] © Copyright. All rights reserv...
Big Data Analytics Michał Bryś © Copyright. All rights reserved. Not to be reproduced without prior written consent. 1 About me ■ ■ ■ ■ ■ Michał Bryś ML Architect at GetInData 13+ years experience in data industry Interests: ML & MLOps e-mail: [email protected] © Copyright. All rights reserved. Not to be reproduced without prior written consent. 2 About GetInData We work with organizations to grow their business by using data, analytics & technology Focus on Big Data & Cloud (from day 1) 130+ engineers and data scientists Founded in 2014 by ex-Spotify engineers Community Builders (conference, meetups, blog posts, open-source) © Copyright. All rights reserved. Not to be reproduced without prior written consent. Your Experience Have you already used Hadoop or other Big Data tools? ■ Have you worked with any data related cloud solutions? ● Google BigQuery ● Amazon Redshift, Athena ● Snowflake ● Databricks ■ What is your experience with popular programming languages? ● Python, SQL, R, Scala ■ What is your experience with ML? ■ © Copyright GetInData. All rights reserved. Not to be reproduced without prior written consent. 4 Goals Introduction to Big Data ● Fundamental concepts and most popular technologies ■ Hands-on experience with most popular technologies ● Inspired by a real-world companies and their use-cases ■ Knowledge required to start your journey with Big Data ■ How to transfer your current knowledge about data analysis and machine learning to Big Data ■ What we won’t do: Learn new data analysis or ML concepts ■ Course won’t be very technical ■ © Copyright. All rights reserved. Not to be reproduced without prior written consent. 5 Program of the course ■ ■ ■ ■ ■ ■ Introduction to BigData Introduction to cloud DWH (BigQuery) Data visualization in Big Data* Data preparation & exploration with Apache Spark ML with Apache Spark Introduction to MLOps* Theory Hands-on exercises © Copyright. All rights reserved. Not to be reproduced without prior written consent. 6 Assessment Active attendance ■ (Optional) various hands-on quests ■ Test ■ Details here: Big Data Analytics - course information *optional © Copyright. All rights reserved. Not to be reproduced without prior written consent. 7 Literature & resources Resources: ➔ Books: https://learning.oreilly.com/home/ ➔ Labs: ◆ Qwiklabs for Students | https://www.cloudskillsboost.google/ ◆ DataCamp ➔ GCP credits - check your mailbox! © Copyright. All rights reserved. Not to be reproduced without prior written consent. 8 Today you’ll learn… ■ ■ ■ ■ ■ Basic characteristics of Big Data (5 x V’s) Concept of distributed computing & storage Evolution of Big Data Hadoop ecosystem fundamentals (HDFS & MapReduce) Key differences between cloud vs. on-premise © Copyright. All rights reserved. Not to be reproduced without prior written consent. 9 Technologies © Copyright GetInData. All rights reserved. Not to be reproduced without prior written consent. 10 Big data - what is it? © Copyright GetInData. All rights reserved. Not to be reproduced without prior written consent. 11 Big data - what is it? Wikipedia definition: Big data is a term used to refer to the study and applications of data sets that are too complex for traditional data-processing application software to adequately deal with. © Copyright GetInData. All rights reserved. Not to be reproduced without prior written consent. 12 Big data - what is it? Wikipedia definition: Big data is a term used to refer to the study and applications of data sets that are too complex for traditional data-processing application software to adequately deal with. Common definition: When data stops fitting in one machine © Copyright GetInData. All rights reserved. Not to be reproduced without prior written consent. 13 Big data - what is it? Wikipedia definition: Big data is a term used to refer to the study and applications of data sets that are too complex for traditional data-processing application software to adequately deal with. Common definition: When data stops fitting in one machine Velocity, Volume, Variety, … © Copyright GetInData. All rights reserved. Not to be reproduced without prior written consent. 14 Big data - what is it? Source: https://www.kdnuggets.com/2017/05/must-know-common-data-quality-issues-big-data.html © Copyright GetInData. All rights reserved. Not to be reproduced without prior written consent. 15 Big Data - motivation, opportunities Source: https://www.linkedin.com/pulse/evolution-analytics-10-20-30-shankar-meganatha/ Source: https://www.3pillarglobal.com/insights/the-evolution-of-big-data-analytics-market/ © Copyright GetInData. All rights reserved. Not to be reproduced without prior written consent. 16 Big data - new opportunities come with new challenges Source: https://www.abigdatablog.com/post/series-1-part-7-big-data-challenges © Copyright GetInData. All rights reserved. Not to be reproduced without prior written consent. 17 Big data - clusters What is a cluster? A bunch of server units grouped together to provide one or more functions like storage or compute. Source: https://books.google.pl/books?id=2Kd9DwAAQBAJ © Copyright GetInData. All rights reserved. Not to be reproduced without prior written consent. 18 Distributed storage & computing key concepts ■ ■ ■ ■ ■ Horizontal vs. vertical scalability Fault tolerance & high availability (HA) Data locality vs. separation of compute & storage Autoscaling of resources Data agnosticism © Copyright GetInData. All rights reserved. Not to be reproduced without prior written consent. 19 Big data Let’s focus now on two main elements of Big Data: Storage ■ Computation ■ © Copyright GetInData. All rights reserved. Not to be reproduced without prior written consent. 20 Storing Large Datasets ■ How to store large volumes of data in reliable way? Popular Solution HDFS ■ S3/GCS ■ © Copyright. All rights reserved. Not to be reproduced without prior written consent. 21 Hadoop Distributed File System ● ● ● ● ● Data Nodes & Name Nodes Distributed read / write Replication (default: 3) Striping (default block: 128 MB) Built-in fault tolerance Source: https://books.google.pl/books?id=2Kd9DwAAQBAJ © Copyright. All rights reserved. Not to be reproduced without prior written consent. 22 Hadoop Distributed File System A Definition For Your Uncle An easy to use software program that runs on many inexpensive computers and stores many files redundantly and still works when some of the computers crash! © Copyright. All rights reserved. Not to be reproduced without prior written consent. 23 HDFS - some examples $ hdfs dfs -ls /user/tiger $ hdfs dfs -put songs.txt /user/tiger $ hdfs dfs -cat /user/tiger/songs.txt $ hdfs dfs -mkdir songs $ hdfs dfs -mv songs.txt songs $ hdfs dfs -rmr songs © Copyright. All rights reserved. Not to be reproduced without prior written consent. 24 Uploading a File To HDFS Question? ■ What happens when a file is uploaded to HDFS? $ hdfs dfs -put songs.txt /user/tiger © Copyright. All rights reserved. Not to be reproduced without prior written consent. 25 Uploading a File To HDFS Answer! A file is split into smaller, but still large, blocks ■ Each block is stored redundantly on multiple machines ■ © Copyright. All rights reserved. Not to be reproduced without prior written consent. 26 Splitting a File Into Blocks A file is just “sliced” into chunks after each 128MB (or so) ● It does NOT matter if it is text, binary, compressed or not ● It does matter later - when reading the data ■ HDFS is data-agnostic ■ Image source: http://pixgood.com/slicing-bread.html © Copyright. All rights reserved. Not to be reproduced without prior written consent. 27 Reading a File From HDFS Question? ■ What happens when a file is read from HDFS? $ hdfs dfs -cat /user/tiger/songs.txt © Copyright. All rights reserved. Not to be reproduced without prior written consent. 28 Reading a File From HDFS Answer! ■ Some information about the file is needed! ● How was the file split into the blocks? ● Where are these blocks located? ● If a block is replicated multiple times, which replica to read from? © Copyright. All rights reserved. Not to be reproduced without prior written consent. 29 Master And Slaves The Master Node manages the metadata information ■ The Slave Nodes store blocks of data and serve them to clients ■ © Copyright. All rights reserved. Not to be reproduced without prior written consent. 30 Properties & use cases of HDFS ■ Assumes streaming reads and writes (not random) ● Reading and writing data from the beginning to the end ● Immutable files ■ “Write once and read many times” Focuses on throughput (not latency) ● The analogy to a big truck (not Ferrari) ■ Likes batch processing on large files ■ © Copyright. All rights reserved. Not to be reproduced without prior written consent. 31 Bad Use-Cases For HDFS Low-latency requests ● e.g. serving the content of MP3 files ■ Random read or random write requests ● e.g. serving playlists created by users or user profiles ■ An extremely high number of small files ● e.g. small XML files coming from an external systems ■ © Copyright. All rights reserved. Not to be reproduced without prior written consent. 32 Big Data file formats Source: https://www.slideshare.net/julienledem/if-you-have-your-own-columnar-format-stop-now-and-use-parquet Columnar file formats: ORC, Parquet Next-gen table formats: Hudi, Iceberg, Delta © Copyright. All rights reserved. Not to be reproduced without prior written consent. 33 But Hadoop is not only about storing data, right? © Copyright. All rights reserved. Not to be reproduced without prior written consent. 34 Sending Computation To Data ■ It is more efficient to send computation to data, isn’t? Large volume of data Computation e.g. a JAR file © Copyright. All rights reserved. Not to be reproduced without prior written consent. 35 YARN - yet another resource negotiator ● ● ● ● ● Node Managers & Resource Manager(s) Co-located with data node for data locality Resources allocated in containers Various scheduling concepts Application agnostic Source: https://books.google.pl/books?id=2Kd9DwAAQBAJ © Copyright. All rights reserved. Not to be reproduced without prior written consent. 36 HDFS + YARN = Core Hadoop ● NodeManagers are collocated with DataNodes ● Resource Manager tries to schedule tasks on a node that already has data ● Large volumes of data don’t have to be sent over the network © Copyright. All rights reserved. Not to be reproduced without prior written consent. 37 Processing Data In Distributed Way ■ How to implement a distributed application that processes terabytes of data and runs on tens (or thousands) of machines? First Solutions ■ MapReduce © Copyright. All rights reserved. Not to be reproduced without prior written consent. 38 MapReduce Model Programming model inspired by functional programming ● Two map() and reduce() functions ● They process data in form of <key, value> pairs ■ Useful for processing large datasets in a distributed way ■ Popularized by Google in 2004 in a research paper ■ © Copyright. All rights reserved. Not to be reproduced without prior written consent. Map And Reduce Functions map: ■ reduce: ■ (key1, value1) (key2, [value2]) -> [(key2, value2)] -> [(key3, value3)] © Copyright. All rights reserved. Not to be reproduced without prior written consent. Counting Words © Copyright. All rights reserved. Not to be reproduced without prior written consent. MapReduce Parallelism Question ■ How can MapReduce model be useful for processing very large datasets in a distributed way? © Copyright. All rights reserved. Not to be reproduced without prior written consent. MapReduce Parallelism Answer Many map() functions can run independently at the same time ■ Once they’re done, many reduce() functions can run independently at the same time ● They process the intermediate data generated by map() functions ■ © Copyright. All rights reserved. Not to be reproduced without prior written consent. MapReduce at Larger Scale © Copyright. All rights reserved. Not to be reproduced without prior written consent. MapReduce Flow Subset of input data Invokes map() function multiple times List of key-value pairs Keys are sorted, values not (but they could be) Invokes reduce() function multiple times (over a subset of intermediate data) © Copyright. All rights reserved. Not to be reproduced without prior written consent. Number of Streams Per Artist Artist Name, Song Title, Timestamp, Username Offset of the line from the beginning of the file We can influence which Reduce Task an artist should go to (by default the hash function on artist name is used) © Copyright. All rights reserved. Not to be reproduced without prior written consent. MapReduce vs. Spark MapReduce low-level API not suitable for ad-hoc analytics ■ Spark: ● Much faster ● Efficient usage of memory (caching) ● SQL interface, Data Frame abstraction ● Streaming workloads ● Graph processing ● Support for various APIs (Python, R, Java, Scala) ● More… soon in the course! [5 min. intro] ■ © Copyright. All rights reserved. Not to be reproduced without prior written consent. Cloud models Source: https://www.c-sharpcorner.com/article/what-is-cloud-computing-explore-the-services-and-deployment-models/ © Copyright. All rights reserved. Not to be reproduced without prior written consent. 48 On-premise vs cloud More on cloud in the next lab! Source: https://wearekemb.com/en/on-premise-vs-cloud/ © Copyright. All rights reserved. Not to be reproduced without prior written consent. 49 Questions ? © Copyright. All rights reserved. Not to be reproduced without prior written consent. Exercises Get familiar with: Vertex AI Workbench / Google Colab ■ Go to your group’s folder on Google Drive - details on your mailbox! ■ © Copyright GetInData. All rights reserved. Not to be reproduced without prior written consent. 51