Hadoop Ecosystem 2023 PDF

Document Details

CourteousPointillism

Uploaded by CourteousPointillism

Lakehead University

2023

Abed Alkhateeb

Tags

hadoop big data data processing machine learning

Summary

This presentation from Lakehead University, 2023, covers the Hadoop ecosystem. Topics include Hadoop's components (HDFS and MapReduce), applications (data mining, machine learning), and algorithms for analyzing big data.

Full Transcript

1 Lakehead University Hadoop Ecosystem by Abed Alkhateeb 2023 Apache Hadoop Ecosystem 7/34  Hadoop is an open-source software framework written in Java, 2005 designed to answer the question: “How to process big data with reasonable cost and ti...

1 Lakehead University Hadoop Ecosystem by Abed Alkhateeb 2023 Apache Hadoop Ecosystem 7/34  Hadoop is an open-source software framework written in Java, 2005 designed to answer the question: “How to process big data with reasonable cost and time”  Distributed storage / Computer clusters  Distributed processing of very large data sets  Two main components  HDFS (Hadoop Distributed File System) Provides Distributed Storage  MapReduce (Distributed Data Processing Model) Who Uses Hadoop? 8/34 History of Hadoop 4 Apache Hadoop Ecosystem 345/ Apache Hadoop Ecosystem Pieces 346/  Flume, a tool for ingesting streaming data into your cluster.  Sqoop, for ingesting data from external datastores.  HBase, a tabular database on Hadoop.  Oozie, for Hadoop task management.  Pig, a data flow scripting language for analyzing unstructured data.  Mahout for machine learning with Hadoop.  Drill, a low latency query engine for structured, semi structured and nested data.  Spark, an data processing engine for distributed computing. Hadoop Distributed File System (HDFS) 347/  Based on Google’s GFS (Google File System)  Data is distributed across all nodes  Build around the idea of “write-once, read-many- times” (WORM)  File are stored as “Blocks”  64MB  Different blocks of the same file are stored on different nodes  MapReduce 34/8  MapReduce is a style of programming designed for:  Easy parallel programming.  Invisible management of hardware and software failures.  Easy management of very-large-scale data. MapReduce 13/34  MapReduce is a method for distributing a task across multiple nodes and Consists of two phases  Map  Reduce  Apply a user-written Map function to each input element, in parallel (Mapper)  The system sorts all the key-value pairs by key, forming key-(list of values) pairs. MapReduce 34/10  Another user-written function, the Reduce function, is applied to each key-(list of values).  Application of the Reduce function to one key and its list of values is a reducer.  Each reducer produces some output, and the output of the entire job is the union of what is produced by each reducer. MapReduce 15/34 Input Output Reducers Mappers MapReduce: Word Count Example 16/34 Data mining 3413/  Data Is the New OIL  Data mining = extraction of actionable information from (usually) very large datasets, is the subject of extreme hype, fear, and interest.  It’s part of machine learning. Machine Learning 3414/  Machine Learning = Algorithms for inferring unknowns from knowns.  Applications  Spam  Climate modeling  Handwriting  Speech  Netflix Types of Machine Learning 3415/  Supervised learning (labeled data)  Regression : Predict a value (house prices # rooms # bath # size Price)  Classification: predict a class (blood test cancer or no - 0 or 1 )  Unsupervised learning (unlabeled data)  Clustering (find group, outliers)  Recommender (EX. Netflix amazon) Association Rules Frequent Itemsets 34/16  The Market-Basket Model  Association Rules  A-Priori Algorithm  PCY Algorithm The Market-Basket Model 34/17  A large set of items, e.g., things sold in a supermarket.  A large set of baskets, each of which is a small set of the items, e.g., the things one customer buys on one day. Support and Support threshold 34/18  Simplest question: Find sets of items that appear “frequently” in the baskets  Support for itemset I = the number of baskets containing all items in I.  Sometimes given as a percentage of the baskets.  Given a support threshold s, a set of items appearing in at least s baskets is called a frequent itemset. Example: Frequent Itemsets 34/19  Items={milk, coke, pepsi, beer, juice}.  Support threshold = 3 baskets. B1 = {m, c, b} B2 = {m, p, j} B3 = {m, b} B4 = {c, j} B5 = {m, p, b} B6 = {m, c, b, j} B7 = {c, b, j} B8 = {b, c}  Frequent itemsets: {m}, {c}, {b}, {j}, }m,b{ }b,c{ ,.}c,j{ , Applications 34/20  Supermarkets  Story of “diapers and beer” discovery.  Used to position potato chips between diapers and beer to enhance sales of potato chips.  Plagiarism detection.  Items = documents; baskets = sentences.  Basket/sentence contains all the items/documents that have that sentence. Association Rules 34/21  {i1, i2,…, ik} → j means: “if a basket contains all of i1,…, ik then its likely to contain j”  Example: {bread, peanut-butter} → jelly.  Confidence of this association rule is the “probability” of j given i1,…, ik. Example: Confidence 34/22 + B1 = {m, c, b} B2 = {m, p, j} _ B3 = {m, b} B4 = {c, j} + B5 = {m, p, b} B6 =_ {m, c, b, j} B7 = {c, b, j} B8 = {b, c}  An association rule: {m, b} → c.  Confidence = 2/4 = 50%. Main-Memory Bottleneck 34/23  Main Issue is to get the frequent-itemset.  For many frequent-itemset algorithms, main memory is the critical resource.  As we read baskets, we need to count itemsets.  In practice, we read the data in passes.  Thus, we measure the cost by the number of passes an algorithm takes. A-Priori Algorithm 34/24  A two-pass approach called a-priori limits the need for main memory.  Key idea: monotonicity: if a set of items appears at least s times, so does every subset of the set, for example {m, c, b}.  Contrapositive for pairs: if item i does not appear in s baskets, then no pair including i can appear in s baskets. A-Priori Algorithm 34/25  Pass 1: Read baskets and count in main memory the occurrences of each item.  Items that appear at least s times are the frequent items.  Pass 2: Read baskets again and count in main memory only those pairs both of which were found in Pass 1 to be frequent. Picture of A-Priori 34/26 Count All pairs Count All of items To be the items the pairs explained items from L1 C1 Filter L1 Construct C2 Filter L2 Construct C3 First Second pass pass Frequent Frequent items pairs Picture of A-Priori 34/27 Item counts Frequent items Counts of pairs of frequent items Pass 1 Pass 2 PCY (Park-Chen-Yu) Algorithm 34/28  Improvement to A-Priori  During Pass 1 of A-priori, most memory is idle.  Use that memory to keep counts of buckets into which pairs of items are hashed.  For each basket, enumerate all its pairs, hash them, and increment the resulting bucket count by 1.  On Pass 2, we only count pairs of frequent items that also hash to a frequent bucket. Picture of PCY 34/29 Item counts Frequent items Bitmap Hash table Counts of for pairs candidate pairs Pass 1 Pass 2 30 QA

Use Quizgecko on...
Browser
Browser