Big Data MapReduce PDF

Summary

This document provides an overview of MapReduce, a programming model used in big data analysis. It explains concepts like data locality, process locality, and how MapReduce jobs are executed, along with examples. It covers various aspects of MapReduce, including its components, how to run MapReduce jobs, and its advantages and disadvantages. The topics covered in the document will be beneficial to understand the fundamental concepts of big data and MapReduce.

Full Transcript

Program Name- M.C.A Course Name –Big Data Analytics and Visualization By- Dr. Mrs. Ashwini Renavikar Topics of the session MapReduce – Concept need and examples Wordcount script Matrix multiplication script MapReduce MapReduce is a programming model that runs on Hadoop – a data...

Program Name- M.C.A Course Name –Big Data Analytics and Visualization By- Dr. Mrs. Ashwini Renavikar Topics of the session MapReduce – Concept need and examples Wordcount script Matrix multiplication script MapReduce MapReduce is a programming model that runs on Hadoop – a data analytics engine widely used for Big Data – and writes applications that run in parallel to process large volumes of data stored on clusters. 2003 Google published a paper on MapReduce, Apcahe used this model for Hadoop framework. Data locality – Distributed process goes to distributed data Process locality – Distributed data goes to distributed process MapReduce is based on Data Locality Election example Every citizen going to one single place for voting – process locality Disadvantage - Large movement of people Disadvantage - Very costly Advantage – Entire elections can be completed in one day Is suitable for small number of people EVMs and other machinery coming to different places – data locality Disadvantage - Large movement of data Advantage – Election will happen in batches / phases Is suitable for large number of people University paper checking example All papers collected at University Scanning will be done Soft copies are distributed to different places Correction will take place at different places Outcome (grades) of each paper are sent back at one place Results are compiled at one place Issues with MapReduce jobs Parallel distributed processing of the code Who controls the job What if a particular part (thread) of the job fails? MapReduce Job MapReduce job is triggered from edge node Mapper is divided into parts and will run on different blocks on different nodes Default mappers – 1 Maximum mappers -4 Individual output is generated for every mapper Then all outputs are combined at some data node MapReduce job Number of mappers and reducers For Hadoop 3.x, total data to be processed is divided by 128 (Block size) E.g. for 10 GB data = (10 * 1024) / 128 = app 80. Number of mappers will be 80 Number of reducers = 10% of mappers = 8 1 block generally has 1 mapper but may increase depending upon the size of the data 1 mapper , 1 reducer – allowed Many mappers , 1 reducer - allowed Many mappers , many reducers - allowed Many mappers many reducers + a reducer - not allowed (Use SPARK) Job tracker - responsibilities Steps in execution of MapReduce Job tracker is assigned a job of MapReduce by a developer Job tracker requests details of data and replication to name node Name node responds with the details Task trackers will be assigned to each data node. Every 3 second task trackers send heartbeat to job tracker. MapReduce job is executed at individual data node Output of individual task tracker is stored at respective data node on disk storage Once all mapper tasks are over at all data nodes, a reducer task may start at new or on a data node part of job All outputs are transferred at reducer node (via http) Reducer output stored on the disk storage of reducer node Final outcome of reducer is sent to HDFS and JT gets the signal that job is completed Diagram – MapReduce execution Steps in execution of MapReduce In case of failure of a particular data node, task tracker of that data node will send a signal to job tracker about failure Only that particular mapper is killed Job tracker will collect information from name node about replication of that particular data on other data node, and job continues. If multiple blocks of same data file reside on the same node, multiple mappers can run on the same block, but 100% parallelism can not be achieved. One TT will wait for other mapper to complete the execution and then allocate resources to another mapper. By default , there can be 2 parallel thread for same task tracker Changes can be made in mapreducesite.xml to increase the number of parallel instances. I-O formats for Mappers and Reducers For a mapper there is an input and output For a reducer, there is an input and output All the inputs and outputs have to be in a key-value pair formats only There can be a text file input where offset of the record is a key and rest of the record is a value E.g 1,AAR,Mobile,45000 is the record then for first record 0 is the offset, for next record, 17 will be the offset (add 1 to the length of previous record) For a key value pair, key will be upto first tab and rest of the record will be value E.g. 111 AAR,Mobile 45000 How to run MapReduce job 1.Client: Submitting the MapReduce job. 2.Yarn node manager: In a cluster, it monitors and launches the compute containers on machines. 3.Yarn resource manager: Handles the allocation of computing resources coordination on the cluster. 4.MapReduce application master Facilitates the tasks running the MapReduce work. 5.Distributed Filesystem: Shares job files with other entities. Steps in MapReduce Combine and Partition There are two intermediate steps between Map and Reduce. Combine is an optional process. The combiner is a reducer that runs individually on each mapper server. It reduces the data on each mapper further to a simplified form before passing it downstream. This makes shuffling and sorting easier as there is less data to work with. Often, the combiner class is set to the reducer class itself, due to the cumulative and associative functions in the reduce function. However, if needed, the combiner can be a separate class as well. Partition is the process that translates the pairs resulting from mappers to another set of pairs to feed into the reducer. It decides how the data has to be presented to the reducer and also assigns it to a particular reducer. The default partitioner determines the hash value for the key, resulting from the mapper, and assigns a partition based on this hash value. There are as many partitions as there are reducers. So, once the partitioning is complete, the data from each partition is sent to a specific reducer. MapReduce Combiner - What is a combiner? Combiner always works in between Mapper and Reducer. The output produced by the Mapper is the intermediate output in terms of key-value pairs which is massive in size. If we directly feed this huge output to the Reducer, then that will result in increasing the Network Congestion. So to minimize this Network congestion we have to put combiner in between Mapper and Reducer. These combiners are also known as semi-reducer. It is not necessary to add a combiner to your Map-Reduce program, it is optional. MapReduce Combiner - What is a combiner? Combiner is also a class in our java program like Map and Reduce class that is used in between this Map and Reduce classes. Combiner helps us to produce abstract details or a summary of very large datasets. When we process or deal with very large datasets using Hadoop Combiner is very much necessary, resulting in the enhancement of overall performance. The key-value pairs generated by the Mapper are known as the intermediate key-value pairs or intermediate output of the Mapper. Since the MapReduce Combiner lacks a How combiner works predefined interface, it must implement a reducer interface method. Each output from a map with a key is processed by a combiner. Therefore, outputs with similar keys that are similar in value should be handled by the Reducer class. Due to the fact that it replaces the map's original output data, the combiner can produce summary information even with a large dataset. When a MapReduce job is run on a large dataset, the map class generates a sizable piece of intermediate data. It is then handed to the reducer for further processing, which will cause significant network congestion. Combiner example Advantages of combiner Reduces the time taken for transferring the data from Mapper to Reducer. Reduces the size of the intermediate output generated by the Mapper. Improves performance by minimizing Network congestion. Reduces the workload on the Reducer: Combiners can help reduce the amount of data that needs to be processed by the Reducer. By performing some aggregation or reduction on the data in the Mapper phase itself, combiners can reduce the number of records that are passed on to the Reducer, which can help improve overall performance. Improves fault tolerance: Combiners can also help improve fault tolerance in MapReduce. In case of a node failure, the MapReduce job can be re-executed from the point of failure. Since combiners perform some of the aggregation or reduction tasks in the Mapper phase itself, this reduces the amount of work that needs to be re-executed, which can help improve fault tolerance. Disadvantages of combiner The intermediate key-value pairs generated by Mappers are stored on Local Disk and combiners will run later on to partially reduce the output which results in expensive Disk Input-Output. The map-Reduce job can not depend on the function of the combiner because there is no such guarantee in its execution. Increased resource usage: Combiners can increase the resource usage of MapReduce jobs since they require additional CPU and memory resources to perform their operations. This can be especially problematic in large-scale MapReduce jobs that process huge amounts of data. Combiners may not always be effective: While combiners can help reduce the amount of data transferred between the Mapper and Reducer, they may not always be effective in doing so. This is because the effectiveness of combiners depends on the data being processed and the operations being performed. In some cases, using combiners may actually increase the amount of data transferred, which can reduce overall performance. MapReduce partitioner A partitioner works like a condition in processing an input dataset. The partition phase takes place after the Map phase and before the Reduce phase. The number of partitioners is equal to the number of reducers. That means a partitioner will divide the data according to the number of reducers. Therefore, the data passed from a single partitioner is processed by a single Reducer. A partitioner partitions the key-value pairs of intermediate Map-outputs. It partitions the data using a user-defined condition, which works like a hash function. The total number of partitions is same as the number of Reducer tasks for the job. Let us take an example to understand how the partitioner works. Map Tasks The map task accepts the key-value pairs as input while we have the text data in a text file. The input for this map task is as follows − Input − The key would be a pattern such as “any special key + filename + line number” (example: key = @input1) and the value would be the data in that line (example: value = 1201 \t gopal \t 45 \t Male \t 50000). Method − The operation of this map task is as follows − Read the value (record data), which comes as input value from the argument list in a string. Using the split function, separate the gender and store in a string variable. String[] str = value.toString().split("\t", -3); String gender=str; Send the gender information and the record data value as output key-value pair from the map task to the partition task. context.write(new Text(gender), new Text(value)); Repeat all the above steps for all the records in the text file. Output − You will get the gender data and the record data value as key-value pairs. Partitioner Task The partitioner task accepts the key-value pairs from the map task as its input. Partition implies dividing the data into segments. According to the given conditional criteria of partitions, the input key-value paired data can be divided into three parts based on the age criteria. Input − The whole data in a collection of key-value pairs. key = Gender field value in the record. value = Whole record data value of that gender. Method − The process of partition logic runs as follows. Read the age field value from the input key-value pair. Check the age value with the following conditions. Age less than or equal to 20 Age Greater than 20 and Less than or equal to 30. Age Greater than 30. Output − The whole data of key-value pairs are segmented into three collections of key- value pairs. The Reducer works individually on each collection. Reduce Tasks The number of partitioner tasks is equal to the number of reducer tasks. Here we have three partitioner tasks and hence we have three Reducer tasks to be executed. Input − The Reducer will execute three times with different collection of key-value pairs. key = gender field value in the record. value = the whole record data of that gender. Method − The following logic will be applied on each collection. Read the Salary field value of each record. String [] str = val.toString().split("\t", -3); Note: str have the salary field value. Reduce Tasks Check the salary with the max variable. If str is the max salary, then assign str to max, otherwise skip the step. if(Integer.parseInt(str)>max) { max=Integer.parseInt(str); } Repeat Steps 1 and 2 for each key collection (Male & Female are the key collections). After executing these three steps, you will find one max salary from the Male key collection and one max salary from the Female key collection. context.write(new Text(key), new IntWritable(max)); Output − Finally, you will get a set of key-value pair data in three collections of different age groups. It contains the max salary from the Male collection and the max salary from the Female collection in each age group respectively. Reduce Tasks After executing the Map, the Partitioner, and the Reduce tasks, the three collections of key-value pair data are stored in three different files as the output. All the three tasks are treated as MapReduce jobs. The following requirements and specifications of these jobs should be specified in the Configurations − Job name Input and Output formats of keys and values Individual classes for Map, Reduce, and Partitioner tasks Computing selection and projection MapReduce Selection: selection(WHERE clau Projection: In order to select se in SQL) lets you apply a some columns only we use the condition over the data you have projection operator. It’s analogous and only get the rows that satisfy to SELECT in SQL. the condition. Grouping and aggregation (sum, count, max, min, etc.) in MapReduce Grouping and Aggregation: Group rows based on some set of columns and apply some aggregation (sum, count, max, min, etc.) on some column of the small groups that are formed. This corresponds to GROUP BY in SQL. MapReduce Aggregation code in Python >>> from bson.son import SON >>> pipeline = [... {"$unwind": "$tags"},... {"$group": {"_id": "$tags", "count": {"$sum": 1}}},... {"$sort": SON([("count", -1), ("_id", -1)])}... ] >>> import pprint >>> pprint.pprint( (db.things.aggregate(pipeline))) [{u'_id': u'cat', u'count': 3}, {u'_id': u'dog', u'count': 2}, {u'_id': u'mouse', u'count': 1}]

Use Quizgecko on...
Browser
Browser