lec 3.pdf.pdf
Document Details
Uploaded by Deleted User
Full Transcript
MapReduce Idea What is the MapReduce (MR)? Parallel The phases of MapReduce The phases of a MapReduce job: split: data is partitioned across several computer nodes map: apply a map function to each chunk of data (by user) sort & shuffle: the output of the mappers is...
MapReduce Idea What is the MapReduce (MR)? Parallel The phases of MapReduce The phases of a MapReduce job: split: data is partitioned across several computer nodes map: apply a map function to each chunk of data (by user) sort & shuffle: the output of the mappers is sorted and distributed to the reducers reduce: finally, a reduce function is applied to the data and an output is produced Example: The phases of MapReduce (cont..) We have seen that a MapReduce job consists of four phases: split, map, sort & shuffle, and reduce Splitting, sorting and shuffling are done by the framework, Map and reduce functions are defined by the user. It is also possible for the user to interact with the splitting, sorting and shuffling phases and change their default behavior, for example by managing the amount of splitting or defining the sorting comparator. The phases of MapReduce (cont..) Notes The same map (and reduce) function is applied to all the chunks in the data The map and reduce computations can be carried out in parallel because they’re completely independent from one another. The split is not the same as the internal partitioning into blocks. The phases of MapReduce (cont..) The shuffling and sorting phase is often the most costly in a MapReduce job. The mapper takes as input unsorted data and emits key-value pairs. The purpose of sorting is to provide data that is already grouped by key to the reducer. This way reducers can start working as soon as a group (identified by a key) is filled. Map Task Reduce Task MapReduce Daemons Job Tracker Task Tracker JobTracker JobTracker is the daemon service for submitting and tracking MapReduce jobs in Hadoop JobTracker performs following actions in Hadoop : It accepts the MapReduce Jobs from client applications Talks to NameNode to determine data location Locates available TaskTracker Node Submits the work to the chosen TaskTracker Node TaskTracker A TaskTracker node accepts map, reduce or shuffle operations from a JobTracker Its configured with a set of slots, these indicate the number of tasks that it can accept JobTracker seeks for the free slot to assign a job TaskTracker notifies the JobTracker about job success status. TaskTracker also sends the heartbeat signals to the job tracker to ensure its availability, it also reports the number of available free slots with it. Popular tasks for MapReduce Distributed Grep Distributed Word Count Distributed Grep grep is a command-line utility for searching plain-text data sets for lines matching a regular expression. Split data grep matches very Split data grep matches big Split data grep matches All cat matches data Split data grep matches 17 Distributed Word Count Word count is to count the number of words in documents Split data count count very Split data count count big Split data count count merge merged data count Split data count count Next lecture