Podcast
Questions and Answers
Who developed MapReduce?
Who developed MapReduce?
Dean and Ghemawat at Google
MapReduce is designed to process small volumes of data.
MapReduce is designed to process small volumes of data.
False (B)
In MapReduce, the workload is divided into what?
In MapReduce, the workload is divided into what?
multiple independent tasks
In MapReduce, each task's work is performed in isolation from others.
In MapReduce, each task's work is performed in isolation from others.
When using MapReduce, what aspects do you not have to worry about handling?
When using MapReduce, what aspects do you not have to worry about handling?
What type of data does the Map function take as input?
What type of data does the Map function take as input?
What is the output of the Map function in MapReduce?
What is the output of the Map function in MapReduce?
Input data is automatically partitioned into how many shards?
Input data is automatically partitioned into how many shards?
The framework groups together intermediate values with different intermediate keys.
The framework groups together intermediate values with different intermediate keys.
What type of data does the Reduce function take as input?
What type of data does the Reduce function take as input?
What type of data does the Reduce function produce as output?
What type of data does the Reduce function produce as output?
Map workers _____ the data by keys.
Map workers _____ the data by keys.
Map data will be processed by Reduce workers.
Map data will be processed by Reduce workers.
Each Reduce worker needs to read its partition from every Map worker.
Each Reduce worker needs to read its partition from every Map worker.
What gets notified by the master about the location of intermediate files for its partition?
What gets notified by the master about the location of intermediate files for its partition?
What does the Reduce worker sort the data by?
What does the Reduce worker sort the data by?
Master pings each worker non-periodically
Master pings each worker non-periodically
If no response is received within a certain time, the what is marked as failed?
If no response is received within a certain time, the what is marked as failed?
Map or reduce tasks given to this worker are _____ back to the initial state and rescheduled for other workers.
Map or reduce tasks given to this worker are _____ back to the initial state and rescheduled for other workers.
Flashcards
What is MapReduce?
What is MapReduce?
A programming model for processing large datasets in a distributed and parallel manner.
Who developed MapReduce?
Who developed MapReduce?
Dean and Ghemawat developed MapReduce at Google.
How does MapReduce process data?
How does MapReduce process data?
MapReduce efficiently processes large volumes of data using commodity computers in parallel.
How does MapReduce breakdown tasks?
How does MapReduce breakdown tasks?
Signup and view all the flashcards
How do the tasks work?
How do the tasks work?
Signup and view all the flashcards
What is Parallelization?
What is Parallelization?
Signup and view all the flashcards
What is Data distribution?
What is Data distribution?
Signup and view all the flashcards
What is Load balancing?
What is Load balancing?
Signup and view all the flashcards
What is Fault tolerance?
What is Fault tolerance?
Signup and view all the flashcards
What is the Map function?
What is the Map function?
Signup and view all the flashcards
What is input data sharding?
What is input data sharding?
Signup and view all the flashcards
What does Map do with data?
What does Map do with data?
Signup and view all the flashcards
How does framework group intermediate values?
How does framework group intermediate values?
Signup and view all the flashcards
What is the Reduce function?
What is the Reduce function?
Signup and view all the flashcards
Merging values for reduction
Merging values for reduction
Signup and view all the flashcards
How are Reduce workers assigned?
How are Reduce workers assigned?
Signup and view all the flashcards
Data division.
Data division.
Signup and view all the flashcards
What is the Master's Role?
What is the Master's Role?
Signup and view all the flashcards
How are tasks assigned?
How are tasks assigned?
Signup and view all the flashcards
Number of partions.
Number of partions.
Signup and view all the flashcards
Worker Task
Worker Task
Signup and view all the flashcards
What does Mapper produce?
What does Mapper produce?
Signup and view all the flashcards
Partitioning of data.
Partitioning of data.
Signup and view all the flashcards
Reduce Task: Sorting
Reduce Task: Sorting
Signup and view all the flashcards
What does Reduce worker do?
What does Reduce worker do?
Signup and view all the flashcards
What is shuffle?
What is shuffle?
Signup and view all the flashcards
Sort phase.
Sort phase.
Signup and view all the flashcards
What is given to the reduce function?
What is given to the reduce function?
Signup and view all the flashcards
Ending output task.
Ending output task.
Signup and view all the flashcards
What happens after completed?
What happens after completed?
Signup and view all the flashcards
Study Notes
- Map Reduce is a programming model for processing big data
Development and Design
- Developed by Dean and Ghemawat at Google in 2004
- Designed to efficiently process large volumes of data
- Uses commodity computers working in parallel
- Divides workload into multiple independent tasks
- Schedules tasks across cluster nodes
- Works in isolation
Handling features
- Parallelization
- Data distribution
- Load balancing
- Fault tolerance
Map stage function
- Input shard transforms in intermediate key or value pairs
- Input data is automatically partitioned into M shards
- Discards unnecessary data and generates (key, value) sets
- Framework groups intermediate values with the same intermediate key and passes them to the Reduce function
Reduce stage function
- Intermediate key or value pairs convert to result files
- Input comprises a key and set of values.
- Merges the values to form a smaller set of values.
- Reduce workers are distributed by partitioning the intermediate key space, using R pieces via a partitioning function
Step 1: Split Input Files
- Input data divides into M pieces
- Results in M workers; each worker processes one shard
- Two files results in 2 workers; each worker will have to process 2 shards only
Step 2: Fork Processes
- Multiple copies of the program start on a cluster of machines
- One master acts as scheduler and coordinator
- Many workers map or reduce workers
- Idle workers assigned map tasks, each shard has M map tasks
- Reduces tasks of intermediate files where there are R partitions, defined by user
Step 3: Run Map tasks
- Reads content of the assigned input shard.
- Parses key or value pairs from the input data.
- Passes each pair to a user-defined map function to produce intermediate key or value pairs - buffered in memory
Step 4: Partitioning
- It occurs after intermediate key/value pairs are produced
- The user map function stores buffered memory, periodically written to the local disk
- The map worker partitions the data by keys
- Partition into R -number of reduces- regions by using a partitioning function
- Map data is processed by reduce workers
- User reduce functions are called once for each unique key generated
Step 5: Reduce Task
- Reduce worker is notified by the master of the location of intermediate files for partitioning
- Shuffles RPCs to read data from local disks from the Map worker
- Arranges data by intermediate key
- Gathers occurrences of the same key
Step 6: Reduce Task
- The sort phase groups data with a unique intermediate key
- User's Reduce function receives the key and set of intermediate values
- The Reduce function output appends to an output file
Step 7: Return to User
- When all map and reduce tasks finish, the master wakes up the user
- The MapReduce call in the user program returns and allows the program to resume execution
- The output of MapReduce is available in R output files
Map and Reduce Worker definitions
- Map worker parses data into key value, then writes to an intermediate file
- Shuffle and sort fetches relevant partitions of output from mappers: Sorts by keys
- Reduce inputs sorted outputs of mappers. The program calls a reduce function for each key with values to aggregate the results
Example function
- Maps counts occurrences of each count in the collection of documents, parses and outputs data for each with a count of one
- Reduce will sort by keys or words, and sum counts of each word
Fault Tolerance in Map Reduce
- To achieve fault tolerance map reduces periodically pings each worker
- If no response occurs in a set time, the worker is marked as failed
- Assigned map and reduce tasks will reset, and be rescheduled to other works
- Map tasks are re-executed
- All workers will be notified if any re-execution occurs
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.