MapReduce: Processing Big Data

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

Who developed MapReduce?

Dean and Ghemawat at Google

MapReduce is designed to process small volumes of data.

False (B)

In MapReduce, the workload is divided into what?

multiple independent tasks

In MapReduce, each task's work is performed in isolation from others.

<p>True (A)</p> Signup and view all the answers

When using MapReduce, what aspects do you not have to worry about handling?

<p>All of the above (E)</p> Signup and view all the answers

What type of data does the Map function take as input?

<p>input shard</p> Signup and view all the answers

What is the output of the Map function in MapReduce?

<p>intermediate key/value pairs</p> Signup and view all the answers

Input data is automatically partitioned into how many shards?

<p>M shards</p> Signup and view all the answers

The framework groups together intermediate values with different intermediate keys.

<p>False (B)</p> Signup and view all the answers

What type of data does the Reduce function take as input?

<p>intermediate key/value pairs</p> Signup and view all the answers

What type of data does the Reduce function produce as output?

<p>result files</p> Signup and view all the answers

Map workers _____ the data by keys.

<p>partitions</p> Signup and view all the answers

Map data will be processed by Reduce workers.

<p>True (A)</p> Signup and view all the answers

Each Reduce worker needs to read its partition from every Map worker.

<p>True (A)</p> Signup and view all the answers

What gets notified by the master about the location of intermediate files for its partition?

<p>Reduce worker</p> Signup and view all the answers

What does the Reduce worker sort the data by?

<p>intermediate keys</p> Signup and view all the answers

Master pings each worker non-periodically

<p>False (B)</p> Signup and view all the answers

If no response is received within a certain time, the what is marked as failed?

<p>worker</p> Signup and view all the answers

Map or reduce tasks given to this worker are _____ back to the initial state and rescheduled for other workers.

<p>reset</p> Signup and view all the answers

Flashcards

What is MapReduce?

A programming model for processing large datasets in a distributed and parallel manner.

Who developed MapReduce?

Dean and Ghemawat developed MapReduce at Google.

How does MapReduce process data?

MapReduce efficiently processes large volumes of data using commodity computers in parallel.

How does MapReduce breakdown tasks?

It divides work into independent tasks and schedules them across cluster nodes.

Signup and view all the flashcards

How do the tasks work?

Each task in MapReduce is performed in isolation from one another.

Signup and view all the flashcards

What is Parallelization?

Splitting the data into smaller parts.

Signup and view all the flashcards

What is Data distribution?

Distributing the data across multiple machines or nodes in a cluster.

Signup and view all the flashcards

What is Load balancing?

Distributing workload to ensure no single machine is overwhelmed.

Signup and view all the flashcards

What is Fault tolerance?

The ability of a system to continue operating properly in the event of the failure.

Signup and view all the flashcards

What is the Map function?

The map function transforms input shards into key-value pairs.

Signup and view all the flashcards

What is input data sharding?

Splitting input data into smaller, manageable parts.

Signup and view all the flashcards

What does Map do with data?

Discard irrelevant data and create sets of key-value pairs.

Signup and view all the flashcards

How does framework group intermediate values?

Groups intermediate values by the same key and pass them to the reduce function.

Signup and view all the flashcards

What is the Reduce function?

It transforms intermediate key/value pairs into result files.

Signup and view all the flashcards

Merging values for reduction

Combines and merges values to form a more concise set.

Signup and view all the flashcards

How are Reduce workers assigned?

These are split via a partitioning function into pieces.

Signup and view all the flashcards

Data division.

Divides the input data into M pieces.

Signup and view all the flashcards

What is the Master's Role?

Scheduler and coordinator.

Signup and view all the flashcards

How are tasks assigned?

Workers receive tasks from the master.

Signup and view all the flashcards

Number of partions.

The number of partitions defined by the user.

Signup and view all the flashcards

Worker Task

Each worker reads and parses data into key-value out of the input shard.

Signup and view all the flashcards

What does Mapper produce?

Intermediate data.

Signup and view all the flashcards

Partitioning of data.

Partitioning key/value pairs for reduces.

Signup and view all the flashcards

Reduce Task: Sorting

Sorts the data by intermediate keys so that the same keys are together.

Signup and view all the flashcards

What does Reduce worker do?

The worker gets notified by the master that an intermediate file exists.

Signup and view all the flashcards

What is shuffle?

A shuffle uses RPC to read data from the local disks of the map workers.

Signup and view all the flashcards

Sort phase.

The phase where the grouped data is given a unique intermediate key.

Signup and view all the flashcards

What is given to the reduce function?

The set of intermediate values belonging to that key.

Signup and view all the flashcards

Ending output task.

The output is appended to an output file.

Signup and view all the flashcards

What happens after completed?

Return to the user the results.

Signup and view all the flashcards

Study Notes

  • Map Reduce is a programming model for processing big data

Development and Design

  • Developed by Dean and Ghemawat at Google in 2004
  • Designed to efficiently process large volumes of data
  • Uses commodity computers working in parallel
  • Divides workload into multiple independent tasks
  • Schedules tasks across cluster nodes
  • Works in isolation

Handling features

  • Parallelization
  • Data distribution
  • Load balancing
  • Fault tolerance

Map stage function

  • Input shard transforms in intermediate key or value pairs
  • Input data is automatically partitioned into M shards
  • Discards unnecessary data and generates (key, value) sets
  • Framework groups intermediate values with the same intermediate key and passes them to the Reduce function

Reduce stage function

  • Intermediate key or value pairs convert to result files
  • Input comprises a key and set of values.
  • Merges the values to form a smaller set of values.
  • Reduce workers are distributed by partitioning the intermediate key space, using R pieces via a partitioning function

Step 1: Split Input Files

  • Input data divides into M pieces
  • Results in M workers; each worker processes one shard
  • Two files results in 2 workers; each worker will have to process 2 shards only

Step 2: Fork Processes

  • Multiple copies of the program start on a cluster of machines
  • One master acts as scheduler and coordinator
  • Many workers map or reduce workers
  • Idle workers assigned map tasks, each shard has M map tasks
  • Reduces tasks of intermediate files where there are R partitions, defined by user

Step 3: Run Map tasks

  • Reads content of the assigned input shard.
  • Parses key or value pairs from the input data.
  • Passes each pair to a user-defined map function to produce intermediate key or value pairs - buffered in memory

Step 4: Partitioning

  • It occurs after intermediate key/value pairs are produced
  • The user map function stores buffered memory, periodically written to the local disk
  • The map worker partitions the data by keys
  • Partition into R -number of reduces- regions by using a partitioning function
  • Map data is processed by reduce workers
  • User reduce functions are called once for each unique key generated

Step 5: Reduce Task

  • Reduce worker is notified by the master of the location of intermediate files for partitioning
  • Shuffles RPCs to read data from local disks from the Map worker
  • Arranges data by intermediate key
  • Gathers occurrences of the same key

Step 6: Reduce Task

  • The sort phase groups data with a unique intermediate key
  • User's Reduce function receives the key and set of intermediate values
  • The Reduce function output appends to an output file

Step 7: Return to User

  • When all map and reduce tasks finish, the master wakes up the user
  • The MapReduce call in the user program returns and allows the program to resume execution
  • The output of MapReduce is available in R output files

Map and Reduce Worker definitions

  • Map worker parses data into key value, then writes to an intermediate file
  • Shuffle and sort fetches relevant partitions of output from mappers: Sorts by keys
  • Reduce inputs sorted outputs of mappers. The program calls a reduce function for each key with values to aggregate the results

Example function

  • Maps counts occurrences of each count in the collection of documents, parses and outputs data for each with a count of one
  • Reduce will sort by keys or words, and sum counts of each word

Fault Tolerance in Map Reduce

  • To achieve fault tolerance map reduces periodically pings each worker
  • If no response occurs in a set time, the worker is marked as failed
  • Assigned map and reduce tasks will reset, and be rescheduled to other works
  • Map tasks are re-executed
  • All workers will be notified if any re-execution occurs

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

MapReduce Data Reading Quiz
5 questions
Introducción a Big Data – Parte 2
12 questions
Spark vs MapReduce Comparison
18 questions

Spark vs MapReduce Comparison

PeerlessCarnelian6080 avatar
PeerlessCarnelian6080
Use Quizgecko on...
Browser
Browser