Recent Lessons

Show all results for ""

MapReduce: Processing Big Data

MapReduce: Processing Big Data

Choose a study mode

Play Quiz

Study Flashcards

Spaced Repetition

Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Download our mobile app to listen on the go

Get App

Questions and Answers

Who developed MapReduce?

Dean and Ghemawat at Google

MapReduce is designed to process small volumes of data.

False (B)

In MapReduce, the workload is divided into what?

multiple independent tasks

In MapReduce, each task's work is performed in isolation from others.

<p>True (A)</p>

Signup and view all the answers

When using MapReduce, what aspects do you not have to worry about handling?

<p>All of the above (E)</p>

Signup and view all the answers

What type of data does the Map function take as input?

<p>input shard</p>

Signup and view all the answers

What is the output of the Map function in MapReduce?

<p>intermediate key/value pairs</p>

Signup and view all the answers

Input data is automatically partitioned into how many shards?

<p>M shards</p>

Signup and view all the answers

The framework groups together intermediate values with different intermediate keys.

<p>False (B)</p>

Signup and view all the answers

What type of data does the Reduce function take as input?

<p>intermediate key/value pairs</p>

Signup and view all the answers

What type of data does the Reduce function produce as output?

<p>result files</p>

Signup and view all the answers

Map workers _____ the data by keys.

<p>partitions</p>

Signup and view all the answers

Map data will be processed by Reduce workers.

<p>True (A)</p>

Signup and view all the answers

Each Reduce worker needs to read its partition from every Map worker.

<p>True (A)</p>

Signup and view all the answers

What gets notified by the master about the location of intermediate files for its partition?

<p>Reduce worker</p>

Signup and view all the answers

What does the Reduce worker sort the data by?

<p>intermediate keys</p>

Signup and view all the answers

Master pings each worker non-periodically

<p>False (B)</p>

Signup and view all the answers

If no response is received within a certain time, the what is marked as failed?

<p>worker</p>

Signup and view all the answers

Map or reduce tasks given to this worker are _____ back to the initial state and rescheduled for other workers.

<p>reset</p>

Signup and view all the answers

Flashcards

What is MapReduce?

A programming model for processing large datasets in a distributed and parallel manner.

Who developed MapReduce?

Dean and Ghemawat developed MapReduce at Google.

How does MapReduce process data?

MapReduce efficiently processes large volumes of data using commodity computers in parallel.

How does MapReduce breakdown tasks?

It divides work into independent tasks and schedules them across cluster nodes.

Signup and view all the flashcards

How do the tasks work?

Each task in MapReduce is performed in isolation from one another.

Signup and view all the flashcards

What is Parallelization?

Splitting the data into smaller parts.

Signup and view all the flashcards

What is Data distribution?

Distributing the data across multiple machines or nodes in a cluster.

Signup and view all the flashcards

What is Load balancing?

Distributing workload to ensure no single machine is overwhelmed.

Signup and view all the flashcards

What is Fault tolerance?

The ability of a system to continue operating properly in the event of the failure.

Signup and view all the flashcards

What is the Map function?

The map function transforms input shards into key-value pairs.

Signup and view all the flashcards

What is input data sharding?

Splitting input data into smaller, manageable parts.

Signup and view all the flashcards

What does Map do with data?

Discard irrelevant data and create sets of key-value pairs.

Signup and view all the flashcards

How does framework group intermediate values?

Groups intermediate values by the same key and pass them to the reduce function.

Signup and view all the flashcards

What is the Reduce function?

It transforms intermediate key/value pairs into result files.

Signup and view all the flashcards

Merging values for reduction

Combines and merges values to form a more concise set.

Signup and view all the flashcards

How are Reduce workers assigned?

These are split via a partitioning function into pieces.

Signup and view all the flashcards

Data division.

Divides the input data into M pieces.

Signup and view all the flashcards

What is the Master's Role?

Scheduler and coordinator.

Signup and view all the flashcards

How are tasks assigned?

Workers receive tasks from the master.

Signup and view all the flashcards

Number of partions.

The number of partitions defined by the user.

Signup and view all the flashcards

Worker Task

Each worker reads and parses data into key-value out of the input shard.

Signup and view all the flashcards

What does Mapper produce?

Intermediate data.

Signup and view all the flashcards

Partitioning of data.

Partitioning key/value pairs for reduces.

Signup and view all the flashcards

Reduce Task: Sorting

Sorts the data by intermediate keys so that the same keys are together.

Signup and view all the flashcards

What does Reduce worker do?

The worker gets notified by the master that an intermediate file exists.

Signup and view all the flashcards

What is shuffle?

A shuffle uses RPC to read data from the local disks of the map workers.

Signup and view all the flashcards

Sort phase.

The phase where the grouped data is given a unique intermediate key.

Signup and view all the flashcards

What is given to the reduce function?

The set of intermediate values belonging to that key.

Signup and view all the flashcards

Ending output task.

The output is appended to an output file.

Signup and view all the flashcards

What happens after completed?

Return to the user the results.

Signup and view all the flashcards

Study Notes

Map Reduce is a programming model for processing big data

Development and Design

Developed by Dean and Ghemawat at Google in 2004
Designed to efficiently process large volumes of data
Uses commodity computers working in parallel
Divides workload into multiple independent tasks
Schedules tasks across cluster nodes
Works in isolation

Handling features

Parallelization
Data distribution
Load balancing
Fault tolerance

Map stage function

Input shard transforms in intermediate key or value pairs
Input data is automatically partitioned into M shards
Discards unnecessary data and generates (key, value) sets
Framework groups intermediate values with the same intermediate key and passes them to the Reduce function

Reduce stage function

Intermediate key or value pairs convert to result files
Input comprises a key and set of values.
Merges the values to form a smaller set of values.
Reduce workers are distributed by partitioning the intermediate key space, using R pieces via a partitioning function

Step 1: Split Input Files

Input data divides into M pieces
Results in M workers; each worker processes one shard
Two files results in 2 workers; each worker will have to process 2 shards only

Step 2: Fork Processes

Multiple copies of the program start on a cluster of machines
One master acts as scheduler and coordinator
Many workers map or reduce workers
Idle workers assigned map tasks, each shard has M map tasks
Reduces tasks of intermediate files where there are R partitions, defined by user

Step 3: Run Map tasks

Reads content of the assigned input shard.
Parses key or value pairs from the input data.
Passes each pair to a user-defined map function to produce intermediate key or value pairs - buffered in memory

Step 4: Partitioning

It occurs after intermediate key/value pairs are produced
The user map function stores buffered memory, periodically written to the local disk
The map worker partitions the data by keys
Partition into R -number of reduces- regions by using a partitioning function
Map data is processed by reduce workers
User reduce functions are called once for each unique key generated

Step 5: Reduce Task

Reduce worker is notified by the master of the location of intermediate files for partitioning
Shuffles RPCs to read data from local disks from the Map worker
Arranges data by intermediate key
Gathers occurrences of the same key

Step 6: Reduce Task

The sort phase groups data with a unique intermediate key
User's Reduce function receives the key and set of intermediate values
The Reduce function output appends to an output file

Step 7: Return to User

When all map and reduce tasks finish, the master wakes up the user
The MapReduce call in the user program returns and allows the program to resume execution
The output of MapReduce is available in R output files

Map and Reduce Worker definitions

Map worker parses data into key value, then writes to an intermediate file
Shuffle and sort fetches relevant partitions of output from mappers: Sorts by keys
Reduce inputs sorted outputs of mappers. The program calls a reduce function for each key with values to aggregate the results

Example function

Maps counts occurrences of each count in the collection of documents, parses and outputs data for each with a count of one
Reduce will sort by keys or words, and sum counts of each word

Fault Tolerance in Map Reduce

To achieve fault tolerance map reduces periodically pings each worker
If no response occurs in a set time, the worker is marked as failed
Assigned map and reduce tasks will reset, and be rescheduled to other works
Map tasks are re-executed
All workers will be notified if any re-execution occurs

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

MapReduce Data Reading Quiz

5 questions

MapReduce Data Reading Quiz

StrikingUvarovite

Procesamiento de Grandes Cantidades de Datos con MapReduce

18 questions

Procesamiento de Grandes Cantidades de Datos con MapReduce

FascinatingAphorism

Introducción a Big Data – Parte 2

12 questions

Introducción a Big Data – Parte 2

EloquentDarmstadtium

MapReduce Framework Overview

8 questions

MapReduce Framework Overview

SuppleHarpGuitar

Use Quizgecko on...

Browser