Big Data Analytics: Map-Reduce

WelcomeElegy avatar
WelcomeElegy
·
·
Download

Start Quiz

Study Flashcards

12 Questions

What does the map function do in a MapReduce environment?

The map function processes each word in the input data and emits key-value pairs.

What is the role of the reduce function in MapReduce?

The reduce function processes values associated with a key and produces a final output.

What are some tasks that Map-Reduce environment takes care of?

Partitioning the input data, scheduling program execution, performing group by key step, handling machine failures, and managing inter-machine communication.

What is a common recommendation for the number of map tasks compared to the number of nodes in a cluster?

Make the number of map tasks much larger than the number of nodes in the cluster

The master node in MapReduce takes care of coordinating task status and pings workers periodically to detect failures.

True

What is the main motivation for using Map-Reduce in big data analytics?

To address challenges in distributing computation and working with big data.

What is one of the challenges addressed by Map-Reduce?

Distributing computation

Map-Reduce is an elegant way to work with big data.

True

Map-Reduce brings computation close to the ____________.

data

Match the following components with their descriptions:

Map-Reduce = Google's computational/data manipulation model Distributed File System = Provides global file namespace Chunk servers = Responsible for storing chunks of data spread across machines

What is the typical usage pattern of a Distributed File System like Hadoop HDFS?

Storing huge files, rarely updating in place, and frequently reading/appending data.

What does the Reduce step in Map-Reduce involve?

Aggregating and summarizing data

Study Notes

Map-Reduce Overview

  • Map-Reduce is a computational model for processing large-scale data sets in a parallel and distributed manner.
  • It's a fundamental concept in big data analytics, particularly in data mining and machine learning.

Single Node Architecture

  • Traditional data mining approach: single node with a CPU, memory, and disk.
  • Limitations: processing large datasets takes a long time, and disk I/O is a major bottleneck.

Motivation: Google Example

  • Google's example: processing 20+ billion web pages (400+ TB) would take 4 months to read the data using a single computer.
  • Solution: distribute the data across multiple nodes (commodity Linux machines) and use a commodity network (Ethernet) to connect them.

Cluster Architecture

  • A cluster consists of multiple racks, each with 16-64 nodes.
  • Each node has a CPU, memory, and disk, and is connected to a switch.
  • Nodes are connected to other nodes in the same rack and to other racks.

Large-Scale Computing Challenges

  • Distributing computation across multiple nodes is a major challenge.
  • Making it easy to write distributed programs is another challenge.
  • Machines fail, and with a large number of machines, failures are frequent.

Idea and Solution

  • Copying data over a network takes time, so bring computation close to the data.
  • Store files multiple times for reliability.
  • Map-Reduce addresses these problems by providing a scalable and reliable way to process large datasets.

Storage Infrastructure

  • Distributed File System (DFS) provides a global file namespace.
  • Google's GFS and Hadoop's HDFS are examples of DFS.
  • Typical usage pattern: huge files (100s of GB to TB), data rarely updated in place, and reads and appends are common.

Distributed File System

  • Chunk servers: files are split into contiguous chunks, typically 16-64MB, and replicated (usually 2x or 3x).
  • Master node: stores metadata about where files are stored, and might be replicated.
  • Client library: talks to the master to find chunk servers and connects directly to chunk servers to access data.

Programming Model: MapReduce

  • Sequentially read a lot of data, map it, group by key, reduce, and write the result.
  • Programmer specifies two methods: map and reduce.
  • Input: a set of key-value pairs.
  • Output: a new set of key-value pairs.

Word Counting Example

  • Map: read input and produce a set of key-value pairs.
  • Reduce: collect all values belonging to the key and output the result.

Map-Reduce Environment

  • Takes care of partitioning the input data, scheduling the program's execution, performing the group by key step, handling machine failures, and managing inter-machine communication.

Map-Reduce in Parallel

  • All phases are distributed with many tasks doing the work.
  • Programmer specifies map and reduce functions and input files.

Data Flow

  • Input and final output are stored on a distributed file system.
  • Scheduler tries to schedule map tasks close to the physical storage location of input data.
  • Intermediate results are stored on local file systems of map and reduce workers.

Coordination: Master

  • Master node takes care of coordination: task status, idle tasks, and worker failure detection.

Dealing with Failures

  • Map worker failure: reset tasks to idle, notify reducers.
  • Reduce worker failure: reset in-progress tasks to idle, restart reduce task.
  • Master failure: abort the task, notify the client.

Task Granularity and Pipelining

  • Fine granularity tasks: map tasks >> machines.
  • Improves dynamic load balancing and speeds up recovery from worker failures.

Refinements: Backup Tasks

  • Problem: slow workers significantly lengthen the job completion time.
  • Solution: spawn backup copies of tasks near the end of the phase.
  • Effect: dramatically shortens job completion time.

Refinements: Combiners

  • Often, a map task will produce many pairs of the form (k, v1), (k, v2), ... for the same key k.
  • Can save network time by pre-aggregating values in the mapper.
  • Works only if the reduce function is commutative and associative.

Problems Suited for Map-Reduce

  • Link analysis and graph processing.
  • Machine learning algorithms.
  • Statistical machine translation.
  • Joining large datasets.

Cost Measures for Algorithms

  • Communication cost: total I/O of all processes.
  • Elapsed communication cost: max of I/O along any path.
  • Computation cost: analogous to communication cost, but count only running time of processes.
  • Note: big-O notation is not the most useful in MapReduce.

This quiz covers the concept of Map-Reduce in big data analytics, including its role in large scale computing for data mining and distributed/parallel programming. It discusses how Map-Reduce addresses the challenges of distributing computation.

Make Your Own Quizzes and Flashcards

Convert your notes into interactive study material.

Get started for free

More Quizzes Like This

MapReduce Data Reading Quiz
5 questions
Big Data Technologies Quiz
15 questions
Use Quizgecko on...
Browser
Browser