Hadoop and Apache Spark Overview

IntricateCommonsense avatar
IntricateCommonsense
·
·
Download

Start Quiz

Study Flashcards

12 Questions

How many nodes are typically used in the distributed architecture of Apache Hadoop?

Thousands

What is the purpose of Apache Spark?

To provide high performance for batch and streaming data

What is the purpose of giving a hardcoded value of 1 to each of the tokens or words in MapReduce?

To indicate that every word occurs once

What is the approximate input file size for Hadoop?

1TB

What does the MapReduce process do?

Sorts and shuffles data

What is the maximum file size that can be processed by Hadoop?

1TB

What is Intel BigDL used for?

To write learning programs to run on data stored in HDFS or Hive

What is the main benefit of Apache Spark compared to Hadoop?

It is more efficient for processing large datasets

What is the approximate speed of Apache Spark compared to Hadoop?

100 times faster

What is Intel BigDL used for?

Writing learning programs to run on data stored in HDFS or Hive

What is the primary purpose of the MapReduce process in Hadoop?

To write applications run on Hadoop

What is the advantage of Apache Spark over Hadoop?

It is faster

Study Notes

  • Hadoop is a scalable, distributed data storage and analytics engine that can store and analyze massive amounts of unstructured and semi-structured data.

  • The distributed architecture of Apache Hadoop is comprised of multiple nodes (sometimes thousands), which are individual servers that run an off-the-shelf operating system and the Apache Hadoop software.

  • Input file sizes can be about one terabyte (TB).

  • Hadoop consists of two parts or sub-projects: MapReduce and HDFS.

  • MapReduce is a computational model and software framework for writing applications run on Hadoop.

  • A Word Count Example of MapReduce

  • First, we divide the input into four splits as shown in the figure. This will distribute the work among all the map nodes.

  • Then, we tokenize the words in each of the mappers and give a hardcoded value (1) to each of the tokens or words. The rationale behind giving a hardcoded value equal to 1 is that every word, in itself, will occur once.

  • Now, a list of key-value pair will be created where the key is nothing, but the individual words and value is one. So, for the first line (Welcome to Hadoop) we have 3 key-value pairs —Welcome, 1; to, 1; Hadoop, 1.

  • After the mapper phase, a partition process takes place where sorting and shuffling happen so that all the tuples with the same key are sent to the corresponding reducer.

  • So, after the sorting and shuffling phase, each reducer will have a unique key and a list of values corresponding to that very key. For example, Bad, [1]; is, [1,1]., etc.

  • Each Reducer counts the values which are present in that list of values. As shown in the figure, reducer gets a list of values which is [1,1] for the key Bear. Then, it counts the number of ones in the very list and gives the final output as — Hadoop, 3.

  • Finally, all the output key/value pairs are then collected and written in the output file.

  • Apache Spark is an open-source cluster computing framework for big data and machine learning.

  • Apache Spark is built on top of Hadoop MapReduce and is optimized to run in memory, unlike Hadoop's MapReduce, which writes data to and from computer hard drives.

  • Apache Spark is up to 100K faster than Hadoop for processing large datasets.

  • Apache Spark provides high performance for batch and streaming data using its DAG scheduler, a query optimizer, and a physical execution engine.

  • Intel's Value Adds to Apache Spark include its Distributed Deep Learning library, Intel BigDL, and its rich deep learning support.

  • Intel BigDL can efficiently scale out to perform data analytics at a “big data scale” by using Spark, as well as efficient implementations of synchronous stochastic gradient descent (SGD) and all-reduce communications in Spark.

  • You can use Intel BigDL to write your learning programs to run on data stored in HDFS or Hive.

  • Intel also provides pointers and system setting for hardware and software that will give the best performance for most situations.

Test your knowledge of Hadoop and Apache Spark with this quiz covering topics such as distributed architecture, MapReduce, HDFS, Apache Spark's features, and Intel's contributions to the framework. Learn about the key concepts and technologies behind these powerful big data and machine learning solutions.

Make Your Own Quizzes and Flashcards

Convert your notes into interactive study material.

Get started for free

More Quizzes Like This

Apache Hadoop Storage Quiz
5 questions
Apache Hadoop Overview
10 questions

Apache Hadoop Overview

WellManneredPrime avatar
WellManneredPrime
Use Quizgecko on...
Browser
Browser