Podcast
Questions and Answers
How many nodes are typically used in the distributed architecture of Apache Hadoop?
How many nodes are typically used in the distributed architecture of Apache Hadoop?
What is the purpose of Apache Spark?
What is the purpose of Apache Spark?
What is the purpose of giving a hardcoded value of 1 to each of the tokens or words in MapReduce?
What is the purpose of giving a hardcoded value of 1 to each of the tokens or words in MapReduce?
What is the approximate input file size for Hadoop?
What is the approximate input file size for Hadoop?
Signup and view all the answers
What does the MapReduce process do?
What does the MapReduce process do?
Signup and view all the answers
What is the maximum file size that can be processed by Hadoop?
What is the maximum file size that can be processed by Hadoop?
Signup and view all the answers
What is Intel BigDL used for?
What is Intel BigDL used for?
Signup and view all the answers
What is the main benefit of Apache Spark compared to Hadoop?
What is the main benefit of Apache Spark compared to Hadoop?
Signup and view all the answers
What is the approximate speed of Apache Spark compared to Hadoop?
What is the approximate speed of Apache Spark compared to Hadoop?
Signup and view all the answers
What is Intel BigDL used for?
What is Intel BigDL used for?
Signup and view all the answers
What is the primary purpose of the MapReduce process in Hadoop?
What is the primary purpose of the MapReduce process in Hadoop?
Signup and view all the answers
What is the advantage of Apache Spark over Hadoop?
What is the advantage of Apache Spark over Hadoop?
Signup and view all the answers
Study Notes
-
Hadoop is a scalable, distributed data storage and analytics engine that can store and analyze massive amounts of unstructured and semi-structured data.
-
The distributed architecture of Apache Hadoop is comprised of multiple nodes (sometimes thousands), which are individual servers that run an off-the-shelf operating system and the Apache Hadoop software.
-
Input file sizes can be about one terabyte (TB).
-
Hadoop consists of two parts or sub-projects: MapReduce and HDFS.
-
MapReduce is a computational model and software framework for writing applications run on Hadoop.
-
A Word Count Example of MapReduce
-
First, we divide the input into four splits as shown in the figure. This will distribute the work among all the map nodes.
-
Then, we tokenize the words in each of the mappers and give a hardcoded value (1) to each of the tokens or words. The rationale behind giving a hardcoded value equal to 1 is that every word, in itself, will occur once.
-
Now, a list of key-value pair will be created where the key is nothing, but the individual words and value is one. So, for the first line (Welcome to Hadoop) we have 3 key-value pairs —Welcome, 1; to, 1; Hadoop, 1.
-
After the mapper phase, a partition process takes place where sorting and shuffling happen so that all the tuples with the same key are sent to the corresponding reducer.
-
So, after the sorting and shuffling phase, each reducer will have a unique key and a list of values corresponding to that very key. For example, Bad, [1]; is, [1,1]., etc.
-
Each Reducer counts the values which are present in that list of values. As shown in the figure, reducer gets a list of values which is [1,1] for the key Bear. Then, it counts the number of ones in the very list and gives the final output as — Hadoop, 3.
-
Finally, all the output key/value pairs are then collected and written in the output file.
-
Apache Spark is an open-source cluster computing framework for big data and machine learning.
-
Apache Spark is built on top of Hadoop MapReduce and is optimized to run in memory, unlike Hadoop's MapReduce, which writes data to and from computer hard drives.
-
Apache Spark is up to 100K faster than Hadoop for processing large datasets.
-
Apache Spark provides high performance for batch and streaming data using its DAG scheduler, a query optimizer, and a physical execution engine.
-
Intel's Value Adds to Apache Spark include its Distributed Deep Learning library, Intel BigDL, and its rich deep learning support.
-
Intel BigDL can efficiently scale out to perform data analytics at a “big data scale” by using Spark, as well as efficient implementations of synchronous stochastic gradient descent (SGD) and all-reduce communications in Spark.
-
You can use Intel BigDL to write your learning programs to run on data stored in HDFS or Hive.
-
Intel also provides pointers and system setting for hardware and software that will give the best performance for most situations.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Description
Test your knowledge of Hadoop and Apache Spark with this quiz covering topics such as distributed architecture, MapReduce, HDFS, Apache Spark's features, and Intel's contributions to the framework. Learn about the key concepts and technologies behind these powerful big data and machine learning solutions.