Data Products in Big Data Analytics and Hadoop

UNIT 1 Fundamentals of Big Data Analytics and Hadoop Ecosystem – ▪ The volume of data being generated every second is very overwhelming. Data is increasingly changing how we work, play, and entertain ourselves, and technology has come to describe every facet of life, from the food we prepare to our online social interactions. ▪ Yet we expect highly personalized and finely tuned products and services well suited to our behavior and nature which has resulted in creating an opportunity in the market with a new technology—the data product. ▪ Data products are created using processing chains in data science, thoughtful application of complex algorithms possibly predictive or inferential being applied to a specific dataset. What is Data Product? What are the advantages of Data Product in hadoop? Defining a Data Product ▪ Traditionally a data product is any application combining data with algorithms. ▪ Writing a software is not just combining data with algorithms, speaking of data product, it is the combination of data and statistical algorithms useful for generating inferences or predictions. Ex. Facebook’s “People You May Know” ▪ But this definition limits data products to single software instances (ex, any web application), A data product is not just a name for a app driven by data but also an data application which uses the data to acquire its value and in the process creates additional data as output. it’s a data product, not just an application with data. ▪ A data product is a cost-effective engine that extracts value from data while also generating more data. ▪ Data products have been described as systems that learn from data and can self-adapt in a number of ways. ▪ Data products are economic engines that self-adapt and uses the data to acquire its value and in the process creates additional data while it makes inferences or predictions upon new data by influencing human behavior with this very data. ▪ Data products are no longer programs that run on the web interface, they are becoming an important part of every domain of activity in the current modern world. Using Hadoop to build data products at scale – The experimental methodology is this typical analytical workflow as pointed by data scientists in creating a data product is: Ingestion → Wrangling → Modeling → Reporting & Visualization. The data science pipeline and is human designed and augmented by the use of languages like R and Python. When we create a data product it allows data to become big in size, fast in execution, and enables larger variety of data for computation which in turn help to derive insights and does not involve human interaction. Data products advantages – Data products are designed to meet demand Users create data products to meet specific demands and use cases. They are focused and designed to meet demand. They help focus attention on the core needs of datasets and the business problems surrounding them, achieving greater insight and clarity. Data products enable data democratization across the organization Anyone with access rights can produce a data product for themselves or others. Power to the business functions Business owners understand the problems they want to answer. Data products empower business functions to solve problems quickly and with agility. They put the power of an ETL pipeline alongside the power of business context. The result is faster insights and better use of data. Save, share, collaborate Data products allow every business unit to create, save, and share business insights, placing data more directly in the hands of those who use it. Data products help short-circuit misunderstandings between departments and create greater cross-functional collaboration, reducing the need for long conversations and multiple iterations and speeding the time to insight. Data products are easily reused and adapted to fit new scenarios This ensures that different teams can benefit from work already started elsewhere in the organization. In general, their packaged nature helps maintain both focus and reuse. Data products are discoverable and accessible Data products are searchable and organized to help teams quickly find and access the information they need to succeed. Discoverability and accessibility ensure that data is always available to answer the right questions. Data products support data federation at their core At the same time, the same datasets can be viewed differently by different teams within the business. Because of this, it is no longer necessary to make multiple copies of a dataset. Instead, data products allow customized access to the same data, each tailored to the needs of individual business functions. Data products facilitate committed data ownership Data products provide a superior user experience when compared to traditional query methods. Data products are focused, curated, and ready to use, and this precision helps derive insights. Organizational benefits of data products Data products also solve specific business problems and enhance the organizational processes of the businesses that use them. Teamwork: Both data consumers and data consumers benefit from data products Data consumers benefit from greater accessibility, increased reliability, and the ability to bring the data closer to the business context. Data moves closer to the underlying business contexts, directly engaging the teams that own that context. Data producers benefit from a reduced workload, owing to the improved self-sufficiency of data consumers. Technical teams can address exceptions and overall trends rather than being drawn into day-to-day needs across the entire business. Data products enhance efficiency in several different ways They reduce the time spent liaising between teams, improve autonomy and self-sufficiency, and ensure that the underlying datasets are correct. Data producers particularly benefit from data products by tracking their use over time. Data products are iterable The teams that own data products improve them over time. Because they are easy to access, share, and maintain, the time between iterations is much shorter than traditional methods, speeding progress. Data products can also enhance security Data products include specific roles, responsibilities, and permissions from the outset. This feature is critical for any business, especially highly-regulated industries, and sensitive datasets, including those involving high-risk personal or financial data. Data products are highly scalable and increase agility Data products connect data and people and are social and organizational by nature. Curated and ready-for- consumption, data products open up the data world and reduce bottlenecks across the business. A data product developed in one setting is easy to use in another. They are shared within teams and across teams. Individual data products can also be connected for complex solutions, providing another layer of use. Data products help facilitate an agile workflow and reduce costs Their focused, curated nature makes them well-suited to iterative workflows that move business problems forward step by step. They can also be updated and modified easily. If the problem changes or the data shifts, this ensures that the data product shifts along with it. In agile workplaces, this represents a powerful tool to help facilitate dynamic growth. Data products save money by making data more accessible to non- technical users, reducing costs by decreasing reliance on expensive central IT teams. They reduce complexity, while the. self-service approach saves time. Using Large Datasets as an Advantage – Humans have extraordinary vision for large-scale patterns, such as woods and clearings visible through the foliage. Statistical methodologies allow us to deal with both noisy and meaningful data by defining them with aggregations and indices or inferentially by performing the analysis directly. As our ability to gather data has increased, so has the requirement for more generalization. Smart grids, quantified selves, mobile technology, sensors, and wired homes all require personalised statistical inference. Scale is measured by the number of facets that must be explored in addition to the amount of data—a forest view for individual trees. Hadoop is distinct due to the economics of data processing as well as the fact that it is a platform. Hadoop's release was interesting in that it came at a time when the world needed a solution for large- scale data analytics. Creating Data Products using Hadoop – Hadoop has been developed by tech giants such as Google, Facebook, and Yahoo to deal with big data challenges. Data issues are no longer limited to tech behemoths; they also impact commercial and public organisations of all sizes, from large companies to startups, federal agencies to cities, and perhaps even individuals. Computing services are also becoming more available and affordable. Data scientists can get on-demand, instant access to clusters of large sizes by using different cloud computing platforms like Google Compute Engine or Amazon EC2 at a fraction of the cost of traditional data centres and with no requirement of doing data center management. Big data computing is being made democratic and more open to everyone by Hadoop. The Data Science Pipeline – Characteristics of data science pipeline are as follows: Human - driven, is concerned with the development of practical data visualisations. Having a workflow with the aim to produce results that enable humans to make decisions. An analyst takes in large volume of data performs some operations on it to convert it into a normal form so that can we can perform different calculations to finally present the results in a visual manner. With the overwhelming growth rate in the volume and velocity at which many businesses are now generating data, this humanpowered model is not scalable. Explain the Bigdata pipeline of hadoop. The Big Data Pipeline – We implement a machine learning feedback loop into the data science pipeline to create a framework that enables the development of scalable, automated data analysis and insight generation solutions. The new framework is the big data pipeline which is ✓ Not human-driven, ✓ Is an iterative model ✓ has four primary phases ✓ ensures scalability and automation The 4 stages of the Big Data Pipeline are: staging, ingestion, computation, workflow management In its most basic form, this model, like the data science pipeline, takes raw data and transforms it into insights. This stage generates a reusable data product as the output by transforming the ingestion, staging, and computation phases into an automated workflow. A feedback system is often needed during the workflow management stage, that gives the output of one job can be automatically fed in as the data input for the next, allowing for self-adaptation. The ingestion phase involves both the model's initialization and the model's device interaction with users. Users may define data source locations or annotate data during the initialization process While interacting, users will receive the predictions given by the model and in turn give important feedback to strengthen the model The staging step requires executing transformations on data to make it usable and storeable, allowing it to be processed. The tasks of staging include data normalization, standardization & data management. The computation phase takes maximum time while executing the key responsibilities of extracting insights from data, conducting aggregations or reports, and developing machine learning models for recommendations, regressions, clustering, or classification. The workflow management phase involves tasks such as abstraction, orchestration, and automation, which enables to operationalize the performance of workflow steps. The final output is supposed to be an program that is automated that can be run as desired. Explain the Hadoop Ecosystem. Explain the hadoop core components. The Hadoop Ecosystem – Hadoop encompasses a multiplicity of tools that are designed and implemented to work together. As a result, Hadoop can be used for many things, and, consequently, people often define it based on the way they are using it. For some people, Hadoop is a data management system bringing together massive amounts of structured and unstructured data that touch nearly every layer of the traditional enterprise data stack, positioned to occupy a central place within a data center. For others, it is a massively parallel execution framework bringing the power of supercomputing to the masses, positioned to fuel execution of enterprise applications. Hadoop as an open source community creating tools and software for solving Big Data problems. Because Hadoop provides such a wide array of capabilities that can be adapted to solve many problems, many consider it to be a basic framework. Certainly, Hadoop provides all of these capabilities, but Hadoop should be classified as an ecosystem comprised of many components that range from data storage, to data integration, to data processing, to specialized tools for data analysts. Hadoop Core Components – Starting from the bottom of the diagram in Figure 1-1, Hadoop’s ecosystem consists of the following: HDFS — A foundational component of the Hadoop ecosystem is the Hadoop Distributed File System (HDFS). HDFS is the mechanism by which a large amount of data can be distributed over a cluster of computers, and data is written once, but read many times for analytics. It provides the foundation for other tools, such as HBase. MapReduce — Hadoop’s main execution framework is MapReduce, a programming model for distributed, parallel data processing, breaking jobs into mapping phases and reduce phases (thus the name). Developers write MapReduce jobs for Hadoop, using data stored in HDFS for fast data access. Because of the nature of how MapReduce works, Hadoop brings the processing to the data in a parallel fashion, resulting in fast implementation. HBase — A column-oriented NoSQL database built on top of HDFS, HBase is used for fast read/write access to large amounts of data. HBase uses Zookeeper for its management to ensure that all of its components are up and running. Zookeeper — Zookeeper is Hadoop’s distributed coordination service. Designed to run over a cluster of machines, it is a highly available service used for the management of Hadoop operations, and many components of Hadoop depend on it. Oozie — A scalable workflow system, Oozie is integrated into the Hadoop stack, and is used to coordinate execution of multiple MapReduce jobs. It is capable of managing a significant amount of complexity, basing execution on external events that include timing and presence of required data. Pig — An abstraction over the complexity of MapReduce programming, the Pig platform includes an execution environment and a scripting language (Pig Latin) used to analyze Hadoop data sets. Its compiler translates Pig Latin into sequences of MapReduce programs. Hive — An SQL-like, high-level language used to run queries on data stored in Hadoop, Hive enables developers not familiar with MapReduce to write data queries that are translated into MapReduce jobs in Hadoop. Like Pig, Hive was developed as an abstraction layer, but geared more toward database analysts more familiar with SQL than Java programming. The Hadoop ecosystem also contains several frameworks for integration with the rest of the enterprise: Sqoop is a connectivity tool for moving data between relational databases and data warehouses and Hadoop. Sqoop leverages database to describe the schema for the imported/ exported data and MapReduce for parallelization operation and fault tolerance. Flume is a distributed, reliable, and highly available service for efficiently collecting, aggregating, and moving large amounts of data from individual machines to HDFS. It is based on a simple and flexible architecture, and provides a streaming of data flows. It leverages a simple extensible data model, allowing you to move data from multiple machines within an enterprise into Hadoop. Hadoop’s ecosystem is growing to provide newer capabilities and components, such as the following: Whirr — This is a set of libraries that allows users to easily spin-up Hadoop clusters on top of Amazon EC2, Rackspace, or any virtual infrastructure. Mahout — This is a machine-learning and data-mining library that provides MapReduce implementations for popular algorithms used for clustering, regression testing, and statistical modeling. BigTop — This is a formal process and framework for packaging and interoperability testing of Hadoop’s sub-projects and related components. Ambari — This is a project aimed at simplifying Hadoop management by providing support for provisioning, managing, and monitoring Hadoop clusters. What is DFS? Explain the features of DFS. Distributed File System – A Distributed File System (DFS) as the name suggests, is a file system that is distributed on multiple file servers or multiple locations. It allows programs to access or store isolated files as they do with the local ones, allowing programmers to access files from any network or computer. The main purpose of the Distributed File System (DFS) is to allows users of physically distributed systems to share their data and resources by using a Common File System. A collection of workstations and mainframes connected by a Local Area Network (LAN) is a configuration on Distributed File System. A DFS is executed as a part of the operating system. In DFS, a namespace is created and this process is transparent for the clients. DFS has two components: Location Transparency – Location Transparency achieves through the namespace component. Redundancy – Redundancy is done through a file replication component. In the case of failure and heavy load, these components together improve data availability by allowing the sharing of data in different locations to be logically grouped under one folder, which is known as the “DFS root”. It is not necessary to use both the two components of DFS together, it is possible to use the namespace component without using the file replication component and it is perfectly possible to use the file replication component without using the namespace component between servers. File system replication: Early iterations of DFS made use of Microsoft’s File Replication Service (FRS), which allowed for straightforward file replication between servers. The most recent iterations of the whole file are distributed to all servers by FRS, which recognises new or updated files. “DFS Replication” was developed by Windows Server 2003 R2 (DFSR). By only copying the portions of files that have changed and minimising network traffic with data compression, it helps to improve FRS. Additionally, it provides users with flexible configuration options to manage network traffic on a configurable schedule. Features of DFS : Transparency : ▪ Structure transparency – There is no need for the client to know about the number or locations of file servers and the storage devices. Multiple file servers should be provided for performance, adaptability, and dependability. ▪ Access transparency – Both local and remote files should be accessible in the same manner. The file system should be automatically located on the accessed file and send it to the client’s side. ▪ Naming transparency – There should not be any hint in the name of the file to the location of the file. Once a name is given to the file, it should not be changed during transferring from one node to another. ▪ Replication transparency – If a file is copied on multiple nodes, both the copies of the file and their locations should be hidden from one node to another. User mobility : It will automatically bring the user’s home directory to the node where the user logs in. Performance : Performance is based on the average amount of time needed to convince the client requests. This time covers the CPU time + time taken to access secondary storage + network access time. It is advisable that the performance of the Distributed File System be similar to that of a centralized file system. Simplicity and ease of use : The user interface of a file system should be simple and the number of commands in the file should be small. High availability : A Distributed File System should be able to continue in case of any partial failures like a link failure, a node failure, or a storage drive crash. A high authentic and adaptable distributed file system should have different and independent file servers for controlling different and independent storage devices. Scalability : Since growing the network by adding new machines or joining two networks together is routine, the distributed system will inevitably grow over time. As a result, a good distributed file system should be built to scale quickly as the number of nodes and users in the system grows. Service should not be substantially disrupted as the number of nodes and users grows. Security : A distributed file system should be secure so that its users may trust that their data will be kept private. To safeguard the information contained in the file system from unwanted & unauthorized access, security mechanisms must be implemented. Heterogeneity : Heterogeneity in distributed systems is unavoidable as a result of huge scale. Users of heterogeneous distributed systems have the option of using multiple computer platforms for different purposes. High reliability : The likelihood of data loss should be minimized as much as feasible in a suitable distributed file system. That is, because of the system’s unreliability, users should not feel forced to make backup copies of their files. Rather, a file system should create backup copies of key files that can be used if the originals are lost. Many file systems employ stable storage as a high-reliability strategy. Data integrity : Multiple users frequently share a file system. The integrity of data saved in a shared file must be guaranteed by the file system. That is, concurrent access requests from many users who are competing for access to the same file must be correctly synchronized using a concurrency control method. Atomic transactions are a high-level concurrency management mechanism for data integrity that is frequently offered to users by a file system. What is DFS? Explain the applications of DFS. Applications : NFS – NFS stands for Network File System. It is a client-server architecture that allows a computer user to view, store, and update files remotely. The protocol of NFS is one of the several distributed file system standards for Network-Attached Storage (NAS). CIFS – CIFS stands for Common Internet File System. CIFS is an accent of SMB. That is, CIFS is an application of SIMB protocol, designed by Microsoft. SMB – SMB stands for Server Message Block. It is a protocol for sharing a file and was invented by IMB. The SMB protocol was created to allow computers to perform read and write operations on files to a remote host over a Local Area Network (LAN). The directories present in the remote host can be accessed via SMB and are called as “shares”. Hadoop – Hadoop is a group of open-source software services. It gives a software framework for distributed storage and operating of big data using the MapReduce programming model. The core of Hadoop contains a storage part, known as Hadoop Distributed File System (HDFS), and an operating part which is a MapReduce programming model. NetWare – NetWare is an abandon computer network operating system developed by Novell, Inc. It primarily used combined multitasking to run different services on a personal computer, using the IPX network protocol. Explain the working of DFS. Explain the advantages and disadvantages of it. Working of DFS : There are two ways in which DFS can be implemented: Standalone DFS namespace – It allows only for those DFS roots that exist on the local computer and are not using Active Directory. A Standalone DFS can only be acquired on those computers on which it is created. It does not provide any fault liberation and cannot be linked to any other DFS. Standalone DFS roots are rarely come across because of their limited advantage. Domain-based DFS namespace – It stores the configuration of DFS in Active Directory, creating the DFS namespace root accessible at \\\ or \\\ Advantages : DFS allows multiple user to access or store the data. It allows the data to be share remotely. It improved the availability of file, access time, and network efficiency. Improved the capacity to change the size of the data and also improves the ability to exchange the data. Distributed File System provides transparency of data even if server or disk fails. Disadvantages : In Distributed File System nodes and connections needs to be secured therefore we can say that security is at stake. There is a possibility of lose of messages and data in the network while movement from one node to another. Database connection in case of Distributed File System is complicated. Also handling of the database is not easy in Distributed File System as compared to a single user system. There are chances that overloading will take place if all nodes tries to send data at once. MapReduce with Python – When we deal with “BIG” data, as the name suggests dealing with a large amount of data is a daunting task. MapReduce is a built-in programming model in Apache Hadoop. It will parallel process your data on the cluster. This chapter will look into how MapReduce works with an example dataset using Python. MapReduce: Analyse big data MapReduce will transform the data using Map by dividing data into key/value pairs, getting the output from a map as an input, and aggregating data together by Reduce.MapReduce will deal with all your cluster failures. What is mapreduce? Explain the working of hadoop streaming. How MapReduce Works – To understand MapReduce, let’s take a real-world example. You have a dataset that consists of hotel reviews and ratings. Now you need to find out how many reviews each rating. Dataset Snapshot – So what you will do is you will go through the Rating columns and count how many reviews are there for 1,2,3,4,5. It is pretty easy to do if we do have a small amount of data but when it comes to big data it can be billions or trillions of data. Therefore we can use MapReduce. What’s happening under the Hadoop Streaming – Fig. Working of Hadoop MapReduce in Data Processing Client node will submit MapReduce Jobs to The resource manager Hadoop YARN. Hadoop YARN resource managing and monitoring the clusters such as keeping track of available capacity of clusters, available clusters, etc. Hadoop Yarn will copy the needful data Hadoop Distribution File System(HTFS) in parallel. Next Node Manager will manage all the MapReduce jobs. MapReduce application master located in Node Manager will keep track of each of the Map and Reduce tasks and distribute it across the cluster with the help of YARN. Map and Reduce tasks connect with HDFS cluster to get needful data to process and output data. MapReduce is written in Java but capable of running in different languages such as Ruby, Python, and C++. Here we are going to use Python with the MR job package. We will count the number of reviews for each rating(1,2,3,4,5) in the dataset. Step 1: Transform raw data into key/value pairs in parallel. The mapper will get the data file and make the Rating the key and the values will be the reviews. We will add number 1 for reviews. Step 2: Shuffle and short by the MapReduce model. The process of transferring mappers’ intermediate output to the reducer is known as shuffling. It will collect all the reviews(number 1s) together with the individual key and it will sort them. it will get sorted by key. Step3: Process the data using Reduce. Reduce will count each value(number 1) for each key. Prerequisites 1. Install Python 2. Install Hadoop 3. Install MRJob pip install mrjoborpython setup.py test && python setup.py install Here we are having a Job called ‘NoRatings’ consisting of a Mapper function, a Reducer function, and a definition. A Step function is used to define our functions for mappers and reducers in our job. Note: Used yield to return statement in a function. Here is the full code which is saved as NoRatings.py from mrjob.job import MRJob from mrjob.step import MRStep import csv#split by , columns = 'Review,Rating'.split(',')class NoRatings(MRJob): def steps(self): return[ MRStep(mapper=self.mapper_get_ratings, reducer=self.reducer_count_ratings) ] #Mapper function def mapper_get_ratings(self, _, line): reader = csv.reader([line]) for row in reader: zipped=zip(columns,row) diction=dict(zipped) ratings=diction['Rating'] #outputing as key value pairs yield ratings, 1#Reducer function def reducer_count_ratings(self, key, values): yield key, sum(values)if __name__ == "__main__": NoRatings.run() Verify whether the both dataset(Hotel_Reviews.csv) and the python file(NoRatings.py) are there in the directory. ls #or hadoop fs -ls Run python script with the mrjob. We can run the NoRatings.py script locally or with Hadoop. #run locallly #Hotel_Reviews.csv is the dataset. #verify whether the both dataset(Hotel_Reviews.csv) and the python file(NoRatings.py) are there in current directory. uing 'ls' command.python NoRatings.py Hotel_Reviews.csv#run with hadoop#python [python file] -r hadoop --hadoop-streaming-jar [The_path_of_Hadoop_Streaming_jar] [dataset]python NoRatings.py -r hadoop -- hadoop-streaming-jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.7.1 jar Hotel_Reviews.csv Python MapReduce is a programming model. It enables the processing and creation of large amounts of data by separating work into discrete jobs. It also allows work to be performed in parallel across a cluster of machines. The functional programming constructs map and reduce inspired the MapReduce programming model. This article will present the MapReduce programming model in Python. Also, explain how data flows across the model's various stages. What is MapReduce? Explain the features of MapReduce. MapReduce is a method of programming that offers great scalability over a Hadoop cluster's hundreds or thousands of machines. MapReduce, as the processing component, lies at the heart of Apache Hadoop. The phrase MapReduce refers to two independent functions performed by Hadoop programs. The first is a map task, which takes one set of data and changes it into another set of data in which individual items are broken down into tuples (key/value pairs). The reduce job takes a map's output as input and merges those data tuples into a smaller collection of tuples. The reduction job is always executed after the map job, as the term MapReduce implies. To speed up processing, MapReduce performs logic on the server where the data already resides rather than transporting the data to the location of the application or logic. MapReduce initially appeared as a tool for analyzing Google search results. However, It quickly gained popularity due to its ability to partition and analyze terabytes of data in parallel, resulting in faster results. Features of MapReduce – Some of the features of MapReduce are discussed below: MapReduce is a highly scalable framework. This is due to its ability to distribute and store huge amounts of data across multiple servers. These servers can all run at the same time and are economically priced. MapReduce provides high security. The security mechanisms used by the MapReduce programming model are HBase and HDFS. Only authenticated users can read and manipulate the data. Big data volumes can be stored and handled extremely affordably with the MapReduce programming framework and Hadoop's scalable design. This type of system is extremely cost-effective and scalable. The MapReduce program runs faster because of the parallel processing. It makes it simpler for the processes to handle each job. Parallel processing allows many processors to carry out these broken- down tasks. Hadoop MapReduce is built on a simple programming model and is one of the technology’s many worthy features. As a result, programmers may make efficient MapReduce applications that can handle tasks fast. Explain the implementation of MapReduce in python. Implementation of MapReduce in Python – There are three main phases in the MapReduce framework. These are the Map, Filter, and Reduce models of functional programming. They let the programmer write simpler, shorter code without having to worry about complexities such as loops and branching. The map and filter are built-in modules with Python (in the __builtins__ module) and do not need to be imported. Reduce, on the other hand, must be imported because it is found in the Functools package. Map Phase The map() method in Python has the following syntax: map(func, *iterable) Where func is the function that would be applied to each element in iterable (however many there are), before we move on to an example, keep the following in mind: The map() method in Python 2 returns a list. However, in Python 3, the function produces a map object, which is a generator object. The built-in list() function on the map object can be used to receive the result as a list. Specifically, list(map(func, *iterable)). The number of parameters passed to func must equal the number of iterables specified. The below code shows the implementation of the map() function: my_pets = ['Alfred', 'Tabitha', 'William', 'Arla'] uppered_pets = list(map(str.upper, my_pets)) print(uppered_pets) Output: ['ALFRED', 'TABITHA', 'WILLIAM', 'ARLA'] The above code converts the input lower case strings to upper case using map(). Note that with respect to the syntax of the map(), the func, in this case, is str.upper, and iterable is the my_pets list. More importantly, we only sent one iterable to the str.upper function because that method's definition calls for just one input. Therefore, you must pass two, three, or n iterables to the function you are passing if it requires two, three, or n arguments. Shuffle and Sort – As the mappers finish, the intermediate outputs from the map stage are moved to the reducers. This process of shifting output from mappers to reducers is known as shuffling. The partitioner is a divider function that moves the shuffle. The partitioner is responsible for directing the flow of key-value pairs from mappers to reducers. It is given the mapper's output key as well as the number of reducers. The partitioner ensures that all values for the same key are routed to the appropriate reducer. The sorting process is the final stage before the reducers begin processing data. The Hadoop framework organizes the intermediate keys and values for each partition before passing them on to the reducer. Filter Phase – The filter() first requires the function to return boolean values (true or false), and then iterates through the function, filtering out those that are false. The syntax is as follows: filter(func, iterable) The following points should be observed about filter(): In contrast to map(), only one iterable is required. The func argument must return a boolean type. If not, the filter just returns the iterable that was supplied to it. Since only one iterable is necessary, it follows that func can only accept one parameter. The filter runs each iterable element through func and only returns those that evaluate to true. The below code shows the implementation of filter(): scores = [66, 90, 68, 59, 76, 60, 88, 74, 81, 65] def is_A_student(score): return score > 75 over_75 = list(filter(is_A_student, scores)) print(over_75) Output: [90, 76, 88, 81] In the above code, there is a list (iterable) of the scores of 10 students in an exam. The code filters out those who passed with scores more than 75 using filter(). Reduce Phase Reduce applies a function of two parameters cumulatively on the elements of an iterable, starting with an optional initial argument. The syntax is as follows: reduce(func, iterable[, initial]) Where func is the function to which each iterable element is cumulatively applied, and initial is the optional value that is inserted before the iterable items in the calculation. It serves as a default when the iterable is empty. The following points should be noted about reduce(): The first argument to func is the first element in iterable (if the initial is not specified), and the second argument is the second element in iterable. If an initial is given, it becomes the first argument to func, and the first member of iterable becomes the second element. “reduce” reduces the iterable to a single value. The code implementation of reduce() is as follows: from functools import reduce numbers = [3, 4, 6, 9, 34, 12] def custom_sum(first, second): return first + second result = reduce(custom_sum, numbers) print(result) Output: 68 In the above code, the sum() function returns the sum of all elements in the iterable that was supplied to it. Reduce takes the first and second elements in numbers and passes them to custom_sum. custom_sum computes and returns their sum to reduce. Reduce then uses that result as the first element in custom_sum and the third element in numbers as the second element in custom_sum.This is repeated indefinitely until the number is exhausted. MapReduce – Advanced Concepts: Map only Job: So, until now we have seen both mapper and reducer. Here we don’t have reducer, output of mapper is the final output and stored in HDFS. Here the main advantage is since we don’t have any reducers there is no need of data shuffling between mapper and reducer. So, map only job is more efficient. We use map only job only for certain problems like data parsing in which there is no need of aggregation or summation. Combiner – We have a Combiner In between Map and reduce. It is mostly like a reducer. The tasks done by the combiner are also aggregating, filtering, etc. The main difference between Combiner and Reducer is that Reducer process data from all the mappers from all the nodes whereas Combiner process data from mappers of a single node. By this the number of intermediate outputs generated by mapper will be reduced, it can reduce data movement of key-value pairs between mapper and reducer. We can define our own Combiner function. We can take an example like we can have many repeated keys generated by mapper, then we can do aggregation by Combiner and reduce the size of data being sent to reducer. Combiner internally implements reducer method only. We can take a simple Word Count example and analyse the combiner. Data Locality: Data locality means movement of code or computation closer to data stored in HDFS. It is usually easy to move kb’s of code near to place where data is stored. We cannot move HDFS data which is of very large size near to code. This process of movement is called as Data Locality which minimizes the network congestion. So, let us see how this process is done. First Mapreduce code or job will come nearer to the slaves and process the blocks stored on each and every slave. Here mapper operates on the data located on slaves. Suppose the data is located on the same node where the mapper is running then it can be referred as Data Local. Here computation is closer to the data. Data local is the good choice but it is not always possible to do this due to resource constraints on a busy cluster. In such situations it is preferred to run mapper on a different node but on the same rack which has the data. In this case, data will be moved between different nodes of same rack. Data can also travel different rack, this is the least preferred scenario. MapReduce API – In this section, we focus on MapReduce APIs. Here, we learn about the classes and methods used in MapReduce programming. MapReduce Mapper Class In MapReduce, the role of the Mapper class is to map the input key-value pairs to a set of intermediate key- value pairs. It transforms the input records into intermediate records. These intermediate records associated with a given output key and passed to Reducer for the final output. Methods of Mapper Class void cleanup(Context context) This method called only once at the end of the task. void map(KEYIN key, VALUEIN value, This method can be called only once for each key-value Context context) in the input split. void run(Context context) This method can be override to control the execution of the Mapper. void setup(Context context) This method called only once at the beginning of the task. MapReduce Reducer Class In MapReduce, the role of the Reducer class is to reduce the set of intermediate values. Its implementations can access the Configuration for the job via the JobContext.getConfiguration() method. Methods of Reducer Class void cleanup(Context context) This method called only once at the end of the task. void map(KEYIN key, Iterable values, This method called only once for each key. Context context) void run(Context context) This method can be used to control the tasks of the Reducer. void setup(Context context) This method called only once at the beginning of the task. MapReduce Job Class The Job class is used to configure the job and submits it. It also controls the execution and query the state. Once the job is submitted, the set method throws IllegalStateException. Methods of Job Class Methods Description Counters getCounters() This method is used to get the counters for the job. long getFinishTime() This method is used to get the finish time for the job. Job getInstance() This method is used to generate a new Job without any cluster. Job getInstance(Configuration conf) This method is used to generate a new Job without any cluster and provided configuration. Job getInstance(Configuration conf, String This method is used to generate a new Job without any jobName) cluster and provided configuration and job name. String getJobFile() This method is used to get the path of the submitted job configuration. String getJobName() This method is used to get the user-specified job name. JobPriority getPriority() This method is used to get the scheduling function of the job. void setJarByClass(Class c) This method is used to set the jar by providing the class name with.class extension. void setJobName(String name) This method is used to set the user-specified job name. void setMapOutputKeyClass(Class class) This method is used to set the key class for the map output data. void setMapOutputValueClass(Class This method is used to set the value class for the map class) output data. void setMapperClass(Class

Data Products in Big Data Analytics and Hadoop

Document Details

Tags

Related

Summary

Full Transcript