Big Data Analytics PDF
Document Details
Uploaded by WellIntentionedOnyx2727
Dayananda Sagar College of Engineering
Tags
Summary
This document provides an overview of big data analytics, including its definition, the rise of big data, and its advantages. It also discusses data types, technologies like Hadoop, and the phases involved in big data analytics.
Full Transcript
[BIG] DATAANALYTICS Big data analytics Definition Big data : Big data is a term applied to data sets whose size or type is beyond the ability of traditional relational databases to capture, manage and process the data with low latency (minimum delay). Rise of Big Data: Data generation was...
[BIG] DATAANALYTICS Big data analytics Definition Big data : Big data is a term applied to data sets whose size or type is beyond the ability of traditional relational databases to capture, manage and process the data with low latency (minimum delay). Rise of Big Data: Data generation was limited in earlier days – single storage unit and processor. Data generation increases enormously in recent years. Big data comes from sensors, devices, video/audio, networks, log files, transactional applications, web, and social media — much of it generated in real time and at a very large scale. Definition Big data analytics : Big data analytics is the use of advanced analytic techniques against very large, diverse data sets that include structured, semi-structured and unstructured data, from different sources, and in different sizes from terabytes to zettabytes. Data Generation in Real Time Big data analytics Advantages of Analysis of Big data and techniques : Businesses can use advanced analytics techniques such as text analytics, machine learning, predictive analytics, data mining, statistics and natural language processing to gain new insights. Analysis of big data allows analysts, researchers and business users to make better and faster decisions using data that was previously inaccessible or unusable. Handles the unstructured data. Risk Management (Banking System - fraudulent activities and discrepancies). Product Development and Innovations. (Rolls-Royce analyze the engine designs). Improve Customer Experience (Delta Air Lines – use tweets to monitor the experience). Big data analytics Structured Data, Unstructured Data, Semi structured Data: Big data analytics Structured Data and Unstructured Data: Big data analytics Parallel Processing and Distributed Storage: DATA VS BIG DATA Big data is just data with: More volume Faster data generation (velocity) Multiple data format (variety) The ability to turn data into useful insights for your business (Value) Trustworthiness in terms of quality and accuracy (Veracity) World's data volume to grow 40% per year. Data coming from various human & machine activity. http://e27.co/worlds-data-volume-to-grow-40-per-year-50-times-by-2020-aureus-20150115-2/ Big data analytics The main challenges that Big Data faced and the solutions: Challenges Solution Single central storage Distributed storage Serial processing Parallel processing One input Multiple inputs One processor Multiple processors One output One output Ability to process every type of Lack of ability to process unstructured data data Big data analytics The Phases of Big Data Analytics: Big data analytics Types of Data Analytics: Big data analytics Types of Data Analytics: Big data analytics Hadoop: Hadoop is an open source distributed processing framework that manages data processing and storage for big data applications in scalable clusters of computer servers. Hadoop systems can handle various forms of structured and unstructured data, giving users more flexibility for collecting, processing, analyzing and managing data than relational databases and data warehouses provide. It is the most commonly used software to handle Big Data. There are three components of Hadoop. 1. Hadoop HDFS - Hadoop Distributed File System (HDFS) is the storage unit of Hadoop. 2. Hadoop MapReduce - Hadoop MapReduce is the processing unit of Hadoop. 3. Hadoop YARN - Hadoop YARN is a resource management unit of Hadoop. Big data analytics Hadoop: Master and slave nodes form the HDFS cluster. The name node is called the master, and the data nodes are called the slaves. The name node is responsible for the workings of the data nodes. It also stores the metadata. The data nodes read, write, process, and replicate the data. They also send signals, known as heartbeats, to the name node. These heartbeats show the status of the data node. Big data analytics Hadoop HDFS : Data is stored in a distributed manner in HDFS. There are two components of HDFS – name node - one data node - multiple HDFS is specially designed for storing huge datasets in commodity hardware. Hadoop enables you to use commodity machines as your data nodes. HDFS maintains all the coordination between the clusters and hardware, thus working at the heart of the system. Features of HDFS: Provides distributed storage Can be implemented on commodity hardware Provides data security Highly fault-tolerant - If one machine goes down, the data from that machine goes to the next machine Big data analytics Hadoop Map Reduce Phase: Hadoop MapReduce is the processing unit of Hadoop. In the MapReduce approach, the processing is done at the slave nodes, and the final result is sent to the master node. coded data is usually very small in comparison to the data itself. Big data analytics Hadoop Yarn: It is the resource management unit of Hadoop Hadoop YARN acts like an OS to Hadoop. It is a file system that is built on top of HDFS. It is responsible for managing cluster resources to make sure you don't overload one machine. It performs job scheduling to make sure that the jobs are scheduled in the right place Big data analytics Hadoop Ecosystem: Apart from those Hadoop components, the Hadoop ecosystem has other capabilities that help with Big Data processing. The following comprise the Hadoop ecosystem: HDFS: Hadoop Distributed File System YARN: Yet Another Resource Negotiator MapReduce: Programming based Data Processing Spark: In-Memory data processing PIG, HIVE: Query based processing of data services HBase: NoSQL Database Mahout, Spark MLLib: Machine Learning algorithm libraries Solar, Lucene: Searching and Indexing Zookeeper: Managing cluster Oozie: Job Scheduling Big data analytics Hadoop Ecosystem: https://www.geeksforgeeks.org/hadoop-ecosystem/ Big data analytics Pig: Pig was basically developed by Yahoo which works on a pig Latin language, which is Query based language similar to SQL. It is a platform for structuring the data flow, processing and analyzing huge data sets. Pig does the work of executing commands and in the background, all the activities of MapReduce are taken care of. After processing, pig stores the result in HDFS. Pig Latin language is specially designed for this framework which runs on Pig Runtime. Just the way Java runs on the JVM. Pig helps to achieve ease of programming and optimization and hence is a major segment of the Hadoop Ecosystem. HIVE The Apache Hive ™ data warehouse software facilitates querying and managing large datasets residing in distributed storage. With Hive you can write the schema for the data in HDFS Hive provide many library that enable you to read various data type like XML, JSON, or even compressed format You can create your own data parser (convert to readable format that is suitable for analysis) with Java language Hive support SQL language to read from your data Hive will convert your SQL into Java MapReduce code, and run it in cluster Big data analytics HIVE: With the help of SQL methodology and interface, HIVE performs reading and writing of large data sets. However, its query language is called as HQL (Hive Query Language). It is highly scalable as it allows real-time processing and batch processing both. Also, all the SQL datatypes are supported by Hive thus, making the query processing easier. Similar to the Query Processing frameworks, HIVE too comes with two components: JDBC Drivers and HIVE Command Line. JDBC, along with ODBC drivers work on establishing the data storage permissions and connection whereas HIVE Command line helps in the processing of queries. Big data analytics Mahout: Mahout, allows Machine Learnability to a system or application. Machine Learning, as the name suggests helps the system to develop itself based on some patterns, user/environmental interaction or on the basis of algorithms. It provides various libraries or functionalities such as collaborative filtering, clustering, and classification which are nothing but concepts of Machine learning. It allows invoking algorithms as per our need with the help of its own libraries. Big data analytics Apache Spark: Data processing engine It’s a platform that handles all the process consumptive tasks like batch processing, interactive or iterative real-time processing, graph conversions, and visualization, etc. It consumes in memory resources hence, thus being faster than the prior in terms of optimization. Spark is best suited for real-time data whereas Hadoop is best suited for structured data or batch processing, hence both are used in most of the companies interchangeably. Big data analytics Apart from all of these, there are some other components too that carry out a huge task in order to make Hadoop capable of processing large datasets. They are as follows: Zookeeper helps in the different tools and services to communicate with each other. Zookeeper: There was a huge issue of management of coordination and synchronization among the resources or the components of Hadoop which resulted in inconsistency, often. Zookeeper overcame all the problems by performing synchronization, inter- component based communication, grouping, and maintenance. Big data analytics Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs and binding them together as a single unit. There is two kinds of jobs.i.e Oozie workflow and Oozie coordinator jobs. Oozie workflow is the jobs that need to be executed in a sequentially ordered manner whereas Oozie Coordinator jobs are those that are triggered when some data or external stimulus is given to it.