Big Data and Hadoop Intro Lecture 5 PDF
Document Details
Uploaded by EntertainingTrombone
Tags
Summary
This lecture provides an introduction to big data and Hadoop. It covers data types, sources, challenges in handling big data, and unlocking big data solutions, including Hadoop ecosystem components. The lecture also discusses examples of how big data is used in practice.
Full Transcript
Introduction to Big data A deep introduction about big data topic along with real advices of how to start a career in this hot topic. Be ready to digest a concentrated big data tablet that will put you on the right way. Agenda: Data nowadays: - Data types - Fun facts about...
Introduction to Big data A deep introduction about big data topic along with real advices of how to start a career in this hot topic. Be ready to digest a concentrated big data tablet that will put you on the right way. Agenda: Data nowadays: - Data types - Fun facts about data nowadays. - From where we generate data. - Lake of data effect on business decisions. - Future of data size. Big Data: - What’s big data? - How big is the big data? - The famous Vs about big data. - Challenges of dealing with such data amount. - Why to consider a career in big data? Unlocking Big data solutions: - Hadoop. - Hadoop ecosystem Zoo - Big data landscape - Top Big data companies - How to start a career in Big data - Questions Data unit measures: 2 Data Types: information with a degree of information with a lack of organization that is readily structure that is time and energy searchable and quickly consuming to search and find and consolidate into facts. consolidate into facts Examples: RDMBS, spreadsheet Exemples: email, documents, images, reports Semi Structured data : XML data Challenges for Unstructured data: How do you store How long does it take to Data has no Billions of Files? migrate 100’s of TB’s or structure data every 3-5 years Data Redundancy Data Backup Resources Limitation Sources of data generation: Social Media Sensors Cell Phones GPS Purchase WWW E-mails Media streaming Healthcare IOT Facts about data: 2 Facts about data: 70% of data is created by Individuals – but enterprises are responsible for storing and managing 80% of it. 52% of travelers use social media to plan for their vacations. 35% of purchases on Amazon are though recommendations 75% of what people watch on Netflix are recommendations. Facts about data: Lake of data and business decisions : 2 Can traditional DBMS solve this ? Size: First, the data size has increased tremendously to the range of petabytes—one petabyte = 1,024 terabytes. RDBMS finds it challenging to handle such huge data volumes. To address this, RDBMS added more central processing units (or CPUs) or more memory to the database management system to scale up vertically. Data types: Second, the majority of the data comes in a semi-structured or unstructured format from social media, audio, video, texts, and emails. However, the second problem related to unstructured data is outside the purview of RDBMS because relational databases just can’t categorize unstructured data. They’re designed and structured to accommodate structured data such as weblog sensor and financial data. Velocity: Also, “big data” is generated at a very high velocity. RDBMS lacks in high velocity because it’s designed for steady data retention rather than rapid growth. Cost: Even if RDBMS is used to handle and store “big data,” it will turn out to be very expensive. What is Big data: Big data is a term that describes the large volume of data – both structured and unstructured – that generates on a day-to-day basis. But it’s not the amount of data that’s important. It’s what organizations do with the data that matters. Big data can be analyzed for insights that lead to better decisions and strategic business moves. Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation Big data is a term for data sets that are so large or complex that traditional data processing applications are inadequate to deal with them. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, querying, updating and information privacy. The Vs: Velocity Volume Variety Veracity Variability Visualization Value Big data in action: UPS stores a large amount of data – much of which comes from sensors in its vehicles - GPS the world's Data Analytics largest and Data Science operations research project ORION (On-Road Integration Optimization and Navigation) savings of more 85 million miles Saved than 8.4 million off of daily routes $30 million/Day Big data in action: “We want to know what every product in the world is. We want to know who every person in the world is. And we want to have the ability to connect them together in a transaction.” -Neil Ashe, CEO of Global E-commerce at Walmart Walmart collects 2.5 petabytes of information from 1 million customers. from 6000 store Pricing Advertising strategies campaigns Big data System (Kosmix) 30% on their Revenue got Online sales increased by 40% Big data in quotes: “Without big data analytics, companies are blind and deaf, wandering out onto the web like deer on a freeway.” – Geoffrey Moore “The world is one big data problem.” – Andrew McAfee “Data is the new science. Big Data holds the answers.” – Pat Gelsinger “With too little data, you won’t be able to make any conclusions that you trust. With loads of data you will find relationships that aren’t real… Big data isn’t about bits, it’s about talent.” – Douglas Merrill Big data market forecast: The “big data” market is expected to cross $50 billion by 2017. 124000 usd to sar = Big data jobs trend: 465012 /12 = 38751 SAR/Month IBM , Cisco and Oracle together advertised 26,488 open positions that required big data expertise in the last twelve months. The median advertised salary for professionals with big data expertise is $124,000 a year. How to solve big data Hadoop: is a big data analysis engine What is Hadoop The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs. Hadoop history Nutch is a well matured, production ready Web crawler. that enables fine grained configuration, relying on Apache Hadoop™ data structures, which are great for batch processing. Why Hadoop is important ? Ability to store and process huge amounts of any kind of data, quickly. With data volumes and varieties constantly increasing, especially from social media and the Internet of Things (IoT), that's a key consideration. Computing power. Hadoop's distributed computing model processes big data fast. The more computing nodes you use, the more processing power you have. Fault tolerance. Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. Multiple copies of all data are stored automatically. Why Hadoop is important ? Flexibility. Unlike traditional relational databases, you don’t have to preprocess data before storing it. You can store as much data as you want and decide how to use it later. That includes unstructured data like text, images and videos. Low cost. The open-source framework is free and uses commodity hardware to store large quantities of data. Scalability. You can easily grow your system to handle more data simply by adding nodes. Little administration is required. Horizontal scaling means that you scale by adding more machines into your pool of resources Scalability Vertical scaling means that you scale by adding more power (CPU, RAM) to an existing machine How is Hadoop being used? Going beyond its original goal of searching millions (or billions) of web pages and returning relevant results, many organizations are looking to Hadoop as their next big data platform. Popular uses today include: How is Hadoop being used? How is Hadoop being used? Hadoop ecosystem Querying Layer Sqoop (Hive,Impala) Flume Processing Layer (MapReduce,Spark,Pig) Storm Data Ingestion NiFi Storage Layer (HDFS,Hbase,Hcatalog) Kafka Cluster monitoring, provisioning and management Hadoop | Data Ingestion Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured data stores such as relational databases. Hadoop | Data Ingestion Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. Hadoop | Data Ingestion Storm is real-time computation system. Storm makes it easy to reliably process unbounded streams of data, doing for real-time processing. A Storm topology consumes streams of data and processes those streams in arbitrarily complex ways, repartitioning the streams between each stage of the computation however needed. Hadoop | Data Ingestion An easy to use, powerful, and reliable system to process and distribute data. Apache NiFi supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic in a Web-based user interface Hadoop | Data Ingestion Kafka™ is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, wicked fast, and runs in production in thousands of companies. Hadoop | Data Ingestion Large scale log aggregator, and analytics. Fluentd is an open source data collector for unified logging layer. Fluentd allows you to unify data collection and consumption for a better use and understanding of data. Apache Samza is a distributed stream processing framework. It uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management. Hadoop | Data Storage Layer Hadoop Distributed File System (HDFS) offers a way to store large files across multiple machines. Hadoop and HDFS was derived from Google File System (GFS) paper. Hadoop | Data Storage Layer Hadoop | Data Storage Layer A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point queries (random reads) Hadoop | Data Storage Layer A metadata and table management system for Hadoop. It shares the metadata with other tools like map reduce, Pig and Hive. It provides one constant data model for all Hadoop tools along with a shared schema. Hadoop | Data Processing Layer MapReduce is the heart of Hadoop. It is this programming paradigm that allows for massive scalability across hundreds or thousands of servers in a Hadoop cluster with a parallel, distributed algorithm. Hadoop | Data Processing Layer Hadoop | Data Processing Layer A scripting SQL based language and execution environment for creating complex MapReduce transformations. Functions are written in Pig Latin (the language) and translated into executable MapReduce jobs. Pig also allows the user to create extended functions (UDFs) using Java. Hadoop | Data Processing Layer In memory data analytics cluster computing framework originally developed in the AMPLab at UC Berkeley. Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Hadoop | Data Querying Layer A distributed data warehouse built on top of HDFS to manage and organize large amounts of data. Hive provides a query language based on SQL semantic (HiveQL) which is translated by the runtime engine to MapReduce jobs for querying the data. Hadoop | Data Querying Layer pen source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Hadoop | Management Layer intuitive, easy-to-use Hadoop management web UI. Apache Ambari was donated by Hortonworks team. It's a powerful and nice interface for Hadoop and other typical applications from the Hadoop ecosystem. Big data existing solutions: Big data existing solutions: Ambari: A web interface for managing, configuring and testing Hadoop services and components. A Scalable machine learning and data mining library A web-based tool for A high-performance provisioning, managing, and coordination service for monitoring Apache Hadoop distributed applications. is a data serialization system clusters. Ambari also provides a dashboard for viewing cluster data collection system for monitoring large health such as heatmaps and Data Sources is a Java Web application distributed systems. ability to view MapReduce, Pig used to schedule Apache and Hive applications visually Hadoop jobs along with features to diagnose their performance. A platform for manipulating data MapReduce stored in HDFS via a high-level A data warehousing and SQL-like An open-source A distributed data processing language called Pig Latin. It does query language that presents cluster computing model and execution data extractions, data in the form of tables. Hive framework with environment that runs on large transformations and loading, and programming is similar to in-memory clusters of commodity basic analysis in patch mode database programming. analytics. machines. YARN: A framework for job scheduling and cluster resource management. HDFS : A platform for manipulating data HBase: A distributed, column-oriented HCatalog A table and storage management stored in HDFS via a high-level database. HBase uses HDFS for language called Pig Latin. It does its underlying storage, and layer for Hadoop that enables Hadoop data extractions, supports both batch-style applications (Pig, MapReduce, and transformations and loading, and computations using MapReduce Hive) to read and write data to a basic analysis in patch mode and point queries. tabular form as opposed to files. Other apache projects: Apache Flink Apache Falcon Apache Ranger Apache Tez is an open source Feed management Ranger is a framework to develop a generic platform for and data processing to enable, monitor and application which distributed stream platform manage comprehensive can be used to and batch data data security across the process complex processing. Hadoop platform. data-processing task Apache Tika Apache Parquet Apache Zeppelin Apache Drill toolkit detects and columnar storage Schema-free SQL A web-based notebook extracts metadata format available to Query Engine for that enables interactive and text from over any project in the Hadoop, NoSQL data analytics. a thousand Hadoop ecosystem and Cloud Storage different file types Top Leading Big data companies The Apache Software Foundation (ASF) is an American non- profit corporation to support Apache projects How to start 1. Identify business use cases tied to business outcomes, metrics and your big data roadmap 2. Identify big data champions from both the business and IT sides of your organization 3. Select infrastructure, tools and architecture for your big data POC/implementation 4. Staff the project with the right big data skills or a strategic big data implementation partner 5. Run your project/POC in sprints or short projects with tangible and measurable outcomes 6. Try to scale your success POC up to test your Logic implementation against the big dataset. What can I do now ? Certification Path | Administration Cloudera HortonWorks Certification Path | Development Cloudera HortonWorks Certification Path | Data Science Cloudera HortonWorks Questions Thanks For your time