Big Data Part 1 Lectures: Getting Started with BD (Summer 2021) PDF
Document Details
Uploaded by AccurateIsland
University of Petra
2021
Qasem Abu Al-Haija
Tags
Summary
This document is an overview of Big Data, part of lectures, by Qasem Abu Al-Haija, PhD, Assistant Professor. It covers topics like getting started with big data, Hadoop system, Spark system, and distributed computing.
Full Transcript
Big Data Part #1 Lectures: Getting Started with BD Summer 2021 Qasem Abu Al-Haija, PhD Assistant Professor, Department of Data Science & Artificial Intelligence, University of Petra , Amman, Jordan 1 Outlines Getting Started wit...
Big Data Part #1 Lectures: Getting Started with BD Summer 2021 Qasem Abu Al-Haija, PhD Assistant Professor, Department of Data Science & Artificial Intelligence, University of Petra , Amman, Jordan 1 Outlines Getting Started with BD Overview of Hadoop System Overview of Spark System Distributed Computing Big Data_Dr. Qasem Abu Al-Haija 2 Getting Started with BD Big Data_Dr. Qasem Abu Al-Haija 3 Data All Around Lots of data is being collected and warehoused Web data, e-commerce Financial transactions, bank/credit transactions Online trading and purchasing Social Network Big Data_Dr. Qasem Abu Al-Haija 4 Big Data Fundamentals Google processes 20 PB a day (2008) Facebook has 60 TB of daily logs eBay has 6.5 PB of user data + 50 TB/day (5/2009) Cost of 1 TB of disk: $35 Time to read 1 TB disk: 3 hr.s (100 MB/s). Big Data_Dr. Qasem Abu Al-Haija 5 Big Data (BD) and Three V’s Data exceeds processing capacity of conventional database systems. Volume: size is too big - starts at Terabyte scales (1012 bytes) and has no upper limit. Velocity: latency of data processing relative to growing demand (30 KB/S~30 GB/S). Variety: Heterogeneous, diversity of sources, formats, quality, and structures. 3Vs Diagram Sources of BD 6 Big Data and Storage Units Big Data (BD) Review: Data Storage Measurement Units Byte B 8 bits 𝟏𝟎 × 𝟐 Kilobyte KB 1024 Bytes × 𝟐−𝟏𝟎 At every At every step Megabyte MB 1024 Kilobytes step Going Going from Gigabyte GB 1024 Megabytes from small to Terabyte TB 1024 Gigabytes small to large large Petabyte PB 1024 Terabyte Remember: bit = b = 0 or 1 1024 = 210 8 Life Cycle of BD Problem Solving Data must first be captured, and then organized and integrated. After that, data can be analyzed based on the problem being addressed. Finally, management takes action based on the outcome of that analysis. Big Data_Dr. Qasem Abu Al-Haija 9 Life Cycle of BD Problem Solving Also, Five Questions for BD Problem Solving (5HOWs) How much data will my organization need to manage today and in the future? How often will my organization need to manage data in real time or near real time? How much risk can my organization afford? Is my industry subject to strict security, compliance, and governance requirements? How important is speed to my need to manage data? How certain or precise does the data need to be? Big Data_Dr. Qasem Abu Al-Haija 10 BD Architecture BD architecture must include a variety of services that enable companies to make use of diverse/huge data sources in a fast and effective manner. Big Data_Dr. Qasem Abu Al-Haija 11 BD Architecture-Main Components Interfaces and feeds o Since BD relies on picking up lots of data from lots of sources. o Therefore, open Application Programming Interfaces (APIs) will be core to any BD architecture. API is a software intermediary that allows two applications to talk to each other. Example: Each time you use an app like Facebook, send an instant message, or check the weather on your phone, you're using an API. Big Data_Dr. Qasem Abu Al-Haija 12 BD Architecture-Main Components Redundant physical infrastructure o Since BD uses huge volume of data, data may be physically stored in many different locations and can be linked together through networks, the use of a distributed file system, and various big data analytic tools and applications. Example: Storage Redundancy Data Keep multiple copies of the data in different machines (ex. Use cloud) Big Data_Dr. Qasem Abu Al-Haija 13 BD Architecture-Main Components Security infrastructure o You will need to take into account who is allowed to see your data and under what circumstances they are allowed to do so. o You will need to be able to verify the identity of users as well as protect the privacy of your data. Example: Use of Firewalls. Security systems to defend communications from external cyber-attacks. Installed at network edges to monitor traffic flow by filtering all incoming data packets. Filtration process proceeds with either “allow,” “deny,” or “drop/reset” incoming packet. Big Data_Dr. Qasem Abu Al-Haija 14 BD Architecture-Main Components Operational data (OD) sources o Traditionally, an OD source consisted of highly structured data managed by the line of business in a relational database SQL Language. o But in BD, OD encompasses a broader set of data sources, including structured, unstructured and semi-structured data types. Either Machine-generated (created by a machine, no human intervention), or, Human- generated (created by a humans, in interaction with computers). o To cope with the diverse data types NoSQL (Not only SQL) New data management tools for BD world can process different data types including document, graph, geospatial (Key-Value), and columnar (wide-column) database architectures. Document Store Key-Value store Wide-Column Store Graph Store Data are stored Simplest, data is Similar to key-value but Data is stored in a hierarchically in represented as a pairs allow a very large graph structure (node, (JSON/XML) format. key-value. number of columns. edge, data properties). Application: CouchDB Application: DynamoDB Application: Cassandra Applications: Neo4j Operational Data Types Structured Data Data that has a defined length and format such as numbers, dates, and strings. Stored & manipulated in traditional relational database management system (RDBMS). You can query it using a language like structured query language (SQL). Collected from “traditional” sources such as customer relationship management (CRM) data, operational enterprise resource planning (ERP) data, and financial data Unstructured Data. Data that does not follow a specified format doesn’t fit into a structured database format. Commonly generated from human activities such as audio, video, images, and documents. Usually, enterprises data are composed of 20% structured data & 80% is unstructured data. Semi-structured Data. Data doesn’t fit into a structured database but it is structured by tags that are useful for creating a form of order and hierarchy in the data, such as: XML, Email, Webpage. Example of simple label/value pairs: =Jones, =Jane, =Sarah. Big Data_Dr. Qasem Abu Al-Haija 18 Sources of Structured Data Machine-generated Structured Data. o Sensor data: Radio Frequency ID (RFID) tags, smart meters, medical devices, and Global Positioning System (GPS) data. o Web log data: When servers, applications, & networks operate, they capture all kinds of data activity. o Point-of-sale data: When the cashier swipes the bar code of any product that you are purchasing, all that data associated with the product is generated. How about all the products across all the people!!! o Financial data: Predefined rules that automate processes, Stock trading data is a good example of this. Human-generated Structured Data. o Input data: Any data entered by human input into a computer, such as name, age, income, survey…etc o Click-stream data: Data is generated every time you click a link on a website. o Gaming-related data: Every move you make in a game can be recorded. Big Data_Dr. Qasem Abu Al-Haija 19 Sources of Unstructured Data Machine-generated Unstructured Data. o Satellite images: This includes weather data or data captured by satellite surveillance imagery (Google Earth) o Scientific data: This includes seismic imagery, atmospheric data, and high energy physics. o Photographs and video: This includes security, surveillance, and traffic video. o Radar or sonar data: This includes vehicular, meteorological, and oceanographic seismic profiles. Human-generated Unstructured Data. o Text internal to your company: Think of all the text within documents, logs, survey results, and e-mails. o Social media data: This data is generated from the social media platforms such as Facebook or Twitter. o Mobile data: This includes data such as text messages and location information. o Website content: This comes from any site delivering unstructured content, like YouTube or Instagram. Big Data_Dr. Qasem Abu Al-Haija 20 Understanding the role of relational databases (RDB) in BD In RDB, data are stored in a tables & this database contains a schema Schema: structural representation to define database elements (tables, columns, relationships). The data is stored in columns (data attributes) and in the rows (data records). Example: The following schema defines a simple database of two tables. Table 1 stores product info. & Table 2 stores demographic info. Tables are updated (add, delete, read, queried) via SQL. PostgresSQL: most common available open source RDB. Tables are queried using a common key (relationship) Example of SQL query to determine gender of customers who purchased a specific product Big Data_Dr. Qasem Abu Al-Haija 21 Understanding the role of a CMS in BD management Content Management System (CMS) Computer software system that can manage the complete life cycle of content including publishing, editing, modifying, organizing, deleting contents from a central interface. CMS deals with unstructured data including web content, document content, & others. CMSs are often used to run websites containing blogs, news, and shopping. Typically aim to avoid the need for hand coding. Requires support of unstructured technologies like Hadoop, MapReduce, & streaming. CMSs are essential for BD management since they provide: Low cost and workflow management, Easy customization, Easy to use, & Good for search engine optimization is an example of CMS Big Data_Dr. Qasem Abu Al-Haija 22 Non Real-Time and Real-Time Processing for BD Real-time Processing? Requires a continual input, constant processing, and steady output of data. A great examples include: data streaming, radar systems, and bank ATMs Spark is a great tool to use for real-time processing. Non-real-time (Batch) Processing? Involves three separate processes. First, data is collected, usually over a period of time. Second, the data is processed by a separate program. Thirdly, the data is output. A great examples include: payroll and billing activities, which usually occur on monthly cycles MapReduce is a great tool to use for batch processing and analytics. 23 Organizing data services and tools What to do with this Growing amount of variety of source? Aggregation and Statistics Data warehousing and OLAP (online analytical processing), … Indexing, Searching, and Querying Keyword based search, Pattern matching (RDF/XML),… Knowledge discovery Data Mining, Statistical Modeling, Prediction, Classification,… Big Data_Dr. Qasem Abu Al-Haija 24 Therefore, Data is very critical feature to study Since it helps business leaders to make decisions based on facts, statistical numbers and trends. Due to this growing scope of data, Big Data Science has emerged as a multidisciplinary field. 25 Big Data_Dr. Qasem Abu Al-Haija Very Common Tools for BD Big Table: o Developed by Google as a distributed storage system intended to manage highly scalable structured data with huge volumes. o Unlike a traditional relational database model, Big Table is a sparse, distributed, persistent multidimensional sorted map. Map Reduce: o Developed by Google as a way of efficiently executing a set of functions against a large amount of data in batch mode. o “map” distributes tasks across a large number of systems with balance loads & manages recovery from failures. o “reduce” component aggregates all the elements back together to provide a result. Very Common Tools for BD Hadoop: o Apache-managed software framework derived from MapReduce and Big Table to support massive data volumes o Hadoop allows applications based on MapReduce to run on large clusters of commodity hardware to speed computations and hide latency. Spark: o Apache-managed open-source distributed processing system used for BD workloads. o It utilizes in-memory computing for fast and general large- scale data processing engine. o Spark was created for big data and is MUCH faster than Hadoop alone Very Common Tools for BD HDFS & AWS for tools BD Storage o Used to store huge data via nonhierarchical data storage system (data lakes) o HDFS: Hadoop distributed file system/ AWS: Amazon Web Services S3. o HDFS platform uses clusters of commodity servers to store big data. o AWS platform is a cloud architecture that’s available for storing big data. Python & R- Languages for BD programming : o R is specifically tailored for statistical computing/ analysis. Provide newer adaptations for big data like MapR, Rhadoop. o Python is mostly used for BD development & processing. It is the most compatible programming language with Hadoop. Big Data_Dr. Qasem Abu Al-Haija 28 Approaches for BD Analytics (Traditional & Advanced) Analytical data warehouses and data marts o Provide compression, multilevel partitioning & parallel processing architecture. Big data analytics o The capability to manage and analyze petabytes of data enables companies to deal with clusters of information that could have an impact on the business. Reporting and visualization o These are tools for looking at the context of how data is related and the impact of those relationships on the future. Big data applications o Rely on huge volumes, velocities, & varieties of data to transform the market behavior. o In healthcare: BD app. can monitor premature infants to determine when intervention is needed. o In manufacturing: BD app. can prevent a machine from shutting down during a production run. o In traffic management: BD app. can reduce the number of traffic jams on busy city highways to decrease accidents, save fuel, and reduce pollution. 29 More about Data Analytics (DA) DA concerns of Converting Raw Data into Actionable Insights. 30 More about Data Analytics (DA) Big Data_Dr. Qasem Abu Al-Haija 31 BD Analytics vs BI Analytics What is Business Intelligence (BI)? BI concerns with performing descriptive analysis of data using technology and skills to make informed/ intelligent business decisions. The set of tools used for BI collects, governs, and transforms data. Big Data_Dr. Qasem Abu Al-Haija 32 The Power of BD Companies have always had to deal with lots of data in lots of forms. The change that BD brings is what you can do with that information. If you have the right technology in place, you can use BD to anticipate and solve business problems and react to opportunities. With BD, you can analyze data patterns to change everything. Manage cities, prevent failures, conduct experiments, manage traffic, improve customer satisfaction, enhance product quality, …etc. Top BD Applications Big Data_Dr. Qasem Abu Al-Haija 33 Overview of Hadoop System Big Data_Dr. Qasem Abu Al-Haija 34 https://www.youtube.com/watch?v=aReuLtY0YMI&t=304 35 More about Hadoop System BD can not be handled using traditional RDBMS due to the 3V’s. Data engineers turn to Hadoop data processing platform (Typical BD solution). Hadoop divides BD into smaller datasets that are manageable to analyze. More Precisely, Hadoop is: Open Source framework written in Java that allows distributed processing of big data-sets across the cluster of commodity hardware. Open Source Source code is freely available/ It may be redistributed and modified. Distributed Processing Data is distributed on multiple nodes to be processed independently. Cluster Multiple nodes (machines) connected together via LAN (Local Area Network). Commodity Hardware Economic & affordable machines/ Typically low performance hardware. 36 More about Hadoop System: Frameworks Hadoop (V. 2 or +) platform composed of three main frameworks: MapReduce for bulk/batch data processing. Implemented by JAVA Language. YARN for resource management. Implemented by JAVA Language HDFS for data storage. Used by SQL to query data from HDFS. Use tools like Hive/Spark SQL. Big Data_Dr. Qasem Abu Al-Haija 37 More about Hadoop System: Hadoop Nodes Big Data_Dr. Qasem Abu Al-Haija 38 More about Hadoop System: YARN + HDFS Big Data_Dr. Qasem Abu Al-Haija 39 More about Hadoop System: MapReduce Map the data: tag the Reduce pairs into smaller sets of data by (key, value) pairs data using aggregation operations Big Data_Dr. Qasem Abu Al-Haija 40 More about Hadoop System: Putting it all together YARN + Big Data_Dr. Qasem Abu Al-Haija 41 Overview of Apache Spark Ecosystem Big Data_Dr. Qasem Abu Al-Haija 42 The Apache Spark Ecosystem Apache Spark is an in-memory distributed computing application Apache Spark is a unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. It sits on top of the HDFS and acts a secondary processing framework, to be used in parallel with the large-scale batch processing work that’s done by MapReduce Because Spark processes data in micro-batches, with 3-second cycle times Then, it can be used to significantly decrease time-to-insight in cases time is essential. 43 Spark In-Memory Computing 44 Main Components of Spark 1.Spark SQL: It’s a built SQL package, to work with & query structured data using Spark. Alternatively, you can use HiveQL using the Spark processing engine. Big Data_Dr. Qasem Abu Al-Haija 45 Main Components of Spark 2. GraphX: GraphX library is how you store and process network data from/within Spark. 46 Big Data_Dr. Qasem Abu Al-Haija Main Components of Spark 3. Streaming: * Streaming module is where BD processing takes place. * This module basically breaks a continuously streaming data source into much smaller data streams, called Dstreams — or discreet data streams. * Because Dstreams are small, these batch cycles can be completed within three seconds, which is why it’s called micro-batch processing. 47 Main Components of Spark 4. MLlib: MLlib module is where you analyze data, generate statistics, and deploy machine learning algorithms within Spark environment. MLlib has APIs for Java, Scala, Python, and R. MLlib module allows you to build machine learning models in Python or R but pull data directly from HDFS (rather than via an intermediary repository). This helps reduce the reliance that data scientists sometimes have on data engineers. Furthermore, computations are known to be 100 times faster when processed in-memory using Spark as opposed to the standard MapReduce framework. 48 https://www.youtube.com/watch?v=2PVzOHA3ktE&t=43s Distributed Computing System (DCS) Big Data_Dr. Qasem Abu Al-Haija 50 From a Single Computer to DS 1945-1985: Computers o Large and expensive o Operated independently 1985-now: two advances in technology o Powerful microprocessors o High-speed computer networks Result: Distributed Systems Putting together computing systems composed of a large number of computers connected by a high-speed network Distributed Computing (DC) DC is a field of computing science that studies the distributed systems o Specifically, the use of distributed systems to solve computational problems DC technology introduced 50 years ago used in computer science research. o To solve complex problems without the expense of massive computing systems. No single DC model as computing resources can be distributed in many ways. o Example 1: You may distribute a set of programs on the same physical server and use messaging services to enable them to communicate and pass information. OR o Example 2: It is also possible to have many different systems or servers, each with its own memory, that can work together to solve one problem. Big Data_Dr. Qasem Abu Al-Haija 52 Distributed Computing Systems (DCS) Can think of DS as: breaking down an application into individual computing agents distributed over a network work together on a cooperative task Motivation for DC: Scalability: can solve larger problems without larger computers Openness & heterogeneity: applications and data may be difficult to relocate & reconfigure Fault-tolerance: redundant processing agents for system availability Divide and Conquer “Work” Partition w1 w2 w3 “worker” “worker” “worker” r1 r2 r3 “Result” Combine What is a Distributed System (DS) DS: Collection of autonomous hosts that are connected through a computer network. Each host executes components and operates a distribution middleware. Middleware enables the components to coordinate their activities. Users perceive the system as a single, integrated computing facility. Logical View of MWL Middleware Layer (MWL) Set of tools that provide a uniform means and style of access to system resources across all platforms o Enable programmers to use the same method to access data o Enable programmers to use standard programming interface & protocols o There is both a client and server component to middleware o It provides a uniform access to different systems Why Distributed Systems? Constructing information-sharing distributed systems from diverse sources: o Heterogeneous o Networked o Physically Disparate o Multi-vendor Centralized vs Distributed Computing Centralized Systems Centralized systems have non-autonomous components Centralized systems are often built using homogeneous technology Multiple users share the resources of a centralized system at all times Centralized systems have a single point of control and of failure Distributed Systems Distributed systems have autonomous components Distributed systems may be built using heterogeneous technology Distributed systems are executed in concurrent processes Distributed systems have multiple points of failure Parallel versus distributed computing Parallel Computing Systems Consists of multiple processors that communicate with each other using a shared memory. Distributed Computing Systems Contains multiple processors connected by a communication network. Distributed System Goals Resource Accessibility: Means that the existing resources in a distributed system can be accessed or remotely accessed across multiple computers in the system. Openness: is concerned with extensions and improvements of distributed systems. The distributed system must be open in terms of Hardware and Software. Scalability: is concerned about how DS handles the growth as the number of users for the system increases. Mostly we scale the DS by adding more computers in the network. Transparency: DS should be perceived by users and programmers as a whole rather than as a collection of cooperating components. This includes access, location, concurrency, etc. Heterogeneity: DS components have variety & differences in Networks, Computer hardware, Operating systems, Programming languages and implementations by different developers. Concurrency: Concurrent execution of activities over different components running on multiple machines as part of a DS (Reduces latency and increases throughput of DS). Advantages and Disadvantages of DCS Advantages Disadvantages Shareability Network reliance Expandability Complexities Local autonomy Security Improved performance Multiple point of failure Improved reliability & availability Potential cost reductions Example 1: Distribution of Domain Name System (DNS) Instead of one centralized service, divide into parts & distribute geographically o Each part handles one aspect of the job DNS namespace is organized as a tree of domains. o Each domain is divided into zones; names in each zone are handled by a different name server o WWW consists of many (millions?) of servers An example of dividing the DNS name space into zones. Example2: Massively Multiplayer Online Games (MMOG) Very large number of users sharing a virtual world. Example3: Detection of Escaped Animal in a Zoo Imagine a zoo with an array of security cameras. Each security camera records video footage in a digital format. The cameras send their video data to a computer cluster placed at zoo headquarters. That cluster runs video analysis algorithms to detect escaped animals. The cluster also sends the video data to a cloud computing server. Cloud server analyzes terabytes of video data to discover historical trends. Types of Distributed Computing Systems Cluster Computing systems Grid computing systems Cloud computing Systems INF5040, ifi/UiO Type1: Cluster Computing Systems Collection of similar PCs, closely connected, all run same OS, e.g.: A collection of computing nodes + master node Master runs middleware: parallel execution and management Type1: Cluster Computing Systems A cluster consists of nodes, network, operating system, and cluster middleware. The node consists of the master and the other computing nodes. The master node receives client requests, handle the job queue and the allocation of nodes to a particular parallel program to run on the cluster. The computing nodes may only need to run a standard operating system. Type2: Grid Computing Systems Federation of autonomous and heterogeneous computer systems (HW,OS,...), different administrative domains A layered architecture for grid computing systems Type2: Grid Computing Systems Fabric layer: provides interfaces to local resources at a specific site to allow sharing of resources. Connectivity layer: provides communication protocols for supporting grid transactions than A layered architecture for grid span usage of multiple resources computing systems Resource layer: responsible for managing a single resource to provide configuration information, creating processes or reading data. Collective layer: handles access to multiple resources, resource discovery, allocation and scheduling of tasks onto multiple resources. Applications: make use of the grid computing environment Type3: Distributed Computing as a Utility (Cloud Computing) View: distributed resources as a commodity or utility o Resources are provided by service suppliers and effectively rented rather than owned by the end user. The term cloud computing capture the vision of computing as a utility SalesForce CRM SaaS Clients LotusLive Google App PaaS Engine Internet IaaS 70 Type3: Distributed Computing as a Utility (Cloud Computing) Cloud Service Types: Infrastructure as a service (IaaS) offers essential compute, storage, and networking resources on demand, on a pay-as- you-go basis. Software as a service (SaaS) allows users to connect to and use cloud-based apps over the Internet. Common examples are email, calendaring, and office tools (such as Microsoft Office 365). Platform as a service (PaaS) is a complete development and deployment environment with resources to enable you to deliver everything from simple cloud-based apps to sophisticated, cloud-enabled enterprise applications. Why We Need DC for BD Some BD requires complex analysis, thus, data is moved to an external service where lots of spare resources are available for processing. It is economically not feasible to buy enough computing resources to handle emerging requirements and analytics. Need to take advantages of existing hardware resources by automating processes like load balancing and optimization across a huge cluster of nodes. Recent BD software include a built-in rules that can treat all nodes of DC as one big pool of computing using the technology of virtualization. Getting Performance Right Just having a faster computer isn’t enough to ensure the right level of performance to handle big data. You need to be able to distribute components of your big data service across a series of nodes. Simply, Use Distributed Computing Systems to handle big data in high level of Performance Getting Performance Right An evolution of DC relies on jobs being divided for distributed execution of tasks In distributed computing: o A node is an element contained within a cluster of systems or within a rack. o A node typically includes CPU, memory, and some kind of disk. o In a BD environment, these nodes are typically clustered together to provide scale Getting Performance Right – Two Examples Processing of Growing Data source of BD. o This case when you start with a BD analysis and continue to add more data sources. o To accommodate the growth, an organization simply adds more nodes into a cluster so that it can scale out to accommodate growing requirements. o However, it isn’t enough to simply expand the number of nodes in the cluster. o Rather, it is important to be able to send part of the big data analysis to different physical environments. Executing different BG algorithms in Parallel. o This case when you want to execute many different algorithms in parallel, even within the same cluster, to achieve the speed of analysis required. o To accommodate this execution, it is possible to distribute BD analysis across networks to benefit of available capacity, based on requirements for performance. Getting Performance Right – Possible Solutions Using Cloud computing as DC system for BD analytics. o Cloud computing is excellent example of how DC can make BD successfully operate and scalable. o It offers fast networks & inexpensive clusters of hardware that can be combined to increase performance. o These clusters are supported by software automation that enables dynamic scaling and load balancing. Using MapReduce computing as DC system for BD analytics. o MapReduce is excellent example of how DC can make BD operationally visible and affordable. o Combining distributed computing, improved hardware systems, and practical solutions such as MapReduce and Hadoop is changing data management in profound ways. Big Data_Dr. Qasem Abu Al-Haija 77