Realtime and Big Data Analytics Lecture 1 PDF
Document Details
Uploaded by Deleted User
Tags
Summary
This document presents an introduction to realtime and Big Data Analytics. It covers fundamental concepts and terminology, such as datasets, data analysis, data analytics. It also explains Big Data characteristics.
Full Transcript
Realtime and Big Data Analytics Lecture 1 — Understanding Big Data What is Big Data? Big Data is a field dedicated to the analysis, processing, and storage of large collections of data that frequently originate from disparate sources. Big Data addresses distinct requirements: Combine multiple unre...
Realtime and Big Data Analytics Lecture 1 — Understanding Big Data What is Big Data? Big Data is a field dedicated to the analysis, processing, and storage of large collections of data that frequently originate from disparate sources. Big Data addresses distinct requirements: Combine multiple unrelated datasets. Process large amounts of unstructured data. Harvest hidden information in a time-sensitive manner. 2 Agenda Concepts and terminology Big Data characteristics Different types of data 3 Agenda Concepts and terminology Datasets Data analysis Data analytics Big Data characteristics Different types of data 4 Concepts and terminology Datasets A dataset is a collection or group of related data. Each dataset member (“datum”) shares the same set of attributes or properties as others in the same dataset. Examples Tweets stored in a flat file. A collection of image files in a directory. An extract of rows from a database table stored in a CSV formatted file. Historical weather observations that are stored as XML files. 5 Concepts and terminology Data analysis Data analysis is the process of examining data to find facts, relationships, patterns, insights and/or trends. The overall goal of data analysis is to support better decision-making. Example: the analysis of ice cream sales data. Determine how the number of ice cream cones sold is related to the daily temperature. The results of such an analysis would support decisions related to how much ice cream a store should order in relation to weather forecast information. 6 Concepts and terminology Data analytics Data analytics is a discipline that includes the management of the complete data lifecycle, which encompasses collecting, cleansing, organizing, storing, analyzing and governing data. Examples In business-oriented environments, data analytics results can lower operational costs and facilitate strategic decision-making. In the scientific domain, data analytics can help identify the cause of a phenomenon to improve the accuracy of predictions. In service-based environments, data analytics can help strengthen the focus on delivering high-quality services by driving down costs. 7 Concepts and terminology Data analytics There are four general categories of analytics that are distinguished by the results they produce: 8 Concepts and terminology Descriptive analytics Descriptive analytics aim to answer questions about events that have already occurred. Descriptive analytics contextualizes data to generate information. Examples What was the sales volume over the past 12 months? What is the number of support calls received as categorized by severity and geographic location? What is the monthly commission earned by each sales agent? 9 Concepts and terminology Descriptive analytics The operational systems are queried via descriptive analytics tools to generate static reports or dashboards. 10 Customer relationship management system Enterprise resource planning system Online transaction processing system Concepts and terminology Diagnostic analytics Diagnostic analytics aim to determine the cause of a phenomenon that occurred in the past using questions that focus on the reason behind the event. The goal is to determine what information is related to the phenomenon in order to answer questions that seek to determine why something has occurred. Examples Why were Q2 sales less than Q1 sales? Why have there been more support calls originating from the Eastern region than from the Western region? Why was there an increase in patient re-admission rates over the past three months? 11 Concepts and terminology Diagnostic analytics Diagnostic analytics usually collect data from multiple sources and store it in a structure so that users can perform interactive drill-down and roll-up analysis. 12 Online analytical processing system Concepts and terminology Predictive analytics Predictive analytics aim to determine the outcome of an event that might occur in the future. Information is associated to build models that are used to generate future predictions based upon past events. Examples What are the chances that a customer will default on a loan if they have missed a monthly payment? What will be the patient survival rate if Drug B is administered instead of Drug A? If a customer has purchased Products A and B, what are the chances that they will also purchase Product C? 13 Concepts and terminology Predictive analytics Predictive analytics use large datasets of internal and external data and various data analysis techniques to provide user-friendly front-end interfaces. 14 Online analytical processing system Concepts and terminology Prescriptive analytics Prescriptive analytics build upon the results of predictive analytics by prescribing actions that should be taken. The focus is not only on what prescribed option is best to follow, but why. Examples Among three drugs, which one provides the best results? When is the best time to trade a particular stock? 15 Concepts and terminology Prescriptive analytics Prescriptive analytics use business rules and large amounts of internal and external data to simulate outcomes and prescribe the best course of action. 16 Online analytical processing system Concepts and terminology Data analytics: summary 17 Agenda Concepts and terminology Big Data characteristics Volume Velocity Variety Veracity Value Different types of data 18 Big Data characteristics Volume 19 Source: https://www.statista.com/statistics/871513/worldwide-data-created/ Kilobyte 210 103 Megabyte 220 106 Gigabyte 230 109 Terabyte 240 1012 Petabyte 250 1015 Exabyte 260 1018 Zettabyte 270 1021 Yottabyte 280 1024 Brontobyte 290 1027 Geopbyte 2100 1030 Volume of data/information created, captured, copied, and consumed worldwide Big Data characteristics Volume 20 Kilobyte 210 103 Text document Megabyte 220 106 Picture, song Gigabyte 230 109 Movie Terabyte 240 1012 Hard drive, (the Library of Congress holds 300TB+ of data) Petabyte 250 1015 Rack of nodes Exabyte 260 1018 Data center Zettabyte 270 1021 All Internet data Yottabyte 280 1024 ??? Brontobyte 290 1027 Geopbyte 2100 1030 Big Data characteristics Volume Where do all those data come from? Online transactions, such as point-of-sale and banking. Scientific and research experiments, such as the Large Hadron Collider and Atacama Large Millimeter/Submillimeter Array telescope. Sensors, such as GPS sensors, RFIDs, smart meters and telematics. Social media, such as Facebook and X. Can you think of more? 21 Big Data characteristics Velocity 22 Source: https://www.domo.com/data-never-sleeps Big Data characteristics Variety Big Data solutions need to support multiple formats and types of data. 23 unstructured data semi-structured data Big Data characteristics Veracity Veracity refers to the quality of data. Data with a high signal-to-noise ratio has more veracity. Signal is data that has value and leads to meaningful information. Noise is data that can’t be converted into information and thus has no value. Examples Online user registration data usually has high signal-to-noise ratio. Blog postings usually have low signal-to-noise ratio. 24 Big Data characteristics Value Value is defined as the usefulness of data for an enterprise. Can you think of other factors that may impact value? 25 Big Data characteristics Value Value is also impacted by data lifecycle-related concerns… How well has the data been stored? Were valuable attributes of the data removed during data cleansing? Are the right types of questions being asked during data analysis? Are the results of the analysis being accurately communicated to the appropriate decision-makers? 26 Big Data characteristics Summary: the five Vs Volume Velocity Variety Veracity Value 27 Agenda Concepts and terminology Big Data characteristics Different types of data 28 Different types of data Taxonomy Source Human-generated data Machine-generated data Format Structured data Unstructured data Semi-structured data Metadata 29 Different types of data Human-generated data 30 Different types of data Machine-generated data 31 Different types of data Structured data Structured data conforms to a data model or schema. It is used to capture relationships between different entities and is therefore most often stored in a relational database. Examples Banking transactions. Invoices. Customer records. Can you think of more? 32 Different types of data Unstructured data Unstructured data does not conform to a data model or schema. It is either textual or binary and often conveyed via files that are self-contained and non-relational. The majority of data is unstructured. 33 Different types of data Semi-structured data Semi-structured data has a defined level of structure and consistency, but is not relational in nature. Instead, it is hierarchical or graph-based. This kind of data is commonly stored in text-based files. Can you think of more? 34 Different types of data Semi-structured data XML (Extensible Markup Language) Tove Jani Reminder Don't forget me this weekend! 35 Different types of data Semi-structured data JSON (JavaScript Object Notation) { "firstName": "John", "lastName": "Smith", "isAlive": true, "age": 27, "address": { "streetAddress": "21 2nd Street", "city": "New York", "state": "NY", "postalCode": "10021-3100" }, "phoneNumbers": [ { "type": "home", "number": "212-555-1234" }, { "type": "office", "number": "646-555-4567" } ], "children": [], "spouse": null } 36 Different types of data Semi-structured data CSV (comma-separated values) Year,Make,Model,Description,Price 1997,Ford,E350,"ac, abs, moon",3000.00 1999,Chevy,"Venture ""Extended Edition""","",4900.00 1999,Chevy,"Venture ""Extended Edition, Very Large""",,5000.00 1996,Jeep,Grand Cherokee,"MUST SELL! air, moon roof, loaded",4799.00 37 Year Make Model Description Price 1997 Ford E350 ac, abs, moon 3000.00 1999 Chevy Venture "Extended Edition" 4900.00 1999 Chevy Venture "Extended Edition, Very Large" 5000.00 1996 Jeep Grand Cherokee MUST SELL! air, moon roof, loaded 4799.00 Different types of data Metadata Metadata provides information about a dataset’s characteristics and structure. It is mostly machine-generated and can be appended to data. Examples XML tags providing the author and creation date of a document. Attributes providing the file size and resolution of a digital photograph. 38 Different types of data Summary: data variety Big Data solutions need to support multiple formats and types of data. 39 unstructured data semi-structured data References Thomas Erl, Wajid Khattak, and Paul Buhler Big Data Fundamentals: Concepts, Drivers & Techniques