Introduction to Big data analytics and data analytics life cycle.pdf
Document Details
Uploaded by Deleted User
Tags
Related
Full Transcript
Big Data Analytics Monson 2024 - 2025 Data Structure Unstructured Semi-structured Structured Images HTML Excel files Videos XML SQL databases PDFs JSON Point-of-sale data Memos em...
Big Data Analytics Monson 2024 - 2025 Data Structure Unstructured Semi-structured Structured Images HTML Excel files Videos XML SQL databases PDFs JSON Point-of-sale data Memos email data White papers Email body Data Deluge More data is generated than can be successfully and efficiently managed or capped Main Reasons Everything is online Capacity to produce data is faster than infrastructure and technologies required to enable data-driven research Introduction Big data Big Data is data whose scale, distribution, diversity, and/or timeliness require the use of new technical architectures and analytics to enable insights that unlock new sources of business value. Three v’s Large volume Wide variety High velocity ❖ Huge volume of data ❖ Complexity of data types and structures ❖ Speed of new data creation and growth Why Big Data The more data we have for analysis, the greater will be the analytical accuracy and the greater would be the confidence in our decisions based on these analytical findings. The analytical accuracy will lead a greater positive impact in terms of enhancing operational efficiencies, reducing cost and time, and originating new products, new services, and optimizing existing services. Characteristics Composition: The composition of data deals with the structure of data, that is, the sources of data, the granularity, the types, and the nature of data as to whether it is static or real-time streaming. Condition: The condition of data deals with the state of data, that is, "Can one use this data as is for analysis?" or "Does it require cleansing for further enhancement and enrichment?" Context: The context of data deals with "Where has this data been generated?" "Why was this data generated?" How sensitive is this data?" Definition of Big Data Data —> Information —> Actionable intelligence —> Better decisions — > Enhanced business value 1) Big data is high-velocity and high-variety information assets that demand cost effective, innovative forms of information processing for enhanced insight and decision making. 2) Big data refers to datasets whose size is typically beyond the storage capacity of and also complex for traditional database software tools 3) Big data is anything beyond the human & technical infrastructure needed to support storage, processing and analysis. 4) It is data that is big in volume, velocity and variety. Evolution Wide Web (WWW) and the Internet of Things (IOT) have led to an onslaught of structured, unstructured, and multimedia data The evolution of big data and Big Data Analytics Data repositories A centralized place to hold, share data publicly, and organize data in a logical manner. Types from an analyst perspective Benefits Data repository Characteristics Easier onboarding process Spreadsheet and data marts low-volume databases for Collaboration became record keeping. Contain data easier specific to departments Gain higher visibility and Data warehouses Centralized data containers in a insight purpose-built space Required significant time for One place storage data organization Analytic sandbox (testing Flexible, high-performance environment) analysis Low cost but high risk Data Warehouse A central repository system for storing and analyzing structured and semi-structured data from multiple sources typically on a regular basis, including transactional systems, relational databases, and other sources. Key Characteristics 1. Subject-Oriented 2. Integrated 3. Non-Volatile 4. Time-Variant Different Types Of Analytics Descriptive Diagnostic Predictive Prescriptive Classification of Analytics: First School of Thought Basic analytics: This primarily is slicing and dicing of data to help with basic business insights. This is about reporting on historical data, basic visualization, etc. Operationalized analytics: It is operationalized analytics if it gets woven into the enterprises business processes. Advanced analytics: This largely is about forecasting for the future by way of predictive and prescriptive modelling. Monetized analytics: This is analytics in use to derive direct business revenue. Classification of Analytics: Second School of Thought How can we make it happen? Prescriptive analysis What will happen? Why did it happen? Predictive analysis 2012 to Present Diagnostic analysis What happened? Descriptive analysis 2005 to 2012 Mid 1990s to 2009 Data Science vs Data Analytics High Data Science Looking forward Nature of work – Explore, discover, investigate, and Predictive analytics: What will happen next? Business value visualize Perspective analytics: What should be done to prevent Output – Data product Data Analytics: Looking backward Nature of work - Report Descriptive analytics: What happened? and optimize Diagnostic analytics: Why did it happen? Output – reports and dashboard Low Past Present Future Time Example Descriptive Diagnostic Predictive Prescriptive Month-over-month sales growth Early Detection of Allergic Reactions Examining market demand Determine staffing needs Fraud Detection Improving company culture Total revenue per subscriber Investment Decisions Traditional Business Intelligence (Bi) versus Big Data What is BI? A broad term encompassing Data mining, process analysis, performance benchmarking, and descriptive analytics. Outcome: Straightforward reports Performance measures Recent trends helpful for management decision Traditional BI Big data Data is housed in Data resides in a A central server Distributed file system Data analyzed in an offline mode Real-time streaming and offline mode Structured data Structured, semi- structured, and unstructured data BI versus Data Science Predictive Analytics and Data BI Mining (Data Science) Typical Optimization. Standard and ad hoc Techniqu predictive reporting, dashboards, Explorator es and modelling, alerts, queries, details y Data forecasting. on demands Types statistical analysis Structured data. Structured/unstructu traditional sources. Data red data, many types manageable datasets Science of sources, very Analytical large datasets approach BI Common What if... ? What happened last Questions What's the optimal quarter? scenario for our How many units sold? business? Where is the Explorator What will happen problem? Hey in y Pas Tim Presen next? What if these which t e t trends Situation? Continue? Why is this happening? Current Analytical Architecture 1. Data loaded into warehouse; needs understanding 2. Due to level of control on the EDW (enterprise data Warehouse-on server or on cloud), local systems emerging as departmental warehouse 3. Data is read by additional applications 4. Analyst get data from server 5. Analysts create data extracts from the EDW to analyze data offline Drivers Of Big Data Emerging Ecosystem and New Approaches Data devices and the "Sensornet” Data collectors Data aggregators Data users / buyers Key Roles Of New Big Data Ecosystem Data analytical talent Advanced training in quantitative disciplines – e.g., math, statistics, and machine learning Data savvy (intelligent, knowledgeable) professionals Less technical depth Technology and Data Enablers Support people – DB admin, programmers etc. Represents people providing technical expertise to support analytical projects Provisioning and administrating analytical sandboxes, and managing large-scale data architectures Challenges of Big Data Scale: Storage or NoSQL is one major concern with both RDBMS and NoSQL Security: Lack of proper authentication and authorization mechanisms, especially for NoSQL big data platforms Schema: Rigid schemas have no place. Continuous availability: The big question here is how to provide 24/7 support because almost all RDBMS and NoSQL big data platforms have a certain amount of downtime built in. Consistency: Should one opt for consistency or eventual consistency? Partition tolerant: How to build partition tolerant systems that can take care of both hardware and software failures? Data quality: How to maintain data accuracy, completeness, timeliness, etc.? Do we have appropriate metadata in place? Challenges with Big Data Data volume: Data today is growing at an exponential rate. This high tide of data will continue to rise continuously. The key questions are – “will all this data be useful for analysis?”, “Do we work with all this data or subset of it?”, “How will we separate the knowledge from the noise?” etc. Storage: Cloud computing is the answer to managing infrastructure for big data as far as cost-efficiency, elasticity and easy upgrading / downgrading is concerned. This further complicates the decision to host big data solutions outside the enterprise. Data retention: How long should one retain this data? Some data may require for log-term decision, but some data may quickly become irrelevant and obsolete. Skilled professionals: To develop, manage and run those applications that generate insights, organizations need professionals who possess a high-level proficiency in data sciences. Other challenges: Other challenges of big data are with respect to capture, storage, search, analysis, transfer and security of big data. Visualization: Big data refers to datasets whose size is typically beyond the storage capacity of traditional database software tools. There is no explicit definition of how big the data set should be for it to be considered bigdata. Data visualization(computer graphics) is becoming popular as a separate discipline. There are very few data visualization experts. Big Data Technologies ✣ Cheap and ample storage ✣ Faster processors to help with quicker processing of big data ✣ Affordable open source distributed big data platforms, such as Hadoop ✣ Parallel processing, clustering, virtualization, large grid environments, high connectivity, and high throughputs ✣ Cloud computing and other flexible resource allocation arrangements Throughputs = rate at which something is processed Activities and Profile of Data Scientist Five main sets of skills and Main activities requires to preform behavioral characteristics ❑ Reframe business challenges as Quantitative analytics challenges skill Technical Skeptical ❑ Design, implement, and deploy aptitude mind and statistical models and data mining critical thinking techniques on Big Data Curious Communicative ❑ Develop insights that lead to actionable and and creative recommendations collaborative Introduction To Big Data Analytics Technology-enabled analytics gaining a meaningful, deeper, and richer insight into your business to steer it in the right direction A tight handshake between three communities: IT, business users, and data scientists Importance of Big data Various approaches to analysis of data and what it leads to ✤ Reactive-Business Intelligence ✤ Reactive - Big Data Analytics ✤ Proactive - Analytics ✤ Proactive - Big Data Analytics BIG DATA ANALYTICS LIFE CYCLE 1. Discovery 6. 2. Data Operationalize prep Data Analytic Lifecyle 5. Communicate results 3. Model 4. Model planning building 1. Learning the Business Domain 1. Learning the Business Domain 2. Resources 3. Framing the problem 4. Identifying key stakeholders 5. Interviewing analytical sponsor 6. Developing Initial Hypotheses 7. Identifying Potential Data Sources Preparing the Analytic Sandbox or the Workspace Team can explore data without interfering with live production databases Team needs to collect all kind of data: summary level aggregated data, structured data, raw data feeds, and unstructured text data 2. Performing ETLT: Getting most of data integration Integrating ETL (extract, transform, load) and ELT (extract, load, transform) ETLT is a “best of both worlds” approach speeds up data ingestion and ensuring data quality and security ETL ELT ETLT Extract raw data into a Extract raw, unprepared data Extract the raw data staging area. from source applications and databases. Transform and Load the unprepared data into Transformation applies to one aggregate the data with the warehouse. data source at a time SORT, JOIN Load data into the Use high performance and Load the prepared data into the warehouse. cloud-based data warehouse to data warehouse. process transformations Transform and integrate multiple data source 2. Data Conditioning and Data cleansing Highlight gaps within the dataset Identify and gather data outside the organization through open APIs, data sharing, and purchasing Data conditioning: Optimizes the movement and management of data to protect it and increase its productivity. Information moves through i/o path Complement the data storage functionality Data cleansing: Technique that searches for and corrects records that are inaccurate or corrupt. Used techniques: data standardization, data enhancement 2. Survey and Visualize "Overview first, zoom and filter, then details-on-demand.” Helps to know the followings about your data: 1. Does the distribution stay consistent over time? 2. Assess the granularity of the data, the range of values, and the level of aggregation of the data. 3. Does the data represent the population of interest? 4. For time related variables, assess if the choose time-stamp apt for your problem 5. Is the data standardized/normalized? 6. Are scale consistent? 7. For geospatial datasets, are state or country abbreviations consistent across the data? Are personal names normalized? English units? Metric units? 2. Common Tools for the Data Preparation Phase Hadoop: Perform massive parallel data incoming and custom analysis for web traffic parsing, GPS location analytics, genomic analysis, and combining of massive unstructured data feeds from multiple sources. Alpine Miner: Provides a graphical user interface (GUI) for creating analytic workflows including data manipulations Open Refine (Google Refine): Free, open source, GUI based powerful tool for performing data transformations. Popular to do data wrangling Data Wrangler: An interactive tool for data cleaning and transformation. Data Exploration and Variable Selection Why important? Patterns and Trends: Any recurring themes or relationships between different data points? Anomalies: any data points fall outside the expected range, potentially indicating errors or outliers? Revealing Latent Insights Foundation for Advanced Analysis and Modeling Adaptability and Innovation Risk Mitigation and Compliance How it works? Exploratory Data Analysis (EDA): Univariate Analysis, bivariate analysis Box plot, scatter plot, histograms, distribution plots, and correlation matrices Feature Engineering: enhancing prediction models by introducing or modifying features Data normalization, scaling, encoding, and creating new variables Model Building and Validation: Cross validation methods used 3. Model Selection Where team determines methods, techniques, and workflow it intends to follow for subsequent model building phase. Looking back at the hypotheses. Points to consider: Assess structure of the datasets analytical techniques should meet the business objectives and accept or reject the working hypotheses. Determine the requirement: single model or model ensemble Research on Model Planning in industry Verticals Market sector Analytic Techniques/Methods Used Consumer Packaged Multiple linear regression, automatic relevance determination Goods (ARD), random forest, and decision tree Retail Banking Multiple regression Retail Business Logistic regression, ARD, and decision tree Wireless Telecom Neural network, decision tree, hierarchical neurofuzzy systems, rule evolver, and logistic regression 3. Common Tools for the Model Planning Phase R SAS Build interpretive models with high quality code Programming environment and language for data Execute statistical tests and analysis against big manipulation. data Easy to manage, access, and observe data from various Create quality plots sources Organizes modules for web, social media, and marketing analytics SQL Prediction model for customer’s behaviors and Perform in-database analytics of common data communications. mining functions, involved aggregations, and basic predictive models. RapidMiner Tableau Public – Contains advanced analytics like data mining, Connects to any data source to corporate machine learning without any programming. web-based data. Can generate analytics based on real-life data. Data can be shared through social media. 4. Why Model Building is Important? Extract insights and knowledge from the data to make business decisions and strategies Divides data into: 1. Training data [70%-80%] 2. Testing data [20%-30%] Or, Important step during model training 1. Training data [60-80%] Hyperparameter tuning?? 2. Validation data [10-20%] 3. Testing data [10-20%] Focus is on creating a model that can capture the underlying patterns and relationships in the data, rather than simply memorizing the training data. Tools for the Model Building Commercial tools Open-source 1. SAS Enterprise Miner tools: 1. R 2. SPSS Modeler 2. Python 3. MATLAB 3. SQL 4. Alpine Miner 4. WEKA 5. STATISTICA 6. Mathematica 5. Communicate Results Is the model robust enough? Is it a success or failure? Compare outcomes of modeling to criteria established for success and failure. Articulate findings and outcomes to various team members and stakeholders. Take warning and assumptions into account Identify key findings, quantify business value, and develop narrative to summarize and convey to stakeholders. 6. Operationalize 1. Communicates benefits of project more broadly and sets up pilot project 2. Enables team to learn about performance on small scale 3. Easy to make adjustment before full deployment 4. Team delivers final reports, briefings, codes.