Introduction to Big Data Analytics PDF
Document Details
Uploaded by Deleted User
Prince Sattam Bin Abdulaziz University
Dr. Sana Fakhfakh
Tags
Summary
This document provides an introduction to big data analytics, covering topics such as the generation and growth of big data, the importance of big data analytics, and the various types of data involved. It also touches on the industries benefiting from data analysis.
Full Transcript
BIG DATA ANALYTICS (CS 0654) Master of Data Science Introduction to Big Data Analytics BY Dr. Sana Fakhfakh Department of Information Systems Prince Sattam Bin Abdulaziz University, Al-kharj, Riyadh, KSA Introduction...
BIG DATA ANALYTICS (CS 0654) Master of Data Science Introduction to Big Data Analytics BY Dr. Sana Fakhfakh Department of Information Systems Prince Sattam Bin Abdulaziz University, Al-kharj, Riyadh, KSA Introduction B I G D ATA A N A LY T I C S Big Data Generation and Growth What is Big Data Importance of Big Data Analytics Industries benefiting from Data Analytics Sources of Data (people, machines, organizations) Aspects of Bigness (The 5 V’s of big data) Types of Data (table, text, multimedia, stream, sequence, graphs) The Analytics Process (preprocessing, analytics, visualization) Introduction B I G D ATA A N A LY T I C S Big Data Generation and Growth What is Big Data Importance of Big Data Analytics Industries benefiting from Data Analytics Sources of Data (people, machines, organizations) Aspects of Bigness (The 5 V’s of big data) Types of Data (table, text, multimedia, stream, sequence, graphs) The Analytics Process (preprocessing, analytics, visualization) Big Data Generation and Growth Data has been generated at an exploding rate in recent years Organizations collect trillions of bytes of information about their customers, suppliers, and operations every day Large pools of data is being captured, communicated, aggregated, stored, and analyzed by businesses, academia, and governments Individuals with smartphones on social network sites are continuously fueling the exponential growth of multimedia data Big Data Generation and Growth expandedram bl i ng s.c om Big Data Generation and Growth Where data comes from? Internet users generate about 2.5 quintillion bytes of data each day In 2018, internet users spent 2.8 million years online Social media accounts for 33% of the total time spent online In 2019, there were 2.3 billion active Facebook users Twitter users send nearly half a million tweets every minute By 2020, every person will generate 1.7 megabytes in just a second By 2020, there will be 40 trillion gigabytes of data (40 zettabytes) 90% of all data has been created in the last two years Big Data Generation and Growth Big Data Generation and Growth 90% of all data has been created in the last two years Introduction B I G D ATA A N A LY T I C S Big Data Generation and Growth What is Big Data Importance of Big Data Analytics Industries benefiting from Data Analytics Sources of Data (people, machines, organizations) Aspects of Bigness (The 5 V’s of big data) Types of Data (table, text, multimedia, stream, sequence, graphs) The Analytics Process (preprocessing, analytics, visualization) What is Big Data “Big data”: datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze As technology advances over time, the size of datasets that qualify as big data will also increase The definition varies by sector, depending on the kinds of available software tools and sizes of datasets in a particular industry With those caveats, big data in many sectors today will range from a few dozen terabytes to multiple petabytes (thousands of terabytes) Introduction B I G D ATA A N A LY T I C S Big Data Generation and Growth What is Big Data Importance of Big Data Analytics Industries benefiting from Data Analytics Sources of Data (people, machines, organizations) Aspects of Bigness (The 5 V’s of big data) Types of Data (table, text, multimedia, stream, sequence, graphs) The Analytics Process (preprocessing, analytics, visualization) Data Analytics Data: Set of values of qualitative or quantitative variables Information: Meaningful or organized data Data Analytics: The process of examining data in order to draw and communicate useful conclusions about the information it contains Source: https://enablecomp.com/ Data Analytics: Then and Now Data Analytics has been around for years Even in 1950’s, businesses were using basic analytics (manual examination) on data (essentially numbers in a spreadsheet) to uncover insights and trends New tools and technologies bring speed and efficiency in techniques Today, businesses analyze data and can identify insights for immediate decisions The ability to work faster and stay agile gives organizations a competitive edge they did not have before Why is Big Data Analytics Important Organizations analyze data to identify new opportunities to gain insights that lead to smarter business decisions to identify methods for more efficient operations to maximize larger revenues and higher profits to keep customers satisfied Top three factors' businesses got the most value in Cost reduction Faster, better decision making New products and services What enterprises use Big Data Analytics for Competitor Analysis Online traffic to websites and related social media Market Analysis Trends and market segment analysis Productivity Enhancement Analyze employees tracking data Cost Cutting Reduce energy bills, optimize routes, predict demands, process efficiency and automation Targeted Marketing Analyze purchasing history and target the right people for a product Improved Customer Relations Analyze customer feedback and make adjustments Introduction B I G D ATA A N A LY T I C S Big Data Generation and Growth What is Big Data Importance of Big Data Analytics Industries benefiting from Data Analytics Sources of Data (people, machines, organizations) Aspects of Bigness (The 5 V’s of big data) Types of Data (table, text, multimedia, stream, sequence, graphs) The Analytics Process (preprocessing, analytics, visualization) Industries Benefiting from Big Data Analytics Retail: Advertising, Targeted marketing, recommendation system, customer loyalty, inventory management, demand prediction Banking and Financial: Customer loyalty and churn, fraud detection, risk assessment Brands: 66% brands use data analytics for product and service launch, appropriate timings Logistics and Transportation: Fleet management, maintenance needs, drivers risk assessment, real time tracking Health Care: Efficiency in healthcare operations, predictive analytics, outbreak prediction, immunization strategy Government & Utility Companies: Surveys & census, development planning, health, education, energy supply & demand management Big Data Analytics - Market Introduction B I G D ATA A N A LY T I C S Big Data Generation and Growth What is Big Data Importance of Big Data Analytics Industries benefiting from Data Analytics Sources of Data (people, machines, organizations) Aspects of Bigness (The 5 V’s of big data) Types of Data (table, text, multimedia, stream, sequence, graphs) The Analytics Process (preprocessing, analytics, visualization) Sources of Big Data Sources: Machine Generated Data Biggest source of big data Temperature sensors, GPS navigator, Satellite imagery, Apps, Increasing number of smart devices, IoT A 12 hours flight produces 84TB of data, sensors, temperature, pressure, accelerometer, turbulence Smart City, Smart Transportation Think about the volume of video data collected at Lahore Safe City Authority Control Room Generally, such data is unstructured Sources: People Generated Data Blogs, social network posts, keywords search, photo sharing, pictures, emails, ratings and reviews Daily facebook data 30+ PB > All US Academic libraries (2 PB) Companies use 12PB/day Twitter data for sentiment analysis around their products Typically unstructured, or at best semi-structured such as emails, where the header has somewhat of a structure, except in few cases such as filling up a survey form Generally more text: 500 million tweets per day Sources: Organization Generated Data LUMS Students Data, TCS shipment tracking data Governments open data, Stock Records, Banks, e-Commerce Medical Records Optimize routs and optimal scheduling can save 50m by reducing each drivers route by one mile Combine Walmart sales data with Twitter sentiment analyses or events to launch a new product Estimate demands Fraud Detection Highly Structured Data Categories of Data Introduction B I G D ATA A N A LY T I C S Big Data Generation and Growth What is Big Data Importance of Big Data Analytics Industries benefiting from Data Analytics Sources of Data (people, machines, organizations) Aspects of Bigness (The 5 V’s of big data) Types of Data (table, text, multimedia, stream, sequence, graphs) The Analytics Process (preprocessing, analytics, visualization) Aspects of Big: The 5 V’s 1 Volume 2 Velocity 3 Variety 4 Veracity 5 Value Aspects of Big: The 5 V’s – Volume Volume: size, scale, dimensionality, Challenges: Acquisition, Storage, Retrieval, Processing Time Large dimensional data has more information, it is a blessing It is also a big curse, dealing with large dimensions is a core topic in this course Aspects of Big: The 5 V’s – Velocity Velocity: Speed of data is very high Number of emails, twitter messages, photos, videos etc. per second Late decisions implies missed opportunities Real time processing vs Batch Processing (end of the day) Aspects of Big: The 5 V’s – Variety Variety: Structural variety, different formats, models Source: https://openautomationsoftware.com/ Medium variety, audio, text, video, DBMS, files, traffic logs, XML, code Online vs Offline, Real time vs Intermittent data (another way data varies) Challenges: requirement of analytics, Semantic, how to interpret Aspects of Big: The 5 V’s – Veracity Veracity: Quality of data Data could have many issues (biases, anomalies, inconsistent measurements and units, incomplete and duplicate records) Volatility in data, updated/outdated, changing trends/sentiments Trustworthiness and reliability of sources and generation/processing Fake news, rumours, fake likes, fake followers Source: https://datafloq.com/ Aspects of Big: The 5 V’s – Value Value: Data can be turned into big value Data having no value is of no good to the company Should be able to meet strategic objectives Should amplify other technology innovations Introduction B I G D ATA A N A LY T I C S Big Data Generation and Growth What is Big Data Importance of Big Data Analytics Industries benefiting from Data Analytics Sources of Data (people, machines, organizations) Aspects of Bigness (The 5 V’s of big data) Types of Data (table, text, multimedia, stream, sequence, graphs) The Analytics Process (preprocessing, analytics, visualization) Types of Data Relational Data Text Data Multimedia Data Time Series Data Sequential Data Streams Graphs and Homogeneous Networks Graphs and Heterogeneous Networks Types of Data: Text blogs, webpages, tweets, documents, emails High dimensionality, vocabulary, information retrieval, natural language processing Latest search engine for Walmart.com uses text analysis, machine learning and even synonym mining to produce relevant search results. Wal-Mart says adding semantic search has improved online shoppers completing a purchase by 10% to 15%. ”In Wal-Mart terms, that is billions of dollars,” Types of Data: Multimedia image, audio, video ‘Fast food and video’ company is training cameras on drive- through lanes to determine what to display on its digital menu board. When the lines are longer, the menu features products that can be served up quickly; when the lines are shorter, the menu features higher-margin items that take longer to prepare Types of Data: Time Series Sequence of data points at equally spaced time intervals Sensor data, Stock market data, Forex rates, Temporal tracking (GPS), Smart Meters Data (AMI) Understanding the underlying forces and structure of observed data and fit a model to forecast, monitor or control Economic Forecasting, Sales Forecasting, Stock Market Analysis, Yield Projections, Process and Quality Control, Inventory Studies, Workload Projections, Census Analysis market momentum Applicati on of Ti me Series Analysis in Financial Economics by @Statswork https://l i nk.medi um. co m/n3FJ P zh Iadb Types of Data: Sequential Data Bio-sequences Discretized music and audio data Text Source: Sijo Asokan (slideshare.net) Types of Data: Streams Real time data Single pass algorithms/online algorithms Irreversible decisions Small memory algorithms Types of Data: Graphs/Homogeneous Networks G = (V ,E ), data items represented as graphs Could have similarity on edges Could have weights on vertices, edges or both Facebook, webgraph, twitter, co-authorship graphs (bibliometric), citation networks Types of Data: Heterogeneous Networks Nodes represent different entities Authors and conferences Introduction B I G D ATA A N A LY T I C S Big Data Generation and Growth What is Big Data Importance of Big Data Analytics Industries benefiting from Data Analytics Sources of Data (people, machines, organizations) Aspects of Bigness (The 5 V’s of big data) Types of Data (table, text, multimedia, stream, sequence, graphs) The Analytics Process (preprocessing, analytics, visualization) The Analytics Process Business Objective Why we are seeking data analytics in the first place? How can we reduce production costs without sacrificing quality? What are some ways to increase sales with our current resources? Do customers view our brand in a favorable way? Data Collection What data is needed and available? Identify sources of data and relevance of data Are there enough instances, are all relevant features there? Identify datasets, acquire and retrieve Sources RDBMS,.txt, webservices (soup), RSS, tweets Experiments, synthetic data generation, Survey The Analytics Process Data Preparation Make the data ready for analytics Exploratory Data Analysis Describe, Summarize, Visualize Pre-process: Improve data quality, clean data, transformation, standardization, normalization Data Analysis Apply analytical techniques Supervised and unsupervised learning, Graph analytics Report and Deployment Communicate results and findings, and apply conclusions to gain benefit The Analytics Process Data Analytics Tasks and Methods Data Analytics is the process to discover patterns in data to find relationships in data to (automatically) extract knowledge from data to summarize data in ways that are understandable and useful Discovering knowledge form data often requires learning Data Analytics Tasks and Methods Descriptive Analytics Uncover patterns, correlations, trends & trajectories describing data Explanatory in nature Require post-processing to validate and explain the results Clustering/grouping the data or Detecting outliers (anomalies) in data Predictive Analytics Predict value of an attribute based on values of other attributes Predicted attribute: Target/dependent/response variable Attributes used to predict: Predictor/explanatory/independent variables Classification: nominal target attribute (class labels) Regression: numeric target attribute Data Analytics Taks Clustering: Partition data into meaningful groups Outlier Detection: Detect points that are unusual (unlike others) Classification: Assign (predefined) class labels to each object Regression: Find a function that models (continuous) target variable Association Analysis: Find patterns in data that describe relationships Recommendation: Predict an unknown rating based on known ratings Community Detection: Find (overlapping) communities of nodes in networks Centrality and Important nodes: Find important (or evaluate importance of) nodes in networks Machine Learning for Data Analytics Supervised Learning For some data items the correct results (values of the target variable) are given (ground truth) We want to learn a model that generalizes i.e. the model is able to perform accurately on new/unseen/unlabeled data items Classification, where the target is a categorical attribute Regression, where the target is a continuous attribute Test Data Training Data Predict Target Variable Values Model Known Labels Machine Learning for Data Analytics B i n a r y Classification Multi-Class Classification Regression x2 x2 x1 x1 Machine Learning for Data Analytics Unsupervised Learning No correct output is provided Learning and analytics is done using statistical properties of data Clustering Outlier detection Modeling the density of data Dimensionality reduction Data Analytics Tasks and Methods