Big Data Analytics PDF

Introduction to Big Data Analytics Agenda: Students Will Be Able To: 1. Identify the purpose of Big Data Analytics 2. Discuss several analytical models 3. Discuss major considerations in model performance 4. Evaluate model performance Big Data Analytics BRAINSTORM Based on exposure to literature, conversations with clients, professional publications: What is big data analytics? Big Data Analytics – Several Definitions To extract insight or information from data “Extracting useful knowledge from data to solve business To discover problems” interesting relationships within data * Provost & Fawcett, “Data Science for Business” Data Format Name Age Occupation Purchase? Total Rows, Spent Also known Jane Smith 38 Designer Yes 40,000 as: James Doe 19 Student No N/A Records Observatio Sally Q. 50 CEO Yes 9,000 ns Public NOTE: Input Fields Labels/Output Expects inputs and Columns, Fields, outputs to be at the Also known same level as: Fields Can be: (i.e. for an individual, Binary for a visit to the store, Attributes Categorical for a webpage view, Dimensio Numeric etc) ns Data Format in Different Disciplines BRAINSTORM How does big data format differ in different applications? Does it pose any unique challenges? Data Preparation 1. Data preparation is an inseparable task in any analytics problem 2. Data can be noisy, incomplete, inconsistent, skewed, sparse, and many more 3. As a results, data may need to be scaled, replaced, transformed, compressed, binned, discretized, and many more. Problem Types and Paradigms SUPERVISED UNSUPERVISED CLASSIFICATI VISUALIZATI REGRESSION CLUSTERING ON ON Supervised methods assume labels/outputsUnsupervised methods lack labels/outputs Problem Types: Classification Example 1: Given Data: Every new customer must record their name, age, and occup Given a set of inputs, predict the most Jimmy G. Generic 10 Student likely value for a categorical or binary Desired Outcome: Will this new customer make a purchase? output Other Examples: (Can you specify the given data and the desired outcome?) 1. Given the dollar amount, GPS location of the store, category of the retailer, date, and time of a purchase – is it fraudulent? 2. Given the pulse, respiration, and blood pressure – Will the patient have a complication? Problem Types: Regression Example1: Given Data: Every new customer must record their name, age, and occup Given a set of inputs, Jimmy G. Generic 10 Student predict the most likely value for a real- Desired Outcome: IF this customer makes a purchase, how much are the valued output likely to spend? Can you think of other real-life examples? Problem Types: Clustering Can you think of some real-life examples? Examples: ` Given a set of inputs, with or without ` outputs, assign a Salesperson class label to each Days as a ` record based upon similarity to other records Average $ Value Sales/Day Problem Types and Business BRAINSTORM Now that we’ve introduced the major problem types in data mining: Form groups and create a list of at least five business use cases for each of the problem types. Models Let’s have a look at some models ….. The basic supervised learning framework y = f(x) output classification input function Learning: given a training set of labeled examples {(x1,y1), …, (xN,yN)}, estimate the parameters of the prediction function f Inference: apply f to a never before seen test example x and output the predicted value y = f(x) Learning and inference pipeline Learning Training Labels Training Samples Feature Trainin Learned s g model Learned model Inference Predictio Features n Test Sample K-Nearest Neighbors 1. Non-parametric approach ◦ (Does not actually extract model parameters, relies on having all examples in memory) 2. Relies on a distance function between each point and others 3. Each nearby point “votes” on the class label http://mines.humanoriented.com/classes/2010/fall/ csci568/portfolio_exports/lguo/image/knn/knn.jpg K-Nearest Neighbors, Continued 1. Parameters can be tuned: ◦ K (number of neighbors to use as a vote) ◦ Distance metric 2. What about 1 Nearest Neighbor? 3. What about 100 Nearest Neighbors? Decision Tree Models 1. Recursive algorithm approach 2. Uses some measure of purity ◦ Given a certain split in an input, how “pure” is the output? ◦ Information Gain, based on Entropy, is a popular measure Decision Trees, continued 1. Test Information Gain of the output variable given a split on each of the input variables 2. First split – Maximum Information Gain 3. Recursively repeat for each of the nodes, which correspond to split points of original input value 4. Stop when the leaf nodes have reached an adequate level of “purity” 5. Handle numeric fields with cutpoints (beyond the scope of this session) Why decision trees? 1. Many modern implementations, including ID3 and C4.5/C5.0 2. Intuitive, interpretable models 3. Fast, scalable 4. Can handle many dimensions, mixed categorical and numeric inputs 5. Can handle multiple valued outputs, not just binary 6. Parametric, requires longer to train, scores very quickly 7. Sometimes called CART – Classification and Regression Trees. Can be used for Regression, beyond the scope of this session. Unsupervised Learning Goal: Segment data into meaningful segments; detect patterns There is no target (outcome) variable to predict or classify Methods: Clustering, Association rules,..etc. Association Rules: Goal: Produce rules that define “what goes with what” Example: “If X was purchased, Y was also purchased” Rows are transactions Used in recommender systems – “Our records show you bought X, you may also like Y” Also called “affinity analysis” Unsupervised Learning Clustering ◦ Discover groups of “similar” data points K-Means Clustering 1. K is a given – the number of clusters 2. Place K cluster centers randomly throughout the space, assigning all points to one of the K clusters 3. Calculate the centroid of the points assigned to each cluster (in all dimensions) and move the cluster centers to this new point 4. Calculate the distance of all points to the new centroid, repeat step 3 5. Repeat steps 3 and 4 until convergence (cluster centroids move very little from one iteration to the next) K-Means Clustering Example Notice that the clusters are extremely imbalanced in Iteration 1 Imagine the green points on the bottom left pulling that centroid down and left Likewise with blue Evaluation Measures How Can We Evaluate the Goodness of a Model? Model Performance Considerations PROBLEM SCENARIO: We have clickstream XX # of times clicked data on how long a X customer has viewed an item (total) and how many times they’ve clicked it OO We pull 10 records OO X O X from the database Total time spent viewing Build a model (which type of model? Which type of problem?) X = purchase O = nonpurchase Model Performance Considerations PROBLEM SCENARIO: We have clickstream XX # of times clicked data on how long a X customer has viewed an item (total) and NON- PURCHASE how many times PURCHASE they’ve clicked it OO We pull 10 records OO X O X from the database q Build a model (which Total time spent viewing type of model? Which type of problem) SUCCESS! If Total_Time >= q THEN Purchase ELSE Nonpurchase X = purchase O = nonpurchase Model Performance Considerations We deploy the system, 10 new customers X X XX X # of times clicked X X X Suddenly … Our classifier looks much p worse With more rules, using OO OO both dimensions, we OO X O X can still have O OO reasonable accuracy q Total time spent viewing We can’t have a perfect classifier using these two dimensions … Can we imagine a Model Performance Considerations http://datascience.stackexchange.com/questions/361/when-is-a-model-underfitted Which of these did we display in the original model for our clickstream customers? What if we kept adding rules, and kept adding dimensions? Hold-out Evaluation Holdout method ◦ Given data is randomly partitioned into two independent sets ◦ Training set (e.g., 2/3) for model construction ◦ Test set (e.g., 1/3) for accuracy estimation ◦ Random sampling: a variation of holdout ◦ Repeat holdout k times, accuracy = avg. of the accuracies obtained 30 What is wrong with hold-out methods? What if you data is imbalanced? Testing a Model Cross Validation lets us choose random subsets of the data to hold out from the model building stage 10-fold cross validation is the most popular https://chrisjmccormick.files.wordpress.com/2013/07/10_fold_cv.png 10-Fold Cross Validation Utility 1. Choose between different modeling algorithms 2. Tune parameters of chosen algorithm to reduce bias or variance 3. Create an estimate of model performance Conclusion Topics introduced: 1. Definition of Data Mining 2. Data format – Inputs, Outputs, Dimensions 3. Problem Types and Paradigms 4. Model Performance Considerations, Measuring and Comparing Models 5. K-Nearest Neighbors non-parametric classification 6. Decision Trees parametric classification 7. K-Means clustering, unsupervised clustering

Big Data Analytics PDF

Document Details

Tags

Related

Summary

Full Transcript