Module 4 - Advanced Analytics - Theory and Methods PDF
Document Details
Uploaded by GlamorousPoisson4448
AAST College of Dentistry
2012
Tags
Related
- Advanced Statistical Analysis - Week 1 Lecture 2
- Advanced Statistical Analysis Lecture Notes - University of Groningen
- Advanced Statistical Analysis Lecture Notes
- An Introduction to Statistical Learning (PDF)
- NMU Flow Machines & AFD Lecture 1 PDF
- STA 773 Advanced Econometric Methods Lecture 2 K-variables PDF
Summary
This document is a module on advanced analytics, focusing on linear and logistic regression models. It covers topics like general descriptions, technical explanations, use cases, and diagnostics for validating these models.
Full Transcript
Module 4 – Advanced Analytics - Theory and Methods EMC2 PROVEN PROFESSIONAL Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 1 Module 4: Advanced Analytics – Theory and Methods Lesson 3: Linear Regression...
Module 4 – Advanced Analytics - Theory and Methods EMC2 PROVEN PROFESSIONAL Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 1 Module 4: Advanced Analytics – Theory and Methods Lesson 3: Linear Regression During this lesson the following topics are covered: General description of regression models Technical description of a linear regression model Common use cases for the linear regression model Interpretation and scoring with the linear regression model Diagnostics for validating the linear regression model The Reasons to Choose (+) and Cautions (-) of the linear regression model EMC2 PROVEN PROFESSIONAL Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 2 Regression Regression focuses on the relationship between an outcome and its input variables. 4 In other words, we don't just predict the outcome, we also have a sense of how changes in individual drivers affect the outcome. The outcome can be continuous or discrete. 4 When it's discrete, we are predicting the probability that the outcome will occur. Example Questions: 4 I want to predict the life time value (LTV) of this customer (and understand what drives LTV). 4 I want to predict the probability that this loan will default (and understand what drives default). Our examples: Linear Regression, Logistic Regression EMC2 PROVEN PROFESSIONAL Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 3 Linear Regression -What is it? Used to estimate a continuous value as a linear (additive) function of other variables 4 Income as a function of years of education, age, gender 4 House price as function of median home price in neighborhood, square footage, number of bedrooms/bathrooms 4 Neighborhood house sales in the past year based on unemployment, stock price etc. Input variables can be continuous or discrete. Output: 4 A set of coefficients that indicate the relative impact of each driver. 4 A linear expression for predicting outcome as a function of drivers. EMC2 PROVEN PROFESSIONAL Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 4 Linear Regression - Use Cases The preferred method for almost any problem where we are predicting a continuous outcome 4 Try this first; if it fails, then try something more complicated Examples: 4 Customer lifetime value 4 Home value 4 Loss given default on loan 4 Income as a function of demographics EMC2 PROVEN PROFESSIONAL Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 5 Example: Predict Mortgage Foreclosure/Delinquency Rates fdq_rate = -0.9 + 0.66 CurrentUnemp + 1.06 ChgInUnem1yr + 0.22 hicost_mort_rate EMC2 PROVEN PROFESSIONAL Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 6 Technical Description coefficients coefficients Constant independant var independant var Solve for the bi 4 Ordinary Least Squares 8 storage quadratic in number of variables 8 must invert a matrix Categorical variables are expanded to a set of indicator variables, one for each possible value. EMC2 PROVEN PROFESSIONAL Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 7 Representing Categorical Variable State is a categorical variable: 50 possible values. Expand it to 49 indicator (0/1) variables: 4 The remaining level is the "default level“ 4 This is done automatically by standard packages Gender is categorical, too, but binary 4 so one variable: genderMale, which is 0 for females EMC2 PROVEN PROFESSIONAL Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 8 What do the Coefficients bi Mean? Change in y as a function of unit change in xi 4 all other things being equal Example: income in units of $10K, years in age, bage= 2 4 For the same gender, years of education, and state of residence, a person's income increases by 2 units (20K)for every year older Standard packages also report the significance of the bi: probability that, in reality, bi = 0 4 bi "significant" if P(bi = 0) is small EMC2 PROVEN PROFESSIONAL Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 9 Diagnostics Hold-out data 4 Does the model predict well on data it hasn't seen? N-fold cross-validation 4 Partition the data into N groups. 4 Fit N models, holding out each group, and calculate the residuals on the group. 4 Estimated prediction error is the average over all the residuals. R2 : The fraction of the variance in the output variable that the model can explain. 4 It is also the square of the correlation between the true output and the predicted output. You want it close to 1. EMC2 PROVEN PROFESSIONAL Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 10 Diagnostics (Continued) Sanity check the coefficients 4 Do the signs make sense? Are the coefficients excessively large? 8 Wrong sign is an indication of correlated inputs, but doesn't necessarily affect predictive power. 8 Excessively large coefficient magnitudes may indicate strongly correlated inputs; you may want to consider eliminating some variables, or using regularized regression techniques. ▪ Ridge, Lasso 8 Infinite magnitude coefficients could indicate a variable that strongly predicts a subset of the output (and doesn't predict well on the rest). ▪ Plot output vs. this input, and see if you should segment the data before regressing. EMC2 PROVEN PROFESSIONAL Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 11 Diagnostics (Continued) Overpredicts for low true values, underpredicts at higher values. Improve Plot it! the model. 4 Prediction vs. true outcome Look for: 4 Systematic over/under prediction 4 Non-consistent variance 8 The data cloud should be symmetric about the line of true prediction 4 Glaring outliers You will see other diagnostic plots in the lab Not quite consistent variance, but much better. EMC2 PROVEN PROFESSIONAL Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 12 Linear Regression - Reasons to Choose (+) and Cautions (-) Reasons to Choose (+) Cautions (-) Concise representation (the coefficients) Does not handle missing values well Robust to redundant variables, correlated Assumes that each variable affects the variables outcome linearly and additively Lose some explanatory value Variable transformations and modeling variable interactions can alleviate this A good idea to take the log of monetary amounts or any variable with a wide dynamic range Explanatory value Can't handle variables that affect the Relative impact of each variable on outcome in a discontinuous way the outcome Step functions Easy to score data Doesn't work well with discrete drivers that have a lot of distinct values For example, ZIP code EMC2 PROVEN PROFESSIONAL Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 13 Check Your Knowledge 1. How is the measure of significance used in determining the Your Thoughts? explanatory value of a driver with linear regression models? 2. Detail the challenges with categorical values in linear regression model. 3. Describe N-Fold cross validation method used for diagnosing a fitted model. 4. List two use cases of linear regression models. 5. List and discuss two standard sanity checks that you will perform on the coefficients derived from a linear regression model. EMC2 PROVEN PROFESSIONAL Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 14 Module 4: Advanced Analytics – Theory and Methods Lesson 3: Linear Regression - Summary During this lesson the following topics were covered: General description of regression models Technical description of a linear regression model Common use cases for the linear regression model Interpretation and scoring with the linear regression model Diagnostics for validating the linear regression model The Reasons to Choose (+) and Cautions (-) of the linear regression model EMC2 PROVEN PROFESSIONAL Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 15 Lab Exercise 6: Linear Regression This Lab is designed to investigate and practice Linear Regression. After completing the tasks in this lab you should be able to: Use R functions for Linear Regression (Ordinary Least Squares – OLS) Predict the dependent variables based on the model Investigate different statistical parameter tests that measure the effectiveness of the model EMC2 PROVEN PROFESSIONAL Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 16 Lab Exercise 6: Linear Regression - Workflow Set Working directory 1 Use random number generators to create data for the OLS 2 Model Generate the OLS model using R function “lm” 3 Print and visualize the results and review the plots generated 4 Generate Summary Outputs 5 Introduce a slight non-linearity and test the model 6 Perform In-database Analysis of Linear Regression 7 EMC2 PROVEN PROFESSIONAL Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 17 Module 4: Advanced Analytics – Theory and Methods Lesson 4: Logistic Regression During this lesson the following topics are covered: Technical description of a logistic regression model Common use cases for the logistic regression model Interpretation and scoring with the logistic regression model Diagnostics for validating the logistic regression model Reasons to Choose (+) and Cautions (-) of the logistic regression model EMC2 PROVEN PROFESSIONAL Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 18 Logistic Regression Used to estimate the probability that an event will occur as a function of other variables 4 The probability that a borrower will default as a function of his credit score, income, the size of the loan, and his existing debts Can be considered a classifier, as well 4 Assign the class label with the highest probability Input variables can be continuous or discrete Output: 4 A set of coefficients that indicate the relative impact of each driver 4 A linear expression for predicting the log-odds ratio of outcome as a function of drivers. (Binary classification case) 8 Log-odds ratio easily converted to the probability of the outcome EMC2 PROVEN PROFESSIONAL Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 19 Logistic Regression Use Cases The preferred method for many binary classification problems: 4 Especially if you are interested in the probability of an event, not just predicting the "yes or no“ 4 Try this first; if it fails, then try something more complicated Binary Classification examples: 4 The probability that a borrower will default 4 The probability that a customer will churn Multi-class example 4 The probability that a politician will vote yes/vote no/not show up to vote on a given bill EMC2 PROVEN PROFESSIONAL Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 20 Logistic Regression Model - Example Training data: default is 0/1 4 default=1 if loan defaulted The model will return the probability that a loan with given characteristics will default If you only want a "yes/no" answer, you need a threshold 4 The standard threshold is 0.5 EMC2 PROVEN PROFESSIONAL Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 21 Logistic Regression- Visualizing the Model Overall fraction of default: ~20% Logistic regression returns a score that estimates the probability that a borrower will default The graph compares the distribution of defaulters and non-defaulters as a function of the model's predicted probability, for borrowers scoring higher than 0.1 Blue=defaulters EMC2 PROVEN PROFESSIONAL Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 22 Technical Description (Binary Case) y=1 is the case of interest: 'TRUE' LHS is called logit(P(y=1)) 4 hence, "logistic regression" logit(P(y=1)) is inverted by the sigmoid function 4 standard packages can return probability for you Categorical variables are expanded as with linear regression Iterative, not closed form solution 4 "Iteratively re-weighted least squares" EMC2 PROVEN PROFESSIONAL Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 23 What do the Coefficients bi Mean? Invert the logit expression: exp(bj) tells us how the odds-ratio of y=1 changes for every unit change in xj Example: bcreditScore = -0.69 exp(bcreditScore) = 0.5 = 1/2 for the same income, loan, and existing debt, the odds-ratio of default is halved for every point increase in credit score Standard packages return the significance of the coefficients in the same way as in linear regression EMC2 PROVEN PROFESSIONAL Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 24 An Interesting Fact About Logistic Regression "The probability mass equals the counts" If 13% of our loan risk training set defaults 4 The sum of all the training set scores will be 13% of the number of training examples If 40% of applicants with income < $50,000 default 4 The sum of all the training set scores of people in this income category will be 40% of the number of examples in this income category EMC2 PROVEN PROFESSIONAL Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 25 Diagnostics Hold-out data: 4 Does the model predict well on data it hasn't seen? N-fold cross-validation: Formal estimate of generalization error "Pseudo-R2" : 1 – (deviance/null deviance) 4 Deviance, null deviance both reported by most standard packages 4 The fraction of "variance" that is explained by the model 4 Used the way R2 is used EMC2 PROVEN PROFESSIONAL Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 26 Diagnostics (Cont.) Sanity check the coefficients 4 Do the signs make sense? Are the coefficients excessively large? 8 Wrong sign is an indication of correlated inputs, but doesn't necessarily affect predictive power. 8 Excessively large coefficient magnitudes may indicate strongly correlated inputs; you may want to consider eliminating some variables, or using regularized regression techniques. ▪ Unfortunately, regularized logistic regression is not standard. 8 Infinite magnitude coefficients could indicate a variable that strongly predicts a subset of the output (and doesn't predict well on the rest). ▪ Try a Decision Tree on that variable, to see if you should segment the data before regressing. EMC2 PROVEN PROFESSIONAL Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 27 Diagnostics: ROC Curve Area under the curve (AUC) tells you how well the model predicts. (Ideal AUC = 1) For logistic regression, ROC curve can help set classifier threshold EMC2 PROVEN PROFESSIONAL Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 28 Diagnostics: Plot the Histograms of Scores good separation EMC2 PROVEN PROFESSIONAL Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 29 Logistic Regression - Reasons to Choose (+) and Cautions (-) Reasons to Choose (+) Cautions (-) Explanatory value: Does not handle missing values well Relative impact of each variable on the outcome in a more complicated way than linear regression Robust with redundant variables, correlated variables Assumes that each variable affects the log-odds of the Lose some explanatory value outcome linearly and additively Variable transformations and modeling variable interactions can alleviate this A good idea to take the log of monetary amounts or any variable with a wide dynamic range Concise representation with the Cannot handle variables that affect the outcome in a the coefficients discontinuous way. Step functions Easy to score data Doesn't work well with discrete drivers that have a lot of distinct values For example, ZIP code Returns good probability estimates of an event Preserves the summary statistics of the training data "The probabilities equal the counts" EMC2 PROVEN PROFESSIONAL Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 30 Check Your Knowledge Your Thoughts? 1. What is a logit and how do we compute class probabilities from the logit? 2. How is ROC curve used to diagnose the effectiveness of the logistic regression model? 3. What is Pseudo R2 and what does it measure in a logistic regression model? 4. How do you describe a binary class problem? 5. Compare and contrast linear and logistic regression methods. EMC2 PROVEN PROFESSIONAL Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 31 Module 4: Advanced Analytics – Theory and Methods Lesson 4: Logistic Regression - Summary During this lesson the following topics were covered: Technical description of a logistic regression model Common use cases for the logistic regression model Interpretation and scoring with the logistic regression model Diagnostics for validating the logistic regression model Reasons to Choose (+) and Cautions (-) of the logistic regression model EMC2 PROVEN PROFESSIONAL Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 32 Lab Exercise 7: Logistic Regression This Lab is designed to investigate and practice Logistic Regression. After completing the tasks in this lab you should be able to: Use R functions for Logistic Regression – (also known as Logit) Predict the dependent variables based on the model Investigate different statistical parameter tests that measure the effectiveness of the model EMC2 PROVEN PROFESSIONAL Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 33 Lab Exercise 7: Logistic Regression - Workflow 1 Set the Working Directory 2 Define the problem and review input data 3 Read in and Examine the Data 4 Build and Review logistic regression Model 5 Review and interpret the coefficients 6 Visualize the Model Using the Plot Function Use relevel Function to re-level the Price factor with value 30 as the base 7 reference 8 Plot the ROC Curve 9 Predict Outcome given Age and Income Predict outcome for a sequence of Age values at price 30 and income at its 10 mean 11 Predict outcome for a sequence of income at price 30 and Age at its mean 12 Use Logistic regression as a classifier EMC2 PROVEN PROFESSIONAL Copyright © 2012 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 34