FBA Lecture 04 - MLR and Classification PDF

BUSI4371: Foundational Business Analytics Instructor: Dr Evgeniya Lukinova Week 4: MLR and Classification 0. The story so far: SUPERVISED LEARNING UNSUPERVISED LEARNING DISCRETE CLASSIFICATION CLUSTERING CONTINUOUS point models linear regression DIMENSIONALITY REGRESSION REDUCTION 0. The Linear Model 1. Today’s Learning Objectives We will cover two fundamental analytics models: Multiple Linear Regression Multiple Logistic Regression You’ll recognize the equations underpinning both models. You’ll know conceptually what regression and classification techniques do differently, and what a hyperplane is. You will know what “Parametric Modelling” is, and its importance to business analytics. You’ll learn why “optimization” is key to our work. i. Industry Example - Experian Experian is one of the leading users of modern business analytics. They apply analytics both to internal processes, but mostly externally (selling analytics services/products in 80 countries). They have almost 18k employees in 39 countries and a revenue of ~£4.7b. They gave us Credit Market Customer Vehicle Risk Intelligence Insight Insight several examples Plan Know Acquire Manage Grow of specific projects Application Fraud and Consumer Portfolios and we can study → Processing ID Solutions Services Regulation ii. Example Experian Projects Opticians Store Locations Credit Rating Prediction Northern Power Grid What will the likely number of What are the likely total What is the vulnerability of customers be to a store for a losses to the business if customers in LSOAs to particular location? customers default? power outages? ii. Example Experian Regression Projects Opticians Store Locations Credit Rating Prediction Northern Power Grid What will the likely number of What are the likely total What is the vulnerability of customers be to a store for a losses to the business if customers in LSOAs to particular location? customer’s default? power outages? Predicting number of people Predicting an amount in £ Predicting a vulnerability (a continuous variable) (a continuous variable) ‘score’ (a continuous variable) iii. Regression Problems mean median mode All of these analytics projects try to predict continuous variables, making them regression problems. We have covered two ways of solving these so far.... These are point models and linear models → Remember, that our linear models are based on the following equation for a straight line. iv. Customer “Numbers” Modelling The input features (with some being slightly cryptic) were: ✓ Daytime Population ✓ AvRegion ✓ AvCategory ✓ AvMaturity ✓ HHLDS ✓ pUP ✓ NCompetitors Experian is using far more than one When your linear regression problem has more than one input variable to input variable, and just one make their output feature, you are doing: “multiple linear regression” predictions v. Other Experian Projects Insurance Product Purchase Credit Default Prediction Gym? Given historical data, will a Is a credit-card user going to Will a customer ‘churn’ – i.e. person buy a new insurance default next month? will they stop coming to the product? gym. v. Other Experian Classification Projects Insurance Store Opticians Product Purchase Credit Default Prediction Locations Gym? Given historical data, will a Is a credit-card user going to Will a customer ‘churn’ – i.e. person buy a new insurance default next month? will they stop coming to the product? gym. Predicting Predicting Predicting purchase/no-purchase default/non-default churner/non-churner (a discrete, categorical (a discrete, categorical (a discrete, categorical variable) variable) variable) vi. Customer “Conversion” Modelling This is a very common marketing challenge. Will a potential customer “convert” or “not-convert” if direct mailed? Predicting convert/no-convert (a discrete, categorical variable) Experian says that to do this they used an approach called “Logistic Regression” 2. Multiple Linear Regression (MLR) The reasons to try MLR are normally clear: When you have more than one independent variable (or “input feature”), and predicting a continuous dependent variable (or “output feature”) you are doing multiple regression. And if your chosen model is a straight line: you are doing “multiple linear regression”. 2. Multiple Linear Regression (MLR) Models now have “planes” connecting their points together: 2. Multiple Linear Regression (MLR) As machine learners for any model we train it. This means tuning each parameter (β0, β1,…, βn) so that the model’s predictions have the least possible error when tested on historical data resulting in estimates (b0, b1,…, bn). 2. Multiple Linear Regression (MLR) And that’s all Experian have done here. They have: - Isolated input features in collaboration with the client (domain experts). - Created an MLR model - Trained that model on historical data examples to find the best estimates for the parameters. - Used the resulting model to make new predictions. 3. “Parametric” Models MLR is an example of a Parametric Model: This is any model that has an equation, whose parameters need “training”. Very occasionally there is a direct “closed form” formula to do this (e.g. least squares algorithm) However, more often than not, you have to use a computer to find the best parameters possible. This process is called “optimization” To perform it you use an “optimizer” or a “solver” 3. “Parametric” Models Parametric Model: It makes strong assumptions about the form of the mapping function and the data. A parametric model may work well if the assumptions turn out to be correct (badly if assumptions are wrong). Summarize data with a set of parameters of fixed size (independent of the number of training examples). e.g., Linear regression, Logistic regression, Naive Bayes Non-Parametric Model: It does not make strong assumptions about the form of the mapping function. Non-parametric does not mean that they have NO parameters! A non-parametric algorithm uses a flexible number of parameters. Nonparametric methods are good when you have a lot of data and no prior knowledge about what to model. e.g., k-Nearest Neighbors, Decision Trees, Random Forests 3. Optimizing parameters To find the best parameters for your analytics model: First define an “objective function” (normally an equation that represents ‘error’) Tell the optimizer/solver what parameters it can adjust the values of. Press “GO”. Wait until the optimizer finds a parameterization that has the minimum error it can find. OR let it continue until you have run out of patience / time! 3. Optimizing parameters in “MLR” For MLR, the objective function is just the total error our model produces when tested on historical data. Recall: The objective function in MLR traditionally minimizes the sum of squared errors: The difference for each data point between the models prediction and the reality (“residuals”), squared and added up! 3. Testing an optimized model Once we have a model we often want to know if we have got trustworthy results. For this there are a number of formal testing procedures: There are numerous parametric ‘tests’ (e.g. F-test) These tests assign each variable’s coefficient in a regression a score called a “p-value” testing whether that coefficient is significant and the variable is reliable enough to use in your model (or if its apparent effects are actually just random). We will come back to these in the future! For now just pick the model that gives the best predictions. Task I. Multiple Regression Exercise! 4. Classification Problems We noted that several of Experian’s real-world analytics projects weren’t predicting a continuous variable. Instead, they were predicting a category: Purchase / no-purchase Defaulter / non-defaulter Churner / non-churner Converter / non-converter When we are predicting a category or “class”, approaches such as linear regression just won’t work. 4. Classification Business Problems In business situations, models often tend to predict a class rather than a continuous value. “Classifiers” are more common than “Regressors” The mode We already have a model that can do this…. point model malignant / benign pregnant / not! sun / rain / snow blue / green / brown 4. Classification Business Problems Classifiers are extremely common in business analytics: malignant breach / benign / non-breach pregnant / not-pregnant upturn / downturn win / loss /draw public analytics consumer analytics financial analytics sports analytics outbreak / no outbreak lapser / non-lapser give credit / reject buy / sell 5. Why would “linear regression” fail? Linear Regression will do the best it can… pregnant But it is completely the wrong model to use How can anyone be 50% pregrant? Or negatively pregnant!? The model gives spurious predictions… not-pregnant increase in purchases of folic acid 5. The solution: Logistic Regression The Answer? Use a different line for the Probability of pregnancy 1 model, and use a “probabilistic” space. That line is called the “Logistic Function” 0.5 Any model that uses it is: Logistic Regression It also has an equation we can optimize! 0 increase in purchases of folic acid 5. Logistic Regression 🡪 The equation for logistic regression looks a bit more intimidating: 🡪 However, for any data point it’s not actually that much harder to calculate than when we were doing linear regression. 5. Optimizing a Logistic Regression model ✓ Note, we have no error function here. Instead, because we interpret this ratio as the probability the data point is in the target class, we can find the best parameters using an approach called “maximum likelihood estimation”. ✓ To do this, for each data point we work out the “log likelihood” that our model would have assigned it to the class it is actually in. 5. Optimizing a Logistic Regression model So, first note that for any given data point xi, If its true label is the “target class” (1, e.g., pregnant), the model would assign it correctly yi percent of the time. But, if its true label is not the “target class” (0, e.g., not-pregnant), then the model would get the answer right (1-yi) percent of the time. We want to maximise these probabilities. 5. Optimizing a Logistic Regression model Let’s denote the true class of xi as ci Then, the likelihood that the model would get the class right is: Because it is difficult to calculate the product we You won’t be tested on can take natural logarithms and calculate the these formulas so you sum: can (if you’re panicking) just trust me about them 5. Optimizing a Logistic Regression model Note that the equation we got is a bit of a trick, for a particular i: If the true class label, ci, is one then: If the true class label, ci, is zero then: 5. Optimizing a Logistic Regression model In the end we just sum these log likelihood Probability of pregnancy 1 values for every data point. It’s just a formula. We find values of parameters (β0, β1) that 0.5 maximise that sum. This gives us a model with the highest, or “max” likelihood of producing the true data. 0 All this will make more sense after you’ve done the tutorial! increase in purchases of folic acid 5. Optimizing a Logistic Regression model Probability of However before then note how we make real pregnancy 1 predictions from our model– if the probability produced is >0.5 we predict the target class. 0.5 Otherwise, we predict the other class. So in a way we are just creating a “separating line” 0 This is called a “linear discriminant” – a line (or “hyperplane”) that separates the two classes. increase in purchases of folic acid 6. “Partitioning” the Feature Space Customer Not Pregnant Hyperplane Customer Pregnant (linear discriminant) increase in folic acid purchases🡪 Area where we are at least 50% sure that if we predicted “PREGNANT” that we would be correct 6. “Partitioning” the Feature Space We can also use two dimensions…Or three, or four…. (but then it gets hard to visualize!) Not Obese Health Food Purchases Obese Hyperplane Area where we are at least 50% confident that if we predicted “OBESE” we would be correct Donut purchases 7. Multiple Logistic Regression Just like linear regression, we can generalize to more input features: The location of the kink in the The stretch of the slant The stretch of the slant now multidimensional in the S-shape across in the S-shape across S-shape the first feature. the second feature… and so on! Task II. Logistic Regression exercise Homework – Week 4 Of course, more extra reading and exercises: Read sections 14.1 and 14.7 of Chapter 14 of “basic business statistics” provided on Moodle. Ignore the testing and dummy variable sections for now, but do the exercises for multiple linear regression and logistic regression. Read Chapter 4 of “Data Science for Business”, which will teach you more about logistic regression and more about linear discriminants as classifiers. Continue to practice Python! Which brings us to… Schedule: Week 1. Overview / Intro to Analysis Intro to Python Week 2. Fundamental Business Stats Flow & Data Structures Week 3. Making Linear Predictions (Quiz 1) Data Structures II Week 4. Classification Models Functions Week 5. Decision Trees (Quiz 2) Pandas Week 6. Node Impurity & Entropy More Pandas Week 7. Success & Ensemble Methods (Quiz 3) Revision & Sklearn I Week 8. Naïve Bayes & kNN Python Test (20th Nov) Week 9. Clustering I (Quiz 4) Plotting & Sklearn II Week 10. Clustering II Bringing it all Together! Week 11. Dimensionality reduction (Quiz 5) Revision

FBA Lecture 04 - MLR and Classification PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue