MA_Unit 3.pdf

MARKETING ANALYTICS K S Deepika Assistant Professor Department of Management Studies MARKETING ANALYTICS Applications in classification and data reduction K S Deepika Department of Management Studies 2 Dimensionality Reduction The number of input features, variables, or columns present in a given dataset is known as dimensionality, and the process to reduce these features is called dimensionality reduction. "It is a way of converting the higher dimensions dataset into lesser dimensions dataset ensuring that it provides similar information.” Used in Machine learning to train the algorithms Fields of usage Speech recognition Signal processing Bioinformatics Data visualization Noise reduction Cluster analysis Dimensionality Reduction Dimensionality reduction techniques Feature selection Feature extraction Filter methods PCA- Principal Component Analysis Wrapper methods Factor Analysis Intrinsic/ Embedded methods Singular value decomposition Feature Extraction Feature extraction is a part of the dimensionality reduction process, in which, an initial set of the raw data is divided and reduced to more manageable groups. “Taking original features and mapping it into lower dimensional space and expressing it as a function of feature set” Lower dimensions should be uncorrelated Large variance Feature Extraction It can be done for Images Text Geospatial data Date and time Web data Sensors data Factor Analysis Interdependence technique Factors analysis is a set of techniques which, by analyzing correlations between variables, reduces their number into fewer factors which explain much of the original data, more economically. Assumptions: 1. Variables must be related There should be sufficient number of correlation (Bartlett test) 2. Sample size Minimum 50, preferably 100 min 5 observations/ item, preferably min 10 observations/ item Factor Analysis Types Exploratory FA (EFA)- Principle Component Analysis (PCA/ Thurstone) Confirmatory FA (CFA)- Structural Equations Modelling (SEM) Issues with FA Overloading Cross loading: identify variable with lowest communality and delete What is high loading? The proportion of variance in any one of the original variables which is captured by the extracted factors is known as communality Factor Analysis Factor: Linear composite of variables Factor score: score on a given factor Eigen value: sum of squares of variables of a factor loading Less than 1: omit Scree plot Factor Analysis is done in two stages, called Extraction of Factors and Rotation of the Solution obtained in stage Factor Analysis is best performed with interval or ratio-scaled variables Factor Analysis Factor Rotation Factor Analysis STEPS Load the data Run the FA OUTPUT - Correlation matrix -Bartlett’s test of sphericity -KMO (>0.05- sampling adequacy) -Communalities -Total variance -Scree plot -Component matrix -Rotated component matrix Factor Analysis- KMO and Bartlett’s test KMO measure of sampling adequacy is a test to assess the appropriateness of using factor analysis on the data set. Bartlett' test of sphericity is used to test the null hypothesis that the variables in the population correlation matrix are uncorrelated. KMO returns values between 0 and 1. Factor Analysis- KMO and Bartlett’s test If KMO value lies between 0.8 and 1, it means that the sampling is adequate. If KMO value is less than 0.6 or lies between 0.5 and 0.6, it means that the sampling is not adequate. This means proper actions need to be taken. If KMO value is closer to 0, this indicates that the data contains large number of partial correlations in comparison to the sum of correlations. This is not suited for factor analysis Factor Analysis- Scree plot Factor Analysis Applications 1. In marketing research, a common application area of Factor Analysis is to understand underlying motives of consumers who buy a product category or a brand 3. Two wheeler manufacturer is interested in determining which variables his potential customers think about when they consider his product Factor Analysis The school system of a major city wanted to determine the characteristics of a great teacher, and so they asked 120 students to rate the importance of each of the following 9 criteria using a Likert scale of 1 to 10 with 10 representing that a particular characteristic is extremely important and 1 representing that the characteristic is not important. Setting high expectations for the students Entertaining Able to communicate effectively Having expertise in their subject Able to motivate Caring Charismatic Having a passion for teaching Friendly and easy-going Factor Analysis Identify the number of factors. Group the variables into respective components and name the factors. What is eigen value and how do you decide the number of factors? What do you mean by the model is fit/ good and which output determines the same? How do you infer that the number of samples taken are good enough for factor analysis? How much of communality is contributed by factor 1, factor 2 and both together? MARKETING ANALYTICS Introduction 1. Factor Analysis is a set of techniques used for understanding variables by grouping them into “factors” consisting of similar variables 2. It can also be used to confirm whether a hypothesized set of variables groups into a factor or not 3. It is most useful when a large number of variables needs to be reduced to a smaller set of “factors” that contain most of the variance of the original variables 4. Generally, Factor Analysis is done in two stages, called Extraction of Factors and Rotation of the Solution obtained in stage 5. Factor Analysis is best performed with interval or ratio-scaled variables MARKETING ANALYTICS Assumptions MARKETING ANALYTICS Application Areas/Example 1. In marketing analytics, a common application area of Factor Analysis is to understand underlying motives of consumers who buy a product category or a brand 2. In this example, we assume that a two wheeler manufacturer is interested in determining which variables his potential customers think about when they consider his product MARKETING ANALYTICS Application Areas/Example 4. Let us assume that twenty two-wheeler owners were surveyed by this manufacturer (or by a marketing research company on his behalf). They were asked to indicate on a seven point scale (1=Completely Agree, 7=Completely Disagree), their agreement or disagreement with a set of ten statements relating to their perceptions and some attributes of the two-wheelers. 5. The objective of doing Factor Analysis is to find underlying "factors" which would be fewer than 10 in number, but would be linear combinations of some of the original 10 variables MARKETING ANALYTICS Example The research design for data collection can be stated as follows- Twenty 2-wheeler users were surveyed about their perceptions and image attributes of the vehicles they owned. Ten questions were asked to each of them, all answered on a scale of 1 to 7 (1= completely agree, 7= completely disagree). 1. I use a 2-wheeler because it is affordable. 2. It gives me a sense of freedom to own a 2-wheeler. 3. Low maintenance cost makes a 2-wheeler very economical in the long run. 4. A 2-wheeler is essentially a man’s vehicle. 5. I feel very powerful when I am on my 2-wheeler. 6. Some of my friends who don’t have their own vehicle are jealous of me. 7. I feel good whenever I see the ad for 2-wheeler on T.V., in a magazine or on a hoarding. 8. My vehicle gives me a comfortable ride. 9. I think 2-wheelers are a safe way to travel. 10. Three people should be legally allowed to travel on a 2-wheeler. MARKETING ANALYTICS Example – Input Data The input data containing responses of twenty respondents to the 10 statements are in Appendix 1, in the form of a 20 Row by 10 column matrix (reproduced below). QUESTION NO. S. No. 1 2 3 4 5 6 7 8 9 10 1 1 4 1 6 5 6 5 2 3 2 2 2 3 2 4 3 3 3 5 5 2 3 2 2 2 1 2 1 1 7 6 2 4 5 1 4 2 2 2 2 3 2 3 5 1 2 2 5 4 4 4 1 1 2 6 3 2 3 3 3 3 3 6 5 3 7 2 2 5 1 2 1 2 4 4 5 8 4 4 3 4 4 5 3 2 3 3 9 2 3 2 6 5 6 5 1 4 1 10 1 4 2 2 1 2 1 4 4 1 Table contd on next slide... MARKETING ANALYTICS Example – Input Data QUESTION NO. S. 1 2 3 4 5 6 7 8 9 10 No. 11 1 5 1 3 2 3 2 2 2 1 12 1 6 1 1 1 1 1 1 2 2 13 3 1 4 4 4 3 3 6 5 3 14 2 2 2 2 2 2 2 1 3 2 15 2 5 1 3 2 3 2 2 1 6 16 5 6 3 2 1 3 2 5 5 4 17 1 4 2 2 1 2 1 1 1 3 18 2 3 1 1 2 2 2 3 2 2 19 3 3 2 3 4 3 4 3 3 3 20 4 3 2 7 6 6 6 2 3 6 MARKETING ANALYTICS Steps in Factor Analysis MARKETING ANALYTICS Steps in Factor Analysis: The Correlation Matrix MARKETING ANALYTICS Steps in Factor Analysis: The Correlation Matrix MARKETING ANALYTICS Steps in Factor Analysis: Factor Extraction MARKETING ANALYTICS Steps in Factor Analysis: Factor Extraction MARKETING ANALYTICS Interpretation of the Output 1. The first step in interpreting the output is to look at the factors extracted, their eigen values and the cumulative percentage of variance (fig 3, reproduced below). Fig. 3: Final Statistics Variable Communal * Factor Eigenvalue Pact of Cum ity Var Pct VAR00001.72243 * 1 3.88282 38.8 38.8 VAR00002.45214 * 2 2.77701 27.8 66.6 VAR00003.73056 * 3 1.37475 13.7 80.3 VAR00004.94488 * VAR00005.95038 * VAR00006.91376 * VAR00007.95474 * VAR00008.79869 * VAR00009.77745 * VAR00010.78946 * MARKETING ANALYTICS Interpretation of the Output 1. We note that three factors have been extracted, based on our criterion that only Factors with eigen values of 1 or more should be extracted. We see from the Cum. Pct. (Cumulative Percentage of Variance Explained) column in Fig. 3 that the three factors extracted together account for 80.3 percent of the total variance (information contained in the original ten variables). This is a pretty good bargain, because we are able to economise on the number of variables (from 10 we have reduced them to 3 underlying factors), while we lost only about 20 percent of the information content (80 percent is retained by the 3 factors extracted out of the 10 original variables). 2. This represents a reasonably good solution for our problem. MARKETING ANALYTICS Statistics Associated with Factor Analysis MARKETING ANALYTICS Steps in Factor Analysis: Factor Rotation MARKETING ANALYTICS Output – Factor Matric (Unrotated) Fig. 2: Factor Matrix (Unrotated) Factor 1 Factor 2 Factor 3 VAR00001.17581.66967.49301 VAR00002 -.13577 -.60774.25369 VAR00003 -.10651.81955.21827 VAR00004.96647 -.03627 -.09745 VAR00005.95098.16594 -.13593 VAR00006.95184 -.08442 -.02522 VAR00007.97128.09591 -.04636 VAR00008 -.32171.77498 -.03757 VAR00009 -.06890.73502 -.48213 VAR00010.16143.31862 -.81356 MARKETING ANALYTICS Interpretation of the Output 1.Now, we try to interpret what these 3 extracted factors represent. This we can accomplish by looking at figs 4 and 2, the rotated and unrotated factor matrices. Fig. 4: Rotated Factor Matrix Factor 1 Factor 2 Factor 3 VAR00001.13402.34749.76402 VAR00002 -.18143 -.64300 -.07596 VAR00003 -.10944.62985.56742 VAR00004.96986 -.06383 -.01338 VAR00005.96455.13362.04660 VAR00006.94544 -.13868.02600 VAR00007.97214.02862.09411 VAR00008 -.26169.85203.06517 VAR00009.00891.87772 -.08347 VAR00010.07209 -.10990.87874 MARKETING ANALYTICS Interpretation of the Output MARKETING ANALYTICS Steps in Factor Analysis: Making Final Decisions MARKETING ANALYTICS Interpretation of the Output MARKETING ANALYTICS Interpretation of the Output 1. Now we will attempt to interpret factor 2. We look in fig 4, down the column for Factor 2, and find that variables 8 and 9 have high loadings of 0.85203 and 0.87772, respectively. This indicates that factor 2 is a combination of these two variables. 2. But if we look at fig. 2, the unrotated factor matrix, a slightly different picture emerges. Here, variable 3 also has a high loading on factor 2, along with variables 8 and 9. It is left to the researcher which interpretation he wants to use, as there are no hard and fast rules. Assuming we decide to use all three variables, the related statements are “low maintenance”, “comfort” and “safety” (from statements 3, 8 and 9). We may combine these variables into a factor called “utility” or “functional features” or any other similar word or phrase which captures the essence of these three statements / variables. MARKETING ANALYTICS Interpretation of the Output 3. For interpreting Factor 3, we look at the column labelled factor 3 in fig. 4 and find that variables 1 and 10 are loaded high on factor 3. According to the unrotated factor matrix of fig. 2, only variable 10 loads high on factor 3. Supposing we stick to fig. 4, then the combination of “affordability’ and “cost saving by 3 people legally riding on a 2-wheeler” give the impression that factor 3 could be “economy” or “low cost”. 4. We have now completed interpretation of the 3 factors with eigen values of 1 or more. We will now look at some additional issues which may be of importance in using factor analysis. MARKETING ANALYTICS Additional Issues in Interpreting Solutions 1. We must guard against the possibility that a variable may load highly on more than one factors. Strictly speaking, a variable should load close to 1.00 on one and only one factor, and load close to 0 on the other factors. If this is not the case, it indicates that either the sample of respondents have more than one opinion about the variable, or that the question/ variable may be unclear in its phrasing. 2. The other issue important in practical use of factor analysis is the answer to the question ‘what should be considered a high loading and what is not a high loading?” Here, unfortunately, there is no clear-cut guideline, and many a time, we must look at relative values in the factor matrix. Sometimes, 0.7 may be treated as a high value, while sometimes 0.9 could be the cutoff for high values. MARKETING ANALYTICS Additional Issues (Contd.) 1. The proportion of variance in any one of the original variables which is captured by the extracted factors is known as Communality. For example, fig. 3 tells us that after 3 factors were extracted and retained, the communality is 0.72243 for variable 1, 0.45214 for variable 2 and so on (from the column labelled communality in fig. 3). This means that 0.72243 or 72.24 percent of the variance (information content) of variable 1 is being captured by our 3 extracted factors together. Variable 2 exhibits a low communality value of 0.45214. This implies that only 45.214 percent of the variance in variable 2 is captured by our extracted factors. This may also partially explain why variable 2 is not appearing in our final interpretation of the factors (in the earlier section). It is possible that variable 2 is an independent variable which is not combining well with any other variable, and therefore should be further investigated separately. “Freedom” could be a different concept in the minds of our target audience. 2. As a final comment, it is again the author’s recommendation that we use the rotated factor matrix (rather than unrotated factor matrix) for interpreting factors, particularly when we use the principal components method for extraction of factors in stage 1. MARKETING ANALYTICS Discriminant Analysis for Classification and Prediction K S Deepika Department of Management Studies 45 MARKETING ANALYTICS Application Areas 1. The major application area for this technique is where we want to be able to distinguish between two or three sets of objects or people, based on the knowledge of some of their characteristics. 2. Examples include the selection process for a job, the admission process of an educational programme in a college, or dividing a group of people into potential buyers and non-buyers. MARKETING ANALYTICS Application Areas 3. Discriminant analysis can be, and is in fact used, by credit rating agencies to rate individuals, to classify them into good lending risks or bad lending risks. The detailed example discussed later tells you how to do that. 4. To summarise, we can use linear discriminant analysis when we have to classify objects into two or more groups based on the knowledge of some variables (characteristics) related to them. Typically, these groups would be users-non- users, potentially successful salesman – potentially unsuccessful salesman, high risk – low risk consumer, or on similar lines. MARKETING ANALYTICS Methods, Data etc. 1. Discriminant analysis is very similar to the multiple regression technique. The form of the equation in a two-variable discriminant analysis is: Y = a + k1 x1 + k2 x2 2. This is called the discriminant function. Also, like in a regression analysis, y is the dependent variable and x1 and x2 are independent variables. k1 and k2 are the coefficients of the independent variables, and a is a constant. In practice, there may be any number of x variables. MARKETING ANALYTICS Methods, Data etc. 3. Please note that Y in this case is a categorical variable (unlike in regression analysis, where it is continuous). x1 and x2 are however, continuous (metric) variables. k1 and k2 are determined by appropriate algorithms in the computer package used, but the underlying objective is that these two coefficients should maximise the separation or differences between the two groups of the y variable. 4. Y will have 2 possible values in a 2 group discriminant analysis, and 3 values in a 3 group discriminant analysis, and so on. MARKETING ANALYTICS Methods, Data etc. 5. K1 and K2 are also called the unstandardised discriminant function coefficients 6. As mentioned above, Y is a classification into 2 or more groups and therefore, a ‘grouping’ variable, in the terminology of discriminant analysis. That is, groups are formed on the basis of existing data, and coded as 1 and 2. 7. The independent (x) variables are continuous scale variables, and used as predictors of the group to which the objects will belong. Therefore, to be able to use discriminant analysis, we need to have some data on y and the x variables from experience and / or past records. MARKETING ANALYTICS Building a Model for Prediction/Classification Assuming we have data on both the y and x variables of interest, we estimate the coefficients of the model which is a linear equation of the form shown earlier, and use the coefficients to calculate the y value (discriminant score) – for any new data points that we want to classify into one of the groups. A decision rule is formulated for this process – to determine the cut off score, which is usually the midpoint of the mean discriminant scores of the two groups. Accuracy of Classification: Then, the classification of the existing data points is done using the equation, and the accuracy of the model is determined. This output is given by the classification matrix (also called the confusion matrix), which tells us what percentage of the existing data points is correctly classified by this model. MARKETING ANALYTICS Stepwise / Fixed Model: Just as in regression, we have the option of entering one variable at a time (Stepwise) into the discriminant equation, or entering all variables which we plan to use. Depending on the correlations between the independent variables, and the objective of the study (exploratory or predictive / confirmatory), the choice is left to the student. MARKETING ANALYTICS Relative Importance of Independent Variables 1. Suppose we have two independent variables, x1 and x2. How do we know which one is more important in discriminating between groups? 2. The coefficients of x1 and x2 are the ones which provide the answer, but not the raw (unstandardised) coefficients. To overcome the problem of different measurement units, we must obtain standardised discriminant coefficients. These are available from the computer output. 3. The higher the standardised discriminant coefficient of a variable, the higher its discriminating power. MARKETING ANALYTICS A Priori Probability of Classification into Groups The discriminant analysis algorithm requires us to assign an a priori (before analysis) probability of a given case belonging to one of the groups. There are two ways of doing this. We can assign an equal probability of assignment to all groups. Thus, in a 2 group discriminant analysis, we can assign 0.5 as the probability of a case being assigned to any group. We can formulate any other rule for the assignment of probabilities. For example, the probabilities could proportional to the group size in the sample data. If two thirds of the sample is in one group, the a priori probability of a case being in that group would be 0.66 (two thirds). MARKETING ANALYTICS Steps Involved in Conducting Discriminant Analysis MARKETING ANALYTICS Conducting Discriminant Analysis in SPSS MARKETING ANALYTICS Case Study We will turn now to a complete worked example which will clarify many of the concepts explained earlier. We will begin with the problem statement and input data. Case Suppose State Bank of Bhubaneswar wants to start credit card division. They want to use discriminant analysis and set up a system to screen applicants and classify them as either ‘low risk’ or ‘high risk’ (risk of default on credit card bill payments), based on information collected from their applications for a credit card. Suppose SBB has managed to get from SBI, its sister bank, some data on SBI’s credit card holders who turned out to be ‘low risk’ (no default) and ‘high risk’ (defaulting on payments) customers. These data on 18 customers are given in fig. 1. MARKETING ANALYTICS Slide 7 Table Fig. 1 1 2 3 4 RISKLOHI AGE INCOME YRSMARID 1 1 35 40000 8 2 1 33 45000 6 3 1 29 36000 5 4 2 22 32000 0 5 2 26 30000 1 6 1 28 35000 6 7 2 30 31000 7 8 2 23 27000 2 9 1 32 48000 6 10 2 24 12000 4 11 2 26 15000 3 12 1 38 25000 7 13 1 40 20000 5 14 2 32 18000 4 15 1 36 24000 3 16 2 31 17000 5 17 2 28 14000 3 18 1 33 18000 6 MARKETING ANALYTICS Case Study We will perform a discriminant analysis and advise SBB on how to set up its system to screen potential good customers (low risk) from bad customers (high risk). In particular, we will build a discriminant function (model) and find out The percentage of customers that it is able to classify correctly. Statistical significance of the discriminant function. Which variables (age, income, or years of marriage) are relatively better in discriminating between ‘low’ and ‘high’ risk applicants. How to classify a new credit card applicant into one of the two groups – ‘low risk’ or ‘high risk’, by building a decision rule and a cut off score. MARKETING ANALYTICS Interpretation Input Data are given in fig. 1. Interpretation of Computer Output: Fig. 3 : Classification Matrix We will now find answers to all the four questions we STAT. Classification Matrix (discrbkl.sta) have raised earlier. DISCRIM. Rows: Observed classifications ANALYSIS Columns: Predicted classifications Q1. How good is the Model? How many of the 18 Group Percent G_1 (Predicted) G_2 data points does it classify correctly? Correct P=.50000 (Predicted) To answer this question, we look at the computer P=.50000 output labelled fig. 3. This is a part of the discriminant G1 100.0000 9 0 analysis output from any computer package such as (Observed) 88.8889 1 8 SPSS, SYSTAT, STATISTICA, SAS etc. (there could be G2 minor variations in the exact numbers obtained, and ( Total ) 94.4444 10 8 major variations could occur if options chosen by the student are different. For example, if a priori probabilities chosen for the classification into the two groups are equal, as we have assumed while generating this output, then you will very likely see similar numbers in your output). MARKETING ANALYTICS Interpretation This output (fig. 3) is called the classification matrix (also known as the confusion matrix), and it indicates that the discriminant function we have obtained is able to classify 94.44 percent of the 18 objects correctly. This figure is in the “percent correct” column of the classification matrix. More specifically, it also says that out of 10 cases predicted to be in group 1, 9 were observed to be in group 1 and 1 in Group 2, (from column G-1). Similarly, from the column G-2, we understand that our of 8 cases predicted to be in group 2, all 8 were found to be in group 2. Thus, on the whole, only 1 case out of 18 was misclassified by the discriminant model, thus giving us a classification (or prediction) accuracy level of (18-1)/18, or 94.444 percent. As mentioned earlier, this level of accuracy may not hold for all future classification of new cases. But it is still a pointer towards the model being a good one, assuming the input data were relevant and scientifically collected. There are ways of checking the validity of the model, but these will be discussed separately. MARKETING ANALYTICS Statistical Significance Q2. How significant, statistically speaking, is the discriminant function? This question is answered by looking at the Wilks’ Lambda and the probability value for the F test given in the computer output, as a part of fig. 3.(shown below) Wilk’s lambda tests suggests how well each level of independent variable contributes to the model. The scale ranges from 0 to 1, where 0 means total discrimination, and 1 means no discrimination The value of Wilks’ Lamba is 0.318. This value is between 0 and 1, and a low value (closer to 0) indicates better discriminating power of the model. Thus, 0.318 is an indicator of the model being good. The probability value of the F test indicates that the discrimination between the two groups is highly significant. MARKETING ANALYTICS Variables Slide 12 Importance Q3. We have 3 independent (or predictor) variables – Age, Income and No. of Years Married for. Which of these is a better predictor of a person being a low credit risk or high credit risk? To answer this question, we look at the standardised coefficients in the output. These are given in fig. 5 (shown below). Fig. 5. STAT. Standardized Coefficients DISCRIM. (discrbkl.sta) for Canonical ANALYSIS Variables Variable Root 1 AGE.923955 INCOME.774780 YRSMARID.151298 Eigenval 2.136012 Cum.Prop 1.000000 This output shows that Age is the best predictor, with the coefficient of 0.92, followed by Income, with a coefficient of 0.77, Years of Marriage is the last, with a coefficient of 0.15, Please recall that the absolute value of the standardised coefficient of each variable indicates its relative importance. MARKETING ANALYTICS Classification Q4. How do we classify a new credit card applicant into either a ‘high risk’ or ‘low risk’ category, and make a decision on accepting or refusing him a credit card? This is the most important question to be answered. Please remember why we started out with the discriminant analysis in this problem. State Bank of Bhubaneswar wished to have a decision model for screening credit card applicants. The way to do this is to use the outputs in fig. 4 (Raw or unstandardised coefficients in the discriminant function) and fig. 6 (Means of canonical variables). Fig. 6, the means of canonical variables, gives us the new means for the transformed group centroids. Fig. 6. STAT. Means of Canonical Variables DISCRIM. (discrbkl.sta) ANALYSIS Group Root 1 G_1:1 -1.37793 G_2:2 1.37792 MARKETING ANALYTICS Classification Thus, the new mean for group 1 (low risk) is 1.37793, and the new mean for group 2 (high risk) is - 1.37792. This means that the midpoint of these two is 0. This is clear when we plot the two means on a straight line, and locate their midpoint, as shown below- -1.37 0 +1.37 Mean of Group1 Mean of Group2 (High Risk) (Low Risk) MARKETING ANALYTICS Classification This also gives us a decision rule for classifying any new case. If the discriminant score of an applicant falls to the right of the midpoint, we classify him as ‘high risk’, and if the discriminant score of an applicant falls to the left of the midpoint, we classify him as ‘low risk’. In this case, the midpoint is 0. Therefore, any positive (greater than 0) value of the discriminant score will lead to classification as ‘high risk’, and any negative (less than 0) value of the discriminant score will lead to classification as ‘low risk’. But how do we compute the discriminant scores of an applicant? We use the applicant’s Age, Income and Years of Marriage (from his application) and plug these into the unstandardised discriminant function. This gives us his discriminant score. MARKETING ANALYTICS Model STAT. Raw Coefficients (discrbkl.sta) for DISCRIM. Canonical Variables ANALYSIS Variable Root 1 AGE.24560 INCOME.00008 YRSMARID.08465 Constant 10.00335 Eigenval 2.13601 Cum.Prop 1.00000 From Fig. 4 (reproduced above), the unstandardised (or raw) discriminant function is Y = -10.0036 + Age (.24560) + Income (.00008) + Yrs. Married (.08465) Where y would give us the discriminant score of any person whose Age, Income and Yrs. Married were known. MARKETING ANALYTICS Model Let us take an example of a credit card application to SBB who is aged 40, has an income of Rs. 25,000 per month and has been married for 15 years. Plugging these values into the discriminant function or model above, we find his discriminant score y to be -10.0036 + 40 (.24560) + 25000 (.00008) +15 (.08465), which is = -10.0036 +9.824 + 2 + 1.26975 = 3.09015 According to our decision rule, any discriminant score to the right of the midpoint of 0 leads to a classification in the low risk group. Therefore, we should give this person a credit card, as he is a low risk customer. The same process is to be followed for any new applicant. If his discriminant score is to the left of the midpoint of 0, he should be denied a credit card, as he is a ‘high risk’ customer. We have completed answering the four questions raised by State Bank of Bhubaneswar. MARKETING ANALYTICS Logistic Regression for Classification and Prediction K S Deepika Department of Management Studies 72 MARKETING ANALYTICS Introduction Logistic Regression is used to distinguish between two or more groups. Typical application areas are cases where one wishes to predict the likelihood of an entity belonging to one group or another, such as in response to a marketing effort (likelihood of purchase/non- purchase), creditworthiness (high/low risk of default), insurance (high/low risk of accident claim) Similar to Discriminant Analysis in application 73 MARKETING ANALYTICS Binomial Logistic Regression using SPSS Statistics A binomial logistic regression (often referred to simply as logistic regression), predicts the probability that an observation falls into one of two categories of a dichotomous dependent variable based on one or more independent variables that can be either continuous or categorical. For example, you could use binomial logistic regression to understand whether exam performance can be predicted based on revision time, test anxiety and lecture attendance (i.e., where the dependent variable is "exam performance", measured on a dichotomous scale – "passed" or "failed" – and you have three independent variables: "revision time", "test anxiety" and "lecture attendance"). Alternately, you could use binomial logistic regression to understand whether drug use can be predicted based on prior criminal convictions, drug use amongst friends, income, age and gender (i.e., where the dependent variable is "drug use", measured on a dichotomous scale – "yes" or "no" – and you have five independent variables: "prior criminal convictions", "drug use amongst friends", "income", "age" and "gender"). MARKETING ANALYTICS Why do we use Logistic Regression rather than Linear Regression? Logistic regression is only used when our dependent variable is binary and in linear regression this dependent variable is continuous. The second problem is that if we add an outlier in our dataset, the best fit line in linear regression shifts to fit that point. Now, if we use linear regression to find the best fit line which aims at minimizing the distance between the predicted value and actual value, the line will be like this: 75 MARKETING ANALYTICS Why do we use Logistic Regression rather than Linear Regression? Here the threshold value is 0.5, which means if the value of h(x) is greater than 0.5 then we predict malignant tumor (1) and if it is less than 0.5 then we predict benign tumor (0). 76 MARKETING ANALYTICS Why do we use Logistic Regression rather than Linear Regression? Everything seems okay here but now let’s change it a bit, we add some outliers in our dataset, now this best fit line will shift to that point. Hence the line will be somewhat like this: 77 MARKETING ANALYTICS Graphical Representation 78 MARKETING ANALYTICS Why do we use Logistic Regression rather than Linear Regression? Another problem with linear regression is that the predicted values may be out of range. We know that probability can be between 0 and 1, but if we use linear regression this probability may exceed 1 or go below 0. To overcome these problems we use Logistic Regression, which converts this straight best fit line in linear regression to an S-curve using the sigmoid function, which will always give values between 0 and 1. 79 MARKETING ANALYTICS Why do we use Logistic Regression rather than Linear Regression? So fundamentally, Logistic Regression is a classification algorithm, used to classify elements of a set into two groups (binary classification) by calculating the probability of each element of the set Logistic Regression is the appropriate regression analysis to conduct when the dependent variable has a binary solution, we predict the values of categorical variables. MARKETING ANALYTICS Steps of Logistic Regression In logistic regression model , we decide a probability threshold. If the probability of a particular element is higher than the probability threshold then we classify that element in one group or vice versa. Step 1 To calculate the binary separation, first, we determine the best-fitted line by following the Linear Regression steps. Step 2 The regression line we get from Linear Regression is highly susceptible to outliers. Thus it will not do a good job in classifying two classes. Thus, the predicted value gets converted into probability by feeding it to the sigmoid function. MARKETING ANALYTICS Why do we use Logistic Regression rather than Linear Regression? The logistic regression hypothesis generalizes from the linear regression hypothesis that it uses the logistic function is also known as sigmoid function(activation function). The equation of sigmoid: As we can see in Fig in next slide, we can feed any real number to the sigmoid function and it will return a value between 0 and 1. MARKETING ANALYTICS Why do we use Logistic Regression rather than Linear Regression? MARKETING ANALYTICS Why do we use Logistic Regression rather than Linear Regression? Thus, if we feed the output ŷ value to the sigmoid function it retunes a probability value between 0 and 1. Step 3 Finally, the output value of the sigmoid function gets converted into 0 or 1(discreet values) based on the threshold value. We usually set the threshold value as 0.5. In this way, we get the binary classification. Now as we have the basic idea that how Linear Regression and Logistic Regression are related, let us revisit the process with an example. MARKETING ANALYTICS Comparison of Linear Regression & Logistic Regression Let us consider a problem where we are given a dataset containing Height and Weight for a group of people. Our task is to predict the Weight for new entries in the Height column. So we can figure out that this is a regression problem where we will build a Linear Regression model. We will train the model with provided Height and Weight values. Once the model is trained we can predict Weight for a given unknown Height value. MARKETING ANALYTICS Why do we use Logistic Regression rather than Linear Regression? MARKETING ANALYTICS Why do we use Logistic Regression rather than Linear Regression? Now suppose we have an additional field Obesity and we have to classify whether a person is obese or not depending on their provided height and weight. This is clearly a classification problem where we have to segregate the dataset into two classes (Obese and Not-Obese). So, for the new problem, we can again follow the Linear Regression steps and build a regression line. This time, the line will be based on two parameters Height and Weight and the regression line will fit between two discreet sets of values. As this regression line is highly susceptible to outliers, it will not do a good job in classifying two classes. To get a better classification, we will feed the output values from the regression line to the sigmoid function. The sigmoid function returns the probability for each output value from the regression line. Now based on a predefined threshold value, we can easily classify the output into two classes Obese or Not-Obese. MARKETING ANALYTICS Comparing Graphical Patterns: Logistic Regression Vs Linear Regression MARKETING ANALYTICS Why do we use Logistic Regression rather than Linear Regression? Finally, we can summarize the similarities and differences between these two models. The linear and logistic probability models are given by the following equations: p = a0 + a1x1 + a2x2 + … + aixi. …..(1) (linear model) ln[p/(1-p)] = b0 + b1x1 + b2x2 + … + bkxk. …..(2) (logistic model) Where p = probability. From eq 1 and 2, probability (p) is considered a linear function of the regressors for the linear model. Whereas, for the logistic model, the log odds p/(1-p) are considered a regressors’ linear function. MARKETING ANALYTICS How it is done To achieve this, a regression is first performed with a transformed value of Y, called the Logit function. The equation (shown below for two independent variables) is: Logit(Y) = ln(odds) = a + k1x1 + k2x2 where odds refers to the odds of Y being equal to 1. To understand the difference between odds and probabilities, consider the following example 90 MARKETING ANALYTICS Example of Odds and Probability When a coin is tossed, the probability of Heads showing up is 0.5, but the odds of belonging to the group “Heads” are 1.0. Odds are defined as the probability of belonging to one group divided by the probability of belonging to the other. Thus, odds = p/(1-p) and for the coin toss example, odds = 0.5/0.5 = 1. 91 MARKETING ANALYTICS Binomial Logistic Regression using SPSS Statistics Assumptions When you choose to analyse your data using binomial logistic regression, part of the process involves checking to make sure that the data you want to analyse can actually be analysed using a binomial logistic regression. You need to do this because it is only appropriate to use a binomial logistic regression if your data "passes" these assumptions that are required for binomial logistic regression to give you a valid result. oAssumption #1: Your dependent variable should be measured on a dichotomous scale. Examples of dichotomous variables include gender (two groups: "males" and "females"), presence of heart disease (two groups: "yes" and "no"), personality type (two groups: "introversion" or "extroversion"), body composition (two groups: "obese" or "not obese"), and so forth. However, if your dependent variable was not measured on a dichotomous scale, but a continuous scale instead, you will need to carry out multiple regression MARKETING ANALYTICS Binomial Logistic Regression using SPSS Statistics Assumption #2: You have one or more independent variables, which can be either continuous (i.e., an interval or ratio variable) or categorical (nominal variable). Examples of continuous variables include revision time (measured in hours), intelligence (measured using IQ score), exam performance (measured from 0 to 100), weight (measured in kg), and so forth. Examples of nominal variables include gender (e.g., 2 groups: male and female), profession (e.g., 5 groups: surgeon, doctor, nurse, dentist, therapist), and so forth. Assumption #3: You should have independence of observations and the dependent variable should have mutually exclusive and exhaustive categories. MARKETING ANALYTICS Binomial Logistic Regression using SPSS Statistics Example A health researcher wants to be able to predict whether the "incidence of heart disease" can be predicted based on "age", "weight", "gender" and "VO2max" (i.e., where VO2max refers to maximal aerobic capacity, an indicator of fitness and health). To this end, the researcher recruited 100 participants to perform a maximum VO2max test as well as recording their age, weight and gender. The participants were also evaluated for the presence of heart disease. A binomial logistic regression was then run to determine whether the presence of heart disease could be predicted from their VO2max, age, weight and gender. Setup in SPSS Statistics In this example, there are six variables: (1) heart_disease , which is whether the participant has heart disease: "yes" or "no" (i.e., the dependent variable); (2) VO2max , which is the maximal aerobic capacity; (3) age , which is the participant's age; (4) weight, which is the participant's weight (technically, it is their 'mass'); and (5) gender , which is the participant's gender (i.e., the independent variables); and (6) caseno , which is the case number. MARKETING ANALYTICS Binomial Logistic Regression using SPSS Statistics SPSS Statistics Test Procedure in SPSS Statistics The steps below show you how to analyse your data using a binomial logistic regression in SPSS Statistics when none of the assumptions in the previous section, Assumptions, have been violated. MARKETING ANALYTICS Binomial Logistic Regression using SPSS Statistics 1. Click Analyze > Regression > Binary Logistic... on the main menu, as shown below: MARKETING ANALYTICS Binomial Logistic Regression using SPSS Statistics You will be presented with the Logistic Regression dialogue box, as shown below: MARKETING ANALYTICS Binomial Logistic Regression using SPSS Statistics 2. Transfer the dependent variable heart disease into the Dependent Variable Box, and the independent variables age, weight, gender and VO2 max into Covariates MARKETING ANALYTICS Binomial Logistic Regression using SPSS Statistics 2. Click on the categorical button. You will be presented with the Logistic Regression: Define Categorical Variables dialogue box, as shown below: MARKETING ANALYTICS Binomial Logistic Regression using SPSS Statistics SPSS Statistics requires you to define all the categorical predictor values in the logistic regression model. It does not do this automatically. Transfer the categorical independent variable, Gender , from the covariates box to the categorical covariates box MARKETING ANALYTICS Binomial Logistic Regression using SPSS Statistics MARKETING ANALYTICS Binomial Logistic Regression using SPSS Statistics 5. Click on the continue button. You will be returned to the Logistic Regression dialogue box 6. Click on the Options and the left box will open MARKETING ANALYTICS Binomial Logistic Regression using SPSS Statistics 7. Click on the button. You will be returned to the Logistic Regression dialogue box. 8. Click on the button. This will generate the output. MARKETING ANALYTICS Logistic Regression vs Linear Models Unlike Multiple Linear Regression or Linear Discriminant Analysis, Logistic Regression fits an S-shaped curve to the data. This curved relationship ensures that the predicted values are always between 0 and 1. 104 MARKETING ANALYTICS Numerical Example To see how Logistic Regression works, and to compare it with Discriminant Analysis, consider the case study described in the Discriminant Analysis chapter on Customer Loyalty at Raymond’s showroom. The data are shown on the next slide- 105 MARKETING ANALYTICS INPUT DATA FREQ Average YEARS Loyalty Purchase 15 24765 3 0 17 18654 4 0 29 20320 1 0 25 41230 7 1 29 31462 5 1 41 7232 6 0 14 45352 4 0 27 45320 5 1 32 51500 5 1 29 45782 7 1 40 59990 9 1 13 8920 3 0 33 23250 5 1 3 35000 6 0 18 14235 2 0 21 25550 3 0 39 33330 7 1 106 31 31654 4 1 MARKETING ANALYTICS Independent Variables The Independent Variables are Freq : Frequency of purchase in a year Avgpurch: Average purchase by customer in a year Years : Number of years the customer has been purchasing from Raymond 107 MARKETING ANALYTICS Score Computation As in a regression, we can compute the score for any observation. Consider the first observation in our data, with values of 15, 24765, and 3 for each of the three independent variables, respectively. The score for this person is (using B coeffs. From the output table titled Predictors) -416.973 + 9.478 (15) + 0.006 (24765) - 5.733 (3) = -133.68. While building a LR equation involving categorical variables, remember to include the coefficients of categorical variables in the equation 108 MARKETING ANALYTICS Converting Score into Probability This score indicates the log of the odds of being disloyal (dependent value of 1). To convert this into a probability (p) of being disloyal, we use the transformation p = e-133.68/[1+ e-133.68] = 0. Since the probability of disloyalty is 0, this person will be classified by the model as loyal (forecasted value of dependent is 0). 109 MARKETING ANALYTICS Classification of New Customer As shown above for an existing customer, the values of the independent variables are used to compute a score, which is then transformed to get a probability of disloyalty. If this probability is greater than 0.5, the customer will be classified as disloyal (1). If less than 0.5, then he/she will be classified as loyal (0). 110 MARKETING ANALYTICS Binomial Logistic Regression using SPSS Statistics The Omnibus Tests of Model Coefficients is used to check that the new model (with explanatory variables included) is an improvement over the baseline model. It uses chi-square tests to see if there is a significant difference between the Log-likelihoods (specifically the -2LLs) of the baseline model and the new model. If the new model has a significantly reduced -2LL compared to the baseline then it suggests that the new model is explaining more of the variance in the outcome and is an improvement! To confuse matters there are three different versions; Step, Block and Model. The Model row always compares the new model to the baseline. The Step and Block rows are only important if you are adding the explanatory variables to the model in a stepwise or hierarchical manner. If we were building the model up in stages then these rows would compare the -2LLs of the newest model with the previous version to ascertain whether or not each new set of explanatory variables were causing improvements. MARKETING ANALYTICS Binomial Logistic Regression using SPSS Statistics Hosmer –Lemeshow tests the null hypothesis that predictions made by the model, fit perfectly with observed group memberships. A chi square statistic is computed covering the observed frequencies with those predicted under the linear model. A non significant chi square indicates that the data fits the model well MARKETING ANALYTICS Binomial Logistic Regression using SPSS Statistics Variance explained In order to understand how much variation in the dependent variable can be explained by the model (the equivalent of R2 in multiple regression), you can consult the table below, "Model Summary": This table contains the Cox & Snell R Square and Nagelkerke R Square values, which are both methods of calculating the explained variation. These values are sometimes referred to as pseudo R2 values (and will have lower values than in multiple regression). However, they are interpreted in the same manner, but with more caution. Therefore, the explained variation in the dependent variable based on our model ranges from 24.0% to 33.0%, depending on whether you reference the Cox & Snell R2 or Nagelkerke R2 methods, respectively. Nagelkerke R2 is a modification of Cox & Snell R2, the latter of which cannot achieve a value of 1. For this reason, it is preferable to report the Nagelkerke R2 value. MARKETING ANALYTICS Binomial Logistic Regression using SPSS Statistics Category prediction Binomial logistic regression estimates the probability of an event (in this case, having heart disease) occurring. If the estimated probability of the event occurring is greater than or equal to 0.5 (better than even chance), SPSS Statistics classifies the event as occurring (e.g., heart disease being present). If the probability is less than 0.5, SPSS Statistics classifies the event as not occurring (e.g., no heart disease). It is very common to use binomial logistic regression to predict whether cases can be correctly classified (i.e., predicted) from the independent variables. Therefore, it becomes necessary to have a method to assess the effectiveness of the predicted classification against the actual classification. There are many methods to assess this with their usefulness often depending on the nature of the study conducted. However, all methods revolve around the observed and predicted classifications, which are presented in the "Classification Table", as shown below: Firstly, notice that the table has a subscript which states, "The cut value is.500". This means that if the probability of a case being classified into the "yes" category is greater than.500, then that particular case is classified into the "yes" category. Otherwise, the case is classified as in the "no" category (as mentioned previously). Whilst the classification table appears to be very simple, it actually provides a lot of important information about your binomial logistic regression result, including: A. The percentage accuracy in classification (PAC), which reflects the percentage of cases that can be correctly classified as "no" heart disease with the independent variables added (not just the overall model). MARKETING ANALYTICS Binomial Logistic Regression using SPSS Statistics oB. Sensitivity, which is the percentage of cases that had the observed characteristic (e.g., "yes" for heart disease) which were correctly predicted by the model (i.e., true positives). oC. Specificity, which is the percentage of cases that did not have the observed characteristic (e.g., "no" for heart disease) and were also correctly predicted as not having the observed characteristic (i.e., true negatives). oD. The positive predictive value, which is the percentage of correctly predicted cases "with" the observed characteristic compared to the total number of cases predicted as having the characteristic. oE. The negative predictive value, which is the percentage of correctly predicted cases "without" the observed characteristic compared to the total number of cases predicted as not having the characteristic. Variables in the equation The "Variables in the Equation" table shows the contribution of each independent variable to the model and its statistical significance. This table is shown below: MARKETING ANALYTICS Binomial Logistic Regression using SPSS Statistics A logistic regression was performed to ascertain the effects of age, weight, gender and VO2max on the likelihood that participants have heart disease. The logistic regression model was statistically significant, χ2(4) = 27.402, p <.0005. The model explained 33.0% (Nagelkerke R2) of the variance in heart disease and correctly classified 71.0% of cases. Males were 7.02 times more likely to exhibit heart disease than females. Increasing age was associated with an increased likelihood of exhibiting heart disease, but increasing VO2max was associated with a reduction in the likelihood of exhibiting heart disease. THANK YOU K S Deepika Assistant Professor, Department of Management Studies [email protected]

Document Details

Tags

Related

Full Transcript

Upgrade to continue