Basic Business Statistics 12th Edition - Multiple Regression PDF
Document Details
Uploaded by SuitableCopernicium
2012
Tags
Summary
This document is chapter 2 of a basic business statistics textbook. It introduces multiple regression, explaining how to develop models, interpret coefficients, and use categorical variables. This document also contains examples and exercises.
Full Transcript
Basic Business Statistics 12th Edition Chapter 2 Introduction to Multiple Regression Copyright ©2012 Pearson Education, Inc. pu...
Basic Business Statistics 12th Edition Chapter 2 Introduction to Multiple Regression Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-1 Learning Objectives In this chapter, you learn: How to develop a multiple regression model How to interpret the regression coefficients How to determine which independent variables to include in the regression model How to determine which independent variables are more important in predicting a dependent variable How to use categorical variables in a regression model How to predict a categorical dependent variable using logistic regression Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-2 The Multiple Regression Model DCOVA Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more independent variables (Xi) Multiple Regression Model with k Independent Variables: Y-intercept Population slopes Random Error Yi β 0 β1X1i β 2 X 2i β k X ki ε i Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-3 Multiple Regression Equation DCOVA The coefficients of the multiple regression model are estimated using sample data Multiple regression equation with k independent variables: Estimated Estimated (or predicted) Estimated slope coefficients intercept value of Y ˆ b b X b X b X Yi 0 1 1i 2 2i k ki In this chapter we will use Excel or Minitab to obtain the regression slope coefficients and other regression summary measures. Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-4 Multiple Regression Equation (continued) Two variable model DCOVA Y Ŷ b0 b1X1 b2 X2 X2 X1 Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-5 Example: 2 Independent Variables DCOVA A distributor of frozen dessert pies wants to evaluate factors thought to influence demand Dependent variable: Pie sales (units per week) Independent variables: Price (in $) Advertising ($100’s) Data are collected for 15 weeks Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-6 Pie Sales Example Pie Price Advertising DCOVA Week Sales ($) ($100s) Multiple regression equation: 1 350 5.50 3.3 2 460 7.50 3.3 3 350 8.00 3.0 Sales = b0 + b1 (Price) 4 430 8.00 4.5 + b2 (Advertising) 5 350 6.80 3.0 6 380 7.50 4.0 7 430 4.50 3.0 8 470 6.40 3.7 9 450 7.00 3.5 10 490 5.00 4.0 11 340 7.20 3.5 12 300 7.90 3.2 13 440 5.90 4.0 14 450 5.00 3.5 15 300 7.00 2.7 Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-7 Excel Multiple Regression Output DCOVA Regression Statistics Multiple R 0.72213 R Square 0.52148 Adjusted R Square 0.44172 Standard Error 47.46341 Sales 306.526 - 24.975(Pri ce) 74.131(Adv ertising) Observations 15 ANOVA df SS MS F Significance F Regression 2 29460.027 14730.013 6.53861 0.01201 Residual 12 27033.306 2252.776 Total 14 56493.333 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Intercept 306.52619 114.25389 2.68285 0.01993 57.58835 555.46404 Price -24.97509 10.83213 -2.30565 0.03979 -48.57626 -1.37392 Advertising 74.13096 25.96732 2.85478 0.01449 17.55303 130.70888 Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-8 Minitab Multiple Regression Output DCOVA Sales 306.526- 24.975(Price) 74.131(Advertising) The regression equation is Sales = 307 - 25.0 Price + 74.1 Advertising Predictor Coef SE Coef T P Constant 306.50 114.30 2.68 0.020 Price -24.98 10.83 -2.31 0.040 Advertising 74.13 25.97 2.85 0.014 S = 47.4634 R-Sq = 52.1% R-Sq(adj) = 44.2% Analysis of Variance Source DF SS MS F P Regression 2 29460 14730 6.54 0.012 Residual Error 12 27033 2253 Total 14 56493 Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-9 The Multiple Regression Equation DCOVA Sales 306.526 - 24.975(Pri ce) 74.131(Adv ertising) where Sales is in number of pies per week Price is in $ Advertising is in $100’s. b1 = -24.975: sales b2 = 74.131: sales will will decrease, on increase, on average, average, by 24.975 by 74.131 pies per pies per week for week for each $100 each $1 increase in increase in selling price, net of advertising, net of the the effects of changes effects of changes due to advertising due to price Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-10 Using The Equation to Make Predictions DCOVA Predict sales for a week in which the selling price is $5.50 and advertising is $350: Sales 306.526 - 24.975(Pri ce) 74.131(Adv ertising) 306.526 - 24.975 (5.50) 74.131 (3.5) 428.62 Note that Advertising is Predicted sales in $100’s, so $350 means that X2 = 3.5 is 428.62 pies Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-11 Predictions in Excel using PHStat DCOVA PHStat | regression | multiple regression … Check the “confidence and prediction interval estimates” box Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-12 Predictions in PHStat (continued) DCOVA Input values < Predicted Y value Confidence interval for the mean value of Y, given these X values Prediction interval for an individual Y value, given these X values Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-13 Predictions in Minitab DCOVA Confidence interval for the mean value of Y, given these X values Predicted Values for New Observations New Obs Fit SE Fit 95% CI 95% PI 1 428.6 17.2 (391.1, 466.1) (318.6, 538.6) PredictedŶ value Values of Predictors for New Observations New Obs Price Advertising 1 5.50 3.50 Prediction interval for an individual Y value, given these X Input values values Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-14 Coefficient of Multiple Determination DCOVA Reports the proportion of total variation in Y explained by all X variables taken together SSR regression sum of squares r 2 SST total sum of squares Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-15 Multiple Coefficient of Determination In Excel DCOVA Regression Statistics SSR 29460.0 Multiple R 0.72213 r2 .52148 R Square 0.52148 SST 56493.3 Adjusted R Square 0.44172 52.1% of the variation in pie sales Standard Error 47.46341 is explained by the variation in Observations 15 price and advertising ANOVA df SS MS F Significance F Regression 2 29460.027 14730.013 6.53861 0.01201 Residual 12 27033.306 2252.776 Total 14 56493.333 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Intercept 306.52619 114.25389 2.68285 0.01993 57.58835 555.46404 Price -24.97509 10.83213 -2.30565 0.03979 -48.57626 -1.37392 Advertising 74.13096 25.96732 2.85478 0.01449 17.55303 130.70888 Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-16 Multiple Coefficient of Determination In Minitab DCOVA The regression equation is Sales = 307 - 25.0 Price + 74.1 Advertising Predictor Coef SE Coef T P SSR 29460.0 Constant 306.50 114.30 2.68 0.020 Price -24.98 10.83 -2.31 0.040 r2 .52148 Advertising 74.13 25.97 2.85 0.014 SST 56493.3 S = 47.4634 R-Sq = 52.1% R-Sq(adj) = 44.2% Analysis of Variance 52.1% of the variation in pie Source DF SS MS F P sales is explained by the Regression 2 29460 14730 6.54 0.012 variation in price and Residual Error 12 27033 2253 Total 14 56493 advertising Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-17 Adjusted r2 DCOVA r2 never decreases when a new X variable is added to the model This can be a disadvantage when comparing models What is the net effect of adding a new variable? We lose a degree of freedom when a new X variable is added Did the new X variable add enough explanatory power to offset the loss of one degree of freedom? Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-18 Adjusted r2 DCOVA (continued) Shows the proportion of variation in Y explained by all X variables adjusted for the number of X variables used 2 n 1 r 2 1 (1 r ) n k 1 adj (where n = sample size, k = number of independent variables) Penalize excessive use of unimportant independent variables Smaller than r2 Useful in comparing among models Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-19 Adjusted r2 in Excel DCOVA .44172 Regression Statistics 2 Multiple R 0.72213 radj R Square 0.52148 Adjusted R Square 0.44172 44.2% of the variation in pie sales is Standard Error 47.46341 explained by the variation in price and Observations 15 advertising, taking into account the sample size and number of independent variables ANOVA df SS MS F Significance F Regression 2 29460.027 14730.013 6.53861 0.01201 Residual 12 27033.306 2252.776 Total 14 56493.333 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Intercept 306.52619 114.25389 2.68285 0.01993 57.58835 555.46404 Price -24.97509 10.83213 -2.30565 0.03979 -48.57626 -1.37392 Advertising 74.13096 25.96732 2.85478 0.01449 17.55303 130.70888 Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-20 Adjusted r2 in Minitab DCOVA The regression equation is 2 radj .44172 Sales = 307 - 25.0 Price + 74.1 Advertising Predictor Coef SE Coef T P Constant 306.50 114.30 2.68 0.020 Price -24.98 10.83 -2.31 0.040 Advertising 74.13 25.97 2.85 0.014 44.2% of the variation in pie S = 47.4634 R-Sq = 52.1% R-Sq(adj) = 44.2% sales is explained by the variation in price and Analysis of Variance advertising, taking into account Source DF SS MS F P the sample size and number of Regression 2 29460 14730 6.54 0.012 independent variables Residual Error 12 27033 2253 Total 14 56493 Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-21 Is the Model Significant? DCOVA F Test for Overall Significance of the Model Shows if there is a linear relationship between all of the X variables considered together and Y Use F-test statistic Hypotheses: H0: β1 = β2 = … = βk = 0 (no linear relationship) H1: at least one βi ≠ 0 (at least one independent variable affects Y) Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-22 F Test for Overall Significance DCOVA Test statistic: SSR MSR k FSTAT MSE SSE n k 1 where FSTAT has numerator d.f. = k and denominator d.f. = (n – k - 1) Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-23 F Test for Overall Significance In Excel DCOVA (continued) Regression Statistics Multiple R 0.72213 MSR 14730.0 R Square 0.52148 FSTAT 6.5386 Adjusted R Square 0.44172 MSE 2252.8 Standard Error 47.46341 With 2 and 12 degrees P-value for Observations 15 of freedom the F Test ANOVA df SS MS F Significance F Regression 2 29460.027 14730.013 6.53861 0.01201 Residual 12 27033.306 2252.776 Total 14 56493.333 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Intercept 306.52619 114.25389 2.68285 0.01993 57.58835 555.46404 Price -24.97509 10.83213 -2.30565 0.03979 -48.57626 -1.37392 Advertising 74.13096 25.96732 2.85478 0.01449 17.55303 130.70888 Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-24 F Test for Overall Significance In Minitab DCOVA The regression equation is Sales = 307 - 25.0 Price + 74.1 Advertising Predictor Coef SE Coef T P Constant 306.50 114.30 2.68 0.020 Price -24.98 10.83 -2.31 0.040 MSR 14730.0 Advertising 74.13 25.97 2.85 0.014 FSTAT 6.5386 MSE 2252.8 S = 47.4634 R-Sq = 52.1% R-Sq(adj) = 44.2% Analysis of Variance Source DF SS MS F P Regression 2 29460 14730 6.54 0.012 Residual Error 12 27033 2253 Total 14 56493 With 2 and 12 degrees P-value for of freedom the F Test Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-25 F Test for Overall Significance (continued) H0: β1 = β2 = 0 Test Statistic: DCOVA H1: β1 and β2 not both zero MSR FSTAT 6.5386 =.05 MSE df1= 2 df2 = 12 Decision: Critical Since FSTAT test statistic is Value: in the rejection region (p- F0.05 = 3.885 value <.05), reject H0 =.05 Conclusion: 0 F There is evidence that at least one Do not Reject H0 reject H0 independent variable affects Y F0.05 = 3.885 Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-26 Residuals in Multiple Regression DCOVA Two variable model Y Sample Yi observation Ŷ b0 b1X1 b2 X2 Residual = < ei = (Yi – Yi) < Yi x2i X2 x1i The best fit equation is found X1 by minimizing the sum of squared errors, e2 Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-27 Multiple Regression Assumptions DCOVA Errors (residuals) from the regression model: < ei = (Yi – Yi) Assumptions: The errors are normally distributed Errors have a constant variance The model errors are independent Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-28 Residual Plots Used in Multiple Regression DCOVA These residual plots are used in multiple regression: < Residuals vs. Yi Residuals vs. X1i Residuals vs. X2i Residuals vs. time (if time series data) Use the residual plots to check for violations of regression assumptions Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-29 Are Individual Variables Significant? DCOVA Use t tests of individual variable slopes Shows if there is a linear relationship between the variable Xj and Y holding constant the effects of other X variables Hypotheses: H0: βj = 0 (no linear relationship) H1: βj ≠ 0 (linear relationship does exist between Xj and Y) Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-30 Are Individual Variables Significant? (continued) DCOVA H0: βj = 0 (no linear relationship) H1: βj ≠ 0 (linear relationship does exist between Xj and Y) Test Statistic: bj 0 t STAT (df = n – k – 1) Sb j Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-31 Are Individual Variables Significant? Excel Output (continued) DCOVA Regression Statistics t Stat for Price is tSTAT = -2.306, with Multiple R 0.72213 R Square 0.52148 p-value.0398 Adjusted R Square 0.44172 Standard Error 47.46341 t Stat for Advertising is tSTAT = 2.855, Observations 15 with p-value.0145 ANOVA df SS MS F Significance F Regression 2 29460.027 14730.013 6.53861 0.01201 Residual 12 27033.306 2252.776 Total 14 56493.333 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Intercept 306.52619 114.25389 2.68285 0.01993 57.58835 555.46404 Price -24.97509 10.83213 -2.30565 0.03979 -48.57626 -1.37392 Advertising 74.13096 25.96732 2.85478 0.01449 17.55303 130.70888 Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-32 Are Individual Variables Significant? Minitab Output DCOVA The regression equation is Sales = 307 - 25.0 Price + 74.1 Advertising Predictor Coef SE Coef T P Constant 306.50 114.30 2.68 0.020 Price -24.98 10.83 -2.31 0.040 t Stat for Price is tSTAT = -2.31, with p- Advertising 74.13 25.97 2.85 0.014 value.040 S = 47.4634 R-Sq = 52.1% R-Sq(adj) = 44.2% t Stat for Advertising is tSTAT = 2.85, Analysis of Variance with p-value.014 Source DF SS MS F P Regression 2 29460 14730 6.54 0.012 Residual Error 12 27033 2253 Total 14 56493 Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-33 Inferences about the Slope: t Test Example DCOVA H0: βj = 0 From the Excel output: H1: βj 0 For Price tSTAT = -2.306, with p-value.0398 For Advertising tSTAT = 2.855, with p-value.0145 d.f. = 15-2-1 = 12 =.05 The test statistic for each variable falls t/2 = 2.1788 in the rejection region (p-values <.05) Decision: /2=.025 /2=.025 Reject H0 for each variable Conclusion: There is evidence that both Reject H0 Do not reject H0 Reject H0 -tα/2 tα/2 Price and Advertising affect 0 -2.1788 2.1788 pie sales at =.05 Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-34 Confidence Interval Estimate for the Slope DCOVA Confidence interval for the population slope βj b j tα / 2 S b where t has (n – k – 1) d.f. j Coefficients Standard Error Intercept 306.52619 114.25389 Here, t has Price -24.97509 10.83213 (15 – 2 – 1) = 12 d.f. Advertising 74.13096 25.96732 Example: Form a 95% confidence interval for the effect of changes in price (X1) on pie sales: -24.975 ± (2.1788)(10.832) So the interval is (-48.576 , -1.374) (This interval does not contain zero, so price has a significant effect on sales) Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-35 Confidence Interval Estimate for the Slope DCOVA (continued) Confidence interval for the population slope βj Coefficients Standard Error … Lower 95% Upper 95% Intercept 306.52619 114.25389 … 57.58835 555.46404 Price -24.97509 10.83213 … -48.57626 -1.37392 Advertising 74.13096 25.96732 … 17.55303 130.70888 Example: Excel output also reports these interval endpoints: Weekly sales are estimated to be reduced by between 1.37 to 48.58 pies for each increase of $1 in the selling price, holding the effect of price constant Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-36 Testing Portions of the Multiple Regression Model DCOVA Contribution of a Single Independent Variable Xj SSR(Xj | all variables except Xj) = SSR (all variables) – SSR(all variables except Xj) Measures the contribution of Xj in explaining the total variation in Y (SST) Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-37 Testing Portions of the Multiple Regression Model (continued) DCOVA Contribution of a Single Independent Variable Xj, assuming all other variables are already included (consider here a 2-variable model): SSR(X1 | X2) = SSR (all variables) – SSR(X2) From ANOVA section of From ANOVA section of regression for regression for ˆ b b X b X Y ˆ b b X Y 0 1 1 2 2 0 2 2 Measures the contribution of X1 in explaining SST Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-38 The Partial F-Test Statistic DCOVA Consider the hypothesis test: H0: variable Xj does not significantly improve the model after all other variables are included H1: variable Xj significantly improves the model after all other variables are included Test using the F-test statistic: (with 1 and n-k-1 d.f.) SSR (X j | all variables except j) FSTAT MSE Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-39 Testing Portions of Model: Example DCOVA Example: Frozen dessert pies Test at the =.05 level to determine whether the price variable significantly improves the model given that advertising is included Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-40 Testing Portions of Model: Example (continued) H0: X1 (price) does not improve the model DCOVA with X2 (advertising) included H1: X1 does improve model =.05, df = 1 and 12 F0.05 = 4.75 (For X1 and X2) (For X2 only) ANOVA ANOVA df SS MS df SS Regression 2 29460.02687 14730.01343 Regression 1 17484.22249 Residual 12 27033.30647 2252.775539 Residual 13 39009.11085 Total 14 56493.33333 Total 14 56493.33333 Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-41 Testing Portions of Model: Example (continued) DCOVA (For X1 and X2) (For X2 only) ANOVA ANOVA df SS MS df SS Regression 2 29460.02687 14730.01343 Regression 1 17484.22249 Residual 12 27033.30647 2252.775539 Residual 13 39009.11085 Total 14 56493.33333 Total 14 56493.33333 SSR (X1 | X 2 ) 29,460.03 17 ,484.22 FSTAT 5.316 MSE(all) 2252.78 Conclusion: Since FSTAT = 5.316 > F0.05 = 4.75 Reject H0; Adding X1 does improve model Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-42 Relationship Between Test Statistics DCOVA The partial F test statistic developed in this section and the t test statistic are both used to determine the contribution of an independent variable to a multiple regression model. The hypothesis tests associated with these two statistics always result in the same decision (that is, the p-values are identical). 2 tSTAT FSTAT Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-43 Coefficient of Partial Determination for k variable model DCOVA 2 rYj.(allvariablesexcept j) SSR (X j | all variables except j) SST SSR(all variables) SSR(X j | all variables except j) Measures the proportion of variation in the dependent variable that is explained by Xj while controlling for (holding constant) the other independent variables Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-44 Coefficient of Partial Determination in Excel DCOVA Coefficients of Partial Determination can be found using Excel: PHStat | regression | multiple regression … Check the “coefficient of partial determination” box Regression Analysis Coefficients of Partial Determination Intermediate Calculations SSR(X1,X2) 29460.02687 SST 56493.33333 SSR(X2) 17484.22249 SSR(X1 | X2) 11975.80438 SSR(X1) 11100.43803 SSR(X2 | X1) 18359.58884 Coefficients r2 Y1.2 0.307000188 r2 Y2.1 0.404459524 Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-45 Using Dummy Variables DCOVA A dummy variable is a categorical independent variable with two levels: yes or no, on or off, male or female coded as 0 or 1 Assumes the slopes associated with numerical independent variables do not change with the value for the categorical variable If more than two levels, the number of dummy variables needed is (number of levels - 1) Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-46 Dummy-Variable Example (with 2 Levels) DCOVA Ŷ b0 b1X1 b2 X2 Let: Y = pie sales X1 = price X2 = holiday (X2 = 1 if a holiday occurred during the week) (X2 = 0 if there was no holiday that week) Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-47 Dummy-Variable Example (with 2 Levels) (continued) DCOVA Ŷ b0 b1 X1 b 2 (1) (b0 b 2 ) b1 X1 Holiday Ŷ b0 b1 X1 b 2 (0) b0 b1 X1 No Holiday Different Same intercept slope Y (sales) If H0: β2 = 0 is b0 + b2 rejected, then b0 “Holiday” has a significant effect on pie sales Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall X1 (Price) Chap 14-48 Interpreting the Dummy Variable Coefficient (with 2 Levels)DCOVA Example: Sales 300 - 30(Price) 15(Holiday ) Sales: number of pies sold per week Price: pie price in $ 1 If a holiday occurred during the week Holiday: 0 If no holiday occurred b2 = 15: on average, sales were 15 pies greater in weeks with a holiday than in weeks without a holiday, given the same price Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-49 Dummy-Variable Models (more than 2 Levels) DCOVA The number of dummy variables is one less than the number of levels Example: Y = house price ; X1 = square feet If style of the house is also thought to matter: Style = ranch, split level, colonial Three levels, so two dummy variables are needed Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-50 Dummy-Variable Models (more than 2 Levels) (continued) DCOVA Example: Let “colonial” be the default category, and let X2 and X3 be used for the other two categories: Y = house price X1 = square feet X2 = 1 if ranch, 0 otherwise X3 = 1 if split level, 0 otherwise The multiple regression equation is: Ŷ b0 b1X1 b2 X2 b3 X3 Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-51 Interpreting the Dummy Variable Coefficients (with 3 Levels) DCOVA Consider the regression equation: Ŷ 20.43 0.045X 1 23.53X 2 18.84X 3 For a colonial: X2 = X3 = 0 With the same square feet, a Ŷ 20.43 0.045X 1 ranch will have an estimated average price of 23.53 For a ranch: X2 = 1; X3 = 0 thousand dollars more than a colonial. Ŷ 20.43 0.045X1 23.53 With the same square feet, a For a split level: X2 = 0; X3 = 1 split-level will have an estimated average price of Ŷ 20.43 0.045X 1 18.84 18.84 thousand dollars more than a colonial. Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-52 Interaction Between Independent Variables DCOVA Hypothesizes interaction between pairs of X variables Response to one X variable may vary at different levels of another X variable Contains two-way cross product terms Ŷ b0 b1X1 b2 X2 b3 X3 b0 b1X1 b2 X2 b3 (X1X2 ) Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-53 Effect of Interaction DCOVA Given: Y β0 β1X1 β2 X2 β3 X1X2 ε Without interaction term, effect of X1 on Y is measured by β1 With interaction term, effect of X1 on Y is measured by β1 + β3 X2 Effect changes as X2 changes Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-54 Interaction Example DCOVA Suppose X2 is a dummy variable and the estimated regression equation is Ŷ = 1 + 2X1 + 3X2 + 4X1X2 Y 12 X2 = 1: 8 Y = 1 + 2X1 + 3(1) + 4X1(1) = 4 + 6X1 4 X2 = 0: Y = 1 + 2X1 + 3(0) + 4X1(0) = 1 + 2X1 0 X1 0 0.5 1 1.5 Slopes are different if the effect of X1 on Y depends on X2 value Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-55 Significance of Interaction Term DCOVA Can perform a partial F test for the contribution of a variable to see if the addition of an interaction term improves the model Multiple interaction terms can be included Use a partial F test for the simultaneous contribution of multiple variables to the model Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-56 Simultaneous Contribution of Independent Variables DCOVA Use partial F test for the simultaneous contribution of multiple variables to the model Let m variables be an additional set of variables added simultaneously To test the hypothesis that the set of m variables improves the model: [SSR(all) SSR (all except new set of m variables )] / m FSTAT MSE(all) (where FSTAT has m and n-k-1 d.f.) Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-57 Logistic Regression DCOVA Used when the dependent variable Y is binary (i.e., Y takes on only two values) Examples Customer prefers Brand A or Brand B Employee chooses to work full-time or part-time Loan is delinquent or is not delinquent Person voted in last election or did not Logistic regression allows you to predict the probability of a particular categorical response Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-58 Logistic Regression DCOVA (continued) Logistic regression is based on the odds ratio, which represents the probability of a success compared with the probability of failure probabilit y of success Odds ratio 1 probabilit y of success The logistic regression model is based on the natural log of this odds ratio Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-59 Logistic Regression (continued) DCOVA Logistic Regression Model: ln(odds ratio) β 0 β1X1i β 2 X 2i β k X ki ε i Where k = number of independent variables in the model εi = random error in observation i Logistic Regression Equation: ln(estimat ed odds ratio) b 0 b1X1i b 2 X 2i b k X ki Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-60 Estimated Odds Ratio and Probability of Success DCOVA Once you have the logistic regression equation, compute the estimated odds ratio: Estimated odds ratio eln(estimated oddsratio) The estimated probability of success is estimated odds ratio Estimated probabilit y of success 1 estimated odds ratio Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-61 Chapter Summary Developed the multiple regression model Tested the significance of the multiple regression model Discussed adjusted r2 Discussed using residual plots to check model assumptions Tested individual regression coefficients Tested portions of the regression model Used dummy variables Evaluated interaction effects Discussed logistic regression Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-62 Example of Logistic regression The marketing department for a credit card company wants to organize a campaign to convince existing holders of the company’s standard credit card to upgrade to the company’s premium card for a nominal annual fee. The marketing department begins with the question “Which of the existing standard credit cardholders should be the target for the campaign?” The department has access to data from a sample of 30 cardholders who were contacted during last year’s campaign. That data indicates whether the cardholder upgraded to a premium card (0= no, 1=yes). The department wants to predict the categorical variable (i.e., did the customer upgrade to a premium card?) using two independent variables: total amount of credit card purchases (in thousands of dollars) in the prior year and whether the cardholder ordered additional credit cards (at extra cost) for other members of the household ( 0=no, 1=yes). Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-63 Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-64 Example of Logistic regression b0: for a credit cardholder who did not charge any purchases last year and who does not have additional cards, the estimated natural logarithm of the odds ratio of purchasing the premium card is -6,940 b1 is 0.13947: holding constant the effect of whether the credit cardholder has additional cards for members of the household, for each increase of $1,000 in annual credit card spending using the company’s card, the estimated natural logarithm of the odds ratio of purchasing the premium card increases by 0.13947. Therefore, cardholders who charged more in the previous year are more likely to upgrade to a premium card. b2 is 2.774: holding constant the annual credit card spending, the estimated natural logarithm of the odds ratio of purchasing the premium card increases by 2.774 for a credit cardholder who has additional cards for members of the household compared with one who does not have additional cards. Therefore, cardholders possessing additional cards for other members of the household are much more likely to upgrade to a premium card. Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-65 Example of Logistic Regression The regression coefficients suggest that the credit card company should develop a marketing campaign that targets cardholders who tend to charge large amounts to their cards, and to households that possess more than one card. As was the case with least-squares regression models, a main purpose of performing logistic regression analysis is to provide predictions of a dependent variable. For example, consider a cardholder who charged $36,000 last year and possesses additional cards for members of the household. What is the probability the cardholder will upgrade to the premium card during the marketing campaign? Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-66 Example of Logistic Regression Therefore, the odds are 2.3512 to 1 that a credit cardholder who spent $36,000 last year and has additional cards will purchase the premium card during the campaign. The estimated probability is 0.7016 that a credit cardholder who spent $36,000 last year and has additional cards will purchase the premium card during the campaign. In other words, you predict 70.16% of such individuals will purchase the premium card. Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-67 Example of Logistic Regression The deviance statistic is frequently used to determine whether or not the current model provides a good fit to the data. The deviance statistic follows a chi-square distribution with degrees of freedom n-k-1. The null and alternative hypotheses a H0: The model is a good-fitting model. H1: The model is not a good-fitting model. the p-value=0,828>α, thus, you do not reject and you conclude that the model is a good fit. Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-68 Example of Logistic Regression Now that you have concluded that the model is a good-fitting one, you need to evaluate whether each of the independent variables makes a significant contribution to the model in the presence of the others. The test statistic is based on the ratio of the regression coefficient to the standard error of the regression coefficient. In logistic regression, this ratio is defined by the Wald statistic, which approximately follows the normal distribution. The Wald statistic (labeled Z) is 2.05 for and 2.33 for Each of these is greater than the critical value of +1.96 for the normal distribution at the 0.05 level of significance (the p-values are 0.04 and 0.02). You can conclude that each of the two independent variables makes a contribution to the model in the presence of the other. Therefore, you should include both these independent variables in the model. Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-69