Summary

This document provides lecture notes on Multiple Regression Analysis, covering topics such as multiple linear regression, regression with Excel, building good regression models, and multicollinearity. The notes are from a data analysis and business modelling course and were created in 2020 by Leila Tahmooresnejad.

Full Transcript

Multiple Regression Analysis ITIS 1P97: Data Analysis and Business Modelling Instructor: Leila Tahmooresnejad Copyright © 2020 Pearson Education, Inc. Summary :Simple Linear Regression Dependent = Independent variable variable...

Multiple Regression Analysis ITIS 1P97: Data Analysis and Business Modelling Instructor: Leila Tahmooresnejad Copyright © 2020 Pearson Education, Inc. Summary :Simple Linear Regression Dependent = Independent variable variable In the population, true values for the slope and intercept are not known. We estimate the parameters from the sample data: Yˆ = b0 + b1 X where ^ Y = predicted value of Y b0 = estimate of β0, based on sample results b1 = estimate of β1, based on sample results 2 Summary : Excel functions Excel functions: Yˆ = b0 + b1 X = INTERCEPT (known_y's,known_x's ) Yˆ = b0 + b1 X = SLOPE(known_y's,known_x's ) Yˆ = b0 + b1 X =TREND(known_y's, known_x's, new_x's) 3 Summary Error = (Actual value) – (Predicted value) ei = Yi - Yˆi (8.3) 4 Summary The coefficient of determination is R2. (R-squared) is a measure of the “fit” of the line to the data. The value of between 0 and 1. SSR = ∑ (Yˆ −Y )2 𝑆𝑆𝑅 𝑅! = SST = ∑ (Y −Y )2 𝑆𝑆𝑇 Y Y § An expression of the strength of the linear relationship Always between +1 and –1 (a) Perfect X (b) Positive X Positive Correlation: The correlation coefficient Y Correlation: r = +1 Y 0 Regression Summary Yˆ = b0 + b1 X | r |, is the sample correlation coefficient. R2 - coefficient of determination, a higher R2 close to 1 is desirable SUMMARY OUTPUT Evaluates the overall significance of the model in Regression Statistics explaining the variance in the dependent variable. Multiple R 0.731255223 R Square 0.534734202 Adjusted R Square0.523102557 A low (e.g., less than 0.05) Significance F (p-value for Standard Error 7287.722712 overall model) indicates a significant relationship Observations 42 between X and Y. ANOVA df SS MS F Significance F Regression 1 2441633669 2441633669 45.97236 3.79802E-08 Residual 40 2124436093 53110902.3 Total 41 4566069762 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0% Intercept 32673.2199 8831.950745 3.69943412 0.00065 14823.1816 50523.3 14823.1816 50523.2582 Square Feet 35.03637258 5.16738385 6.78029223 3.8E-08 24.59270025 45.48 24.59270025 45.48004491 Intercept and If less than 0.05, coefficient, we reject the null coefficient hypothesis that the coefficient is zero Topics Multiple Linear Regression Regression with Excel Building Good Regression Models Multicollinearity Interactions 9 Multiple Linear Regression A linear regression model with more than one independent variable is called a multiple linear regression model. Y = b 0 + b1 X 1 + b 2 X 2 +  + b k X k + e (8.10 ) where Y is the dependent variable, X 1 ,, X k are the independent (explanatory) variables, b 0 is the intercept term, b1 ,, b k are the regression coefficients for the independent variables, e is the error term. 10 Examples Real Estate Pricing – Dependent Variable: House price – Independent Variables: Square footage, – number of bedrooms, location, age of the house Student Performance – Dependent Variable: Final exam score – Independent Variables: Hours spent studying, attendance, prior academic history 11 Examples Healthcare – Dependent Variable: Risk of developing heart disease – Independent Variables: Age, body mass index, blood pressure, cholesterol levels Employee Performance – Dependent Variable: Employee performance metric – Independent Variables: Hours of training, years of experience, job satisfaction 12 Estimated Multiple Regression Equation We estimate the regression coefficients-called partial regression coefficients - b0 , b1 , b2 ,bk , then use the model: Yˆ = b0 + b1 X 1 + b2 X 2 +  + bk X k (8.11) The partial regression coefficients represent the expected change in the dependent variable when the associated independent variable is increased by one unit while the values of all other independent variables are held constant. 13 Examples We have a multiple regression model predicting: – A student's final exam score (Y) – Three independent variables: Hours spent studying (X1), Attendance percentage (X2), Prior GPA (X3). Yˆ = b0 + b1 X 1 + b2 X 2 +  + bk X k (8.11) If the partial regression coefficient for X1 is b1, it means that, for every one-hour increase in studying (while holding attendance and prior GPA constant), you can expect the student's final exam score to increase by b1 points. If the partial regression coefficient for X2 is b2, it means that, for every one-percentage-point increase in attendance (while holding studying and prior GPA constant), you can expect the student's final exam score to increase by b2 points. If the partial regression coefficient for X3 is b3, it means that, for every one-point increase in prior GPA (while holding studying and attendance constant), you can expect the student's final exam score to increase by b3 points. 14 Multiple Regression Key differences, in the context of multiple regression: Multiple R Multiple correlation coefficient R Square Coefficient of multiple determination ANOVA tests for significance of the entire model. That is, it computes an F-statistic for testing the hypotheses: H 0 : b1 = b 2 =  = b k = 0 H1 : at least one b j is not 0 15 Hypotheses about each of the individual regression coefficients The multiple linear regression output also provides information to test hypotheses about each of the individual regression coefficients. – If we reject the null hypothesis that the slope associated with independent variable i is 0, then the independent variable i is significant and improves the ability of the model to better predict the dependent variable. – If we cannot reject H0 , then that independent variable is not significant and probably should not be included in the model. 16 Regression with Excel Data > DataAnalysis > Regression Input Y Range (with header) Input X Range (with header) Check Labels box Excel outputs a table with many useful regression statistics. The independent variables in the spreadsheet must be in contiguous columns. – So, you may have to manually move the columns of data around before applying the tool. 17 Example : Interpreting Regression Results for the Colleges and Universities Data Predict student graduation rates using several indicators: Dependent variable: Graduation % 18 Example : Interpreting Regression Results for the Colleges and Universities Data Regression model The value of R 2 indicates that 53% of the variation in the dependent variable is explained by these independent variables. All coefficients are statistically significant. 19 Adjusted R2 R2, which can only increase as more predictors are added to the model. – Example : Home market value, square footage – Color of the front door The adjusted R2 is a modified version of R2, which adjusts for the number of predictors in a regression model. The adjusted R2 increases only if the additional predictors improve the model more than would be expected by chance. 20 Example : Interpreting Regression Results for the Colleges and Universities Data Regression model 21 Building Good Regression Models A good regression model should include only significant independent variables. However, it is not always clear exactly what will happen when we add or remove variables from a model; Variables that are (or are not) significant in one model may (or may not) be significant in another. – Therefore, you should not consider dropping all insignificant variables at one time, but rather take a more structured approach. 22 Building Good Regression Models Adding an independent variable to a regression model will always result in R 2 equal to or greater than the R 2 of the original model. Adjusted R 2 Reflects both the number of independent variables and the sample size and may either increase or decrease when an independent variable is added or dropped. An increase in adjusted R 2 indicates that the model has improved 23 Systematic Model Building Approach 1. Construct a model with all available independent variables. Check for significance of the independent variables by examining the p-values. 2. Identify the independent variable having the largest p-value that exceeds the chosen level of significance. 3. Remove the variable identified in step 2 from the model and evaluate adjusted R 2. Don’t remove all variables with p-values that exceed at the same time, but remove only one at a time. 4. Continue until all variables are significant. 24 Example: Identifying the Best Regression Model Banking Data Dependent variable: Banking Data Age Education Income Home Value Household Wealth Bank Balance 35.9 14.8 $91,033 $183,104 $220,741 $38,517 37.7 13.8 $86,748 $163,843 $223,152 $40,618 36.8 13.8 $72,245 $142,732 $176,926 $35,206 35.3 13.2 $70,639 $145,024 $166,260 $33,434 35.3 13.2 $64,879 $135,951 $148,868 $28,162 34.8 13.7 $75,591 $155,334 $188,310 $36,708 39.3 14.4 $80,615 $181,265 $201,743 $38,766 25 Example: Identifying the Best Regression Model Home value has the largest p-value; drop and re-run the regression. 26 Example : Identifying the Best Regression Model Regression output after removing Home Value 27 Multicollinearity Multicollinearity occurs when: There are strong correlations among the independent variables, They can predict each other better than the dependent variable. 28 Example Suppose we are building a multiple regression model to predict the price of houses in a neighborhood based on two independent variables: square footage of the house (X1) and the number of bedrooms (X2). Square Footage Number of House (X1) Bedrooms (X2) Price (Y) A 1,800 3 $300,000 B 2,200 4 $350,000 C 1,600 2 $270,000 D 2,000 3 $310,000 E 2,400 4 $360,000 House B has both more square footage and more bedrooms than House A. House E has both more square footage and more bedrooms than House D. House C has both less square footage and fewer bedrooms than House A. Is there likely to be a high correlation between square footage (X1) and the number of bedrooms (X2)? This is because larger houses often have more bedrooms, and smaller houses tend to have fewer bedrooms. 29 Multicollinearity When significant multicollinearity is present: – It becomes difficult to isolate the effect of one independent variable on the dependent variable, – The signs of coefficients may be the opposite of what they should be, making it difficult to interpret regression coefficients. – Multicollinearity can lead to misleading conclusions about variable importance. Variables that are truly significant may appear insignificant due to the presence of multicollinearity. 30 Multicollinearity Detecting Multicollinearity: Multicollinearity can be detected through various methods, including correlation matrices, variance inflation factors (VIFs). High correlation coefficients or high VIF values (typically above 5 or 10) are indicators of multicollinearity. Correlations exceeding ±0.7 may indicate multicollinearity. Addressing Multicollinearity: There are several ways to address multicollinearity: – Remove one or more highly correlated variables from the model, if possible, to retain only the most relevant predictors – Collect more data to reduce the effects of multicollinearity The variance inflation factor is a better indicator, but not computed in Excel. 31 Example : Identifying Potential Multicollinearity Colleges and Universities correlation matrix; None exceed the recommend threshold of ±0.7 32 Example : Identifying Potential Multicollinearity Banking Data Age Education Income Home Value Wealth Bank Balance 35.9 14.8 $91,033 $183,104 $220,741 $38,517 37.7 13.8 $86,748 $163,843 $223,152 $40,618 36.8 13.8 $72,245 $142,732 $176,926 $35,206 35.3 13.2 $70,639 $145,024 $166,260 $33,434 35.3 13.2 $64,879 $135,951 $148,868 $28,162 34.8 13.7 $75,591 $155,334 $188,310 $36,708 39.3 14.4 $80,615 $181,265 $201,743 $38,766 Banking Data correlation matrix Large correlations exist 33 Example : Identifying Potential Multicollinearity Banking data : Large correlations exist We include all the independent variables If we remove Wealth from the model 34 Example : Identifying Potential Multicollinearity Banking data : Large correlations exist If we remove Wealth from the model, the adjusted R 2 drops to 0.9201 we discover that Education is no longer significant. 35 Example : Identifying Potential Multicollinearity Banking data : Large correlations exist Dropping Education and leaving only Age and Income in the model results in an adjusted R 2 of 0.9202. 36 Example : Identifying Potential Multicollinearity Banking data : Large correlations exist However, if we remove Income from the model instead of Wealth, the Adjusted R 2 drops to only 0.9345, and all remaining variables (Age, Education, and Wealth) are significant. 37 Practical Issues in Trendline and Regression Modeling Identifying the best regression model often requires experimentation and trial and error. The independent variables selected should make sense in attempting to explain the dependent variable. – Logic should guide your model development. – In many applications, behavioral, economic, or physical theory might suggest that certain variables should belong in a model. 38 Example Imagine you work for a retail company that wants to understand what factors influence customer purchase behavior in their online store. Your goal is to build a regression model to: – Predict the total amount spent by customers in a single visit to the website. – You have access to various independent variables, including: Advertising Spend: The amount of money spent on online advertising for the products. Website Load Time: The time it takes for the website to load fully. Customer Reviews: The average rating and number of customer reviews for the products. Customer Demographics: Information about the customer's age, gender, and location. Weather Data: Weather conditions on the day of the visit (temperature). Avoiding Illogical Variables: Weather data, while interesting, may not have a strong theoretical basis for directly predicting online purchase behavior. Weather 39 conditions might impact certain types of businesses (e.g., outdoor retail), Practical Issues in Trendline and Regression Modeling Additional variables increase R 2 and, therefore, help to explain a larger proportion of the variation. – Even though a variable with a large p-value is not statistically significant, it could simply be the result of sampling error and a modeler might wish to keep it. Good models are as simple as possible 40 Overfitting Overfitting means fitting a model too closely to the sample data at the risk of not fitting it well to the population in which we are interested. 2 – In fitting the crude oil prices , we noted that the R - value will increase if we fit higher-order polynomial functions to the data. While this might provide a better mathematical fit to the sample data, doing so can make it difficult to explain the phenomena rationally. y = 0.005 x3 - 0.111x 2 + 0.648 x + 59.947 In multiple regression, if we add too many terms to the model, then the model may not adequately predict other values from the population. Overfitting can be mitigated by using good logic, intuition, theory, and parsimony. 41 Example Your task is to build a multiple regression model to – Predict house prices based on various features – Square footage, number of bedrooms, number of bathrooms, neighborhood crime rate, and proximity to schools. – You decide to create a complex regression model with many variables, including interaction terms and polynomial features (e.g., squared terms). – When you evaluate this complex model on your sample data, it fits the data almost perfectly. – Poor Generalization: When you apply the complex model to the testing data, you notice it doesn't predict prices as accurately as it did on the sample It has learned relationships that do not hold in the broader population of houses 42 Stepwise Regression Stepwise regression systematically adds or deletes independent variables. A forward stepwise procedure puts the most significant variable in first, adds the next variable that will improve the model the most. Backward stepwise regression begins with all the independent variables and deletes the least helpful. Regression with Categorical Variables Regression analysis requires numerical data. Categorical data can be included as independent variables, but must be coded numeric using dummy variables. For variables with 2 categories, code as 0 and 1. 44 Example : A Model with Categorical Variables Employee Salaries provides data for 35 employees. Predict Salary using Age and MBA (code as yes = 1, no = 0) Y = b 0 + b1 X 1 + b 2 X 2 + e where Y = salary X 1 = age X 2 = MBA indicator (0 or 1) 45 Example : A Model with Categorical Variables =1 Salary = 893.59 + 1044.15 ´ Age + 14767.23 ´ MBA – If MBA = 0, salary Salary = 893.59 + 1044 ´ Age – If MBA = 1, salary Salary = 15,660.82 + 1,044.15 ´ Age 46 Example : A Model with Categorical Variables Salary = 893.59 + 1044.15 ´ Age + 14767.23 ´ MBA – If MBA = 0, salary Salary = 893.59 + 1044 ´ Age – If MBA = 1, salary Salary = 15,660.82 + 1,044.15 ´ Age Estimate salary for a 30-year-old with an MBA Salary = 893.59 + 1044.15 ´ Age + 14767.23 ´ MBA Age=30 Salary = 893.59+1044.15*30+14767.23*1 MBA=1 =$46,985.32 47 Interactions An interaction occurs when the effect of one variable is dependent on another variable. We can test for interactions by defining a new variable as the product of the two variables, X 3 = X 1 ´ X 2 , and testing whether this variable is significant, leading to an alternative model. Y = b 0 + b1 X 1 + b 2 X 2 + b3 X 3 + e X 3 = X1 ´ X 2 , 48 Interactions Example : Job performance (dependent variable) Job satisfaction (independent variable) Employee experience (independent variable) Job Performance= b0+b1(Job Satisfaction)+b2(Employee Experience) 49 Example Job Performance= b0+b1(Job Satisfaction)+b2(Employee Experience) +b3(Job Satisfaction * Employee Experience) If b3 is statistically significant, this suggests that there is a moderating effect of employee experience on the relationship between job satisfaction and job performance. In practical terms, if b3 is positive and significant, it could mean that as employee experience increases, the positive relationship between job satisfaction and job performance becomes stronger. On the other hand, if b3 is negative, it could indicate that higher levels of employee experience weaken the positive relationship between job satisfaction and performance. 50 Example : Incorporating Interaction Terms in a Regression Model Define an interaction between Age and MBA and re-run the regression. 51 Example : Incorporating Interaction Terms in a Regression Model M𝐁𝐀 × 𝑨𝒈𝒆 The MBA indicator is not significant; we would typically drop it and re-run the regression analysis. 52 Example : Incorporating Interaction Terms in a Regression Model This results in the model: salary = 3,323.11 + 984.25 ´ age + 425.58 ´ MBA ´ age 53 Example : Incorporating Interaction Terms in a Regression Model However, statisticians recommend that if interactions are significant, first-order terms should be kept in the model regardless of their p- values. Thus, using the first regression model, we have: salary = 3902.51 + 971.31´ Age - 2971.08 ´ MBA + 501.85 ´ MBA ´ Age 54 Example : Incorporating Interaction Terms in a Regression Model salary = 3902.51 + 971.31´ Age - 2971.08 ´ MBA + 501.85 ´ MBA ´ Age Salary and Age (Positive): A positive coefficient for age suggests that as age increases, the salary also increases, holding other variables constant. Salary and MBA (Negative, Not Significant): A negative coefficient for holding an MBA implies that having an MBA is associated with a decrease in salary compared to not having one. However, the fact that this variable is not significant indicates that this effect is not statistically reliable; it could be due to random chance. 55 Example : Incorporating Interaction Terms in a Regression Model salary = 3902.51 + 971.31´ Age - 2971.08 ´ MBA + 501.85 ´ MBA ´ Age Age and MBA (Positive): – The effect of age on salary is stronger for those who hold an MBA compared to those who do not. – Although having an MBA alone doesn't increase salary, its true value emerges as you gain more experience (age). Older individuals with an MBA experience a more rapid increase in salary compared to their non-MBA counterparts. 56 Categorical Variables with More Than Two Levels When a categorical variable has k > 2 levels, we need to add k − 1 additional variables to the model. Example: Salary and Education Level – High School, Bachelor's Degree, Master's Degree, PhD Employee Performance and Skill level – Entry-Level, Intermediate, Advanced, Expert Sales and customer satisfaction – Very Dissatisfied, Dissatisfied, Neutral, Satisfied, Very Satisfied 57 Example : A Regression Model with Multiple Levels of Categorical Variables Measurements of the surface finish of 35 parts produced on a lathe, The revolutions per minute (R P M) of the spindle One of four types of cutting tools used. 58 Example : A Regression Model with Multiple Levels of Categorical Variables Measurements of the surface finish of 35 parts produced on a lathe, The revolutions per minute (R P M) of the spindle One of four types of cutting tools used. Surface finish : dependent variable The revolutions per minute (R P M) of the spindle : independent variable Type of tool (Tool type A, Tool type B, Tool type C, Tool type D) : independent variable 59 Example : A Regression Model with Multiple Levels of Categorical Variables Because we have k = 4 levels of tool type, we will define a regression model of the form Y = b 0 + b1 X 1 + b 2 X 2 + b3 X 3 + b 4 X 4 + e where Y = surface finish X 1 = RPM X 2 = 1 if tool type is B and 0 if not X 3 = 1 if tool type is C and 0 if not X 4 = 1 if tool type is D and 0 if not 60 Example : A Regression Model with Multiple Levels of Categorical Variables Cutting tools Add 3 columns to the data, one for each of the tool type variables When X2=X3=X4=0 Tool Type: A 61 Example : A Regression Model with Multiple Levels of Categorical Variables Regression results Surface finish = 24.49 + 0.098 RPM − 13.31 type B − 20.49 type C − 26.04 type D 62 Example : A Regression Model with Multiple Levels of Categorical Variables Y = b 0 + b1 X 1 + b 2 X 2 + b3 X 3 + b 4 X 4 + e Y = surface finish X 1 = RPM X 2 = 1 if tool type is B and 0 if not X 3 = 1 if tool type is C and 0 if not X 4 = 1 if tool type is D and 0 if not Surface finish = 24.49 + 0.098 RPM − 13.31 type B − 20.49 type C − 26.04 type D 63 Example : A Regression Model with Multiple Levels of Categorical Variables Tool Type B Coefficient (-13.31): Using cutting tool type B is predicted to decrease the surface finish by 13.31 units compared to using the baseline tool type A, assuming RPM remains constant. Tool Type C Coefficient (-20.49): Using cutting tool type C is predicted to decrease the surface finish by 20.49 units compared to using the baseline tool type A, assuming RPM remains constant. Tool Type D Coefficient (-26.04): Using cutting tool type D is predicted to decrease the surface finish by 26.04 units compared to using the baseline tool type A, assuming RPM remains constant. Intercept (24.49): When RPM is zero and the base cutting tool type A is used, the predicted surface finish is 24.49 units. However, be cautious with this interpretation as RPM being zero may not be practical in a real-world setting. RPM Coefficient (0.098): For each additional revolution per minute, the surface finish is predicted to increase by 0.098 units, assuming the tool type remains constant. This indicates a positive relationship between RPM and surface finish quality. 64 Regression Models with Nonlinear Terms Curvilinear models may be appropriate when scatter charts or residual plots show nonlinear relationships. A second order polynomial might be used Y = b 0 + b1 X + b 2 X 2 + e Here b1 represents the linear effect of X on Y and b 2 represents the curvilinear effect. 65 Example : Modeling Beverage Sales Using Curvilinear Regression The U-shape of the residual plot (a second-order polynomial trendline was fit to the residual data) suggests that a linear relationship is not appropriate. 66 Example : Modeling Beverage Sales Using Curvilinear Regression Add a variable for temperature squared. The model is: Sales = 142,850 - 3,643.17 ´ Temperature + 23.3 ´ Temperature 2 67 Cautions and Pitfalls If the assumptions are not met, the statistical test may not be valid Correlation does not necessarily mean causation Multicollinearity makes interpreting coefficients problematic, but the model may still be good A t-test for the intercept (b0) may be ignored as this point is often outside the range of the model A linear relationship may not be the best relationship, even if the F test returns an acceptable value Even though a relationship is statistically significant it may not have any practical value 68

Use Quizgecko on...
Browser
Browser