Business Analytics & Machine Learning Regression Diagnostics PDF

Business Analytics & Machine Learning Regression Diagnostics Prof. Dr. Martin Bichler Department of Computer Science School of Computation, Information, and Technology Technical University of Munich Course Content Supervized Learning Introduction Regression Analysis Regression Classification Regression Diagnostics Logistic and Poisson Regression Linear Logistic Regression Regression Naive Bayes and Bayesian Networks Decision Tree Classifiers Poisson Naive Bayes Regression Data Preparation and Causal Inference Model Selection and Learning Theory Ensemble Decision Trees Ensemble Methods and Clustering Methods Dimensionality Reduction PCR Ensemble Association Rules and Recommenders Regression Methods Convex Optimization Neural Lasso Neural Networks Networks Reinforcement Learning Neural Networks 2 Recommended Literature Introduction to Econometrics − James H. Stock and Mark W. Watson − Chapter 6, 7, 10, 17, 18 The Elements of Statistical Learning − Trevor Hastie, Robert Tibshirani, Jerome Friedman − http://web.stanford.edu/~hastie/Papers/ESLII.pdf − Section 3: Linear Methods for Regression An Introduction to Statistical Learning: With Applications in R − Gareth James, Trevor Hastie, Robert Tibshirani − Section 3: Linear Regression 3 Multiple Linear Regression 𝑦𝑦1 1 𝑥𝑥11 𝑥𝑥12 ⋯ 𝑥𝑥1𝑝𝑝 𝛽𝛽0 𝜀𝜀1 𝑦𝑦2 1 𝑥𝑥21 𝑥𝑥22 ⋯ 𝑥𝑥2𝑝𝑝 𝛽𝛽1 𝜀𝜀2 ⋮ = ⋮ ⋮ ⋱ ⋮ + ⋮ ⋮ ⋮ 𝑦𝑦𝑛𝑛 1 𝑥𝑥𝑛𝑛1 𝑥𝑥𝑛𝑛2 ⋯ 𝑥𝑥𝑛𝑛𝑛𝑛 𝛽𝛽𝑝𝑝 𝜀𝜀𝑛𝑛 𝑦𝑦 𝑿𝑿 𝛽𝛽 ε (𝑛𝑛 × 1) (𝑛𝑛 × (𝑝𝑝+1)) ((𝑝𝑝+1) × 1) (𝑛𝑛 × 1) 4 Reminder: Least Squares Estimation 𝐗𝐗 is 𝑛𝑛 × (𝑝𝑝 + 1), 𝑦𝑦 is the vector of outputs RSS(β) = 𝑦𝑦 − 𝐗𝐗𝛽𝛽 𝑇𝑇 (𝑦𝑦 − 𝐗𝐗𝛽𝛽) If 𝐗𝐗 is full rank, then 𝐗𝐗 𝑇𝑇 𝐗𝐗 is positive definite, i.e. it is invertible 𝑅𝑅𝑅𝑅𝑅𝑅 = 𝑦𝑦𝑇𝑇𝑦𝑦 − 2𝛽𝛽 𝑇𝑇 𝐗𝐗 𝑇𝑇 y + 𝛽𝛽 𝑇𝑇 𝐗𝐗 𝑇𝑇 𝐗𝐗𝛽𝛽 𝜕𝜕𝑅𝑅𝑅𝑅𝑅𝑅 = −2𝐗𝐗 𝑇𝑇 y + 2𝐗𝐗 𝑇𝑇 𝐗𝐗𝛽𝛽 = 0 First-order condition 𝜕𝜕𝛽𝛽 𝛽𝛽 = 𝐗𝐗 𝑇𝑇 𝐗𝐗 −1 𝐗𝐗 𝑇𝑇 𝑦𝑦 𝑦𝑦 = 𝐗𝐗 𝐗𝐗 𝑇𝑇 𝐗𝐗 −1 𝑇𝑇 𝐗𝐗 𝑦𝑦 “Hat” or projection matrix 𝐻𝐻 Source: Hastie et al. 2016, p. 46 5 Agenda for Today The linear regression model for a random sample (no bias in the sampling) is computationally simple and „best“ if certain assumptions are satisfied. Today, we will discuss the main assumptions and introduce selected tests that can help you check whether you can assume that the assumptions are satisfied in your data set. There are several alternative tests that have been developed for each assumption. We introduce one of these tests, which will enable you to use the multiple linear regression in applications. 6 Gauss-Markov Theorem The Gauss-Markov theorem states that in a linear regression model in which the errors have expectation zero and are uncorrelated and have equal variances, the best linear unbiased estimator (BLUE) of the coefficients is given by the ordinary least squares (OLS) estimator. „unbiased“ means 𝐸𝐸 𝛽𝛽̂𝑗𝑗 = 𝛽𝛽𝑗𝑗 „best“ means giving the lowest variance of the estimate as compared to other linear unbiased estimators − restriction to unbiased estimation is not always the best (will be discussed in the context of the ridge regression later) 7 Bias, Consistency, and Efficiency Unbiased Consistent Efficient 𝐸𝐸 𝛽𝛽̂ = 𝛽𝛽 ̂ decreases var(𝛽𝛽) var 𝛽𝛽̂ < var 𝛽𝛽 Expected value for with increasing estimator 𝛽𝛽̂ has lower estimator “is true” sample size 𝑛𝑛 variance than any other estimator, 𝛽𝛽 freq 𝛽𝛽̂ ̂ 𝑛𝑛 = 1000 𝛽𝛽, LS 𝛽𝛽̂ ̂ 𝑛𝑛 = 100 𝛽𝛽 𝛽𝛽, 𝛽𝛽 𝛽𝛽 𝛽𝛽 8 Gauss-Markov Assumptions in Detail The OLS estimator is the best linear unbiased estimator (BLUE), iff 1) Linearity Linear relationship in parameters 𝛽𝛽 2) No multicollinearity of predictors No linear dependency between predictors 3) Homoscedasticity The residuals exhibit constant variance 4) No autocorrelation There is no correlation between the 𝑖𝑖 th and 𝑗𝑗th residual terms 5) Expected value of the residual vector, given 𝑋𝑋, is 0 (𝐸𝐸 𝜀𝜀 𝑋𝑋 = 0) (i.e., exogeneity (cov(𝜀𝜀, 𝑋𝑋) = 0)) 9 Gauss-Markov Assumptions in Detail The OLS estimator is the best linear unbiased estimator (BLUE), iff 1) Linearity Linear relationship in parameters 𝛽𝛽 2) No multicollinearity of predictors No linear dependency between predictors 3) Homoscedasticity The residuals exhibit constant variance 4) No autocorrelation There is no correlation between the 𝑖𝑖 th and 𝑗𝑗th residual terms 5) Expected value of the residual vector, given 𝑋𝑋, is 0 (𝐸𝐸 𝜀𝜀 𝑋𝑋 = 0) (i.e., exogeneity (cov(𝜀𝜀, 𝑋𝑋) = 0)) 10 When Linearity Does Not Hold: Try to Reformulate For non-linear regressions the OLS estimator from the last class might not be appropriate. Often, you can adapt your data such that you can still use the multiple linear regression. The following reformulations lead to models that are again linear in 𝛽𝛽: Polynomial regression (is still a linear model): 𝑌𝑌 = 𝛽𝛽0 + 𝛽𝛽1 𝑋𝑋 + 𝛽𝛽2 𝑋𝑋 2 + ⋯ + 𝜀𝜀 Transform either only 𝑋𝑋, only 𝑌𝑌, or both variables, e.g.: 𝑌𝑌 = 𝛽𝛽0 𝑋𝑋𝛽𝛽1 into log 𝑌𝑌 = log 𝛽𝛽0 + 𝛽𝛽1 log 𝑋𝑋 + 𝜀𝜀 Piecewise linear regression (aka. segmentation): 𝑌𝑌 = 𝛽𝛽0 + 𝛽𝛽1 𝑋𝑋 𝑋𝑋 > 𝑋𝑋𝐾𝐾 + 𝜀𝜀 where 𝑋𝑋 > 𝑋𝑋𝐾𝐾 = 0 if 𝑋𝑋 ≤ 𝑋𝑋𝐾𝐾 and 𝑋𝑋 > 𝑋𝑋𝐾𝐾 = 1 if 𝑋𝑋 > 𝑋𝑋𝐾𝐾 11 𝑌𝑌 = 𝛽𝛽0 + 𝛽𝛽1 𝑋𝑋 𝑋𝑋 > 𝑋𝑋𝐾𝐾 + 𝜀𝜀 𝐿𝐿𝐿𝐿𝐿𝐿 𝑌𝑌 = 𝛽𝛽0 + 𝛽𝛽1 log 𝑋𝑋 + 𝜀𝜀 𝑌𝑌 = 𝛽𝛽0 + 𝛽𝛽1 𝑋𝑋 + 𝛽𝛽2 𝑋𝑋 2 + ⋯ + 𝜀𝜀 12 Outliers An outlier is an observation that is unusually small or large. Several possibilities need to be investigated when an outlier is observed: There was an error in recording the value. The point does not belong in the sample. The observation is valid. Identify outliers from + the scatter diagram. + + + + There are also methods + + + + + for “robust” regression. + + + + 13 Cook‘s Distance to Detect Outliers Cook's distance measures the effect of deleting a given observation. 2 ∑𝑖𝑖 𝑦𝑦 𝑖𝑖 − 𝑦𝑦 𝑖𝑖 𝑘𝑘 𝐷𝐷𝑘𝑘 = 𝑝𝑝𝑠𝑠 2 𝑝𝑝 number of independent variables 𝑦𝑦 𝑗𝑗 𝑘𝑘 fitted response value when excluding observation 𝑘𝑘 2 𝑒𝑒 𝑇𝑇 𝑒𝑒 𝑠𝑠 = mean squared error of the regression model 𝑛𝑛−𝑝𝑝 # Calculate Cook's distance res = model.fit() influence = res.get_influence() cooks_d = influence.cooks_distance Relevant size depends on context Rule of thumb: check values > 4/n 14 Gauss-Markov Assumptions in Detail The OLS estimator is the best linear unbiased estimator (BLUE), iff 1) Linearity Linear relationship in parameters 𝛽𝛽 2) No multicollinearity of predictors No linear dependency between predictors 3) Homoscedasticity The residuals exhibit constant variance 4) No autocorrelation There is no correlation between the 𝑖𝑖 th and 𝑗𝑗th residual terms 5) Expected value of the residual vector, given 𝑋𝑋, is 0 (𝐸𝐸 𝜀𝜀 𝑋𝑋 = 0) (i.e., exogeneity (cov(𝜀𝜀, 𝑋𝑋) = 0)) 15 Multicollinearity Independent variables must not be linearly dependent. If two independent variables were dependent, one could easily omit one. To check for linear dependencies of two columns in a matrix, we can use the rank. The observation matrix 𝐗𝐗 has shape 𝑛𝑛 × (𝑝𝑝 + 1), with 𝑝𝑝 + 1 < 𝑛𝑛 We know: rank 𝑋𝑋 ≤ 𝑝𝑝 + 1 If rank 𝐗𝐗 < p + 1, there is an (exact) linear relationship among the variables. If rank 𝐗𝐗 < 𝑝𝑝 + 1, 𝐗𝐗 is singular. Then, it is impossible to calculate the closed-form solution! (Remember: 𝑦𝑦 = 𝐗𝐗𝛽𝛽̂ = 𝐗𝐗 𝐗𝐗 𝑇𝑇 𝐗𝐗 −1 𝐗𝐗 𝑇𝑇 𝑦𝑦 ) Also high correlation between independent variables leads to issues wrt. the significance of predictors. 16 Check for Multicollinearity A basic check of multicollinearity is to calculate the correlation coefficient for each pair of predictor variables. Large correlations (both positive and negative) indicate problems. − Large means greater than the correlations between predictors and response. It is possible that the pairwise correlations are small, and yet a linear dependence exists among three or even more variables. Alternatively use the variance inflation factor (VIF). 17 Variance Inflation Factor 1 VIF = , where the 𝑅𝑅𝑘𝑘2 value here is the value when the predictor in question 1−𝑅𝑅𝑘𝑘2 (𝑘𝑘) is set as the dependent variable. For example, if the VIF = 10, then the respective 𝑅𝑅𝑘𝑘2 would be 90%. This would mean that 90% of the variance in the predictor in question can be explained by the other independent variables. Because so much of the variance is captured elsewhere, removing the predictor in question should not cause a substantive decrease in overall 𝑅𝑅𝑅. The rule of thumb is to remove variables with VIF scores greater than 10. 18 Consequence - Non-Significance If a variable has a non-significant 𝑡𝑡-value, then either the variable is not related to the response, or ( Small 𝑡𝑡-value, small VIF, small correlation with response) the variable is related to the response, but it is not required in the regression because it is strongly related to a third variable that is in the regression, so we don’t need both ( Small 𝑡𝑡-value, high VIF, high correlation with response). The usual remedy is to drop one or more variables from the model. 19 Example Y X1 X2 X3 X4 78.5 7 26 6 60 74.3 1 29 15 52 104.3 11 56 8 20 87.6 11 31 8 47 95.9 7 52 6 33 109.2 11 55 9 22 102.7 3 71 17 6 72.5 1 31 22 44 93.1 2 54 18 22 115.9 21 47 4 26 83.8 1 40 23 34 113.3 11 66 9 12 109.4 10 68 8 12 20 Example High R2 OLS Regression Results ========================================================= Dep. Variable: Y R-squared: 0.982 Model: OLS Adj. R-squared: 0.974 Method: Least Squares F-statistic: 111.5 Date: Mon, 06 Nov 2023 Prob (F-statistic): 4.76e-07 Time: 11:08:39 Log-Likelihood: -26.918 No. Observations: 13 AIC: 63.84 Df Residuals: 8 BIC: 66.66 ========================================================= coef std err t P>|t| [0.025 0.975] ---------------------------------------------------------------------------------------------------- const 62.4054 70.071 0.891 0.399 -99.179 223.989 X1 1.5511 0.745 2.083 0.071 -0.166 3.269 X2 0.5102 0.724 0.705 0.501 -1.159 2.179Large p-values X3 0.1019 0.755 0.135 0.896 -1.638 1.842 X4 -0.1441 0.709 -0.203 0.844 -1.779 1.491 data.corr() Y X1 X2 X3 X4 Y 1.00 0.73 0.82 -0.53 -0.82 High X1 0.73 1.00 0.23 -0.82 -0.25 correlation X2 0.82 0.23 1.00 -0.14 -0.97 X3 -0.53 -0.82 -0.14 1.00 0.03 X4 -0.82 -0.25 -0.97 0.03 1.00 21 Data from statsmodels.stats.outliers_influence import variance_inflation_factor as vif for index, variable_name in enumerate(features.columns): if variable_name != "const": print(f"VIF for variable {variable_name} is {vif(features, index)}") X1 X2 X3 X4 38.49621 254.42317 46.86839 282.51286 Very large! Very large! 22 Drop X4 for index, variable_name in enumerate(features.columns): if variable_name != "const": print(f"VIF for variable {variable_name} is {vif(features, index)}") X1 X2 X3 3.251068 1.063575 3.142125 VIF’s now small OLS Regression Results ========================================================= R2 hardly Dep. Variable: Y R-squared: 0.982 Model: OLS Adj. R-squared: 0.976 decreased Method: Least Squares F-statistic: 166.3 Date: Mon, 06 Nov 2023 Prob (F-statistic): 3.37e-08 Time: 11:08:39 Log-Likelihood: -26.952 No. Observations: 13 AIC: 61.90 Df Residuals: 8 BIC: 64.16 ========================================================= coef std err t P>|t| [0.025 0.975] ---------------------------------------------------------------------------------------------------- const 48.1936 3.913 12.315 0.000 39.341 57.046 X1 1.6959 0.205 8.290 0.000 1.233 2.159 X2 0.6569 0.044 14.851 0.000 0.557 0.757 X1, X2 now signif X3 0.2500 0.185 1.354 0.209 -0.168 0.668 23 Can you explain the intuition behind the VIF? 24 Gauss-Markov Assumptions in Detail The OLS estimator is the best linear unbiased estimator (BLUE), iff 1) Linearity Linear relationship in parameters 𝛽𝛽 2) No multicollinearity of predictors No linear dependency between predictors 3) Homoscedasticity The residuals exhibit constant variance 4) No autocorrelation There is no correlation between the 𝑖𝑖 th and 𝑗𝑗th residual terms 5) Expected value of the residual vector, given 𝑋𝑋, is 0 (𝐸𝐸 𝜀𝜀 𝑋𝑋 = 0) (i.e., exogeneity (cov(𝜀𝜀, 𝑋𝑋) = 0)) 25 Homoscedasticity When the requirement of a constant variance is not violated, we have homoscedasticity. If the data is not homoscedastic, a different estimator (e.g., weighted least squares) might be better than OLS. We also assume residuals to be normally distributed. This can be seen as a 6th Gauss-Markov assumption (independent of homoscedasticity). + ++ Residuals 𝑦𝑦 + + + + ++ + + + + + + ++ + + + + + + + + + + + + x + + + ++ ++ ++ + + + + ++ + + + ++++ The spread of the data points +++ + does not change much. 26 Heteroscedasticity When the requirement of a constant variance is violated, we have heteroscedasticity (var(𝜀𝜀𝑖𝑖 |𝑥𝑥1𝑖𝑖 , … , 𝑥𝑥𝑝𝑝𝑝𝑝 ) not constant). Heteroscedasticity leads to biased error terms and 𝑝𝑝-values of significance tests. Example: annual income when predicted by age. Entry level jobs are often paid very similar, but as people grow older their income distribution increases. + + + Residuals 𝑦𝑦 + + + + + + + + + + + + + + ++ + + + + + + x + + + ++ + + + + + + ++ + + + + + ++ ++ The spread increases with 𝑥𝑥. 27 Glejser Test Apart from visual inspection, the Glejser test is a simple statistical test. It regresses the residuals on the explanatory variable that is thought to be related to the heteroscedastic variance. 1. Estimate the original regression with OLS and find the sample residuals 𝑒𝑒𝑖𝑖. 2. Regress the absolute value |𝑒𝑒𝑖𝑖 | on the explanatory variable 𝑋𝑋 that is associated with heteroscedasticity. Index 𝑖𝑖 describes the 𝑖𝑖th observation. 𝑒𝑒𝑖𝑖 = 𝛾𝛾0 + 𝛾𝛾1 𝑋𝑋𝑖𝑖 + 𝑣𝑣𝑖𝑖 𝑒𝑒𝑖𝑖 = 𝛾𝛾0 + 𝛾𝛾1 𝑋𝑋𝑖𝑖 + 𝑣𝑣𝑖𝑖 1 𝑒𝑒𝑖𝑖 = 𝛾𝛾0 + 𝛾𝛾1 + 𝑣𝑣𝑖𝑖 𝑋𝑋𝑖𝑖 3. Select the equation with the highest 𝑅𝑅 2 and lowest standard errors to represent heteroscedasticity. 4. Perform a t-test on 𝛾𝛾1. If 𝛾𝛾1 is statistically significant, the null hypothesis of homoscedasticity can be rejected. We test if one of the independent variables is significantly related to the variance of our residuals. 28 White Test The White test assumes more complex relationships, and also models interaction terms. You do not have to choose a particular 𝑋𝑋 and do not need normally distributed residuals. The test is also based on an auxiliary regression of 𝑒𝑒 2 on all the explanatory variables (𝑋𝑋𝑗𝑗 ), their squares (𝑋𝑋𝑗𝑗2 ), and all their cross products. 𝑒𝑒 2 = 𝛽𝛽0 + 𝛽𝛽1 𝑋𝑋1 + 𝛽𝛽2 𝑋𝑋2 + 𝛽𝛽3 𝑋𝑋12 + 𝛽𝛽4 𝑋𝑋22 + 𝛽𝛽5 𝑋𝑋1 𝑋𝑋2 + 𝑣𝑣 With more than 2 independent variables, you need to analyze the product of each independent variable with each other independent variable. A large 𝑅𝑅2 counts against homoskedasticity. The test statistic is 𝑛𝑛𝑛𝑛2, where 𝑛𝑛 is the sample size. The statistic is asymptotically chi-square (𝜒𝜒2) distributed with df = 𝑝𝑝, where 𝑝𝑝 is the no. of all explanatory variables in the auxiliary model.* H0: All the variances 𝜎𝜎𝑖𝑖2 are equal (i.e., homoscedastic) Reject H0 if 𝜒𝜒2 > 𝜒𝜒𝑐𝑐𝑐𝑐2 or low p-value. *The sum of squared standard-normal random variables is asymptotically 𝜒𝜒2 distributed. 29 Gauss-Markov Assumptions in Detail The OLS estimator is the best linear unbiased estimator (BLUE), iff 1) Linearity Linear relationship in parameters 𝛽𝛽 2) No multicollinearity of predictors No linear dependency between predictors 3) Homoscedasticity The residuals exhibit constant variance 4) No autocorrelation There is no correlation between the 𝑖𝑖 th and 𝑗𝑗th residual terms 5) Expected value of the residual vector, given 𝑋𝑋, is 0 (𝐸𝐸 𝜀𝜀 𝑋𝑋 = 0) (i.e., exogeneity (cov(𝜀𝜀, 𝑋𝑋) = 0)) 30 Applications of Linear Regressions to Time Series Data Average hours worked per week by manufacturing workers: Period Hours Period Hours Period Hours Period Hours 1 37.2 11 36.9 21 35.6 31 35.7 2 37.0 12 36.7 22 35.2 32 35.5 3 37.4 13 36.7 23 34.8 33 35.6 4 37.5 14 36.5 24 35.3 34 36.3 5 37.7 15 36.3 25 35.6 35 36.5 6 37.7 16 35.9 26 35.6 7 37.4 17 35.8 27 35.6 8 37.2 18 35.9 28 35.9 9 37.3 19 36.0 29 36.0 10 37.2 20 35.7 30 35.7 31 Forecasting Linear Trend using the Multiple Regression 𝑌𝑌𝑖𝑖 = 𝛽𝛽0 + 𝛽𝛽1 𝑋𝑋𝑖𝑖 + 𝜀𝜀𝑖𝑖 , model = ols('target ~ period', data=data).fit() print(model.summary()) where: 𝑌𝑌𝑖𝑖 = data value for period 𝑖𝑖 OLS Regression Results 𝑌𝑌 = 37.416 − 0.0614𝑋𝑋𝑖𝑖 ======================================================== Dep. Variable: y R-squared: 0.612 Model: OLS Adj. R-squared: 0.600 Method: Least Squares F-statistic: 51.95 Date: Prob (F-statistic): 2.90e-08 Time: 11:53:29 Log-Likelihood: -24.835 No. Observations: 35 AIC: 53.67 p-value for hypothesis (𝛽𝛽 = 0) Df Residuals: 33 BIC: 56.78 Df Model: 1 Covariance Type: nonrobust ======================================================== coef std err t P>|t| [0.025 0.975] -------------------------------------------------------------------------------------------------- const 37.416 0.175 213.744 0.000 37.053 37.765 x1 -0.0614 0.008 -7.208 0.000 -0.078 -0.044 ======================================================== Omnibus: 0.812 Durbin-Watson: 0.278 Prob(Omnibus): 0.666 Jarque-Bera (JB): 0.155 Skew: 0.047 Prob(JB): 0.925 Kurtosis: 3.312 Cond. No. 42.3 ======================================================== 32 Hours Worked Data - A Linear Trend Line 33 Autocorrelation Examining the residuals over time, no pattern should be observed if the errors are independent. Autocorrelation can be detected by graphing the residuals against time, or Durbin- Watson statistic. Residual Residual + + ++ + + + + + + + + + + 0 + Time 0 Time + + + + + + + + + + + + + 34 Autocorrelation Reasons leading to autocorrelation: Omitted an important variable Functional misfit Measurement error in independent variable Use Durbin-Watson (DW) statistic to test for first-order autocorrelation. DW takes values within [0, 4]. For no serial correlation, a value close to 2 (e.g., 1.5-2.5) is expected. 𝑛𝑛 ∑𝑖𝑖=2(𝑒𝑒𝑖𝑖 − 𝑒𝑒𝑖𝑖−1 )² 𝐷𝐷𝐷𝐷 = 𝑛𝑛 ∑𝑖𝑖=1 𝑒𝑒𝑖𝑖2 𝐷𝐷𝐷𝐷 = 2 – no autocorrelation 𝐷𝐷𝐷𝐷 = 0 – perfect positive autocorrelation 𝐷𝐷𝐷𝐷 = 4 – perfect negative autocorrelation 35 Test for Autocorrelation in R from statsmodels.formula.api import ols from statsmodels.stats.stattools import durbin_watson model = ols('target ~ period', data=data).fit() durbin_watson(model.resid) 0.2775895 36 Breusch-Godfrey Test The Breusch-Godfrey test also considers higher-order autoregressive schemes: 𝑌𝑌𝑡𝑡 = 𝛽𝛽1 + 𝛽𝛽2 𝑋𝑋𝑡𝑡 + 𝜖𝜖𝑡𝑡 𝜖𝜖𝑡𝑡 = 𝜌𝜌1 𝑢𝑢𝑡𝑡−1 + 𝜌𝜌2 𝑢𝑢𝑡𝑡−2 + ⋯ + 𝜌𝜌𝑝𝑝 𝑢𝑢𝑡𝑡−𝑝𝑝 + 𝑢𝑢𝑡𝑡 Estimate the regression using OLS and obtain the 𝑅𝑅𝑅. If the sample size is large, Breusch and Godfrey have shown that (𝑛𝑛 – 𝑝𝑝)𝑅𝑅𝑅 follow a 𝜒𝜒2-distribution. The null hypothesis is that there is no autocorrelation and 𝜌𝜌𝑗𝑗 = 0. If (𝑛𝑛 – 𝑝𝑝)𝑅𝑅𝑅 exceeds the critical value at the chosen level of significance, we reject the null hypothesis, in which case at least one 𝜌𝜌 is statistically different from zero. 37 Modeling Seasonality A regression can estimate both the trend and additive seasonal indexes. Create dummy variables which indicate the season. Regress on time and the seasonal variables. Use the multiple regression model to forecast. For any season, e.g., season 1, create a column with 1 for time periods which are season 1, and zero for other time periods (only season – 1 dummy variables are required). 38 Dummy Variables Trend variable Seasonal variables Quarterly Input Data Sales t Q1 Q2 Q3 Fall 3497 1 1 Spring 0 0 3484 2 0 1 Summer 0 Year 1 3553 3 0 0 1 Fall 3837 4 0 Not Spring 0 Not Summer Not0Fall 3726 5 1 0 0 Year 2 3589 6 0 1 0 39 Modelling Seasonality The model which is fitted (assuming quarterly data) is 𝑦𝑦 = 𝛽𝛽0 + 𝛽𝛽1 𝑡𝑡 + 𝛽𝛽2 𝑄𝑄1 +𝛽𝛽3 𝑄𝑄2 +𝛽𝛽4 𝑄𝑄3 Only 3 quarters are explicitly modelled. Otherwise: 𝑄𝑄1 = 1 − (𝑄𝑄2 + 𝑄𝑄3 + 𝑄𝑄4 ), for all 4 quarters  Multicollinearity Allows to test for seasonality. 40 Gauss-Markov Assumptions in Detail The OLS estimator is the best linear unbiased estimator (BLUE), iff 1) Linearity Linear relationship in parameters 𝛽𝛽 2) No multicollinearity of predictors No linear dependency between predictors 3) Homoscedasticity The residuals exhibit constant variance 4) No autocorrelation There is no correlation between the 𝑖𝑖 th and 𝑗𝑗th residual terms 5) Expected value of the residual vector, given 𝑋𝑋, is 0 (𝐸𝐸 𝜀𝜀 𝑋𝑋 = 0) (i.e., exogeneity (cov(𝜀𝜀, 𝑋𝑋) = 0)) 41 Gender Bias? Consider the acceptance rates for the following groups of men and women who applied to college. Not Not Counts Accepted Total Percents Accepted accepted accepted Men 198 162 360 Men 55% 45% Women 88 112 200 Women 44% 56% Total 286 274 560 A higher percentage of men were accepted: Is there evidence of discrimination? 42 42 Gender Bias? Consider the acceptance rates broken down by type of school. Computer Science Not Not Counts Accepted Total Percents Accepted accepted accepted Men 18 102 120 Men 15% 85% Women 24 96 120 Women 20% 80% Total 42 198 240 Management Not Not Counts Accepted Total Percents Accepted accepted accepted Men 180 60 240 Men 75% 25% Women 64 16 80 Women 80% 20% Total 244 76 320 43 Explanations? Within each school a higher percentage of women were accepted! There was no discrimination against women. Women rather applied to schools with lower acceptance rates. Men applied in schools with higher acceptance rates. This is an example of Simpson‘s paradox: When the omitted (aka. confounding) variable (Type of School) is ignored the data seem to suggest discrimination against women. However, when the type of school is considered, the association is reversed and suggests discrimination against men. This can happen if say the majority of women apply for the department with the lower acceptance rate. Considering the department is important for this problem. see also https://en.wikipedia.org/wiki/Simpson%27s_paradox. But we often do not have all relevant variables in the data...? 44 44 Endogeneity due to Omitted Variables Endogeneity means (corr 𝜀𝜀𝑖𝑖 , 𝑋𝑋𝑖𝑖 ≠ 0) ⇒ 𝐸𝐸 𝜀𝜀𝑖𝑖 |𝑋𝑋𝑖𝑖 ≠ 0 Simple test: analyze the correlation of the residuals and an independent variable. Reason for endogeneity: measurement errors, variables that affect each other, omitted variables(!) Omitted (aka. confounding) variables: True model: 𝑦𝑦𝑖𝑖 = 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥1 + 𝛽𝛽2 𝑥𝑥2 + 𝑒𝑒𝑖𝑖 Estimated model: 𝑦𝑦𝑖𝑖 = 𝛽𝛽0 +𝛽𝛽1 𝑥𝑥1 + 𝑢𝑢𝑖𝑖 Now 𝑢𝑢𝑖𝑖 = 𝛽𝛽2 𝑥𝑥2 + 𝑒𝑒𝑖𝑖. If 𝑥𝑥1 and 𝑢𝑢𝑖𝑖 are correlated and 𝑥𝑥2 affects 𝑦𝑦, this leads to endogeneity. Why is this a problem? 45 Omitted Variable Bias A reason for endogeneity might be that relevant variables are omitted from the model. For example, enthusiasm or willingness to take risks of an individual describe unobserved heterogeneity. Can we control for these effects when estimating our regression model? Various techniques have been developed to address endogeneity in panel data. In panel data, the same individual is observed multiple times! see https://en.wikipedia.org/wiki/Omitted-variable_bias see https://en.wikipedia.org/wiki/Endogeneity_(econometrics) 46 Panel Data vs. Cross-Section Data Cross-section data collected in observational studies refers to data observing many subjects (such as individuals, firms or countries/regions) at the same point of time, or without regard to differences in time. A panel data set, or longitudinal data set, is one where there are repeated observations on the same units, which makes it possible to overcome an omitted variable bias. A balanced panel is one where every unit is surveyed in every time period. In an unbalanced panel some individuals have not been recorded in some time period. 47 The Panel Data Structure  x111 x211.. x p11  Wave 1    x112 x212.. x p12  Individual 1 ........     x11t x21t.. x p1t  Wave t x x221.. x p 21   121  Wave 1  x122 x222.. x p 22  ....  Individual 2....    x12t x22t.. x p 2t  Wave t      x1n1 x2 n1.. x pn1  Wave 1  x1n 2 x2 n 2.. x pn 2  Individual n   ........  x  1nt x2 nt.. x pnt  Wave t 48 Modeling Fixed Effects Fixed effects assume 𝜆𝜆𝑖𝑖 are constants (there is endogeneity) effects are correlated with the other covariates* see also https://en.wikipedia.org/wiki/Fixed_effects_model Random effects (not discussed here) assume 𝜆𝜆𝑖𝑖 are drawn independently from some probability distribution effects are uncorrelated with the other covariates see also https://en.wikipedia.org/wiki/Random_effects_model Specific packages in Python are available for fixed, random, and mixed effects models, which combine both (e.g., PanelOLS in linearmodels.panel). We only deal with fixed effects in this class. You might want to test, which effect is most likely in applications. *We talk about a factor if it is a categorical variable and a covariate if it is continuous. 49 Fixed Effects Models 60 A 50 40 Individual 1 Individual 2 Individual 3 Individual 4 30 Linear (Individual 1) Linear (Individual 2) Linear (Individual 3) Linear (Individual 4) 20 BNote that the slope is the same for each 10 B individual, only the constant varies A 0 -5 0 5 10 15 20 50 The Fixed Effect Model Treat 𝜆𝜆𝑖𝑖 (the individual-specific heterogeneity) as a constant for each individual. yit = ( β0 + λi ) + β1 x1it + β 2 x2it +... + β p x pit + ε it 𝜆𝜆 is part of constant, but varies by individual 𝑖𝑖 Various estimators for fixed effect models: first differences, within, between, least squares dummy variable estimator. 51 First-Differences Estimator Eliminating unobserved heterogeneity by taking first differences Original equation yit = β 0 + λi + β1 x1it + β 2 x2it +... + β p x pit + ε it Lag one period and subtract yit − yit −1 = β 0 + λi + β1 x1it + β 2 x2it +... + β p x pit + ε it − β 0 − λi − β1 x1it −1 − β 2 x2it −1 −... − β p x pit −1 − ε it −1 Constant and individual effects eliminated yit − yit −1= β1 ( x1it − x1it −1 ) + β 2 ( x2it − x2it −1 ) +... + β p ( x pit − x pit −1 ) + ( ε it − ε it −1 ) Transformed equation ∆yit = β1∆x1it + β 2 ∆x2it +... + β p ∆x pit + ∆ε it 52 How to Estimate a Model with Fixed Effects Least squares dummy variable estimator uses a dummy variable for each individual (or firm, etc.), which we assume to have a fixed effect. Within estimator take deviations from individual means and apply least squares 𝑦𝑦𝑖𝑖𝑖𝑖 − 𝑦𝑦 𝑖𝑖 = 𝛽𝛽1 𝑥𝑥1𝑖𝑖𝑖𝑖 − 𝑥𝑥̅1𝑖𝑖 + ⋯ + 𝛽𝛽𝑝𝑝 𝑥𝑥𝑝𝑝𝑝𝑝𝑝𝑝 − 𝑥𝑥̅𝑝𝑝𝑝𝑝 + ε𝑖𝑖𝑖𝑖 − 𝜀𝜀𝑖𝑖̅ relies on variations within individuals. Fixed Effects models will be revisited, once we discuss causality. 53 Gauss-Markov Assumptions in Detail The OLS estimator is the best linear unbiased estimator (BLUE), iff 1) Linearity Linear relationship in parameters 𝛽𝛽 2) No multicollinearity of predictors No linear dependency between predictors 3) Homoscedasticity The residuals exhibit constant variance 4) No autocorrelation There is no correlation between the 𝑖𝑖 th and 𝑗𝑗th residual terms 5) Expected value of the residual vector, given 𝑋𝑋, is 0 (𝐸𝐸 𝜀𝜀 𝑋𝑋 = 0) (i.e., exogeneity (cov(𝜀𝜀, 𝑋𝑋) = 0)) 54

Business Analytics & Machine Learning Regression Diagnostics PDF

Document Details

Tags

Related

Summary

Full Transcript