Applied Econometrics Regression Diagnostics & Complications PDF
Document Details
Uploaded by ComprehensiveChrysocolla
University of Oulu, Oulu Business School
Elias Oikarinen
Tags
Related
- Applied Econometrics Lecture 1 PDF
- Applied Econometrics Probit and Logit Regression Lecture Handout 11 Autumn 2022 PDF
- Research Methods: Applied Empirical Economics Lecture 4 PDF
- EEP/IAS 118 Introductory Applied Econometrics, Section 1 PDF
- EEP/IAS 118 - Introductory Applied Econometrics, Section 2 PDF
- EEP/IAS 118 Introductory Applied Econometrics, Section 3 PDF
Summary
This document is a lecture handout on applied econometrics, covering regression diagnostics and complications. It discusses concepts such as multicollinearity, heteroscedasticity, and autocorrelation, and includes an empirical illustration.
Full Transcript
Applied Econometrics Regression Diagnostics & Complications Lecture 2 Autumn 2023 D.Sc. (Econ.) Elias Oikarinen Professor (Associate) of Economics University of Oulu, Oulu Business School 1 This Handout • The key diagnostics and potential complications in applied econometric analysis • - - Mul...
Applied Econometrics Regression Diagnostics & Complications Lecture 2 Autumn 2023 D.Sc. (Econ.) Elias Oikarinen Professor (Associate) of Economics University of Oulu, Oulu Business School 1 This Handout • The key diagnostics and potential complications in applied econometric analysis • - - Multicollinearity Heteroscedasticity Autocorrelation Specification Errors • Empirical illustration, the consideration of which is in major focus during the lecture (also the 1st EViews exercise) • Brooks, Chapter 5, or Gujarati, Part II, Chapters 4-7 • EViews 10 User Guide, Ch 24, Specification and Diagnostic Tests 2 I MULTICOLLINEARITY (Gujarati, Ch 4; Brooks, Ch 5.8) MULTICOLLINEARITY ➢ One of the assumptions of the classical linear regression (CLRM) is that there is no exact linear relationship among the regressors. ➢ If there are one or more such relationships among the regressors, we call it multicollinearity, or collinearity for short. ➢ Perfect collinearity: A perfect linear relationship between the two variables exists. ➢ Imperfect collinearity: The regressors are highly (but not perfectly) collinear. Damodar Gujarati Econometrics by Example, second edition 4 CONSEQUENCES ➢ If collinearity is not perfect, but high, several consequences ensue: ➢ The OLS estimators are still BLUE, but one or more regression coefficients have large standard errors relative to the values of the coefficients, thereby making the t ratios small. ➢ Therefore, one may conclude (misleadingly) that the true values of these coefficients are not different from zero. ➢ Also, the regression coefficients may be very sensitive to small changes in the data, especially if the sample is relatively small. Damodar Gujarati Econometrics by Example, second edition 5 VARIANCE INFLATION FACTOR ➢ For the following regression model: Yi = B1 + B2 X 2i + B3 X 3i + ui It can be shown that: var(b2 ) = and var(b3 ) = 2 x (1 − r ) 2 2i 2 23 2 x (1 − r ) 2 3i 2 23 = = 2 x 2 2i VIF 2 x 2 3i VIF where σ2 is the variance of the error term ui, and r23 is the coefficient of correlation between X2 and X3. Damodar Gujarati Econometrics by Example, second edition 6 VARIANCE INFLATION FACTOR (CONT.) 1 VIF = 2 1 − r23 is the variance-inflating factor. (for a regression with more than 2 explanatory variables, see Gujarati p. 70, eq. 4.9) ➢VIF is a measure of the degree to which the variance of the OLS estimator is inflated because of collinearity. Damodar Gujarati Econometrics by Example, second edition 7 DETECTION OF MULTICOLLINEARITY ➢ 1. High R2 but few significant t-ratios. ➢ 2. High pair-wise correlations among explanatory variables. ➢ 3. High partial correlation coefficients. ➢ 4. Significant F-test for auxiliary regressions (regressions of each regressor on the remaining regressors). ➢ 5. High Variance Inflation Factor (VIF) – particularly exceeding 10 in value – and low Tolerance Factor (TOL, the inverse of VIF). Damodar Gujarati Econometrics by Example, second edition 8 Multicollinearity Can Cause a Wrong Parameter Sign • Multicollinearity can in some cases even lead to estimating an incorrect parameter sign (see the example on p. 11) • Hence, an unexpected parameter sign may signal multicollinearity complications 9 What should we do about multicollinearity • Depends on the aim of the analysis, i.e., several different options: • Nothing, for we often have no control over the data. • Nothing, if we are interested in the coefficient(s) on just some particular variable(s) that is (are) not significantly correlated with the other explanatory variables • Indeed, in such case we typically want to include as many control variables as possible to avoid notable omitted variable bias • Redefine the model by excluding variables – this may attenuate the problem, provided we do not omit relevant variables. • [Principal components analysis (Gujarati pp. 76-78).] • If we keep the model with notable multicollinearity, the consequences of multicollinearity have to be considered when interpreting the results 10 Example of regression results with multicollinearity problem • Key long-term demand drivers for housing prices: population and income level • Assuming no notable shifts in the supply curve, we can estimate the long-run equation for housing prices: • Solution: put income and population in one variable (aggregate income = pop*income per capita) • Assumes similar coeffs for both income and population • This example is based on panel data for 50 U.S. metropolitan areas 11 II HETEROSCEDASTICITY (Gujarati, Ch. 5; Brooks, Ch. 5.4) HETEROSCEDASTICITY ➢ ➢ ➢ One of the assumptions of the classical linear regression (CLRM) is that the variance of ui, the error term, is constant, or homoscedastic. In reality, often the error term is not constant, i.e. it is heteroscedastic Reasons are many, including: ➢ Clustered volatility (time series, e.g. asset returns) ➢ Spatial autocorrelation (e.g. house prices) ➢ The presence of outliers in the data ➢ Incorrect functional form of the regression model 13 Homoscedastic vs. Heteroscedastic residuals • Cross-section regression • Later on in the course many examples regarding time series data and models 14 CONSEQUENCES ➢ If heteroscedasticity exists, several consequences ensue: ➢ The OLS estimators are still unbiased and consistent, yet the estimators are less efficient, making statistical inference less reliable (i.e., the estimated t values may not be reliable). ➢ Thus, estimators are not best linear unbiased estimators (BLUE); they are simply linear unbiased* estimators (LUE). ➢ In the presence of heteroscedasticity, the BLUE estimators are provided by the method of weighted least squares (WLS). * An estimator of a given parameter is said to be unbiased if its expected value is equal to the true value of the parameter. In other words, an estimator is unbiased if it produces parameter estimates that are on average correct. Damodar Gujarati Econometrics by Example, second edition DETECTION OF HETEROSCEDASTICITY ➢ Graph histogram of squared residuals ➢ Graph squared residuals against predicted Y ➢ Breusch-Pagan (BP) Test (Gujarati, pp. 86-87) ➢ White’s Test of Heteroscedasticity (Gujarati, pp. 87-89) ➢ Other tests such as Park, Glejser, Spearman’s rank correlation, and Goldfeld-Quandt tests of heteroscedasticity ➢ More on these in the time series econometrics stage Damodar Gujarati Econometrics by Example, second edition REMEDIAL MEASURES ➢ What should we do if we detect heteroscedasticity? (Gujarati, pp. 89-94; Brooks Ch. 5.4.3) ➢ Use method of Weighted Least Squares (WLS) ➢ Divide each observation by the (heteroscedastic) σi and estimate the transformed model by OLS (yet true variance is rarely known). ➢ If the true error variance is proportional to the square of one of the regressors, we can divide both sides of the equation by that variable and run the transformed regression. ➢ Take natural log of dependent variable (sometimes helps). ➢ Use White’s heteroscedasticity-consistent standard errors or robust standard errors. ➢ Valid in large samples. Damodar Gujarati Econometrics by Example, second edition Heteroscedasticity and Time Series Econometrics • Often visual check of variables and residuals can be useful • Various tests available (also with EViews) • If heteroscedasticity is observed, GARCH model can be estimated to cater for it and to forecast volatility • To be discussed and analyzed in more detail later during the course 18 III AUTOCORRELATION (Gujarati, Ch 6; Brooks, 5.5) AUTOCORRELATION ➢ One of the assumptions of the classical linear regression (CLRM) is that the covariance between ui, the error term for observation i, and uj, the error term for observation j, is zero. ➢ Often this is not the case → autocorrelated residual ➢ Reasons for autocorrelation include: ➢ Possible strong correlation between the shock in time t with the shock in time t+1 ➢ Sluggish adjustment to shocks ➢ More common in time series data Damodar Gujarati Econometrics by Example, second edition 20 CONSEQUENCES ➢ If autocorrelation exists, several consequences ensue: ➢ The OLS estimators are still unbiased and consistent*. ➢ They are still normally distributed in large samples. ➢ They are no longer efficient (i.e. not the best possible), meaning that they are no longer BLUE. ➢ In most cases standard errors are underestimated. ➢ Thus, the hypothesis-testing procedure becomes suspect, since the estimated standard errors may not be reliable, even asymptotically (i.e. in large samples). * An estimator of a given parameter is said to be consistent if it converges in probability to the true value of the parameter as the sample size tends to infinity. Damodar Gujarati Econometrics by Example, second edition 21 DETECTION OF AUTOCORRELATION ➢ For time series models (for cross-section data there are specific tests for e.g. spatial autocorrelation) ➢ Graphical method ➢ Plot the values of the residuals, et, chronologically ➢ If discernible pattern exists, autocorrelation likely a problem ➢ Durbin-Watson test (e.g. Gujarati, pp. 101-102) ➢ Breusch-Godfrey (BG) test (e.g. Gujarati, pp. 102-104) ➢ Many other tests Damodar Gujarati Econometrics by Example, second edition 22 REMEDIAL MEASURES ➢ If we find autocorrelation in an application, we need to take care of it, for depending on its severity, we may be drawing misleading conclusions as the usual OLS std. errors could be severely biased. ➢ First-Difference Transformation ➢ If autocorrelation is of AR(1) type, we have: ut − ut −1 = vt ➢ Assume ρ=1 and run first-difference model (taking first difference of dependent variable and all regressors) ➢ Generalized Transformation ➢ Estimate value of ρ through regression of residual on lagged residual and use value to run transformed regression ➢ Newey-West Method: Generates HAC (heteroscedasticity and autocorrelation consistent) standard errors ➢ Model Evaluation Damodar Gujarati Econometrics by Example, second edition 23 Issues w.r.t Autocorrelation and Time Series Econometrics • Autocorrelation is typical in time series data • Estimation of univariate models that include lagged dependent variable: ARIMA models • Models that (also) include other lagged variables, e.g. ARIMAX models, VAR models • Forecasts are often based, to a notable extent, on observed autocorrelation • Often important role of cross-autocorrelations in the models and forecasts • We study issues related to autocorrelation in more detail later on 24 IV MODEL SPECIFICATION ERRORS (Gujarati, Ch. 7; Brooks, Ch. 5.9-5.13) MODEL SPECIFICATION ERRORS ➢ One of the assumptions of the classical linear regression (CLRM) is that the model is specified correctly. ➢ By correct specification we mean one or more of the following: ➢ 1. The model does not exclude any “core” variables. ➢ 2. The model does not include superfluous variables. ➢ 3. The functional form of the model is suitably chosen. ➢ 4. There are no errors of measurement in the regressand and regressors. ➢ 5. Outliers in the data, if any, are taken into account. ➢ 6. The probability distribution of the error term is well specified. ➢ 7. The regressors are nonstochastic. Damodar Gujarati Econometrics by Example, second edition 26 OMISSION OF RELEVANT VARIABLES ➢ If we omit a relevant variable because we do not have the data, or because we have not studied the underlying economic theory carefully, or because we have not studied prior research in the area thoroughly, or just due to carelessness, we are underfitting a model. Damodar Gujarati Econometrics by Example, second edition 27 CONSEQUENCES ➢ 1. If the omitted variables are correlated with the variables included in the model, the coefficients of the estimated model are biased. ➢ This bias does not disappear as the sample size gets larger (i.e., the estimated coefficients of the misspecified model are also inconsistent). ➢ This is called the omitted variable bias ➢ 2. Even if the incorrectly excluded variables are not correlated with the variables included in the model, the intercept of the estimated model is biased. ➢ 3. The disturbance variance is incorrectly estimated. Damodar Gujarati Econometrics by Example, second edition 28 CONSEQUENCES ➢ 4. The variances of the estimated coefficients of the misspecified model are biased. ➢ 5. In consequence, the usual confidence intervals and hypothesis-testing procedures become suspect, leading to misleading conclusions about the statistical significance of the estimated parameters. ➢ 6. Furthermore, forecasts based on the incorrect model and the forecast confidence intervals based on it will be unreliable. Damodar Gujarati Econometrics by Example, second edition 29 On Omitted Variable Bias • For omitted variable bias to occur, two conditions must be fulfilled: • At least one explanatory variable Xi is correlated with the omitted variable • The omitted variable is a determinant of the dependent variable Y 30 F -TEST TO COMPARE TWO MODELS ➢ Gujarati, pp. 117-118 ➢ If the original model is the “restricted” model (r), and the model with the added (previously omitted) variable – which could also be a squared term or an interaction term – is the “unrestricted” model (ur), we can compare the two using an F test: ( Rur2 − Rr2 ) / m F= (1 − Rur2 ) /(n − k ) where m = number of restrictions (or omitted variables), n = number of observations, and k = number of parameters in the unrestricted model. ➢ A rejection of the null suggests that the omitted variables belong in the model. Damodar Gujarati Econometrics by Example, second edition 31 DETECTION OF OMISSION OF VARIABLES ➢ Ramsey’s Regression Specification Error (RESET) Test (Gujarati, pp. 118-119) ➢ Lagrange Multiplier (LM) test (Gujarati, pp. 118-121) Damodar Gujarati Econometrics by Example, second edition 32 RAMSEY’S RESET TEST ➢ 1. From the (incorrectly) estimated model, we first obtain the estimated, or fitted, values of the dependent variable, Yˆi. ➢ 2. Re-estimate the original model including Yˆi 2 and Yˆi 3 (and possibly higher powers of the estimated dependent variable) as additional regressors. ➢ 3. The initial model is the restricted model and the model in Step 2 is the unrestricted model. ➢ 4. Under the null hypothesis that the restricted (i.e., the original) model is correct, we can use the previously mentioned F-test. ➢ 5. If the F-test in Step 4 is statistically significant, we can reject the null hypothesis. That is, the restricted model is not appropriate in the present situation. By the same token, if the F-statistic is statistically insignificant, we do not reject the original model. ➢ Does not suggest any specific alternative model. Damodar Gujarati Econometrics by Example, second edition 33 LAGRANGE MULTIPLIER TEST ➢ 1. From the original model, we obtain the estimated residuals, ei. ➢ 2. If in fact the original model is the correct model, then the residuals ei obtained from this model should not be related to the regressors omitted from that model. ➢ 3. We now regress ei on the regressors in the original model and the omitted variables from the original model. This is the auxiliary regression. ➢ 4. If the sample size is large, it can be shown that n (the sample size) times the R2 obtained from the auxiliary regression follows the chi-square distribution with df equal to the number of regressors omitted from the original regression. ➢ 5. If the computed chi-square value exceeds the critical chi-square value at the chosen level of significance, or if its p value is sufficiently low, we reject the original (or restricted) regression. This is to say, that the original model was misspecified. Damodar Gujarati Econometrics by Example, second edition 34 INCLUSION OF IRRELEVANT OR UNNECESSARY VARIABLES ➢ Sometimes researchers add variables in the hope that the R2 value of their model will increase in the mistaken belief that the higher the R2 the better the model. This is called overfitting a model. But if the variables are not economically meaningful and relevant, such a strategy is not recommended. Damodar Gujarati Econometrics by Example, second edition 35 CONSEQUENCES ➢ 1. The OLS estimators of the “incorrect” or overfitted model are all unbiased and consistent. ➢ 2. The error variance is correctly estimated. ➢ 3. The usual confidence interval and hypothesis testing procedures remain valid. ➢ 4. However, the estimated coefficients of such a model are generally inefficient (their variances will be larger than those of the true model – due to possible multicollinearity for one). Damodar Gujarati Econometrics by Example, second edition 36 MISSPECIFICATION OF THE FUNCTIONAL FORM OF A REGRESSION MODEL ➢ Sometimes researchers mistakenly do not account for the nonlinear nature of variables in a model. Moreover, some dependent variables (such as wage, which tends to be skewed to the right) are more appropriately entered in natural log form. Damodar Gujarati Econometrics by Example, second edition 37 ERRORS OF MEASUREMENT ➢ One of the assumptions of CLRM is that the model used in the analysis is correctly specified. ➢ Although not explicitly spelled out, this presumes that the values of the regressand as well as regressors are accurate. That is, they are not guess estimates, extrapolated, interpolated or rounded off in any systematic manner or recorded with errors. Damodar Gujarati Econometrics by Example, second edition 38 CONSEQUENCES ➢ Consequences for Errors of Measurement in the Regressand: ➢ 1. The OLS estimators are still unbiased. ➢ 2. The variances and standard errors of OLS estimators are still unbiased. ➢ 3. But the estimated variances, and the standard errors, are larger than in the absence of such errors. ➢ In short, errors of measurement in the regressand do not pose a very serious threat to OLS estimation. Damodar Gujarati Econometrics by Example, second edition 39 CONSEQUENCES ➢ Consequences for Errors of Measurement in the Regressor: ➢ 1. OLS estimators are biased as well as inconsistent. ➢ 2. Errors in a single regressor can lead to biased and inconsistent estimates of the coefficients of the other regressors in the model. ➢ It is not easy to establish the size and direction of bias in the estimated coefficients. ➢ It is often suggested that we use instrumental or proxy variables for variables suspected of having measurement errors. ➢ The proxy variables must satisfy two requirements—that they are highly correlated with the variables for which they are a proxy and also they are uncorrelated with the usual equation error as well as the measurement error ➢ But such proxies are not easy to find. ➢ We should thus be very careful in collecting the data and making sure that some obvious errors are eliminated. Damodar Gujarati Econometrics by Example, second edition 40 OUTLIERS, LEVERAGE, AND INFLUENCE DATA ➢ OLS gives equal weight to every observation in the sample. ➢ This may create problems if we have observations that may not be “typical” of the rest of the sample. ➢ Such observations, or data points, are known as outliers, leverage or influence points. Damodar Gujarati Econometrics by Example, second edition 41 OUTLIERS, LEVERAGE, AND INFLUENCE DATA ➢ Outliers: In the context of regression analysis, an outlier is an observation with a large residual (ei), large in comparison with the residuals of the rest of the observations. ➢ Leverage: An observation is said to exert (high) leverage if it is disproportionately distant from the bulk of the sample observations. In this case such observation(s) can pull the regression line towards itself, which may distort the slope of the regression line. ➢ Influential point: If a levered observation in fact pulls the regression line toward itself, it is called an influential point. The removal of such a data point(s) from the sample can dramatically change the slope of the estimated regression line. Damodar Gujarati Econometrics by Example, second edition 42 PROBABILITY DISTRIBUTION OF THE ERROR TERM ➢ The classical normal linear regression model (CNLRM), an extension of CLRM, assumes that the error term ui in the regression model is normally distributed. ➢ This assumption is critical if the sample size is relatively small, for the commonly used tests of significance, such as t and F, are based on the normality assumption. Damodar Gujarati Econometrics by Example, second edition 43 JARQUE-BERA (JB) TEST OF NORMALITY ➢ This is a large sample test. ➢ The test statistic is as follows: JB = S 2 ( K − 3) 2 n + 6 24 where n is the sample size, S = skewness coefficient, K = kurtosis coefficient. ➢ For a normally distributed variable S = 0 and K= 3. When this is the case, the JB statistic is zero. ➢ Therefore, the closer is the value of JB to zero, the better is the normality assumption. ➢ Since in practice we do not observe the true error term, we use its proxy, ei. The null hypothesis is the joint hypothesis that S=0 and K = 3. The statistic follows the chi-square distribution with 2 df (because we are imposing two restrictions, namely, that skewness is zero and kurtosis is 3). If the computed JB statistic exceeds the critical chi-square value, we reject the hypothesis that the error term is normally distributed. Damodar Gujarati Econometrics by Example, second edition 44 What If the Error Term is Non-Normal • Potential solutions: • Adding omitted variables • Using natural log transformations • Sometimes it is not possible to get rid of non-normality in a sensible manner Non-normality of residuals • OLS is still BLUE • In large samples the F- and t-tests are still reliable • For smaller sample, the reliability of the tests is decreased → need to be understood when making conclusions from the analysis • It is possible to compute reliable confidence intervals for basically any form of residuals, but this is out of the scope of this course 45 THE SIMULTANEITY PROBLEM ➢ There are many situations where a unidirectional relationship between Y and the Xs cannot be maintained, since some Xs affect Y but in turn Y also affects one or more Xs. ➢ In other words, there may be a feedback relationship between the Y and X variables. ➢ Simultaneous equation regression models are models that take into account feedback relationships among variables. ➢ Endogenous variables are variables whose values are determined in the model. ➢ Exogenous variables are variables whose values are not determined in the model. ➢ Sometimes, exogenous variables are called predetermined variables, for their values are determined independently or fixed, such as the tax rates fixed by the government. ➢ Estimate parameters using Method of Indirect Least Squares (ILS) or Method of Two-Stage Least Squares (2SLS). Damodar Gujarati Econometrics by Example, second edition 46 REAL-LIFE (EMPIRICAL) EXAMPLE (Excellent illustrative examples also provided in Gujarati’s and Brook’s books) Empirical Illustration: Hedonic housing price function • Background: The value of a dwelling consists of the value and amount of various dwelling characteristics that provide utility to households (size, location, condition, vintage, etc.) • Hedonic price model aims to capture the effect of various factors on housing prices • Can also be used to predict a dwelling value • Numerous factors affect the utility and thus prices • We use a small sample (typically would be many thousands of observations) of 81 observed flat transactions for the Malmi area in Helsinki during 2019 for illustration purposes 48 Empirical Illustration: Variables • Price = price of the transacted flat • Price_sqm = price per m2 • Size = size in m2 • Age = age of the dwelling at the time of transaction • Cond_good = in good condition, dummy variable; Cond_satis = in satisfactory condition, dummy variable; omitted group = poor condition • Lift = Dummy variable for a lift (value 1 if there is a lift, 0 otherwise) • bd2 and bd3 = Dummy variables 2 and 3 room flats (omitted group = 1 room, i.e. studio) • Rooms = Number of rooms in the flat • Pihlajisto and Pmaki = Dummy variables for flats located in the subareas of Pihlajisto and Pukinmäki (omitted group = core Malmi area) 49 Empirical Illustration: Data, Descriptive Statistics • In EViews, one has to make a group of the variables to get a table on descriptive stats on several variables Some examples w.r.t. descriptive stat interpretations • • • Large differences across observations regarding price (also per sqm) and size → leverage observations? → outliers / influential points? 38% of observations in “good” condition and 58% in “satisfactory” condition → only 4% in “poor” condition 41% of the flats are in buildings with a lift 50 Variable Correlations with P-values • EViews: View – Covariance Analysis (in the group) Price and price per sqm have the opposite correlation signs with size → non-linear size effects on price. No huge correlations among the explanatory variables (except obviously for some dummies): Diminishes potential multicollinearity complications. Lower average prices in Pihlajisto than in core Malmi – what about after controlling for other variables?! Lin-Lin Model with the Whole Data • Quick – Estimate Equation • Model seems very problematic • With the whole data, the RESET test rejects and there is a very large outlier observation (obs 8) regardless of the model form • This also is a leverage observation (price per sqm only 893€ and size 94 m2) • Later on we will also see that this is an influential point • Reliability of obs 8 is altogether doubtful Remove obs 8 and re-estimate 52 Log-Log Model with the Whole Data • Sign on size does not make sense here either, even though this model should capture better the non-linear effect • Also otherwise similar complications as in the log-log model Remove obs 8 and re-estimate 53 Models After Outlier Removed • Sign on size makes sense now • Maybe multicollinearity complications? (size together with bd2 and bd3) • NOTE: We should not compare the models based on Adj. R2 – why? 54 Ramsey RESET Test • View – Stability Diagnostics • RESET rejects the lin-lin model, but accepts the log-log model • This is in line with prior expectations (economic theory): Why? We proceed with the log-log model Note: knowing this from experience, a researcher would likely use the lntransformed variables in the computation of the correlation table already. 55 Ramsey RESET Test • UG II, pp. 224-225 • Specify the number of fitted terms to include in the test regression. • The fitted terms are the powers of the fitted values from the original regression, starting with the square or second power. • For example, if you specify 1, then the test will add fitted y in the second power in the regression. • EViews output reports the test regression and the F-statistic and log likelihood ratio for testing the hypothesis that the coefficients on the powers of fitted values are all zero. 56 Variance Inflation Factors • View – Coefficient Diagnostics • Given the multicollinearity complications and insignificance of bd2 and bd3, let’s remove those variables • Whether to remove or not depends on the aim of the analysis and also is a “matter of taste” to some extent • AIC and Adj. R2 would recommend leaving them in the model, while SC prefers the more parsimonious regression 57 Eventual Price Model • Lift could possibly be removed • Correct sign but estimate is not too accurate (i.e. s.e. of estimate is pretty large) • It is obvious that the condition variables are going to be correlated with each other – this is not a reason to remove them, though • Removing them would potentially induce notable omitted variable bias 58 Residual Homoscedasticity • White test rejects homoscedastic residuals (View – Coefficient Diagnostics); UG II, p. 199 * • Let’s estimate with heteroscedasticity consistent standard errors • Not a “large” sample, but still improves the accuracy of p-values • EViews UG II, pp. 33-36 * Test of the null hypothesis of no heteroskedasticity against heteroskedasticity of unknown, general form. We do not have a large sample here, but the 59 test still gives some guidance for us. Eventual Price Model: Interpretations • Assuming that the point estimates pretty well reflect the true coefficients: • Flats in good condition are (on average) 19%* more valuable that similar flats in poor condition • One % increase in size increases price by app. 0.38% • One % increase in age lowers price by app. 0.17% • Price level is lower in Pihlajisto (app. 22%) and Pukinmäki (app. 14%) than in core Malmi* * These values regarding dummy variables are approximations, more accurately: Good = e0.186 – 1 0,244 24% Pihlajisto = e-0.225 – 1 -0,201 -20% Pukinmäki = e-0.143 – 1 -0,133 -13% 60 Normality of Residuals • View – Residual Diagnostics – Histogram / Normality Test For comparison: residual from the model with the leverage (influential) obs included 20 Series: Residuals Sample 1 81 Observations 81 16 12 8 4 0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 Mean Median Maximum Minimum Std. Dev. Skewness Kurtosis -1.84e-15 0.017267 0.317094 -0.832703 0.145613 -2.247295 14.77339 Jarque-Bera Probability 535.9971 0.000000 61 Using the Model to Predict a Flat Value • Given the observed characteristics of a flat, we can use the model to predict the market value of it • Assume a flat: • 10-year old • 50 sqm • Good condition • No lift • Located in Pihlajisto • Ln(Price) 11.00 + .186 + ln(50)*.383 – ln(10)*.166 – .225 12.08 • Price 2.718^12.08 175 580 62 Possible Complications with the Model • Simultaneity? • Omitted variable bias? • Accuracy of the coefficient estimates? • Reliability of the p-values? 63 Illustration of Omitted Variable Bias • Omission of Age • Expected to affect the dependent variable • Correlated with condition and a subarea (Pmaki) dummy • The model is misspecified, with some biasing effect on the coefficient estimates (the omitted variable in this case does not considerably effect the std. errors of coefficients) 64 Illustration of Omitted Variable Bias • Omission of Size • Expected to affect the dependent variable • Not strongly correlated with the other explanatory variables, but still some substantially greater std. errors and changes in the coefficients (so omitted variable influences many coefficients despite not being correlated with these variables - this can happen in small samples like this) 65 Testing for Omitted Variable • Should age be included in the model: comparison between model without age (restricted model) and with age • F-test (p. 31 above): View – Coefficient Diagnostics – Omitted Variables Test 66 Correlations of Variables in Final Model Covariance Analysis: Ordinary Date: 09/07/22 Time: 14:36 Sample: 1 81 Included observations: 80 Balanced sample (listwise missing value deletion) Correlation Probability PIHLAJISTO PIHLAJISTO 1.000000 ----- PMAKI LN_AGE LN_PRICE LN_SIZE LIFT COND_GOOD COND_SAT... PMAKI -0.446596 0.0000 1.000000 ----- LN_AGE 0.126634 0.2630 0.360130 0.0010 1.000000 ----- LN_PRICE -0.267242 0.0166 -0.209625 0.0620 -0.638733 0.0000 1.000000 ----- LN_SIZE 0.082883 0.4648 0.031375 0.7823 -0.051973 0.6471 0.474465 0.0000 1.000000 ----- LIFT 0.433722 0.0001 0.050081 0.6591 0.124262 0.2721 -0.185468 0.0995 -0.028835 0.7996 1.000000 ----- COND_GOOD -0.099570 0.3795 0.146626 0.1943 -0.302525 0.0064 0.365190 0.0009 -0.016777 0.8826 -0.093165 0.4111 1.000000 ----- COND_SATISF 0.137533 0.2238 -0.130435 0.2488 0.281134 0.0115 -0.297507 0.0074 0.081659 0.4715 0.155379 0.1687 -0.925172 0.0000 1.000000 ----- 67 Illustration of Omitted Variable Bias • From Fuerst-Oikarinen-Harjunen (2016)* • We are interested in the influence of energy rating on housing prices • High-rated flat are, on average, 18% more expensive that D rated • However, clearly the price difference is to a major extent due to some other variables, i.e. some other differences in the characteristics between high-rated and average (D) rated → The observed price premium due to being high-rated drops to 1.3%!! * Article included in Moodle: “Additional Material” 68