Simple Linear Regression Analysis PDF
Document Details
Uploaded by MemorableHonor
Tags
Summary
This document introduces simple linear regression analysis, a statistical technique used in business and economics for examining relationships between variables. It covers the model, inferences, diagnostics, remedial measures, and the matrix approach to linear regression.
Full Transcript
Simple Linear Regression Analysis Introduction to Regression Analysis Simple Linear Regression Model Inferences in Regression Analysis Diagnostics and Remedial Measures Matrix Approach to Linear Regression Analysis 1 Introduction to Regression Analysis The regression analysis is one o...
Simple Linear Regression Analysis Introduction to Regression Analysis Simple Linear Regression Model Inferences in Regression Analysis Diagnostics and Remedial Measures Matrix Approach to Linear Regression Analysis 1 Introduction to Regression Analysis The regression analysis is one of the most important and widely used statistical techniques in business and economic analysis for examining the functional relationships between two or more variables. One variable is specified to be the dependent/response variable (DV), denoted by Y, and the other one or more variables are called the independent/predictor/explanatory variables (IV), denoted by Xi, i=1,2, … k. There are two different situations: (a) Y is a random variable and Xi are fixed, no-random variable, e.g. to predict the sales for a company, the Year is the fixed Xi variable. (b) Both Xi and Y are random variables, e.g. all survey data are of this type, in this situation, cases are selected randomly from the population, and both Xi and Y are measured. 2 Main Purposes Regression analysis can be used for either of two main purposes: (1)Descriptive: The kind of relationship and its strength are examined. This examination can be done graphically or by the use of descriptive equations. Tests of hypotheses and confidence intervals can serve to draw inferences regarding the relationship. (2)Predictive: The equation relating Y and Xi can be used to predict the value of Y for a given value of Xi . Prediction intervals can also be used to indicate a likely range of the predicted value of Y. 3 Description of Methods of Regression: The general form of a probabilistic model is Y = Deterministic component + random error As you will see, the random error plays an important role in testing hypotheses and finding confidence intervals for the parameters in the model. The simple regression analysis means that the value of the dependent variable Y is estimated on the basis of only one independent variable. Y = f(X) + . On the other hand, multiple regression is concerned with estimating the value of the dependent variable Y on the basis of two or more independent variables. Y = f(X1 , X2 ... Xk) + , where k 2 . 4 Simple Linear Regression Model We begin with the simplest of probabilistic models - the simple linear regression model. That is, f(X) is a simple linear function of X, f(X) = 0 + 1 X . The model can be stated as follows: Yi = 0 + 1 Xi + i , i = 1, 2, …, n where Yi is the value of the response variable in the ith trial Xi is a non-random variable, the value of the predictor variable in the ith trial. i is a random error with E(i) = 0, var(i) =2, and cov(i, j) = 0, ij. 0 and 1 are parameters. 5 Important Features of the Model (1) The response Yi in the ith trial is the sum of two components: (1) the constant term 0 + 1 Xi and (2) the random term i . Hence, Yi is a random variable. (2) E(Yi) = 0 + 1 Xi (3) Yi in the ith trial exceeds or falls short of the value of the regression function by the error term amount i . (4) var(Yi) = var(i) = 2. Thus, the regression model assumes that the probability distributions of Y have the same variance 2, regardless of the level of the predictor variable X. (5) Since the error terms i and j are uncorrelated, so are Yi and Yj. (6) In summary, the regression model implies that the responses Yi come from probability distributions whose means are E(Yi) = 0 + 1 Xi and var(Yi)=2, the same for all levels of X. Further, any two Yi and Yj are uncorrelated. 6 Estimating the Model Parameters E() = 0 is equivalent to that E(Y) equals the deterministic component of the model. That is, E(Y) = 0 + 1X , where the constants 0 and 1 are the population parameters. It is called the population regression equation (line). Denoting estimates of 0 & 1 by 0 = b0 and 1 = b1 respectively, we can then estimate E(Y) by from the sample regression equation (or the fitted regression line) = b0 + b1 X . The problem of fitting a line to a sample of points is essentially the problem of efficiently estimating the parameters 0 and 1 by b0 and b1 respectively. The best known method for doing this is called the least squares method (LSM). 7 The Least Squares Method The principle of least squares is illustrated in the following Figure. Y Estimated (Y) e4 e2 e1 e3 Actual (Y) Y b0 + b1 x X 8 The Least Squares Method (Cont.) For every observed Yi in a sample of points, there is a corresponding predicted value i, equal to b0 + b1 xi. The sample deviation of the observed value Yi from the predicted i is ei = Yi - i , called a residual, that is, ei = Yi - b0 - b1Xi . We shall find b0 and b1 so that the sum of the squares of the errors (residuals) SSE =ei2 =(Yi - i )2=(Yi - b0 - b1 Xi)2 is a minimum. This minimization procedure for estimating the parameters is called the method of least squares. 9 The Least Squares Method (Cont.) Differentiating SSE with respect to b0 and b1, we have (SSE)/ b0 = -2(yi - b0 - b1 Xi) (SSE)/ b1 = -2(yi - b0 - b1 Xi)Xi . Setting the partial derivatives equal to zero and rearranging the terms, we obtain the equations (called the normal equations) n b0 + b1 Xi = Yi and b0 Xi + b1 Xi2 = Yi Xi which may be solved simultaneously to yield computing formulas for b0 and b1 as follows: b1 = SSxy /SSxx (=r×sy/sx), b0 = - b1 where SSxy = (Xi - )(Yi - ) = Xi Yi - ( Xi Yi )/n SSxx= (Xi - )2 = Xi2 - ( Xi)2/n 10 Properties of Least Squares Estimators (1)Gauss-Markov Theorem Under the conditions of the regression model, the least squares estimators b0 and b1 are unbiased estimators (i.e., E(b0) = 0 and E(b1) = 1) and have minimum variance among all unbiased linear estimators. (2) The estimated value of Y (i.e. = b0 + b1X) is an unbiased estimator of E(Y) = 0 + 1 X, with minimum variance in the class of unbiased linear estimators. Note that the common variance 2 can not be estimated by LSM. We can prove that the following statistic is an unbiased point estimator of 2 (You should try to prove it) s2 = SSE/(n-2) = (SSyy- b1SSxy)/(n-2) 11 Properties of Fitted Regression Line (1) The sum of the residuals is zero: ei = 0 (2) The sum of the squared residuals, ei2 , is a minimum. (3) i = Yi (4) Xiei = 0 (5) iei = 0 (6) The regression line always goes through the point ( , ). All properties can be proved directly by using the norm equations, (Yi - b0 - b1 Xi) = 0 and (Yi - b0 - b1Xi)Xi = 0, or n b0 + b1 Xi = Yi and b0 Xi +b1 Xi2 = Yi Xi . 12 Example 1 A random sample of 42 firms was chosen from the S&P500 firms listed in the Spring 2003 Special Issue of Business Week (The Business Week Fifty Best Performers). The dividend yield (DIVYIELD) and the 2002 earnings per share (EPS) were recorded for the 42 firms. These data are in a file named DIV3. Using dividend yield as the DV and EPS as the IV, plot the scatter diagram and run a regression using SPSS. (a)Find the estimated regression line . (b)Find the predicted values of DV given EPS =1 and EPS=2. 13 Example 1 – Solution Coefficientsa Standardized Unstandardized Coefficients Coefficients Model 1 B (Constant) EPS Std. Error 2.034 .541 .374 .239 Beta t .240 Sig. 3.762 .001 1.562 .126 a. Dependent Variable: Divyield = 2.034 + 0.374 x 14 Example 1 - Scatter Diagram = 2.034 + 0.374x 15 Normal Error Regression Model No matter what may be the form of the distribution of the error terms i (and hence of the Yi), the LSM provides unbiased point estimators of 0 and 1 that have minimum variance among all unbiased linear estimators. To set up interval estimates and make tests, however, we need to make an assumption about the form of the distribution of the i . The standard assumption is that the error terms i are normally distributed, and we will adopt it here. Since now the functional form of the probability distribution of the error terms is specified, we can use the maximum likelihood method to obtain estimators of the parameters 0, 1 and 2. In fact, MLE and LSE for 0 and 1 are the same. The MLE for 2 is biased = ei2/n= SSE/n = s2 (n-2)/n. A normal error term greatly simplifies the theory of regression analysis (See the comments on page 32). 16 Normality & Constant Variance Assumptions f(e) Y X2 X1 X E(Y) = 0 + 1 X 17 Inferences Concerning the Regression Coefficients Aside from merely estimating the linear relationship between X and Y for purposes of prediction, we may also be interested in drawing certain inferences about the population parameters, say 0 and 1 . To make inferences or test hypotheses concerning these parameters, we must know the sampling distributions of b0 and b1. (Note that b0 and b1 are statistics, i.e., functions of the random sample, therefore, they are random variables) 18 Inferences Concerning 1 (a) b1 is an normal random variable for the normal error model. (b) E(b1) = 1 . That is, b1 is an unbiased estimator of 1. (c) Var(b1) = 2/SSxx, which is estimated by s2 (b1) = s2/SSxx , where s2 is the unbiased estimator of 2. (d) The (1 - ) 100% Confidence interval for 1 (2 unknown) b1 - t/2 s(b1) < 1 < b1 + t/2 s(b1) where t/2 is a value of the t - distribution with (n - 2) degrees of freedom, and s(b1) is the standard error of b1 , i.e. s(b1) = s /(SSxx)1/2 . (e) Hypothesis test of 1 To test the null hypothesis H0: 1 = 0 against a suitable alternative, we can use the t distribution with n-2 degrees of freedom to establish a critical region and then base our decision on the value of t = b1 /s(b1) . 19 Inferences Concerning 0 (a) b0 is an normal random variable for the normal error model. (b) E(b0) = 0 . That is, b0 is an unbiased estimator of 0. (c) Var(b0) =2 Xi2/nSSxx, which is estimated by s2 (b0) = s2Xi2/nSSxx , where s2 is the unbiased estimator of 2. (d) The (1 - ) 100% Confidence interval for 0 (2 unknown) b0 - t/2 s(b0) < 0 < b0 + t/2 s(b0) where t/2 is a value of the t - distribution with (n - 2) degrees of freedom, and s(b0) = s(Xi2/nSSxx )1/2 . (e) Hypothesis test of 0 To test the null hypothesis H0: 0 = 0 against a suitable alternative, we can use the t distribution with n-2 degrees of freedom to establish a critical region and then base our decision on the value of t = b0 /s(b0) . 20 Some Considerations Effects of Departures From Normality If the probability distributions of Y are not exactly normal but do not depart seriously, the sampling distributions of b0 and b1 will be approximately normal. Even if the distributions of Y are far from normal, the estimators b0 and b1 generally have the property of asymptotic normality as the sample size increases. Thus, with sufficiently large samples, the confidence interval and decision rules given earlier still apply even if the probability distributions of Y depart far from normality. 21 Inferences Concerning E(Y) (1) The sampling distribution of i is normal for the normal error model. (2) i is an unbiased estimator of E(Yi). Because E(Yi) = 0 + 1Xi and E( i) = E(b0 + b1 Xi) = 0 + 1Xi = E(Yi). (3) The variance of i : var( i) = 2 [(1/n) + (Xi - & the estimated variance of i : s2( i) )2/SSxx] = s2 [(1/n) + (Xi - )2/SSxx] (4) The (1 - ) 100% confidence interval for the mean response E(Yi ) is as follows i - t/2, (n-2) s( i) < E(Yi) < i + t/2, (n-2) s( i) Note that the confidence limits for E(Yi) are not sensitive to moderate departures from the assumption that the error terms are normally distributed. Indeed, the limits are not sensitive to substantial departures from normality if the sample size is large. 22 Prediction of New Observation The distinction between estimation of the mean response E(Yi), discussed in the preceding section, and prediction of a new response Yi(new), discussed now, is basic. In the former case, we estimate the mean of the distribution of Y. In the present case, we predict an individual outcome draw from the distribution of Y. Prediction Interval for Yi(new) When the regression parameters are unknown, they must be estimated. The mean of the distribution of Y is estimated by , as usual, and the variance of the distribution of Y is estimated by MSE (i.e. s2). From the Figure in next page, we can see that there are two probability distributions of Y, corresponding to the upper and lower limits of a confidence interval for E(Y). 23 Prediction Interval Prediction limits if E(Yi) here Prediction limits if E(Yi) here Confidence limits for E(Yi) 24 Prediction Interval (cont.) Since we cannot be certain of the location of the distribution of Y, prediction limits for Yi(new) clearly must take account of two elements: (a) variation in possible location of the distribution of Y; and (b) variation within the probability distribution of Y. That is, var(predi)=var(Yi(new)- i)= var(Yi(new))+var( i)= 2+var( i). An unbiased estimator of var(pred) is as follows s2(predi)= s 2 + s2( i) = s2[1+ (1/n) + (Xi - )2/SSxx] The (1 - ) 100% prediction interval for Yi(new) is as follows i - t/2, (n-2) s(predi) < Yi(new) < i + t/2, (n-2) s(predi) 25 Comments on Prediction Interval The prediction limits, unlike the confidence limits for a mean response E(Yi), are sensitive to departures from normality of the error terms distribution. Prediction intervals resemble confidence intervals. However, they differ conceptually. A confidence interval represents an inference on a parameter and is an interval that is intended to cover the value of the parameter. A prediction interval, on other hand, is a statement about the value to be taken by a random variable, the new observation Yi(new). 26 Hyperbolic Interval Bands Y _ X Xgiven X 27 Example 2 The vice-president of marketing for a large firm is concerned about the effect of advertising on sales of the firm’s major product. To investigate the relationship between advertising and sales, data on the two variables were gathered from a random sample of 20 sales districts. These data are available in a file named SALESAD3. Sales (DV) and advertising (IV) are both expressed in hundreds of dollars. (a) What is the sample regression equation relating sales to advertising? (b) Is there a linear relationship between sales and advertising? (c) What conclusion can be drawn from the test result? (d) Find the 95% confidence interval estimate for the mean value of DV given that IV = 410. (e) Find the 95% prediction interval for the individual value of DV given that IV = 410. (f) Construct a 95% confidence interval estimate of 1 . 28 Example 2 – SPSS OUTPUTS Model Summaryb Model 1 R R Square .930a Std. Error of the Estimate Adjusted R Square .864 .857 594.80820 a. Predictors: (Constant), adv b. Dependent Variable: sales Coefficientsa Unstandardized Coefficients Model 1 B (Constant) Std. Error -57.281 509.750 17.570 1.642 Adv Standardized Coefficients Beta 95% Confidence Interval for B t .930 Sig. Lower Bound Upper Bound -.112 .912 -1128.2 1013.7 10.702 .000 14.121 21.019 a. Dependent Variable: sales = -57.281 + 17.57x 29 Example 2 – Scatter Plot = -57.281 + 17.57x 30 The Coefficient of Determination In many regression problems, the major reason for constructing the regression equation is to obtain a tool that is useful in predicting the value of the dependent variable Y from some known value of the independent variable X. Thus, we often wish to assess the accuracy of the regression line in predicting the Y values. The R2 , called the coefficient of determination, provides a summary measure of how well the regression line fits the sample. It has a proportional reduction in error interpretation. That is, R2 is the proportion of the variability in the dependent variable that is explained by the independent variable (see the figure), namely, Sum of squares due to regression R2 = Total sum of squares 31 Partitioning Variation 32 Partitioning Variation (Cont.) The dependent variable Y can be partitioned into two parts explained variation by regression & unexplained variation. Total Variation Explained Variation Unexplained Variation The total sum of squares is SST = (Yi - )2 . The SS(Total) can be subdivided into two components: SSR = the sum of squares due to regression (explained variation) SSE = the sum of squares due to error (unexplained variation). That is, SST = SSR + SSE, namely, (Yi - )2 = ( i - )2 + (Yi - i)2 33 Computing Formulas The various sums of squares may be found more simply by using the following formulas. SST = SSyy=(Yi - )2 = (Yi)2 - (Yi )2/n SSR = ( i - )2 = b1 (SSxy) SSE = (Yi - i)2 = SS(Total) - SSR . Now we can calculate R2 by using the following equation R2=SSR/SS(Total) = 1 - SSE/SS(Total) = b1SSxy/SSyy and 0 R2 1. The computations are usually summarized in tabular form (ANOVA Table). 34 ANOVA Table ANOVA Table for Simple Regression While the t-test is used to test the significance of individual independent variables, the ANOVA Table provides an overall test of the significance of the whole set of independent variables. The test is an F-test with d.f. (k, n-k-1), where k is the number of independent variables in the model. F= MSR/MSE = [R2/k]/[(1-R2)/(n-k-1)] = (n-2)R2/(1-R2). For the simple linear regression model, the F-test is equivalent to the t-test for parameter 1 . But it is not the case for the multiple regression model. 35 Example 2 (Cont.) (a) Find SST, SSR, SSE, and R2 . (b) Present an ANOVA summary Table. (c) Test the hypothesis H0: 1 = 0 against Ha: 1 0 by using an F-statistic. Let = 0.05. Solution: ANOVAb Sum of Squares Model 1 Regression Residual Total df Mean Square 4.052E7 1 4.052E7 6368342.383 18 353796.799 4.689E7 19 F 114.539 Sig. .000a a. Predictors: (Constant), adv b. Dependent Variable: sales 36 Description of Methods of Regression: Case When X is Random For variable-X case, both X and Y are random variables measured on cases that are randomly selected from a population. The fixed-X regression model applies in this case when we treat the X values as if they were pre-selected. This technique is justifiable theoretically by conditioning on the X values that happened to be obtained in the sample (Textbook page 83). Therefore all the previous discussion and formulas are precisely the same for this case as for the fixed-X case. Since both X and Y are considered random variables, other parameters can be useful for describing the model, say, covariance of X and Y, denoted by XY (or Cov(X, Y)), and correlation coefficient, denoted by , which are measures of how the two variables vary together. 37 Correlation Coefficient The correlation coefficient = XY/XY is a measure of the direction and the strength of linear association between two variables. It is dimensionless, and it may take any value between - 1 and 1, inclusive. A positive correlation (i.e. > 0) means that as one variable increases, the other likewise increases. A negative correlation (i.e. < 0) means that as one variable increases, the other decreases. If = 0 for two variables, then we say that the variables are uncorrelated and that there is no linear association between them. Note that measures only linear relationship. The variables may be perfectly correlated in a curvilinear relationship, even = 0. 38 Correlation Coefficient and R2 The sample correlation coefficient r is an estimator for . The equation for the sample correlation coefficient is given as follows: r = SSxy/ [(SSxx)(SSyy)]1/2 . Simple regression techniques and correlation methods are related. In correlation, r is an estimator for the population correlation coefficient . In regression, r2 = R2 is simply a measure of closeness of fit. Thus the sample correlation coefficient r is used to estimate the direction and the strength of the linear relationship between two variables, whereas the coefficient of determination r2 = R2 is the proportion of the squared error that the regression equation can explain when we use the regression equation rather than the sample mean as a predictor. 39 Test of Coefficient of Correlation Note that tests of hypotheses and confidence intervals for the variable-X case require that X and Y be jointly normally distributed. That is, X and Y follow a bivariate normal distribution. Under the assumption mentioned above, we can test whether there is a linear relationship between X and Y variables (i.e. if = 0), by using the following t-test. The same conclusion as testing population slope 1 will be drawn. (1) H0: = 0 against Ha: 0 (or > 0 or < 0) (2) The test statistic is t = Under H0 the statistic t has the t-distribution with (n-2) degrees of freedom. 40 Example 2 (cont.) Use the data in the example to test that if there is a significant linear relationship between the sales and advertising expense (both in hundreds of dollars). Use = 0.05. Solution: (1) H0: = 0 against Ha: 0 (2) = 0.05, n = 20, df = n - 2 = 18 and t0.025, 18 = 2.101 (3) The rejection rule: If the |t| > 2.101, then reject the H0. (4) Computations: r =SSxy/(SSxxSSyy)1/2 = 0.9296 The test statistic is = = 10.701 (5) We reject H0 at = 0.05 since t = 10.701 > 2.101 and conclude that there is a significant linear relationship between the weekly usage and annual maintenance expense . 41 Further Examination of Computer Output Standardized Regression Coefficient The standardized regression coefficient is the slope in the regression equation if X and Y are standardized. After standardization the intercept in the regression equation will be zero, and for simple linear regression the standardized slop will be equal to the correlation coefficient r. In multiple regression, the standardized regression coefficients help quantify the relative contribution of each X variable. Coefficientsa Unstandardized Coefficients Model 1 B (Constant) Adv Std. Error Standardiz ed Coefficients Beta 95% Confidence Interval for B t -57.281 509.750 -.112 17.570 1.642 .930 10.702 a. Dependent Variable: sales Sig. Lower Bound .912 -1128.227 .000 14.121 Upper Bound r 1013.665 21.019 42 Checking for Violations of Assumptions We usually do not know in advance whether a linear regression model is appropriate for our data set. Therefore, it is necessary to conduct a search to check whether the necessary assumptions are violated. The analysis of the residuals is frequently helpful and useful tool for this purpose. The basic principles apply to all statistical models discussed in this course. Residuals: In model building, a residual is what is left after the model is fit. It is the difference between an observed value of Y and the predicted value of Y, i.e. Residuali = ei = (Yi - i). In regression analysis, the true errors are assumed to be independent normal variables with a mean of 0 and a constant variance of 2. If the model is appropriate for the data, the residuals ei, which are estimates of the true errors, should have similar characteristics. (Refer to Pages102~103) 43 Checking for Violations of Assumptions Identification of equality of variance Scatter plots can also be used to detect whether the assumption of constant variance of y for all values of x is being violated. If the spread of the residuals increases or decreases with the values of the independent variable or with the predicted values, then the assumption of homogeneity of variance is being violated. Identification of independence Usually this assumption is relative easy to meet since observations appear in a random position, and hence successive error terms are also likely to be random. However, in time series data or repeated measures data, this problem of dependence between successive error terms often occurs. 44 Checking for Violations of Assumptions (Cont.) Identification of normality A critical assumption of the simple linear regression model is that the error terms associated with each xi have a normal distribution. Note that it is unreasonable to expect the observed residuals to be exactly normal - some deviation is expected because of sampling variation. Even if the errors are normally distributed in the population, sample residuals are only approximately normal. Another way to compare the observed distribution of residuals to that expected under the assumption of normality is to plot the two cumulative distributions against each other for a series of points. If the two distributions are identical, a straight line results. It is called a P-P plot (a cumulative probability plot). 45 Checking for Violations of Assumptions (Cont.) Identification of linearity For the simple regression, a scatter plot gives a good indication of how well a straight line fits the data. Another convenient method is to plot the residuals against the predicted values. If the assumptions of linearity and homogeneity of variance are met, there should be no relationship between the predicted and residual values, i.e. the residuals should be randomly distributed around the horizontal line through zero. You should be suspicious of any observable pattern. Identification of outliers In combination with a scatter plot of the observed dependent and independent variables, the plot of residuals can be used to identify observations which appear to fall a long way from the normal cluster observations (a residual that is larger than 3s is an outlier). 46 Overview of Tests Involving Residuals Tests for Randomness in the Residuals Runs Test Tests for Autocorrelation in the Residuals in Time Order Durbin-Watson Test Tests for Normality Correlation Test (Shapiro-Wilk Test) Chi-Square Test Kolmogorov Test Tests for Constancy of Error Variance Brown-Forsythe (Modified Levene) Test* Cook-Weisberg (Breusch-Pagan) Test* F-test for Lack Of Fit Test whether a linear regression function is a good fit for the data*. (Note that the tests with * are valid only for large samples or under strong assumptions) 47 Overview of Remedial Measures If the linear regression normal error model is not appropriate for a data set, there are two basic choices Abandon the model and develop and use a more appropriate model ( non-normal, nonlinear models) Employ some transformation(s) on the data. Transformations Transformations for nonlinear relation Transformations for nonnormality and unequal variances Box-Cox Transformations 48 What to Watch Out For In the development of the theory for linear regression, the sample is assumed to be obtained randomly in such a way that it represents the whole population you are studying. Often, convenience samples, which are samples of easily available cases, are taken for economic or other reasons. It is likely to be an underestimate of the variance and possibly bias in the regression line. The lack of randomness in the sample can seriously invalidate our inferences. Confidence intervals are often optimistically narrow because the sample is not truly a random one from the whole population to which we wish to generalize. 49 What to Watch Out For (Cont.) Association versus Causality – A common mistake made when using regression analysis is to assume that a strong fit (high R2) of a regression of Y on X automatically means that “X causes Y” . (1) The reverse could be true: Y causes X (2) There may be third variable related to both X and Y. Forecasting Outside the range of the explanatory variables. 50 Matrix Approach to Simple Linear Regression Analysis yi = 0 + 1 xi + i , i = 1, 2, …, n This implies y1 = 0 + 1 x1 + 1 , y2 = 0 + 1 x2 + 2 , ……………………. yn = 0 + 1 xn + n , Let Yn1 = (y1, y2 , …, yn)’, Xn2 = [1n1 , (x1, x2, … xn)’], 21 = (0 , 1)’ and n1 = (1, 2 , …, n)’ . Then the normal model in matrix terms is as follows Yn1 = Xn2 21 + n1 or simply Y = X + where is a vector of independent normal variables with E( ) = 0 and Var() = Var(Y) = 2 I. 51 LS Estimation in Matrix Terms Normal Equations n b0 + b1 Xi = Yi b0 Xi +b1 Xi2 = Yi Xi in matrix terms are X’Xb = X’Y where b = (b0, b1)’. Estimated Regression Coefficients (X’X)-1 X’Xb = (X’X)-1 X’Y b = (X’X)-1 X’Y LSM in Matrix Notation Q = [Yi - ( 0 + 1 Xi)]2 = (Y - X)’(Y - X) = Y’Y - ’X’Y - Y’X + ’X’X = Y’Y - 2’X’Y + ’X’X (Q)/ = -2X’Y + 2X’X = [Q/0, Q/1]’ Equating to the zero vector, dividing by 2, and substituting b for , then, b = (X’X)-1 X’Y 52 Fitted Values and Residuals in Matrix Terms Fitted Values Residuals Variance-Covariance Matrix Var(e) = Var[(I - H)Y] = (I - H) Var(Y) (I - H)’ = (I - H) 2I (I - H)’ = 2 (I - H) and is estimated by s2(e) = MSE (I - H) 53 ANOVA in Matrix Terms SS(Total) = Yi2 - (Yi)2/n = Y’Y - Y’JY/n SSE = e’e = (Y - Xb)’(Y - Xb) = Y’Y - b’X’Y SSR = b’X’Y - Y’JY/n Note that Xb = HY and b’X’ = (Xb)’ = (HY)’ = Y’H, then SS(T) = Y’(I - J/n)Y = Y’A1Y SSE = Y’(I - H)Y = Y’A2Y SSR = Y’(H - J/n)Y = Y’A3Y Since A1, A2 and A3 are symmetric, SS(T), SSE and SSR are quadratic forms of the Yi. Quadratic forms play an important role in statistics because all sum of squares in the ANOVA for linear statistical models can be expressed as quadratic forms. 54 Inferences in Matrix Terms The variance covariance matrix Var(b) = 2 (X’X)-1 The estimated variance-covariance matrix of b is s2(b) = MSE (X’X)-1 Mean Response Let Xh = (1, xh)’ Var( ) = 2 Xh’(X’X)-1 Xh The estimated variance of in matrix notation is s2( ) = MSE(Xh’(X’X)-1 Xh) Prediction of New Observation s2(pred) = MSE(1+Xh’(X’X)-1 Xh) 55 Multiple Regression Analysis (I) General Multiple Linear Regression Model Linear Regression Model in Matrix Terms Estimation and Inferences in Matrix Terms ANOVA Results R2 and Adjusted R2 Indicator (Dummy) Variables Partial F-test Beta (Standardized) Coefficients Coefficients of Partial Determination Multicollinearity and Its Effects Polynomial Regression Models 1 General Multiple Linear Regression Model In most research problems where regression analysis is applied, more than one independent variable is needed in the regression model. When this model is linear in the coefficients, it is called a multiple linear regression model. The fundamental principles of simple linear regression model can be extended to the multiple regression model. The model can be written as follows, called the general linear regression model, Yi = 0 + 1 Xi1 + 2 Xi2 + ... + k Xik + i , where Xij (j = 1, 2, ..., k & i =1, 2, ..., n) are independent variables, j (j = 0, 1, 2, ..., k) are parameters (partial coefficients), i (i =1, 2, ..., n) are independent N(0, 2). That implies that Yi ~ N(0 + 1 Xi1 + 2 Xi2 + ... + k Xik , 2) and independent. 2 Various Examples of General Linear Regression Model First-order Model When X1, …, Xk represent k different predictor variables, the general linear regression model is called a first-order model in which there are no interaction effects between the predictor variables. Interaction Model For instance, Yi = 0 + 1Xi1 + 2 Xi2 + 3 Xi1Xi2 + i Second-order Model A second-order model, e.g. for two independent variables case, is defined as follows Yi = 0 + 1Xi1 + 2 Xi2 + 3 Xi1Xi2 + 4Xi12 + 5Xi22 + i Polynomial Model Yi = 0 + 1Xi + 2 Xi2 + … + mXim + i 3 General Linear Regression Model in Matrix Terms Yi = 0 + 1 Xi1 + 2 Xi2 + ... + k Xik + i , i = 1, 2, …, n Let Yn1 = (Y1, Y2 , …, Yn)’, Xn2 = [1n1, X1, X2, … Xn]n(k+1), (k+1)1 = (0 , 1 , …, k )’ and n1 = (1, 2 , …, n)’ . Then the general linear regression model in matrix terms is Yn1 = Xn(k+1) (k+1)1 + n1 or simply Y = X + where is a vector of independent normal variables with E( ) = 0 and Var() = 2 I. Consequently, Y is a vector of independent normal variables with E(Y) = X and Var(Y) = 2 I. 4 Estimation in Matrix Terms The Least Squares Normal Equations X’Xb = X’Y where b = (b0, b1, b2 , … , bk)’. Estimated Regression Coefficients LSE and MLE: b = (X’X)-1 X’Y Properties of the Estimators They are minimum variance unbiased, consistent, and sufficient. Fitted Values Residuals Variance-Covariance Matrix Var(e) = 2 (I - H) and is estimated by s2(e) = MSE (I -H) 5 Inferences in Matrix Terms The variance covariance matrix Var(b) = 2 (X’X)-1 The estimated variance-covariance matrix of b is s2(b) = MSE (X’X)-1 = s2(X’X)-1 Inferences bi is normally distributed random variable for the normal model. The (1 - ) 100% Confidence interval for i bi - t/2 s(bi) < i < b0 + t/2 s(bi) where t/2 is a value of the t - distribution with df = (n -k-1). Hypothesis test of i To test the null hypothesis H0: i = 0 against a alternative Ha, we may use the test statistic t = bi/s(bi) 6 Inferences in Matrix Terms Mean Response Let Xh = (1, xh1, xh2 … xhk)’ Var( ) = 2 Xh’(X’X)-1Xh The estimated variance of in matrix notation is s2( ) = MSE(Xh’(X’X)-1 Xh) The (1 - ) 100% confidence interval for the mean response E(Yi ) is as follows - t/2, (n-k-1) s( ) < E(Yh) < + t/2, (n-k-1) s( ) Prediction of New Observation s2(pred) = MSE(1+Xh’(X’X)-1Xh) 7 ANOVA where SST = Y’(I - J/n)Y, SSE = Y’(I - H)Y & SSR = Y’(H - J/n)Y E(MSE) = 2 and E(MSR) is 2 plus a nonnegative quantity, e.g. E(MSR) =2+[12(Xi1-X1)2+22(Xi2-X2)2+212(Xi2-X2)(Xi1-X1)]/2 The F-test associated with the ANOVA table is a test of the null hypothesis that H0: 1 = 2 = ... = k = 0. Ha: One or more of the i values are not equal to zero. In other words, it is a test of whether there is a linear relationship between the dependent variable Y and the entire set of independent variables Xi (i = 1, 2, ... k). 8 Dividend Example (cont.) A random sample of 42 firms was chosen from the S&P500 firms listed in the Spring 2003 Special Issue of Business Week (The Business Week Fifty Best Performers). The dividend yield (DIVYIELD) and the 2002 earnings per share (EPS), and the stock price (PRICE) were recorded for the 42 firms. These data are in a file named DIV4. Using dividend yield as the DV and EPS and PRICE as the IVs, run a regression using SPSS. (a)What is the sample regression equation? (b) What conclusion(s) can be drawn based on the outputs? (c ) Is it necessary to test each coefficient individually to see if either PRICE or EPS is related to DIVYIELD? Why or why not? 9 Dividend Example (cont.) ANOVAb Model 1 Sum of Squares Regression df Mean Square 12.677 2 6.338 Residual 132.532 39 3.398 Total 145.208 41 F Sig. 1.865 .168a a. Predictors: (Constant), price, eps b. Dependent Variable: divyield Coefficientsa Standardized Coefficients Unstandardized Coefficients Model 1 B (Constant) EPS PRICE Std. Error Beta 2.450 .653 .604 .314 -.029 .026 t Sig. 3.753 .001 .387 1.925 .062 -.227 -1.129 .266 a. Dependent Variable: divyield 10 Example 1 A human resources manager is interested in developing a multiple regression model to estimate the salary Y (in thousands of dollars) for employees from experience X1 (in years) with the firm and from performance X2 (as measured by an index). Data were collected for 15 employees and are presented in the following table: 11 Example 1 (cont.) (a)Find the estimated multiple regression equation for Y regressed on X1 and X2 . (b)Find the predicted value for Y given X1 = 10 and X2 = 60. (c) Find s. (d)Test the overall significance of the regression relationship. (e) Test each independent variable separately to see whether it contributes explanatory power to the regression equation. Use = 5%. 12 Example 1- Solution 13 Example 1- Solution (cont.) (a) = 8.49 + 2.778 X1 + 0.0656 X2 (b) If X1 = 10 and X2 = 60 the predicted value for Y is = 8.49 + 2.778 (10) + 0.0656 (60) = 40.25 (c) s = (15.163)1/2 = 3.894 (d) H0: 1 = 2 = 0 vs Ha: H0 is not true. From the ANOVA table, p-value = 0.008 is very small, we reject H0 and conclude that there is a linear relationship among the salary, experience and performance. (e) H0i: i = 0 vs Hai: i 0 i = 1, 2 From the coefficients table, p-values for 1 and 2 are 0.103 and 0.788 respectively, we do not reject the null hypotheses at the 5% level. 14 Coefficient of Determination - R2 The definition of the coefficient of determination R2 for multiple regression analysis is the same as the simple regression analysis. That is, R2 is the proportion of the total variation of Y that is explained by the relationship between Y and independent variables X’s. It is an important summary statistic that is used to help evaluate how well the multiple regression model fits the data. The equation for R2 is as follows R2 = SSR/SST = 1 - SSE/SST . The R2 value will generally increase as more independent variables are included in a multiple regression equation, given a fixed number of observations. 15 Why Do We Need to Consider Ra2 ? The reason (R2 value will increase) is that as additional independent variables X’s are included in a regression equation, the value of SST does not change, but SSR generally increases, equivalently SSE decreases, therefore R2 generally increases. The additional independent variables may not contribute significantly to the explanation of the dependent variable y, but they do increase R2 . Adding more independent variables in the regression equation for the purpose of increasing R2 often results in overfitting and result in worse models rather than better ones. To help prevent overfitting in regression analysis, we use the so called Adjusted R Square (written as Ra2) value as the measure of how well the model fits the data. 16 Adjusted R Square - Ra2 The Ra2 value incorporates the effect of including additional independent variables in a multiple regression equation. This value is computed by the following formula: where k is the number of independent variables and n is the size of the sample. Ra2 will always be smaller than R2. Unlike R2, Ra2 takes into account (‘adjusts for’) both the sample size n and the number of independent variables k in the model. It may actually become smaller when another X variable is introduced into the model, because any decrease in SSE may be more than offset by the loss of a degree of freedom in the denominator n - k -1. 17 Example 1 (cont.) Find the Ra2 and interpret its value. Solution R2 = SSR/SST = 228.447/410.4 = 0.557 Ra2 = 1 - (181.95/410.4)(14/12) = 0.483 The value of the Ra2 can be interpreted in the following way: Ra2 = 0.4828 means that approximately 48.3% of the total variation in the values of Y (salary) can be explained by a linear relationship with independent variables after adjusting the number of independent variables (experience and performance). 18 Example 1 (cont.) Model Summary Model 1 R .746a R Square Adjusted R Square .557 Std. Error of the Estimate .483 =8.49+2.778 X1+0.0656 X2 3.89394 a. Predictors: (Constant), performa, experien Model Summary = 8.959 + 3.148 X1 Model 1 R R Square .744a .554 Adjusted R Square .520 Std. Error of the Estimate 3.75291 a. Predictors: (Constant), experien 19 Indicator (Dummy) Variables There are many occasions in which qualitative (categorical) variables need to be considered as a part of the model development. Examples of qualitative independent variables and some possible categories are sex - male or female; marital status married or not married, and so on. If qualitative variables are to be included in a regression model, they must be quantified, that is, they must be assigned numerical values. Quantification can be accomplished by using indicator (dummy) variables. Indicator variables are assigned the values 0 or 1, for example, 20 Dummy Variables (Cont.) When the qualitative variable had c categories, we use (c - 1) indicator variables. Say, in a study, there are 4 age groups, 0-10, 11-20, 21-40, and 40+, so we need (c - 1) = (4 - 1) = 3 indicator variables. In fact, The reason why we only need three dummy variables is that the category “40+” is treated as the “default group”, or the “otherwise group”. 21 Example 2 A female executive at a certain company claims that male executives earn higher salaries, on average, than female executives with the same education, experience, and responsibilities. To support her claim, she wants to model the salary y of an executive using a qualitative independent variable representing the gender of an executive (male or female). (a) Write a model for mean executive salary, E(y), using a dummy variable for the gender of an executive. (b) Interpret the parameters in the model. 22 Example 2 (cont) (a) The model for executive salary is Y = 0 + 1 X + The mean salary is E(Y) = 0+1X (b) The advantage of using a 0-1 coding scheme is that the coefficients are easily interpreted. if X = 1 (male) M =E(Y) = 0+ 1(1) = 0+1 F = E(Y) = 0 + 1(0) = 0 if X = 0 (female), then 1 = M - F . That is, 0 represents the mean salary for females, and 1 represents the difference between the mean salary for males and the mean salary for females. Therefore, when a 0-1 coding convention is used, 0 will always represent the mean response associated with the level of the qualitative variable assigned the value 0 (called the base level), and 1 will always represent the difference between the mean response for the level assigned the value 1 and the mean for the base level. 23 Example -Employment Discrimination Data for the following variables for 93 employees of Harris Bank Chicago in 1977 are available: Y = beginning salaries in dollars (SALARY) X1= years of schooling at the time of hire (EDUCAT) X2= number of months of previous work experience (EXPER) X3= number of months after January 1, 1969, that the individual was hired (MONTHS)) X4 = indicator variable coded 1 for males and 0 for females (MALE) (a) Is there evidence that Harris Bank discriminated against female employees? (b) What salary would you forecast, on average, for males with 12 years education, 10 years of experience, and with hired equal to 15? What salary would you forecast, on average, for females if all other factors are equal? 24 Employment Discrimination (cont.) ANOVAb Model 1 Regression Sum of Squares 2.367E7 df 4 Mean Square 5916337.848 Residual 2.266E7 88 257476.579 Total 4.632E7 92 F 22.978 Sig. .000a T Sig. 10.760 .000 a. Predictors: (Constant), males, exper, months, educat b. Dependent Variable: salary Coefficientsa Unstandardized Coefficients Model 1 Standardized Coefficients B Std. Error 3526.422 327.725 educat exper 90.020 24.694 .290 3.645 .000 1.269 .588 .162 2.159 .034 months 23.406 5.201 .338 4.500 .000 males 722.461 117.822 .486 6.132 .000 25 (Constant) a. Dependent Variable: salary Beta Employment Discrimination (cont.) = 3526.422 + 722.461 Males + 90.02 Educat + 1.269Exper + 23.406 Months Yes. There is a difference in salaries, on average, for male and female workers after accounting for the effects of the EDUC, EXPER, and MONTHS variables. Males’ salaries are, on average, $722 higher, a statistically significant difference (p-value = 0). Forecast of average salary for males with 12 years education, 10 years of experience and with MONTHS equal to 15: = 3526.422 + 722.461 + 90.020(12) + 1.269(10) + 23.406(15) = 5692.903 Forecast of average salary for females with 12 years education, 10 years of experience and with MONTHS equal to 15: = 3526.422 + 90.020(12) + 1.269(10) + 23.406(15) = 4970.422 26 Interaction Regression Models We define the first-order linear model as follows E(Y) = 0 + 1 X1 + 2 X2 + ... + k Xk The assumption that a first-order model will adequately characterize the relationship between E(Y) and independent variables is equivalent to assuming that independent variables do not “interact”; that is, we assume that the effect on E(Y) of a change in Xi (for a fixed value of Xj) is the same regardless of the value of Xj. Thus, “no interaction” is equivalent to saying that the effect of changes in one variable(say Xi) on E(Y) is independent of the value of the second variable (say Xj). However, if the relationship between E(Y) and Xi does, in fact, depend on the value of Xj held fixed, then the first-order model is not appropriate for predicting Y. In this case, we need another model that will take into account this dependence - Interaction Model. 27 Interaction Model with Two Independent Variables E(Y) = 0 + 1 X1 + 2 X2 + 3 X1X2 where (1 + 3 X2) represents the change in E(Y) for every 1-unit increase in X1, holding X2 fixed. (2 + 3 X1) represents the change in E(Y) for every 1-unit increase in X2, holding X1 fixed. 0 is the intercept of the model, the value of E(Y) when X1=X2 =0 The cross-product term, 3 X1X2 , is called an interaction term. 28 Example 1 (cont.) Is there evidence that X1 and X2 interact? Test at = 0.05. Solution H0: 3 = 0 vs Ha: 3 0 The p-value = 0.169 is greater than = 0.05, H0 is not rejected. There is insufficient evidence to indicate X1 (experience) and X2 (performance) interact at the 5% level. 29 Comparing Nested Models To be successful model builders, we require a statistical method that will allow us to determine (with a high degree of confidence) which one among a set of candidate models best fits the data. One of such techniques we are going to discuss is for nested models. Two models are nested if one model contains all the terms of the second model and at least one additional term. The more complex of the two models is called the complete model (or full model) and the simpler of the two is called the reduced model. For examples, (a) Y= 0 +1X1+2X2 + 3X3+4X4+ -- Complete model --- Reduced model Y= 0 + 1X1+ 2X2 + 3X3 + (b) The first-order and second-order models are nested. 30 Partial F-Test for Comparing Nested Models Y= 0 + 1X1 + 2X2 +… + gXg + -- Reduced model Y = 0 + 1X1 + 2X2 + … + gXg + g+1Xg+1 + … + kXk + H0: g+1 = g+2 = ... = k = 0 The test statistic is vs -- Complete model Ha: H0 is not true SSER = Sum of squared errors for the reduced model. SSEC = Sum of squared errors for the complete model. MSEC= Mean square error for the complete model. k - g = Number of parameters tested (in H0). k = Number of independent variables in the complete model. For the partial F-test, df1 = (k-g) and df2 = (n-k-1). 31 Example 1 (Cont.) The complete model Y = 0 + 1X1 + 2X2 + 3X1X2 + The reduced model Y = 0 + 1 X1 + Where X1 - Experience and X2 - Performance. Test if the complete model fits the data better at the 5% level. Solution H0: 2 = 3 = 0 vs Ha: H0 is not true SSER = 183.096 SSEC = 152.004 MSEC = 13.819 k-g = 3 - 1 = 2 ( = df1), (n-k-1) = 15 - 3 - 1 = 11 (=df2) The test statistic F = [(SSER - SSEC)/2]/MSEC = [(183.096-152.004)/2]/13.819 = 1.125 Since 1.125 < F0.05, (2, 11)= 3.98, we do not reject H0 and conclude that there is insufficient evidence to indicate the complete model is better than the reduced model at the 5% level. 32 Example 1 (Cont.) Reduced Model Full Model 33 Example (Cont.) 34 Beta (Standardized) Coefficients It is inappropriate to interpret the bj’s as indicators of the relative importance of independent variables since generally independent variables measure different concepts, and so have different units of measurements. For this reason, the regression output always includes socalled Beta coefficients (standardized regression coefficients) which are the coefficients of the independent variables when all variables are expressed in standardized form. (Refer to section 7.5 on pages 273 ~ 276) Beta coefficients can be calculated directly from the regression coefficients using the following formula: 35 Beta Coefficients (cont.) Betaj = bj (sxj /sy) = bj (SSxjxj/SSyy) 1/2, where sxj is the standard deviation of the jth independent variable, sy is the standard deviation of the dependent variable and bj is the unstandardized partial regression coefficient for the jth independent variable. When two or more independent variables are entered, the beta coefficients can be used to directly compare the importance of each independent variable in relation to the dependent variable. For example, if Beta1 = 0.85, Beta2 = 0.23, then independent variable x1 is more important to the dependent variable comparing with x2 . 36 Example 3 Mrs. Goh, a real estate agent, wants to develop a multiple regression model to find the relationship between the sale price of houses and various characteristics of the houses. She collected data on six variables, recorded in the table, for 13 houses that were sold recently. The six variables are: Price:Sale price of a house in thousands of dollars Lot size: Size of the lot in acres Living area: Living area in square feet Age: Age of a house in years Corner: Whether or not a house is on a corner lot Garage: Whether or not a house has a garage. 37 Example 3 (Cont.) Discuss the following SPSS printouts for the model based on the above data. 38 Example 3 - Discussion The model is useful The model fits the data well 39 Example 3 - Discussion (Cont.) From the beta coefficients in the above table, we can say that AREA is the most important independent variable, next one is CORNER. The less important variables are AGE and SIZE. 40 Example 3 - Discussion (Cont.) After dropping SIZE and AGE, R2 = 0.976 and Ra2 = 0.968. Significant 41 Example 3 - Discussion (Cont.) H0: s = a = 0 SSER = 1062.059 vs Partial F-test. Ha: H0 is not true SSEC = 642.24 MSEC =91.75 k-g = 5 - 3 = 2 ( = df1), (n - k - 1) = 13 - 5 - 1 = 7 ( = df2) The test statistic F = [(SSER - SSEC)/2]/MSEC = [(1062.059 - 642.24)/2]/91.75 = 2.288 Since 2.288 < F0.05, (2, 7)= 4.74, we do not reject H0 and conclude that there is insufficient evidence to indicate the complete model is better than the reduced model at the 5% level. 42 Extra Sums of Squares In the textbook, the difference, (SSER - SSEC), is called extra sums of squares. An extra sum of squares measures the marginal reduction in the error sum of squares when one or several predictor variables are added to the regression model, given that other predictor variables are already in the model. Equivalently, one can view an extra sum of squares as measuring the marginal increase in the regression sum of squares when one or several predictor variables are added to the regression model. The reason for the equivalence of the marginal reduction in the error sum of squares and the marginal increase in the regression sum of squares is SST = SSR + SSE. That is, SST does not depend on the regression model fitted, any reduction in SSE implies an identical increase in SSR. 43 Coefficients of Partial Determination Extra sums of squares are not only useful for tests on the regression coefficients of multiple regression model, but they are also encountered in descriptive measures of relationship called coefficients of partial determination. R2 measures the proportionate reduction in the variation of Y achieved by the introduction of the entire set of X variables considered in the model. A coefficient of partial determination, in contrast, measures the marginal contribution of one X variable when all others are already included in the model. Let us consider the following model Yi = 0 + 1Xi1 + 2Xi2 + i 44 Coefficients of Partial Determination (cont.) SSE(X1) measures the variation in Y when X1 is included in the model SSE(X1, X2) measures the variation in Y when both X1 and X2 are included in the model Hence, the relative marginal reduction in the variation in Y associated with X2 when X1 is already in the model is r2Y2.1= SSR(X2|X1)/SSE(X1)=[SSE(X1)-SSE(X1, X2)]/SSE(X1) The above is the coefficient of partial determination between Y and X2, given that X1 is in the model. Similarly, we can define the coefficient of partial determination between Y and X1, given that X2 is in the model as follows r2Y1.2= SSR(X1|X2)/SSE(X2)=[SSE(X2)-SSE(X1, X2)]/SSE(X2) 45 Coefficients of Partial Determination (cont.) General Case r2Y1.23 = SSR(X1|X2 , X3)/SSE(X2, X3) r2Y2.13 = SSR(X2|X1, X3)/SSE(X1, X3) r2Y3.12 = SSR(X3|X1 , X2)/SSE(X1, X2) r2Y4.123 = SSR(X4|X1 , X2 , X3)/SSE(X1 , X2, X3) Comments (on page 270) (a) The coefficients of partial determination is between 0 and 1. (b) A coefficients of partial determination can be interpreted as a coefficient of simple determination. 46 Multicollinearity Often, two or more of the independent variables used in the multiple regression model will contribute redundant information. That is, the independent variables will be correlated with each other. When the independent variables are highly correlated, we say that multicollinearity exists. A few problems arise when serious multicollinearity is present in the regression analysis. First, high correlation among the independent variables increase the likelihood of rounding errors in the calculations of the i estimates, standard errors, and so forth. Second, and more important, the regression results may be confusing and misleading. 47 Example b = (X’X)-1X’Y var(b) = 2 (X’X)-1 48 Example (cont.) Three important effects are illustrated in the sequence of matrices (a) The sampling variances of the estimated coefficients increase sharply with increasing collinearity between the independent variables. (b) Greater covariances between the independent variables produce greater sampling covariances for the LS coefficients. (c) Small variations in the data (say, dropping or adding a few observations) may produce substantial variations in the LS coefficients. 49 Detecting Multicollinearity (1) Significant correlation between pairs of independent variables in the model (2) Nonsignificant t tests for all (or nearly all) of the individual parameters when the F-test for overall model adequacy H0: 1 = 2 = … = k = 0 is significant (3) Signs opposite from what is expected in the estimated parameters (4) If VIFi = (1- Ri2 )-1 10 or if the mean of VIF, i.e., (VIFi )/k, considerably larger than 1, i=1,2,…k (pages 408~409). (5) A more sophisticated method is to use Principal Components Analysis. One of the commonly used simple methods to solve the multicollinearity is to drop one or more of the highly correlated independent variables from the multiple regression model. 50 Example 1 (Cont.) Significant Nonsignificant 51 Example 1 (Cont.) X1 (experience ) and X2 (performance) are highly correlated. 52 Example 1 (Cont.) The mean VIF values = (3.75 + 3.75)/2 = 3.75 is considerably larger than 1 53 Example 1 (Cont.) After dropping the independent variable x2 (performance) Significant 54 Polynomial Regression Models Polynomial regression models for quantitative predictor variables are among the most frequently used curvilinear response models in practice because of their ease in handling as a special case of the general linear regression model. Polynomial models have two basic types of uses: (1) When the true curvilinear response function is indeed a polynomial function (2) When the true curvilinear response function is unknown (or complex) but a polynomial function is a good approximation The following are several commonly used polynomial regression models 55 Commonly Used Polynomial Models Second-Order model with one predictor variable Yi = 0 + 1xi + 2xi2 + i , i = 1, 2, …, n where xi = Xi -X, the regression coefficient 0 represents the mean response of Y when x = 0 (i.e. X = X), 1 is often called the linear effect coefficient, and 2 is called the quadratic effect coefficient. The reason for using a centered predictor variable in the polynomial model is that X and X2 often will be highly correlated. Centering the predictor variable often reduces the multicollinearity substantially. Higher-Order model with one predictor variable Yi = 0 + 1xi + 2xi2 +… + mxim + i , i = 1, 2, …, n where xi = Xi -X. 56 Polynomial Models (cont.) Second-Order model with two predictor variables Yi = 0 +1xi1 +2xi2 +3xi12+4xi22 +5xi1xi2 +i , i = 1, 2, …, n where xi1 = Xi1 -X1, xi2 = Xi2 -X2, 5 is called the interaction effect coefficient. Second-Order model with three predictor variables Yi = 0 +1xi1 +2xi2 +3xi3+4xi12 +5xi22 + 6xi32 +7xi1xi2 +8xi1xi3+9xi2xi3 +i , i = 1, 2, …, n where xi1 = Xi1 -X1, xi2 = Xi2 -X2, x13 = Xi3 -X3 , 7, 8 and 9 are called the interaction effect coefficients. 57 Implementation of Polynomial Models Fitting of polynomial regression models presents no new problems since they are special cases of the general linear regression model. For example, technically, the quadratic model includes only one independent variable X, but we can think of the model as a general linear model with two independent variables X1 (= X) and X2 (= X2). Hence, all earlier results on fitting apply, as do the earlier results on making inferences. 58 Implementation of Polynomial Models (cont.) How can you choose an appropriate linear model to fit to a set of data, the first-order, second-order or higher-order? Since most relationships in the real world are curvilinear, a good choice would be a higher-order linear model. If you are fairly certain, based on your experience, knowledge, or prior information (past researches in this area), that the relation ships between E(Y) and independent variables are approximately first-order and that the independent variables do not interact, you could select a first-order model for the data. Keep in mind that you may be forced to use a first-order model rather than a second-order or higher-order model simply because you do not have sufficient data to estimate all parameters in a higher-order model. 59 Example Refer to the case example in the textbook (page 300 - 305) The second-order polynomial model was used first. Yi = 0 +1xi1 +2xi2 +3xi12+4xi22 +5xi1xi2 +i , i =1, 2, …, 11 where xi1=Xi1-1 and xi2 = Xi2 - 20. 60 Example (cont.) 61 Example (cont.) 62 Example (cont.) Partial F-test H0: 3 = 4 = 5= 0 vs Ha: H0 is not true. SSER = 7700.33 SSEC = 5240