Chapter 6 (Part II) Trendlines and Regression Analysis PDF
Document Details
Uploaded by UnrivaledUnderstanding
Universiti Teknologi MARA, Johor
Nur Liyana Mohamed Yousop
Tags
Summary
This document provides a comprehensive overview of trendlines and regression analysis, particularly focusing on multiple linear regression. It covers topics from the concept of multiple linear regression models and their estimated equations to regression tool applications and interpretation of results. It also discusses model-building issues and explains how to check different assumptions involved in regression modeling.
Full Transcript
CHAPTER 6 (PART II) Trendlines and Regression Analysis Prepared by: Nur Liyana Mohamed Yousop MULTIPLE LINEAR REGRESSION MULTIPLE LINEAR REGRESSION • A linear regression model with more than one independent variable is called a multiple linear regression model. ESTIMATED MULTIPLE REGRESSION E...
CHAPTER 6 (PART II) Trendlines and Regression Analysis Prepared by: Nur Liyana Mohamed Yousop MULTIPLE LINEAR REGRESSION MULTIPLE LINEAR REGRESSION • A linear regression model with more than one independent variable is called a multiple linear regression model. ESTIMATED MULTIPLE REGRESSION EQUATION • We estimate the regression coefficients—called partial regression coefficients — b0, b1, b2,… bk, then use the model: • The partial regression coefficients represent the expected change in the dependent variable when the associated independent variable is increased by one unit while the values of all other independent variables are held constant. EXCEL REGRESSION TOOL • The independent variables in the spreadsheet must be in contiguous columns. • So, you may have to manually move the columns of data around before applying the tool. • Key differences: • Multiple R and R Square are called the multiple correlation coefficient and the coefficient of multiple determination, respectively, in the context of multiple regression. • ANOVA tests for significance of the entire model. That is, it computes an F-statistic for testing the hypotheses: INTERPRETING REGRESSION RESULTS Data: Salary Data Predict current salary using the following indicators: Beginning salary Previous experience (in month) when hired Total years of education Samples : 100 employees in a firm INTERPRETING REGRESSION RESULTS Regression model Y= Bo + B1X1 + B2X2 + B3X3 +e = -4139.2377 + 1.7302X1 – 10.9071X2 + 719.1221X3 𝒀 Where; Y : Current Salary X1 : Beginning Salary X2 : Previous Experience (months) X3 : Total years of education INTERPRETING REGRESSION RESULTS The R2 value of 0.8031 indicates that 80.31% of the variation in the current salary (DV) is explained by the IVs. This also indicates that only19.69% of the variation is explained by other variables. From the model, it is evident that total years of education has a larger impact on the current salary as compared to other variables. INTERPRETING REGRESSION RESULTS The test of ANOVA is slightly different for MLR as compared to SLR. The test significance model is as follows: H0 : ß1 = ß2 = ..... = ßn = 0 H1 : At least one ßm is not equal to 0 By looking at the p-value of significance F = 0.000 < α=5%, thus, we reject our null hypothesis. Therefore, it is conclusive that at least one slope is statistically different from zero (all independent variables explains variation in dependent variable-model is fit). MODEL BUILDING ISSUES • A good regression model should include only significant independent variables. • However, it is not always clear exactly what will happen when we add or remove variables from a model; variables that are (or are not) significant in one model may (or may not) be significant in another. • Therefore, you should not consider dropping all insignificant variables at one time, but rather take a more structured approach. • Adding an independent variable to a regression model will always result in R2 equal to or greater than the R2 of the original model. • Adjusted R2 reflects both the number of independent variables and the sample size and may either increase or decrease when an independent variable is added or dropped. An increase in adjusted R2 indicates that the model has improved. SYSTEMATIC MODEL BUILDING APPROACH 1 • Construct a model with all available independent variables. Check for significance of the independent variables by examining the p-values. 2 • Identify the independent variable having the largest p-value that exceeds the chosen level of significance. • Remove the variable identified in step 2 from the model and evaluate adjusted R2. 3 4 • (Don’t remove all variables with p-values that exceed a at the same time, but remove only one at a time.) • Continue until all variables are significant. X2 IDENTIFYING THE BEST REGRESSION MODEL Banking Data Relationship between average bank balances with age, education, income, home value and wealth Result Home value has the largest p-value; drop and re-run the regression. MULTICOLLINEARITY Multicollinearity occurs when there are strong correlations among the independent variables, and they can predict each other better than the dependent variable. • When significant multicollinearity is present, it becomes difficult to isolate the effect of one independent variable on the dependent variable, the signs of coefficients may be the opposite of what they should be, making it difficult to interpret regression coefficients, and p-values can be inflated. Correlations exceeding ±0.7 may indicate multicollinearity The variance inflation factor is a better indicator, but not computed in Excel. IDENTIFYING POTENTIAL MULTICOLLINEARITY Colleges and Universities correlation matrix; none exceed the recommend threshold of ±0.7 Banking Data correlation matrix; large correlations exist IDENTIFYING POTENTIAL MULTICOLLINEARITY Salary Data?? What do you think ??? BEFORE DROPPING ANY VARIABLES MODEL 1 DROP HOME VALUE Bank regression after removing Home Value ✓ Adjusted R2 improves slightly from 94.41% to 94.43%. ✓ All X variables are significant (p-value <α=5%) MODEL 2 DROP WEALTH ✓ If we remove Wealth from the model, the adjusted R2 drops to 91.93%, Education is no longer significant. MODEL 3 DROP HOME VALUE AND WEALTH MODEL 4 ✓ If we remove Home Value and Wealth from the model, the adjusted R2 drops to 92.01%, Education is no longer significant. DROP HOME VALUE, WEALTH AND EDUCATION MODEL 5 ✓ Dropping Home Value, Wealth and Education and leaving only Age and Income in the model results in an adjusted R2 of 92.02% and all variables are significant. DROP HOME VALUE AND INCOME MODEL 6 • If we remove Income from the model instead of Wealth, the Adjusted R2 drops to only 93.45%, and all remaining variables (Age, Education, and Wealth) are significant. SUMMARY Model 1 Model 2 Model 3 Model 4 Model 5 Model 6 Age 0.0000** 0.0000** 0.0000** 0.0000** 0.0000** 0.0000** Education 0.0541* 0.0039** 0.4112 0.3392 Income 0.0005** 0.0000** 0.0000** 0.0000** Home Value Wealth 0.4075 0.0000** 0.0000** 0.9157 0.0000** 0.0000** 0.0000** R2 0.9469 0.9465 0.9225 0.9225 0.9218 0.9365 Adj-R2 0.9441 0.9443 0.9193 0.9201 0.9202 0.9345 PRACTICAL ISSUES IN TRENDLINE AND REGRESSION MODELING • Identifying the best regression model often requires experimentation and trial and error. • The independent variables selected should make sense in attempting to explain the dependent variable • Logic should guide your model development. In many applications, behavioral, economic, or physical theory might suggest that certain variables should belong in a model. • Additional variables increase R2 and, therefore, help to explain a larger proportion of the variation. • Even though a variable with a large p-value is not statistically significant, it could simply be the result of sampling error and a modeler might wish to keep it. • Good models are as simple as possible (the principle of parsimony). OVERFITTING Overfitting means fitting a model too closely to the sample data at the risk of not fitting it well to the population in which we are interested. In multiple regression, if we add too many terms to the model, then the model may not adequately predict other values from the population. Overfitting can be mitigated by using good logic, intuition, theory, and parsimony. CHECKING ASSUMPTION FOR MULTIPLE LINEAR REGRESSION CHECKING ASSUMPTIONS Assumption Verification Details Linearity • Examine scatter diagram (should appear linear) • Examine residual plot (should appear random) If assumption is met: o Residuals randomly scattered about zero o Do not exhibit a specific pattern • View a histogram of standard residuals • Formal Goodness of Fit Test (e.g. Pearson, Chi-square, Jacque-Bera and others) If assumption is met: o Bell-shaped distribution Linear relationship between IV and DV Normality of Errors Errors of all IVs are normally distributed, mean=0 CHECKING ASSUMPTIONS Assumption Verification Homoscedasticity • Examine the residual plot Constant variance Variance around the regression line is similar for all the IVs Details If assumption is met: o There will not be dramatic differences in the spread of the data for different values of the IVs Independence of Errors • Durbin Watson Statistics If assumption is met: (Autocorrelation) o No autocorrelation, if 1.5 ≤ D ≤ The error term for all IVs should 2.5 not be correlated with one • d takes on values between 0 and 4. A value of d = 2 another. If the do, then the means there is no autocorrelation. A value problem of autocorrelation substantially below 2 (and especially a value less than 1) means that the data is positively persists. autocorrelated. A value of d substantially above 2 means that the data is negatively autocorrelated ESTIMATED MODEL MODEL 2 Bank regression after removing Home Value =𝛽 0 + 𝛽 1 Age+ 𝛽 3 Education+ 𝛽 4 Income+ 𝛽 5 Wealth 𝐵𝑎𝑙𝑎𝑛𝑐𝑒 = −12,432.4567 + 325.0653 Age+ 773.3800 Education+ 𝐵𝑎𝑙𝑎𝑛𝑐𝑒 0.1597 Income+ 0.0730 Wealth CHECKING REGRESSION ASSUMPTIONS FOR MODEL 2 • Linearity • linear trend in scatterplot CONTINUED… • Linearity • no pattern in residual plot CONTINUED… • Normality of Errors • Residual histogram appears slightly skewed but is not a serious departure • Data→ Data Analysis → Histogram Histogram 45 40 35 Frequency 30 25 20 Frequency 15 10 5 0 -3 -2 -1 0 1 BIN 2 3 More CONTINUED… • Homoscedasticity • Residual plot shows no serious difference in the spread of the data for different X values. CONTINUED… • Homoscedasticity **the variances along the line of best fit remain similar CONTINUED… • Autocorrelation REGRESSION WITH CATEGORICAL VARIABLES REGRESSION WITH CATEGORICAL VARIABLES Regression analysis requires numerical data. Categorical data can be included as independent variables, but must be coded numeric using dummy variables. • For variables with 2 categories, code as 0 and 1. A MODEL WITH CATEGORICAL VARIABLES Data: Employee Salaries • Employee Salaries provides data for 35 employees Predict Salary using Age and MBA (code as yes=1, no=0) Where; OR; = 𝛽 0 + 𝛽 1 𝐴𝑔𝑒 + 𝛽 2 𝑀𝐵𝐴 𝑆𝑎𝑙𝑎𝑟𝑦 RESULTS FROM DUMMY REGRESSION = 893.5876 + 1044.1460 𝐴𝑔𝑒 + 14767.2316 𝑀𝐵𝐴 • 𝑆𝑎𝑙𝑎𝑟𝑦 • If MBA = 0, salary = 893.5876 + 1044.1460 Age • If MBA = 1, salary =15,660.8192 + 1044.1460 Age RESULTS FROM DUMMY REGRESSION • The coefficient of multiple determination (R2) is 0.9528. For our sample problem, this means 95.28% of salary variation can be explained by Age and by MBA. The predicted equation fits the data pretty well. DUMMY REGRESSION ASSUMPTION ASSUMPTIONS??? END OF CHAPTER 6