Correlation and Regression Analysis PDF
Document Details
Uploaded by ProgressiveBarium5514
Korea University Business School
Kyung Sam Park
Tags
Summary
This document provides an overview of correlation and regression analysis, including fundamental concepts and applications. It's a teaching document, not a test.
Full Transcript
Correlation Analysis (상관분석) Regression Analysis (회귀분석) (Chapters 13, 14) Kyung Sam Park Professor of LSOM Korea University Business School [email protected] Contents Correlation analysis (상관분석) Correlation coefficient (상관계...
Correlation Analysis (상관분석) Regression Analysis (회귀분석) (Chapters 13, 14) Kyung Sam Park Professor of LSOM Korea University Business School [email protected] Contents Correlation analysis (상관분석) Correlation coefficient (상관계수) Regression analysis (회귀분석) Simple regression analysis (단순회귀분석) Multiple regression analysis (다중회귀분석) Incorporating qualitative variables (질적변수) or categorical variables (범주형변수) Problem of multicollinearity (다중공선성 문제) 2 Correlation Analysis Relationship between two variables: Independent variable (독립변수): Predictor variable portrayed on the horizontal axis (X) Dependent variable (종속변수): Resulting variable portrayed on the vertical axis (Y) Examples: Figure A: Positive correlation; Figure B: Negative correlation Note: Avoid the use of spurious correlation: For example, the consumption of peanuts and the consumption of aspirin may have a strong correlation. However, this does not mean that an increase in the consumption of peanuts caused the consumption of aspirin to increase. 3 Correlation Analysis Correlation coefficient (–1 r 1) A measure of the strength of the linear relationship between two sets of variables (or data) Formula: r X X Y Y (n 1) s x s y n = Number of data sx = Standard deviation of X sy = Standard deviation of Y 4 Comments on Correlation Coefficient Correlation coefficient means how well the actual data (or observations) are clustering around the straight (or linear) line. It does not mean to say how steep the increasing rate (or slope) is. Crime rate Crime rate r = 0.8 r = 0.9 Unemployment rate Unemployment rate Relatively, r is lower but the slope is shaper. 5 Simple Regression Analysis Study on the linear relationship between two variables: First, estimating a regression equation (i.e., linear line). Second, doing hypothesis test for the slope (b) estimated, and seeing the fitting performance (r2) of the equation. Estimation of a regression equation Y = a + bX Using the OLS (Ordinary Least Square) principle, estimate a (intercept) and b (slope): sy b r X X Y Y s X X Y Y s y n 1s x s y X X 2 sx x a Y bX where, r the correlation coefficient, Y-bar the mean of the Y data, X-bar the mean of the X data. The OLS principle: Determining a regression equation by minimizing the sum of the squares of the distances between the actual Y values and the predicted values of Y. 6 Hypothesis test for the slope Two approaches: T-test and F-test T-test Calculate all individual slopes (bi), and get the mean of them (b = bi /n), kind of the sample mean!!! Get the T statistics: T = (b – 0) / sb, where sb = the standard deviation of the bi data. Do test (H0: = 0 H1: 0), where refers to the population mean slope for the sample mean slope b. p-value = 2 the tail probability Data Slopes Y 3 Y = 1 + 2X 0.5 2 2.5 1 Therefore, Intercept a Mean(b) =2 T = (2 – 0) / 1 X St. dev. (sb) = 1 =2 0 7 Hypothesis test for the slope F-test ANOVA Test for the same hypotheses (H0: = 0 H1: 0): Variation Sum of df Mean square F value squares (= Variance) Regression SSR 1 SSR/1 = MSR MSR/MSE Residual SSE n 2 SSE/(n 2) = MSE Total TSS n 1 Y Y Y = a + bX Y Y 2 SSE TSS = SSR + SSE = Y Y 2 TSS SSR = SSR Y Y 2 Y SSE = X 0 8 Fitting performance Coefficient of determination, r2 (결정계수): How good the regression line represents the data points. How well the regression line covers the data points. How well the regression line is fitting the data points. The r2% of the variation in Y is explained, or accounted for, by the variation in X. SSR SSE r2 1 SS Total SS Total 9 Simple Regression Analysis: Example EXCEL Output: 10 Multiple Regression More than two independent variables (X1, X2, …, Xk) Regression equation: Y = a + b1X1 + b2X2 + … + bkXk Look r2 Find out which independent variables influence Y significantly. What if qualitative variables involve (ex: with and without garage)? Etc… 11 Multiple Regression: Example 1 Heating cost Outside Tem (F) Insulation Furnace age Home ($) thickness (in) (year) Y X1 X2 X3 1 250 35 3 6 2 360 29 4 10 3 165 36 7 3 4 43 60 6 9 5 92 65 5 6 6 200 30 5 5 7 355 10 6 7 8 290 7 10 10 9 230 21 9 11 10 120 55 2 5 11 73 54 12 4 12 205 48 5 1 13 400 20 5 15 14 320 39 4 7 15 72 60 8 6 16 272 20 5 8 17 94 58 7 3 18 190 40 8 11 19 235 27 9 8 20 139 30 7 5 12 Excel output Output Regression Statistics r 0.896755 r-square 0.80417 Adjusted r-square 0.767452 Standard error 51.04855 Observations 20 ANOVA Source df SS MS F p-value Regression 3 171220.5 57073.49 21.90118 6.56E-06 Residual error 16 41695.28 2605.955 Total 19 212915.8 Coefficients Stand. err. t Stat. p-value 하위 95% 상위 95% Intercept 427.1938 59.60143 7.167509 2.24E-06 300.8444 553.5432 X1 -4.58266 0.772319 -5.93364 2.1E-05 -6.21991 -2.94542 X2 -14.8309 4.754412 -3.11939 0.006606 -24.9098 -4.75196 X3 6.101032 4.01212 1.52065 0.147862 -2.40428 14.60635 13 Excel output: Summary Regression equation: Y = 427.19 4.58X1 14.83X2 + 6.10X3 If X1 = 30, X2 = 5, X3 = 10: Y = 427.19 4.58(30) 14.83(5) + 6.10(10) = 276.6 r2 = 0.804 = 80% Overall test: From the ANOVA table, p-value(0.00…) Individual test: 1. X1 significant. (p-value = 2.110-5). 2. X2 significant. (p-value = 0.0066). 3. X3 not. (p-value = 0.148). 14 Qualitative (or Categorical) Variable Heating cost Outside tem (F) Insulation (in) Garage: O/X Home ($) Y X1 X2 X4 1 250 35 3 0 2 360 29 4 1 3 165 36 7 0 4 43 60 6 0 5 92 65 5 0 6 200 30 5 0 7 355 10 6 1 8 290 7 10 1 9 230 21 9 0 10 120 55 2 0 11 73 54 12 0 12 205 48 5 1 13 400 20 5 1 14 320 39 4 1 15 72 60 8 0 16 272 20 5 1 17 94 58 7 0 18 190 40 8 1 19 235 27 9 0 20 139 30 7 0 15 Excel output 요약 출력 회귀분석 통계량 다중 상관계수0.932651 결정계수 0.869838 조정된 결정계수 0.845433 표준 오차 41.61842 관측수 20 분산 분석 자유도 제곱합 제곱 평균 F비 유의한 F 회귀 3 185202.3 61734.09 35.64133 2.59E-07 잔차 16 27713.48 1732.093 계 19 212915.8 계수 표준 오차 t 통계량 P-값 하위 95% 상위 95% Y 절편 393.6657 45.00128 8.747876 1.71E-07 298.2672 489.0641 X1 -3.96285 0.652657 -6.07186 1.62E-05 -5.34642 -2.57928 X2 -11.334 4.001531 -2.8324 0.01201 -19.8168 -2.85109 X4 77.4321 22.78282 3.398706 0.00367 29.13468 125.7295 Y = 393.67 3.96X1 11.33X2 + 77.43X4 With garage, $77.43 more heating cost is needed. All the independent variables significant. 16 Categorical Variable: Three regions Obs Sales(Y) Ad. cost(X1) Bonus(X2) Region(X3) Three regions 1 960 370 230 A 2 890 410 240 A n – 1 variables 3 1050 410 270 A needed 4 1200 450 290 B X3 X4 5 1400 520 280 C A: 1 0 6 1500 640 320 C B: 0 1 7 1600 630 290 C 8 1000 450 305 A C: 0 0 9 1000 490 240 A 10 1100 500 270 B or 11 1300 480 330 C 12 1550 620 260 C X3 X4 13 1040 450 235 A A: 0 0 14 1050 440 250 B B: 1 0 15 1100 490 230 B 16 1220 540 270 B C: 0 1 17 1500 610 265 C 18 1560 600 280 C 19 1630 585 310 C 20 1160 520 290 A 21 1200 535 270 B 22 1290 480 310 B 23 1460 540 290 C 24 1580 580 290 C 25 1120 500 170 B 17 Categorical Variable: Output Regression Multiple r 0.964498 r-square 0.93026 Adj. r-square0.916309 Stad. Error 67.18806 n 25 ANOVA df Square Mean square F-value p-value Regression 4 1204251 301062.8 66.69188 2.8E-11 Residual 20 90284.71 4514.235 Total 24 1294536 CoefficientStand. ErrorT statistics p-v alue low 95% upper 95% Y Intercept 572.197 222.2452 2.574622 0.018092 108.6021 1035.793 X1 1.2442 0.304009 4.09263 0.00057 0.610045 1.878347 X2 0.73258 0.446141 1.64203 0.11621 -0.19806 1.663214 X3 -298.34 55.55891 -5.3697 3E-05 -414.231 -182.443 X4 -212.83 44.58088 -4.7739 0.00012 -305.82 -119.831 Y = 572.20 + 1.24X1 + 0.73X2 298.34X3 212.83X4 X3 X4 A: 1 0 B: 0 1 C: 0 0 18 Multicollinearity Or Intercorrelation (다중공선성) Correlation among the independent variables. Hos No. Outpatient(X1) X-ray(X2) Inpatient(X3) Population pital Nurses(Y) (X4) 1 696 44 2048 1340 95 2 1033 20 3940 620 128 3 1603 19 6505 570 367 4 1611 49 5720 1500 357 5 1613 45 11520 1365 240 6 1854 55 5780 1690 433 7 2160 59 5970 1640 467 8 2305 94 8460 2870 787 9 3503 128 20110 3655 1805 10 3570 96 13310 2910 609 11 3740 131 10770 3920 1037 12 4026 127 15540 3865 1268 13 10340 252 36190 7680 1577 14 11730 410 34700 12450 1694 15 15410 460 39200 14100 3314 19 Comments on Multicollinearity Why it happens, and find a cure for it. Y X1 X2 Possible regression equations: 1 1 1 (1) Y’ = X1 (r2 = 100%) 2 2 2 (2) Y’ = X2 3 3 3 (3) Y’ = 0.5X1 + 0.5X2 (4) Y’ = 2X1 – X2 If some (or all) of the correlation coefficients among the independent variables are high, then multiple equations are quite possible, so the multicollinearity problem is likely to occur. Otherwise, only one equation would exist, so it is unlikely to occur. 20