Correlation and Regression (Stat 101, AY 2020-2021) PDF
Document Details
Uploaded by Deleted User
Tags
Summary
This document provides an overview of correlation and regression analysis, specifically focusing on simple linear regression. It includes examples, calculations, and interpretations of coefficients, with notes on how to perform regression analysis in Excel and the assumptions for the method.
Full Transcript
Σ Correlation and Regression Chapter 10 Stat 101 1st Semester AY 2020-2021 Motivation Objectives Interpret linear correlation coefficient properly Know the basics of statistical modelling Implement simple linear regression modelling Interpret the coefficients of the simple linear...
Σ Correlation and Regression Chapter 10 Stat 101 1st Semester AY 2020-2021 Motivation Objectives Interpret linear correlation coefficient properly Know the basics of statistical modelling Implement simple linear regression modelling Interpret the coefficients of the simple linear regression model Correlation Analysis Learning Objective 1 Correlation Analysis Purpose: to measure the strength and direction of the linear association between two variables It is used when we’re interested in describing how two variables change relative to each other Correlation Analysis Weight Height Scatter Diagram A two-dimensional graph used to visualize the possible underlying relationship between two variables by plotting individual pairs of observations. Example We have the following variables collected from 10 students: 𝑋 = high school grade in algebra 𝑌 = Stat 101 final exam scores Example Student X Y 1 90 85 2 87 87 100 90 3 85 89 Stat 101 Final Exam Scores 80 70 4 85 90 60 50 5 95 92 40 30 6 96 94 20 7 82 80 10 0 0 20 40 60 80 100 120 8 78 75 HS Algebra Grade 9 75 60 10 84 78 Correlation Analysis Y W X Z Linear Correlation Coefficient (ρ) A measure of the strength and direction of the linear relationship existing between two variables, say 𝑋 and 𝑌, that is independent of their respective scales of measurement. 𝐸 𝑋𝑌 − 𝐸 𝑋 𝐸(𝑌) 𝜌= 𝑉𝑎𝑟 𝑋 𝑉𝑎𝑟(𝑌) Properties A linear correlation coefficient can only assume values between -1 and 1, inclusive of endpoints. −1 ≤ 𝜌 ≤ 1 Properties The sign of 𝜌 describes the direction of the linear relationship between X and Y. A positive value for 𝜌 means that the line slopes upward to the right, and so when X increases, Y is expected to increase. 𝜌>0 Properties The sign of 𝜌 describes the direction of the linear relationship between X and Y. A positive value for 𝜌 means that the line slopes downward to the right, and so when X increases, Y is expected to decrease. 𝜌 𝑡$ (𝑣 = 𝑛 − 2) " Hypothesis Test on ρ Example Using the example on the relationship of high school algebra grade with Stat 101 final exam scores, we got a linear correlation coefficient estimate of 0.8611. Is there a sufficient evidence at 0.05 level of significance that there is a positive correlation between high school algebra grade and Stat 101 final exam grade? Solution Let 𝜌 be the correlation between HS algebra and Stat 101 final exam grade. We are testing: Ho: 𝜌 = 0 Ha: 𝜌 > 0 Decision Rule: Reject Ho if 𝑇 > 𝑡...0,2"3 = 2.306 Test Statistic: (𝑟 − 𝜌4 ) 𝑛 − 2 (0.8611 − 0) 10 − 2 𝑇= = = 4.7896 1−𝑟 % 1 − 0.8611 % Decision: Since 𝑇 = 4.79 > 2.306, we reject Ho at 𝛼 = 0.05. Conclusion: We can conclude that high school algebra grade and Stat 101 final exam grade are positively correlated. Simple Linear Regression Learning Objectives 2, 3, and 4 Example Student X Y 120 1 90 85 100 2 87 87 3 85 89 Stat 101 Final Exam Scores 80 4 85 90 60 5 95 92 6 96 94 40 7 82 80 20 8 78 75 9 75 60 0 0 20 40 60 80 100 120 10 84 78 HS Algebra Grade Simple Linear Regression Purpose: to evaluate the relative impact of a predictor on a particular outcome A simple linear regression model contains only one explanatory (independent) variable, denoted by 𝑋, with 𝑛 observations, say 𝑋! , 𝑖 = 1,2, … 𝑛 and is linear with respect to both the regression coefficients and the response (dependent) variable. Purpose of Linear Regression We can use Linear Regression Modelling for the following purposes: 1. Describe the linear relationship between variables. 2. Determine how much one variable (𝑋) affects another variable on the average (𝑌) 3. Predict the value of a dependent variable given a value of an independent variable. Simple Linear Regression It is given by the equation where is the value of the response variable for the ith element is the value of the explanatory for the ith element Simple Linear Regression is a regression coefficient that gives the Y-intercept of the regression line is the regression coefficient that gives the slope of the regression line is the random error term for the ith element, where the are independent, normally distributed with mean 0 and variance is the number of observations Interpretation of Coefficients is the value of the mean of Y when X=0 gives the amount of change in the mean or expected value of Y for every unit increase in the value of X Simple Linear Regression Model In this model, we assume that the 𝑋! ′s are simply constant values that are used to explain the 𝑌! ’s which are the random variable of interest. Note that because of the uncertain nature of the random error terms , two or more observations having the same value of X will not necessarily have the same value for Y. Example Student X Y 1 90 85 2 87 87 120 Stat 101 Final Exam Scores 3 85 89 100 80 4 85 90 60 5 95 92 40 6 96 94 20 7 82 80 0 0 20 40 60 80 100 120 8 78 75 HS Algebra Grade 9 75 60 10 84 78 Random Error Term A random error term may be thought of as a representation of the effect of other factors, that is, apart from X, not explicitly stated in the model, but have an affect the response variable to some extent. It accounts for the inherent variation or basic and unpredictable element of randomness in responses. It also accounts for measurement errors in recording the value of the response variable. Random Error Term Assumptions üThe error terms are independent from one another; üThe error terms are normally distributed; üThe error terms all have a mean of 0; and üThe error terms have constant variance σ2 Random Error Term Since 𝜖! ~𝑁𝑜𝑟𝑚𝑎𝑙 0, 𝜎 % , it follows that the 𝑌! ’s follow a normal distribution where the expected value of 𝑌! is 𝛽A + 𝛽# 𝑥 and the variance is still 𝜎 %. Thus, we have 𝑌! ~𝑁𝑜𝑟𝑚𝑎𝑙 (𝛽A + 𝛽# 𝑥, 𝜎 % ), 𝑌!B 𝑠 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 Steps in Simple Linear Regression 1. Obtain the equation that best fits the data. 2. Evaluate the equation to determine the strength of the relationship for prediction and estimation. 3. Determine if the assumptions about the error terms are satisfied. 4. If the model fit the data adequately, use the equation for prediction and for describing the nature of the relationship between the variables. Method of Least Squares In this method, we minimize the square deviations of the observed values of Y and the expected values of Y or the square of the error terms. 𝜀! = 𝑌! − 𝐸 𝑌! = 𝑌! − (𝛽C + 𝛽# 𝑋! ) So we require 𝛽A 𝑎𝑛𝑑 𝛽# to be those values that minimizes the term $ L 𝜀! % !"# Method of Least Squares It is hinged on the idea that the “best-fitting” line is selected as the one that minimizes the sum of squares of the deviations of the observed value of Y from its expected value. Method of Least Squares Weight Height Estimated Regression Equation 𝑌M = 𝑏C + 𝑏# 𝑋 where 𝑌M is the estimated value of Y 𝑏A and 𝑏# are the estimates for 𝛽A and 𝛽# For every unit increase in 𝑋, 𝑌M increases by 𝑏# units. If 𝑋 = 0, then 𝑌M is equal to 𝑏A. Estimated Regression Equation The estimated regression equation is appropriate only for the relevant range of X. If X = 0 is not included in the range of the sample data, will NOT have a meaningful interpretation. Example We have the following variables collected from 10 students: 𝑋 = high school grade in algebra 𝑌 = Stat 101 final exam scores Example Student X Y 1 90 85 2 87 87 100 90 3 85 89 Stat 101 Final Exam Scores 80 4 85 90 70 60 5 95 92 50 40 6 96 94 30 7 82 80 20 10 8 78 75 0 0 20 40 60 80 100 120 9 75 60 HS Algebra Grade 10 84 78 Example Regression Analysis Regression Statistics Multiple R 0.861068712 R Square 0.741439326 Adjusted R Square 0.709119242 Standard Error 5.494265981 Observations 10 ANOVA df SS MS F Significance F Regression 1 692.5043306 692.5043306 22.94051342 0.00137364 Residual 8 241.4956694 30.18695867 Total 9 934 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Intercept -29.18831972 23.48754171 -1.242714971 0.249161165 -83.35068803 24.97404858 X 1.30908191 0.273316125 4.789625603 0.00137364 0.678813796 1.939350025 Example Fitting the line, we have: 𝑦! = −29.1883 + 1.3091𝑥 Or stylized as 𝑠. 𝑡𝑎𝑡 = −29.1883 + 1.3091𝑎𝑙𝑔𝑒𝑏𝑟𝑎 Example 𝑠. 𝑡𝑎𝑡 = −29.1883 + 1.3091𝑎𝑙𝑔𝑒𝑏𝑟𝑎 Interpretation For every one unit increase in high school algebra grade, the mean Stat 101 final exam score increases by 1.3091 units. We don’t interpret the y-intercept since there is no observation where the value of 𝑋 is equal to zero. Example Coefficient of Determination (R2) The coefficient of determination, 𝑅 % , is the proportion of the variability in the observed values of the response variable that can be explained by the explanatory variable through their linear relationship. For simple linear regression, 𝑅 % = 𝑟 % where 𝑟 is the Pearson’s correlation coefficient. Coefficient of Determination (R2) The realized value of the coefficient of determination, 𝑅 % , will be between 0 and 1 because −1 ≤ 𝑟 ≤ 1. If a model has perfect predictability, then 𝑅 % = 1 , but if a model has no predictive capability, then 𝑅 % = 0. Coefficient of Determination (R2) An R2 between 0 and 1 indicates the extent to which the dependent variable is predictable. An R2 of 0.10 means that 10 percent of the variance in Y is predictable from X; an R2 of 0.20 means that 20 percent is predictable; and so on. Example Regression Analysis Regression Statistics Multiple R 0.861068712 Approximately 74% of the variability in Stat R Square 0.741439326 101 final exam scores can be explained by Adjusted R Square 0.709119242 Standard Error 5.494265981 the HS algebra grades. Observations 10 ANOVA df SS MS F Significance F Regression 1 692.5043306 692.5043306 22.94051342 0.00137364 Residual 8 241.4956694 30.18695867 Total 9 934 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Intercept -29.18831972 23.48754171 -1.242714971 0.249161165 -83.35068803 24.97404858 X 1.30908191 0.273316125 4.789625603 0.00137364 0.678813796 1.939350025 Predicting Y from X The predicted value of Y at the value of the explanatory variable X is computed by substituting 𝑥 for X in the equation. That is, if we want to predict the value of Y from X based on the prediction equation, 𝑌M = 𝑏A + 𝑏# 𝑋, we have to substitute the value of X in the equation. Predicting Y from X We can predict the value of the mean of Y given a value of X using the model by inputting the value of X in the model. However, we can’t use the model to predict Y when the value of X is outside the bounds of the observed X. Example If a new student will enroll in Stat 101, what is her expected Stat 101 final exam score if her high school algebra grade is 88%? What if it’s 98%? Note that the model is only interpretable for values of X from 75 to 96. a. 𝑌M = −29.1883 + 1.3091 88 = 86.0109 b. Since 98 is outside the scope of X, then we can’t use the model to predict E(Y). Software Applications Regression using MS Excel Simple linear regression analysis may be performed using the Regression option in MS Excel through the following: Data Analysis Toolpak PHStat Data Analysis Toolpak From the menu bar, Data -> Data Analysis -> Regression PHStat From the menu bar, Add-ins -> PHStat -> Regression -> Simple Linear Regression Σ Correlation and Regression END OF CHAPTER 10