Regression Analysis PDF
Document Details
Uploaded by WarmheartedExpressionism
Tags
Summary
This document provides an introduction to regression analysis, specifically for STAT 115. It covers topics like types of variables (independent and dependent), describing linear patterns with regression lines, and the principle of least squares.
Full Transcript
REGRESSION ANALYSIS STAT 115 INTRODUCTION Inanalyzing data for the health sciences disciplines, we find that it is frequently desirable to learn something about the relationship between two numeric variables. We may, for example, be interested in studying the relationship between blood...
REGRESSION ANALYSIS STAT 115 INTRODUCTION Inanalyzing data for the health sciences disciplines, we find that it is frequently desirable to learn something about the relationship between two numeric variables. We may, for example, be interested in studying the relationship between blood pressure and age, height and weight, the concentration of an injected drug and heart rate, the consumption level of some nutrient and weight gain, the intensity of a stimulus and reaction time, or total family income and medical care expenditures. The nature and strength of the relationships between variables may be examined by regression and correlation analysis, two statistical techniques that, although related, serve different purposes. TYPES OF VARIABLES INDEPENDENT DEPENDENT An independent variable is one that A dependent variable is something that stands alone and isn't changed by the depends on other factors. other variables you are trying to measure. For example, a test score could be a dependent variable because it could For example, someone's age might be an change depending on several factors independent variable. such as how much you studied, how much sleep you got the night before Other factors (such as what they eat, how you took the test, or even how hungry much they go to school, how much you were when you took it. television they watch) aren't going to Dependent variable is the variable for which we want to make a prediction. change a person's age. DESCRIBING LINEAR PATTERNS WITH A REGRESSION LINE o Scatterplots show us a lot about a relationship, but we often want more specific numerical descriptions of how the dependent and independent variables are related. o For example, that we are examining the weights and heights of a sample of college women. We might want to know what the increase in average weight is for each 1- inch increase in height. o Or, we might want to estimate the average weight for women with a specific height, like 5’10’’. REGRESSION ANALYSIS Regression Analysis is the area of statistics used to examine the relationship between a quantitative dependent variable and one or more independent variables. A key element of regression analysis is the estimation of an equation that describes how, on average, the dependent variable is related to the independent variables. The regression equation can be used to answer the types of questions that we just asked about the weights and heights of college women. A regression equation can also be used to make predictions. For instance, it might be useful for colleges to have an equation for the connection between verbal SAT scores and college grade point average (GPA).They could use that equation to predict the potential GPAs of future students, based on their verbal SAT scores. SIMPLE LINEAR REGRESSION Simple linear regression analysis is a statistical tool that gives us the ability to estimate the mathematical relationship between a dependent variable (usually called y) and an independent variable (usually called x). The simplest kind of relationship between two variables is a straight line. Straight- line regression model is frequently used. Before using straight-line regression model, we should always examine a scatterplot to verify that a pattern actually is linear. SIMPLE LINEAR REGRESSION MODEL The equation for a straight line relating y and x is y a bx Where a is the “y intercept” and b is the slope. The y represents the vertical direction, and x represents the horizontal direction. The slope tells us how much the y variable changes for each increase of one unit in the x variable. PRINCIPLE OF LEAST SQUARE In regression analysis, our objective is to use the data to position a line that best represents the relationship between the two variables. Our first approach is to use a scatter diagram to visually position the line. However, the line drawn using a straight edge has one disadvantage: Its position is based in part on the judgment of the person drawing the line. The figure below shows the hand-drawn lines represent the judgments of four people. All the lines except line A seem to be reasonable. That is, each line is centered among the graphed data. However, each would result in a different estimate of slope. However, we would prefer a method that results in a single, best regression line. This method is called the least squares principle. It gives what is commonly referred to as the “best-fitting” line. The method of least squares chooses the values for a, and b to minimize the sum of squared errors. where: 𝑌 read Y hat, is the estimated value of the Y variable for a selected X value. “a” is the estimated value of Y where the regression line crosses the Y -axis when X is zero. “b” is the slope of the line, or the average change in 𝑌 for each change of one unit (either increase or decrease) in the independent variable X. X is any value of the independent variable that is selected. FORMULAS FOR “a” AND “b” n n n n ( x x )( y y ) i i n xi yi xi yi b i 1 n i 1 n i 1 n i 1 (x x ) i 1 i 2 n xi2 ( xi ) 2 i 1 i 1 a y bx y x EXAMPLE 1: 1250 41 1380 54 The weekly advertising expenditure (x) and weekly sales 1425 63 (y) are presented in the following table. 1425 54 (i). Fit the Least square Regression Line of Y=Sales on 1450 48 X=Advertising Cost 1300 46 1400 62 (ii). Estimate the value of sales when advertising 1510 61 expenditure is $50. 1575 64 1650 71 SOLUTION: THE NECESSARY CALCULATIONS ARE GIVEN BELOW: X Y X2 XY 41 1250 1681 51250 54 1380 2916 74520 63 1425 3969 89775 54 1425 2916 76950 48 1450 2304 69600 46 1300 2116 59800 62 1400 3844 86800 61 1510 3721 92110 64 1575 4096 100800 71 1650 5041 117150 ƩX=564 ƩY=14365 ƩX2 =32604 ƩXY=818755 CALCULATIONS OF “A” AND “B” From previous table we have: n 10 x 564 x 32604 2 y 14365 xy 818755 The least squares estimates of the regression coefficients are: n xy x y 10(818755) (564)(14365) b 10.8 n x 2 ( x ) 2 10(32604) (564) 2 a 1436.5 10.8(56.4) 828 ESTIMATED REGRESSION LINE The estimated regression line is: ŷ 828 10.8x Sales 828 10.8 Expenditur e This means that if the weekly advertising expenditure is increased by $1 we would expect the weekly sales to increase by $10.8. REGRESSION LINE FOR PREDICTION PURPOSE Estimated values for the sample data are obtained by substituting the x value into the estimated regression function. If the advertising expenditure is $50, then the estimated Sales is: Sales 828 10.8(50) 1368 This is called the point estimate (forecast) of the dependent variable (sales). EXAMPLE 2: THE FOLLOWING SCORES REPRESENT A NURSE’S ASSESSMENT (X) AND A PHYSICIAN’S ASSESSMENT (Y) OF THE CONDITION OF 10 PATIENTS AT TIME OF ADMISSION TO A TRAUMA CENTER. X: 18 13 18 15 10 12 8 4 7 3 Y: 23 20 18 16 14 11 10 7 6 4 (A) CONSTRUCT A SCATTER DIAGRAM FOR THESE DATA. (B) FIND THE REGRESSION EQUATIONS. (C) PREDICT THE VALUE OF PHYSICIAN’S ASSESSMENT WHEN NURSE’S ASSESSMENT IS 2. SOLUTION: X 18 13 18 15 10 12 8 4 7 3 ƩX =108 Y 23 20 18 16 14 11 10 7 6 4 ƩY =129 X2 324 169 324 225 100 144 64 16 49 9 ƩX2 =1424 XY 414 260 324 240 140 132 80 28 42 12 ƩXY =1672 SCATTER PLOT SOLUTION: CONT--- n 10 x 108 x 2 1424 y 129 xy 1672 25 n xy x y Physician’s Assessment b 20 n x ( x ) 2 2 15 10(1672) (108)(129) 10 10(1424) (108) 2 b 1.08 5 0 a 12.9 10.8(1.08) 1.236 0 5 10 15 20 Nurse’s Assessment ESTIMATED REGRESSION LINE The estimated regression line is: ŷ 1.236 1.08x Physician ' sAssessment 1.236 1.08 Nurse's Assessment This means that if the Nurse’s Assessment is increased by 1 score we can expect Physician’s Assessment to be increased by 1.08 scores. REGRESSION LINE FOR PREDICTION PURPOSE Estimated values for the sample data are obtained by substituting the x value into the estimated regression function. If the Nurse’s Assessment is at x=2 scores , then the estimated Physicians' Assessment is: Physician ' sAssessment 1.236 1.08(2) 3.16 EXAMPLE 3:The number of hours spent per week viewing TV, y, and the number of years of education, x, were recorded for 10 randomly selected individuals. The results are given in table below. (A) plot the data into a scatter plot. (B) calculate the coefficient of correlation. (B) find the least-squares line for these data. (C) compute the error sum of squares for the data X Y (𝑋 − 𝑋ഥ ) (𝑌 − 𝑌) ത ത 2 (𝑋 − 𝑋) ത 2 (𝑌 − 𝑌) (𝑋 − 𝑋ഥ ) (𝑌 − 𝑌) ത 12 10 -2.1 -0.6 4.41 0.36 1.26 14 9 -0.1 -1.6 0.01 0.0001 0.16 11 15 -3.1 4.4 9.61 19.36 -13.64 16 8 1.9 -2.6 3.61 6.76 -4.94 16 5 1.9 -5.6 3.61 31.36 -10.64 18 4 3.9 -6.6 15.21 43.56 -25.74 12 20 -2.1 9.4 4.41 88.36 -19.74 20 4 5.9 -6.6 34.81 43.56 -38.94 10 16 -4.1 5.4 16.81 29.16 -22.14 12 15 -2.1 4.4 4.41 19.36 -9.24 σ 𝑋=141 σ 𝑌=106 σ(𝑋 − σ 𝑌 − 𝑌ത 2 =281.8401 ത 𝑌 − 𝑌ത = σ(𝑋 − 𝑋) -143.6 ഥ = σ 𝑿 = 𝟏𝟒𝟏 = 𝟏𝟒. 𝟏 𝑿 and 𝒀ഥ = σ𝒏𝒀 = 𝟏𝟎𝟔 = 𝟏𝟎. 𝟔 𝒏 𝟏𝟎 𝟏𝟎 (B) COEFFICIENT OF CORRELATION Putting the sums in the Formula we get, (A) SCATTER DIAGRAM No of hours spent per week 25 σ 𝑋−𝑋ത 𝑌−𝑌ത 𝑟𝑥𝑦 = σ 𝑋−𝑋ത 𝟐 σ 𝑌−𝑌ത 𝟐 20 15 viewing T.V (Y) −143.6 𝑟𝑥𝑦 = 10 (96.9 )(281.84 ) 5 −143.6 𝑟𝑥𝑦 = = −0.8689 0 165.2582 0 5 10 15 20 25 Strong negative correlation Number of Years of Education (x) For Estimating Regression Line n (x i x )( yi y ) 143.6 b i 1 n 1.48 (x 96.9 i x )2 i 1 a y bx 10.6 14.1( 1.4819) 31.4947 The estimated equation is: ŷ 31.495 1.48x The fitted or estimated value using the least squares line, The errors, and The sums of squares of error are given in the table below: PRACTICE QUESTIONS 1. Find the slope and y intercept of the following straight lines. (a) y=1.5x-2.5 (b) y=3.0-2.5x (c) 2y=4x-6 (d) 16x-32y+8=0 2. The following are the weights (kg) and blood glucose levels (mg/100 ml) of 16 apparently healthy adult males: (a) Prepare a scatter plot to examine the relationship between two variables. (b) Find the simple linear regression equation. (c) Find the Predicted Values PRACTICE QUESTIONS CONT. 4. Given are five observations for two variables, x and y. (a) Develop a scatter diagram for these data. X Y (b) What does the scatter diagram developed in part (a) 3 55 indicate about the relationship between the two variables? 12 40 (c) Calculate the coefficient of correlation between x and y. 6 55 (d) Try to approximate the relationship between x and y by drawing a straight line through the data. 20 10 (e) Develop the estimated regression equation by 14 15 computing the values of “a” and “b”. (f) Use the estimated regression equation to predict the value of y when x =10 5. WAGEWEB CONDUCTS SURVEYS OF SALARY DATA AND PRESENTS SUMMARIES ON ITS WEB SITE. BASED ON SALARY DATA AS OF OCTOBER 1, 2002, WAGEWEB REPORTED THAT THE AVERAGE ANNUAL SALARY FOR SALES VICE PRESIDENTS WAS $142,111, WITH AN AVERAGE ANNUAL BONUS OF $15,432 (WAGEWEB.COM, MARCH 13, 2003). ASSUME THE FOLLOWING DATA ARE A SAMPLE OF THE ANNUAL SALARY AND BONUS FOR 10 SALES VICE PRESIDENTS. DATA ARE IN THOUSANDS OF DOLLARS. a. Develop a scatter diagram for these data with salary as the independent variable. b.What does the scatter diagram developed in part (a) indicate about the relationship between salary and bonus? c. Use the least squares method to develop the estimated regression equation. d.Provide an interpretation for the slope of the estimated regression equation. e. Predict the bonus for a vice president with an annual salary of $120,000.