Simple Linear Regression PDF
Document Details
Uploaded by Deleted User
Colegio de San Juan de Letran Calamba, School of Engineering and Architecture
Cynthia SM Jacob
Tags
Summary
This document covers simple linear regression, including its introduction, statistical models, empirical models, regression analysis, and examples. It details the process of finding the best linear relationship between variables, including the method of least squares. The document also highlights statistical concepts crucial for making predictions and decisions.
Full Transcript
IE013B STATISTICAL ANALYSIS SIMPLE LINEAR FOR INDUSTRIAL ENGINEERING 2 REGRESSION Prepared By Cynthia SM Jacob School of Engineering & Architecture ...
IE013B STATISTICAL ANALYSIS SIMPLE LINEAR FOR INDUSTRIAL ENGINEERING 2 REGRESSION Prepared By Cynthia SM Jacob School of Engineering & Architecture INTRODUCTION STATISTICS - It is the study of how to learn from data. - It helps one to collect the right data, perform the correct analysis, and effectively present the results with statistical knowledge. - Statistical modeling is key to making scientific discoveries, data-driven decisions, and predictions. SIMPLE LINEAR REGRESSION INTRODUCTION STATISTICAL MODEL - involves a mathematical relationship between random and non-random variables - helps identify relationships between variables and make predictions by applying the model to raw data Examples of common data sets: census data, social media data, public health data SIMPLE LINEAR REGRESSION EMPIRICAL MODELS An empirical model is one based on observed data rather than on a theoretical relationship. Deterministic model โ demonstrates an exact relationship between the variables; does not include elements of randomness Example: ๐๐ก = ๐0 + ๐ฃ๐ก measures the displacement of a particle from the origin at time ๐ก = 0, with velocity ๐ฃ after time ๐ก SIMPLE LINEAR REGRESSION EMPIRICAL MODELS Probabilistic model โ contains a random component that affects the relationship but is not being measured Examples: - Fuel mileage of a vehicle is related to its engine, but this is not the only determinant of it. - Power consumption of a house is related to its size, but not purely determined by it. - Meteorological models that predict the chances of weather occurrences use probabilities. SIMPLE LINEAR REGRESSION REGRESSION ANALYSIS It is the collection of statistical tools used to model and explore relationships between variables that are related in a probabilistic manner. Regression analysis is used to: - forecast the value of a dependent variable (Y) from observed values of the independent variable (X) - analyze the relationship between a dependent and an independent variable SIMPLE LINEAR REGRESSION REGRESSION ANALYSIS EXAMPLE: In a chemical process, suppose the yield of the product is related to the process-operating temperature. Regression analysis can be used to: - build a model to predict yield at a given temperature level; - determine the optimal temperature level to maximize yield SIMPLE LINEAR REGRESSION REGRESSION ANALYSIS The table shows the purity of oxygen produced (๐ฆ) in a chemical distillation process and the percentage of hydrocarbons (๐ฅ) present in the main condenser of the distillation unit. SIMPLE LINEAR REGRESSION SIMPLE LINEAR REGRESSION ๐ธ ๐ ๐ฅ = ๐๐|๐ฅ = ๐ฝ0 + ๐ฝ1 ๐ฅ Regression coefficients: ๐ฝ0 - the intercept ๐ฝ1 - the slope Regression analysis deals with finding the best linear relationship between ๐ and ๐. SIMPLE LINEAR REGRESSION THE SIMPLE LINEAR REGRESSION MODEL ๐ = ๐ฝ0 + ๐ฝ1 ๐ฅ + ๐ statistical model containing the random component ๐ representing a random error. ๐ is a random variable with ๐ธ ๐ = 0 and ๐๐๐ ๐ = 2 ๐. Since ๐ธ ๐ = 0, then at a specific ๐, the ๐ฆ-values are distributed around the true regression line ๐ = ๐ฝ0 + ๐ฝ1 ๐ฅ. SIMPLE LINEAR REGRESSION THE REGRESSION LINE SIMPLE LINEAR REGRESSION THE FITTED REGRESSION LINE ๐ฆเท = ๐0 + ๐1 ๐ฅ an estimate of the true regression line ๐ฆเท is the fitted or predicted value ๐0 and ๐1 are estimates of the regression coefficients SIMPLE LINEAR REGRESSION THE FITTED REGRESSION LINE A residual is an error in the fit of the model ๐ฆเท = ๐0 + ๐1 ๐ฅ and is given by ๐๐ = ๐ฆ๐ โ ๐ฆเท๐ for ๐ = 1,2, โฏ , ๐ which also translates to ๐ฆ๐ = ๐0 + ๐1 ๐ฅ๐ + ๐๐ SIMPLE LINEAR REGRESSION THE FITTED REGRESSION LINE The method of least squares is an estimation procedure that minimizes the ๐๐๐ธ. ๐๐๐ธ โ residual sum of squares; sum of squares of the errors about the regression line The method determines ๐0 and ๐1 so as to minimize ๐ ๐ 2 2 ๐๐๐ธ = เท ๐๐ = เท ๐ฆ๐ โ ๐ฆเท๐ ๐=1 ๐=1 ๐ is often used to represent ๐0 and ๐ to represent ๐1. The fitted line is given by ๐ฆเท = ๐ + ๐๐ฅ SIMPLE LINEAR REGRESSION THE METHOD OF LEAST SQUARES Sum of Squares: Estimating the regression coefficients: ฯ๐ฅฯ๐ฆ ๐๐ฅ๐ฆ ๐๐ฅ๐ฆ = เท ๐ฅ๐ฆ โ ๐ = ๐1 = ๐ ๐๐ฅ๐ฅ ฯ ๐ฅ 2 2 ๐๐ฅ๐ฅ = เท ๐ฅ โ ๐ = ๐0 = ๐ฆเดค โ ๐๐ฅาง ๐ ฯ ๐ฆ 2 2 ๐๐ฆ๐ฆ = เท ๐ฆ โ ๐ SIMPLE LINEAR REGRESSION EXAMPLE 1: The grades of a class of 9 students on a midterm report (๐ฅ) and on the final examination (๐ฆ) are as follows: ๐ฅ 77 50 71 72 81 94 96 99 67 ๐ฆ 82 66 78 34 47 85 99 99 68 (a) Estimate the linear regression line. (b) Estimate the final examination grade of a student who received a grade of 85 on the midterm report. SIMPLE LINEAR REGRESSION EXAMPLE 2: A study was done to study the effect of ambient temperature (X) on the electric power consumed by a chemical plant (Y). Other factors were held constant, and the data were collected from an experimental pilot plant. Y(Watts) 73 84 91 87 78 89 80 94 X (โ) -3 7 22 14 -1 16 1 23 (a) Estimate the linear regression line. (b) Predict the power consumption for an ambient temperature of 18 degrees Celsius. SIMPLE LINEAR REGRESSION EXAMPLE 3: An industrial engineer working for a manufacturing company has noticed a deviation in the accuracy of a machine after it runs for long periods without a cool down cycle. This is especially concerning because the company wants to increase production (longer machine operating times without a cool down) because of a large contract the company will start in 3- 4 months. The industrial engineer decides to monitor the machining process to determine the point (hours of operation) when the machine is producing parts that could be out of tolerance. Over the course of several months, the industrial engineer monitored the machining process to determine a relationship between hours of machine use and millimeters off target the machine was. The data collected is shown in tabular form (Table 1) and scatter plot (Figure 1). SIMPLE LINEAR REGRESSION EXAMPLE 3 (contโn): Table 1: Off โtarget measured as a function of machine use Based on the above data, the industrial engineer would like to determine the number of hours of machine use that would produce a millimeters off target because many parts would fail quality check at that point. Determine the number of hours of operation that produces 2 millimeter off-target based on a least squares fit for the data. Figure 1: Off-target as a function of machine use SIMPLE LINEAR REGRESSION PARTITIONING THE VARIATION Total Variation is made up of two parts: SST = SSR + SSE Total Sum Regression Error Sum of of Sum of Squares Squares Squares เดค 2 = Syy SST= โ(yi - ๐ฆ) เดค 2 = ๐Sxy SSR= โ(๐ฆเท -๐ฆ) เท 2 = SST - SSR SSE= โ(yi -๐ฆ) SIMPLE LINEAR REGRESSION PARTITIONING THE VARIATION Total Variation is made up of two parts: SST = SSR + SSE TOTAL SUM OF SQUARES (SST) โ measures the variation of the yi values around their mean REGRESSION SUM OF SQUARES (SSR) โ explained variation attributable to the relationship between x and y ERROR SUM OF SQUARES โ variation attributable to factors other than the relationship between x and y SIMPLE LINEAR REGRESSION PARTITIONING THE VARIATION Total Variation is made up of two parts: SST = SSR + SSE TOTAL SUM OF SQUARES (SST) โ measures the variation of the yi values around their mean REGRESSION SUM OF SQUARES (SSR) โ explained variation attributable to the relationship between x and y ERROR SUM OF SQUARES โ variation attributable to factors other than the relationship between x and y SIMPLE LINEAR REGRESSION EXAMPLE 4: The following data are diastolic blood pressure (DBP) measurements taken at different times after an intervention for n = 5 persons. For each person, the data available include the time of the measurement and the DBP level. Of interest is the relationship between these two variables. Fit a regression line. Patient 1 2 3 4 5 Time (x) 0 5 10 15 20 DBP (y) 72 66 70 64 66 SIMPLE LINEAR REGRESSION Time DBP Patient x y 1 0 72 2 5 66 3 10 70 4 15 64 5 20 66 75 70 Diastolic Blood Pressure y 65 60 55 y = 70.4 - 0.28x 50 45 0 10 20 30 Minutes x SIMPLE LINEAR REGRESSION REGRESSION t-TEST ๐ป0 : ๐ฝ1 = 0 (no linear relationship between X and Y) ๐ป1 : ๐ฝ1 โ 0 (a linear relationship exists between X and Y) ๐ป1 : ๐ฝ1 > 0 (a positive linear relationship exists between X and Y) ๐ป1 : ๐ฝ1 < 0 (a negative linear relationship exists between X and Y) Test statistic: ๐1 โ ๐ฝ10 ๐ก= where: ๐ / ๐๐ฅ๐ฅ ๐๐๐ธ ๐ ๐โ2 ๐๐๐ธ standard error = ๐ ๐1 = = = ๐๐ฅ๐ฅ ๐๐ฅ๐ฅ ๐โ2 ๐๐ฅ๐ฅ SIMPLE LINEAR REGRESSION Coefficients Standard Error t Stat P-value Intercept 70.4 2.172556098 32.40423 6.46E-05 EXAMPLE 4 (contโn) X Variable 1 -0.28 0.177388463 -1.57846 0.212573 Test if the two variables have a significant linear relationship, using ๐ถ = 0.05. ๐ป0 : ๐ฝ1 = 0 ๐1 โ ๐ฝ10 ๐ป1 : ๐ฝ1 โ 0 ๐ก= ๐ / ๐๐ฅ๐ฅ ๐ถ= 0.05 ๐1 โ๐ฝ10 Critical values: ยฑt0.25,3 = ยฑ3.182 = ๐๐๐ธ Critical regions: t > 3.182 and t < -3.182 ๐โ2 ๐๐ฅ๐ฅ โ0.28โ0 3382 = 23.6 ๐๐๐ = ๐๐ฆ๐ฆ = 22982 โ = 43.2 5 502 5โ2 750โ 5 50 338 ๐๐๐ = ๐๐๐ฅ๐ฆ = โ0.28(310 โ = 19.6 5 โ โ1.5785 ๐๐๐ธ = ๐๐๐ โ ๐๐๐ = 43.2 โ 19.6 = 23.6 โ Fail to reject H0. There is no significant linear relationship between X and Y. EXAMPLE 5 The data below show 30 observations on driver age and the maximum distance (feet) at which individuals can read a highway sign. Age 18 20 22 23 23 25 27 28 29 32 37 41 46 49 53 Distance 510 590 560 510 460 490 560 510 460 410 420 460 450 380 460 Age 55 63 65 66 67 68 70 71 72 73 74 75 77 79 82 Distance 420 350 420 300 410 300 390 320 370 280 420 460 360 310 360 1. Determine the regression line. 2. Test whether age and distance are significantly linearly related. EXAMPLE 6: Stretched handspans and heights are measured in inches for 30 college students. The data are shown below using y = height and x = stretched handspan. Handspan 21.5 23.5 22.5 18 23.5 20.0 23.0 24.5 21.0 20.5 18.5 21.0 19.5 22.0 20.0 Height 68 71 73 64 68 59 73 75 65 69 64 67 67 69 62 Handspan 22.5 18.5 21.5 24.5 20.5 24.5 20.5 24.5 21.0 21.0 18.5 18.0 19.5 20.5 21.0 Height 69 64 74 73 66 74 66 74 73 69 64 67 60 75 64 1. Determine the regression line. 2. Test whether height and handspan are significantly linearly related. EXAMPLE 7: You are a manufacturer who wants to obtain a quality measure on a product, but the procedure to obtain the measure is expensive. There is an indirect approach, which uses a different product score (Score 1) in place of the actual quality measure (Score 2). This approach is less costly but also is less precise. You can use regression to see if Score 1 explains a significant amount of the variance in Score 2 to determine if Score 1 is an acceptable substitute for Score 2. The results from a simple linear regression analysis are given below: We are concerned in testing the null hypothesis that Score 1 is not a significant predictor of Score 2 versus the alternative that Score 1 is a significant predictor of Score 2. More formally, we are testing: ๐ป0 : ๐ฝ1 = 0 ๐ป1 : ๐ฝ1 โ 0 Based on the results, what is the appropriate conclusion at ๐ถ = 0.05?