Simple Linear Regression PDF

Document Details

Uploaded by Deleted User

School of Engineering and Architecture

Cynthia SM Jacob

Tags

simple linear regression statistical analysis industrial engineering mathematics

Summary

This document provides lecture notes on simple linear regression, a statistical method used to model the relationship between two variables. It covers topics like empirical models, regression analysis, and examples. The notes are suitable for undergraduate students in engineering or related fields.

Full Transcript

IE013B STATISTICAL ANALYSIS SIMPLE LINEAR FOR INDUSTRIAL ENGINEERING 2 REGRESSION Prepared By Cynthia SM Jacob School of Engineering & Architecture ...

IE013B STATISTICAL ANALYSIS SIMPLE LINEAR FOR INDUSTRIAL ENGINEERING 2 REGRESSION Prepared By Cynthia SM Jacob School of Engineering & Architecture INTRODUCTION STATISTICS - It is the study of how to learn from data. - It helps one to collect the right data, perform the correct analysis, and effectively present the results with statistical knowledge. - Statistical modeling is key to making scientific discoveries, data-driven decisions, and predictions. SIMPLE LINEAR REGRESSION INTRODUCTION STATISTICAL MODEL - involves a mathematical relationship between random and non-random variables - helps identify relationships between variables and make predictions by applying the model to raw data Examples of common data sets: census data, social media data, public health data SIMPLE LINEAR REGRESSION EMPIRICAL MODELS An empirical model is one based on observed data rather than on a theoretical relationship. Deterministic model – demonstrates an exact relationship between the variables; does not include elements of randomness Example: 𝑑𝑡 = 𝑑0 + 𝑣𝑡 measures the displacement of a particle from the origin at time 𝑡 = 0, with velocity 𝑣 after time 𝑡 SIMPLE LINEAR REGRESSION EMPIRICAL MODELS Probabilistic model – contains a random component that affects the relationship but is not being measured Examples: - Fuel mileage of a vehicle is related to its engine, but this is not the only determinant of it. - Power consumption of a house is related to its size, but not purely determined by it. - Meteorological models that predict the chances of weather occurrences use probabilities. SIMPLE LINEAR REGRESSION REGRESSION ANALYSIS It is the collection of statistical tools used to model and explore relationships between variables that are related in a probabilistic manner. Regression analysis is used to: - forecast the value of a dependent variable (Y) from observed values of the independent variable (X) - analyze the relationship between a dependent and an independent variable SIMPLE LINEAR REGRESSION REGRESSION ANALYSIS EXAMPLE: In a chemical process, suppose the yield of the product is related to the process-operating temperature. Regression analysis can be used to: - build a model to predict yield at a given temperature level; - determine the optimal temperature level to maximize yield SIMPLE LINEAR REGRESSION REGRESSION ANALYSIS The table shows the purity of oxygen produced (𝑦) in a chemical distillation process and the percentage of hydrocarbons (𝑥) present in the main condenser of the distillation unit. SIMPLE LINEAR REGRESSION SIMPLE LINEAR REGRESSION 𝐸 𝑌 𝑥 = 𝜇𝑌|𝑥 = 𝛽0 + 𝛽1 𝑥 Regression coefficients: 𝛽0 - the intercept 𝛽1 - the slope Regression analysis deals with finding the best linear relationship between 𝑌 and 𝑋. SIMPLE LINEAR REGRESSION THE SIMPLE LINEAR REGRESSION MODEL 𝑌 = 𝛽0 + 𝛽1 𝑥 + 𝜖 statistical model containing the random component 𝜖 representing a random error. 𝜖 is a random variable with 𝐸 𝜖 = 0 and 𝑉𝑎𝑟 𝜖 = 2 𝜎. Since 𝐸 𝜖 = 0, then at a specific 𝒙, the 𝑦-values are distributed around the true regression line 𝑌 = 𝛽0 + 𝛽1 𝑥. SIMPLE LINEAR REGRESSION THE REGRESSION LINE SIMPLE LINEAR REGRESSION THE FITTED REGRESSION LINE 𝑦ො = 𝑏0 + 𝑏1 𝑥 an estimate of the true regression line 𝑦ො is the fitted or predicted value 𝑏0 and 𝑏1 are estimates of the regression coefficients SIMPLE LINEAR REGRESSION THE FITTED REGRESSION LINE A residual is an error in the fit of the model 𝑦ො = 𝑏0 + 𝑏1 𝑥 and is given by 𝑒𝑖 = 𝑦𝑖 − 𝑦ො𝑖 for 𝑖 = 1,2, ⋯ , 𝑛 which also translates to 𝑦𝑖 = 𝑏0 + 𝑏1 𝑥𝑖 + 𝑒𝑖 SIMPLE LINEAR REGRESSION THE FITTED REGRESSION LINE The method of least squares is an estimation procedure that minimizes the 𝑆𝑆𝐸. 𝑆𝑆𝐸 – residual sum of squares; sum of squares of the errors about the regression line The method determines 𝑏0 and 𝑏1 so as to minimize 𝑛 𝑛 2 2 𝑆𝑆𝐸 = ෍ 𝑒𝑖 = ෍ 𝑦𝑖 − 𝑦ො𝑖 𝑖=1 𝑖=1 𝑎 is often used to represent 𝑏0 and 𝑏 to represent 𝑏1. The fitted line is given by 𝑦ො = 𝑎 + 𝑏𝑥 SIMPLE LINEAR REGRESSION THE METHOD OF LEAST SQUARES Sum of Squares: Estimating the regression coefficients: σ𝑥σ𝑦 𝑆𝑥𝑦 𝑆𝑥𝑦 = ෍ 𝑥𝑦 − 𝑏 = 𝑏1 = 𝑛 𝑆𝑥𝑥 σ 𝑥 2 2 𝑆𝑥𝑥 = ෍ 𝑥 − 𝑎 = 𝑏0 = 𝑦ത − 𝑏𝑥ҧ 𝑛 σ 𝑦 2 2 𝑆𝑦𝑦 = ෍ 𝑦 − 𝑛 SIMPLE LINEAR REGRESSION EXAMPLE 1: The grades of a class of 9 students on a midterm report (𝑥) and on the final examination (𝑦) are as follows: 𝑥 77 50 71 72 81 94 96 99 67 𝑦 82 66 78 34 47 85 99 99 68 (a) Estimate the linear regression line. (b) Estimate the final examination grade of a student who received a grade of 85 on the midterm report. SIMPLE LINEAR REGRESSION EXAMPLE 2: A study was done to study the effect of ambient temperature (X) on the electric power consumed by a chemical plant (Y). Other factors were held constant, and the data were collected from an experimental pilot plant. Y(Watts) 73 84 91 87 78 89 80 94 X (℃) -3 7 22 14 -1 16 1 23 (a) Estimate the linear regression line. (b) Predict the power consumption for an ambient temperature of 18 degrees Celsius. SIMPLE LINEAR REGRESSION EXAMPLE 3: An industrial engineer working for a manufacturing company has noticed a deviation in the accuracy of a machine after it runs for long periods without a cool down cycle. This is especially concerning because the company wants to increase production (longer machine operating times without a cool down) because of a large contract the company will start in 3- 4 months. The industrial engineer decides to monitor the machining process to determine the point (hours of operation) when the machine is producing parts that could be out of tolerance. Over the course of several months, the industrial engineer monitored the machining process to determine a relationship between hours of machine use and millimeters off target the machine was. The data collected is shown in tabular form (Table 1) and scatter plot (Figure 1). SIMPLE LINEAR REGRESSION EXAMPLE 3 (cont’n): Table 1: Off –target measured as a function of machine use Based on the above data, the industrial engineer would like to determine the number of hours of machine use that would produce a millimeters off target because many parts would fail quality check at that point. Determine the number of hours of operation that produces 2 millimeter off-target based on a least squares fit for the data. Figure 1: Off-target as a function of machine use SIMPLE LINEAR REGRESSION PARTITIONING THE VARIATION Total Variation is made up of two parts: SST = SSR + SSE Total Sum Regression Error Sum of of Sum of Squares Squares Squares ത 2 = Syy SST= ∑(yi - 𝑦) ത 2 = 𝑏Sxy SSR= ∑(𝑦ො -𝑦) ො 2 = SST - SSR SSE= ∑(yi -𝑦) SIMPLE LINEAR REGRESSION PARTITIONING THE VARIATION Total Variation is made up of two parts: SST = SSR + SSE TOTAL SUM OF SQUARES (SST) – measures the variation of the yi values around their mean REGRESSION SUM OF SQUARES (SSR) – explained variation attributable to the relationship between x and y ERROR SUM OF SQUARES – variation attributable to factors other than the relationship between x and y SIMPLE LINEAR REGRESSION PARTITIONING THE VARIATION Total Variation is made up of two parts: SST = SSR + SSE TOTAL SUM OF SQUARES (SST) – measures the variation of the yi values around their mean REGRESSION SUM OF SQUARES (SSR) – explained variation attributable to the relationship between x and y ERROR SUM OF SQUARES – variation attributable to factors other than the relationship between x and y SIMPLE LINEAR REGRESSION EXAMPLE 4: The following data are diastolic blood pressure (DBP) measurements taken at different times after an intervention for n = 5 persons. For each person, the data available include the time of the measurement and the DBP level. Of interest is the relationship between these two variables. Fit a regression line. Patient 1 2 3 4 5 Time (x) 0 5 10 15 20 DBP (y) 72 66 70 64 66 SIMPLE LINEAR REGRESSION Time DBP Patient x y 1 0 72 2 5 66 3 10 70 4 15 64 5 20 66 75 70 Diastolic Blood Pressure y 65 60 55 y = 70.4 - 0.28x 50 45 0 10 20 30 Minutes x SIMPLE LINEAR REGRESSION REGRESSION t-TEST 𝐻0 : 𝛽1 = 0 (no linear relationship between X and Y) 𝐻1 : 𝛽1 ≠ 0 (a linear relationship exists between X and Y) 𝐻1 : 𝛽1 > 0 (a positive linear relationship exists between X and Y) 𝐻1 : 𝛽1 < 0 (a negative linear relationship exists between X and Y) Test statistic: 𝑏1 − 𝛽10 𝑡= where: 𝑠/ 𝑆𝑥𝑥 𝑆𝑆𝐸 𝑠 𝑛−2 𝑆𝑆𝐸 standard error = 𝑠𝑏1 = = = 𝑆𝑥𝑥 𝑆𝑥𝑥 𝑛−2 𝑆𝑥𝑥 SIMPLE LINEAR REGRESSION Coefficients Standard Error t Stat P-value Intercept 70.4 2.172556098 32.40423 6.46E-05 EXAMPLE 4 (cont’n) X Variable 1 -0.28 0.177388463 -1.57846 0.212573 Test if the two variables have a significant linear relationship, using 𝜶 = 0.05. 𝐻0 : 𝛽1 = 0 𝑏1 − 𝛽10 𝐻1 : 𝛽1 ≠ 0 𝑡= 𝑠/ 𝑆𝑥𝑥 𝜶= 0.05 𝑏1 −𝛽10 Critical values: ±t0.25,3 = ±3.182 = 𝑆𝑆𝐸 Critical regions: t > 3.182 and t < -3.182 𝑛−2 𝑆𝑥𝑥 −0.28−0 3382 = 23.6 𝑆𝑆𝑇 = 𝑆𝑦𝑦 = 22982 − = 43.2 5 502 5−2 750− 5 50 338 𝑆𝑆𝑅 = 𝑏𝑆𝑥𝑦 = −0.28(310 − = 19.6 5 ≈ −1.5785 𝑆𝑆𝐸 = 𝑆𝑆𝑇 − 𝑆𝑆𝑅 = 43.2 − 19.6 = 23.6 → Fail to reject H0. There is no significant linear relationship between X and Y. EXAMPLE 5 The data below show 30 observations on driver age and the maximum distance (feet) at which individuals can read a highway sign. Age 18 20 22 23 23 25 27 28 29 32 37 41 46 49 53 Distance 510 590 560 510 460 490 560 510 460 410 420 460 450 380 460 Age 55 63 65 66 67 68 70 71 72 73 74 75 77 79 82 Distance 420 350 420 300 410 300 390 320 370 280 420 460 360 310 360 1. Determine the regression line. 2. Test whether age and distance are significantly linearly related. EXAMPLE 6: Stretched handspans and heights are measured in inches for 30 college students. The data are shown below using y = height and x = stretched handspan. Handspan 21.5 23.5 22.5 18 23.5 20.0 23.0 24.5 21.0 20.5 18.5 21.0 19.5 22.0 20.0 Height 68 71 73 64 68 59 73 75 65 69 64 67 67 69 62 Handspan 22.5 18.5 21.5 24.5 20.5 24.5 20.5 24.5 21.0 21.0 18.5 18.0 19.5 20.5 21.0 Height 69 64 74 73 66 74 66 74 73 69 64 67 60 75 64 1. Determine the regression line. 2. Test whether height and handspan are significantly linearly related. EXAMPLE 7: You are a manufacturer who wants to obtain a quality measure on a product, but the procedure to obtain the measure is expensive. There is an indirect approach, which uses a different product score (Score 1) in place of the actual quality measure (Score 2). This approach is less costly but also is less precise. You can use regression to see if Score 1 explains a significant amount of the variance in Score 2 to determine if Score 1 is an acceptable substitute for Score 2. The results from a simple linear regression analysis are given below: We are concerned in testing the null hypothesis that Score 1 is not a significant predictor of Score 2 versus the alternative that Score 1 is a significant predictor of Score 2. More formally, we are testing: 𝐻0 : 𝛽1 = 0 𝐻1 : 𝛽1 ≠ 0 Based on the results, what is the appropriate conclusion at 𝜶 = 0.05?

Use Quizgecko on...
Browser
Browser