Group 2_1A_UNIT 10_SIMPLE LINEAR REGRESSION AND CORRELATION.pdf

UNIT 10 SIMPLE LINEAR REGRESSION AND CORRELATION I. INTRODUCTION In engineering, data analysis plays a vital role in interpreting and understanding the relationships between various factors. Two fundamental techniques used in this process are the simple linear regression and correlation. Simple linear regression builds a mathematical model to describe the linear association between a single independent variable (predictor) and a dependent variable (response). This model allows to predict the expected value of the dependent variable based on changes in the independent variable. On the other hand, correlation quantifies the strength and direction of this linear relationship. By analyzing the correlation coefficient, engineers can determine if a change in one variable is accompanied by a consistent change, positive or negative, in the other. Together, these techniques provide a powerful foundation for engineers to analyze data, identify trends, and make informed decisions. II. LEARNING OBJECTIVES At the end of this unit, the students are expected to: 1. Gain knowledge on the development of empirical models 2. Explain the different approach regression and correlation 3. Gain knowledge on the different testing techniques; and 4. Explain the uses of the different techniques in simple linear regression and correlation. III. CONTENTS A. Empirical Models Empirical models, sometimes called statistical models, are built on the foundation of observation rather than established theories. They identify patterns and relationships within data, allowing researchers to make predictions about future events. Problems in engineering and the sciences involve a study or analysis of the relationship between two or more variables that are related in a non-deterministic manner. For example: X- VARIABLE Y-VARIABLE Size of house Energy consumption Weight of Vehicle Fuel usage of an Automobile Age of Concrete Comprehensive Strength of Concrete From the examples above, the value of the response of interest y cannot be predicted solely from the corresponding x. To explore this kind of relationship between variables that are related in a nondeterministic manner, regression analysis is used. Say, in a chemical process, suppose that the yield of the product is related to the process-operating temperature. Regression analysis can be used to build a model to predict yield at a given temperature level. Observation Hydrocarbon Purity Number Level Y (%) X (%) 1 0.99 90.01 2 1.02 89.05 3 1.15 91.43 4 1.29 93.74 5 1.46 96.73 6 1.36 94.45 7 0.87 87.59 8 1.23 91.77 9 1.55 99.42 10 1.40 93.65 11 1.19 93.54 12 1.15 92.52 13 0.98 90.56 14 1.01 89.54 15 1.11 89.85 16 1.20 90.39 17 1.26 93.25 18 1.32 93.41 19 1.43 94.98 20 0.95 87.33 Table 11. 1. Oxygen Purity and Hydrocarbon Levels By inspecting the scatter diagram, although no simple curve will pass exactly through all the points, there is a strong indication that the points lie scattered randomly around a straight line. Therefore, it is reasonable to assume that the mean of the random variable Y is related to x by the following straight-line relationship: 𝐸(𝑥) = µ𝑌|𝑥 = β0 + β1𝑥 where: β1 = 𝑠𝑙𝑜𝑝𝑒 𝑜𝑟 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑥 = 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒 β0 = 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡/𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 where the slope and intercept of the line are called regression coefficients. Moreover, while the mean of Y is a linear function of x, the actual observed value Y does not fall exactly on a straight line. To generalize this to a probabilistic linear model, assume that the Y is a linear function of x, but that for a fixed value of x the actual value of Y is determined by Simple Linear Regression Model: 𝑌 = β0 + β1𝑥 + ∈ where: β1 = 𝑠𝑙𝑜𝑝𝑒 𝑜𝑟 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑥 = 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒 β0 = 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡/𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 ∈ = 𝑟𝑎𝑛𝑑𝑜𝑚 𝑡𝑒𝑟𝑚 𝑒𝑟𝑟𝑜𝑟 Regression Model Came from a theoretical relationship that is chosen as an empirical model based on inspecting a scatter diagram when there is no theoretical knowledge of the relationship between x and y. To understand more, suppose there is a specific value for X. Even though X is fixed, in the previous equation it includes a random element (∈). Therefore, for a fixed X, the random component on the right side of the equation dictates the possible values of Y. 2 ⮚ If mean and variance of ∈ are 0 and σ , respectively. Then, 𝐸(𝑥) = 𝐸 (β0 + β1 + ∈) = β0 + β1𝑥 + 𝐸(∈) = β0 + β1𝑥 ⮚ Variance of Y given x is 𝑉(𝑥) = 𝑉 (β0 + β1 + ∈) = 𝑉(β0 + β1𝑥) + 𝑉(∈) 2 2 = 0+ σ = σ To conclude, the true regression model is a line of mean values where: height of the regression line/ regression line is the expected value of Y for that x slope (β1) can be interpreted as the change in the mean of Y for a unit change in x. variability of Y at x is determined by the error 2 variance σ , where Y-values have the same distribution and variance at each x value. B. REGRESSION: Modeling Linear Relations - The Least Squares Approach Regression analysis is a statistical term for describing the models that estimate the relationships among variables. It helps the researcher understand how the value of the dependent variable is changing corresponding to an independent variable when the other independent variables are held fixed. Among all techniques for regression, linear regression is its most popular form, wherein it attempts to show the relationship between the independent (or explanatory) variable and the dependent variable by fitting a linear equation to observed data. The most common method for fitting a regression line is the least-squares regression by German mathematician Gauss. It calculates the best fitting-line for the observed data by minimizing the sum of the squares of the vertical deviations (or residuals) from each data point to the line (if a point lies on the fitted line exactly, then its vertical deviation is 0). Because the deviations are first squared, then summed, there are no cancellations between positive and negative values. Algebraically, in finding the fitted line, we use the formula 𝑦 = 𝑚𝑥 + 𝑏, where 𝑚 is the slope of the line, 𝑥 is the independent variable, and 𝑏 is the y-intercept. To find the values of 𝑚 and 𝑏, the following formula are used: 𝑛∑𝑥𝑦−∑𝑥∑𝑦 For the slope (𝑚), 𝑚 = 2 , and; 2 𝑛∑𝑥 −(∑𝑥) ∑𝑦−𝑚∑𝑥 For the y-intercept (𝑏), 𝑏 = 𝑛 ; wherein for both formulas, 𝑛 is the total number of data points, 𝑥 is the independent variable, and 𝑦 is the dependent variable. In statistics, the equation to find the fitted line or the least squares line is expressed as: 𝑦𝑖 = β0 + β1𝑥𝑖 + ϵ1 where β0 is the y-intercept, β1 is the slope of the line, and ϵ1 is the error with 2 mean zero and (unknown) variance σ. Like the algebraic form, 𝑥 represents the independent variable. β0 and β1 are the point estimates of β0 and β1, and they are called the least square estimates. They minimize the sum of the squared vertical deviations. The sum of the squares of the deviations of the observations from the points to the true regression line is: 𝑛 𝑛 2 2 𝐿 = ∑ ϵ𝑖 = ∑ [𝑦𝑖 − (β0 − β1𝑥𝑖)] 𝑖=1 𝑖=1 The least squares estimate of the slope coefficient β1 of the true regression line is: ∑(𝑥𝑖−𝑥)(𝑦𝑖−𝑦) 𝑆𝑥𝑦 𝑏1 = β1 = 2 = 𝑆𝑥𝑥 ∑(𝑥𝑖−𝑥) Computing formulas for the numerator and denominator of β1 are: 2 2 𝑆𝑥𝑦 = Σ𝑥𝑖𝑦𝑖 − (Σ𝑥𝑖)(Σ𝑦𝑖)/𝑛 or 𝑆𝑥𝑥 = Σ𝑥𝑖 − (Σ𝑥𝑖) /𝑛 The least squares estimate of the intercept β0 of the true regression line is: ∑𝑦𝑖− β1Σ𝑥𝑖 𝑏0 = β0 = 𝑛 = 𝑦 − β1𝑥 In computing β0, use extra digits β1 in because, if 𝑥 is large in magnitude, rounding will affect the final answer. Example: Write the linear equation that best fits the data in the table shown below: x 1 2 3 4 5 6 7 y 1.5 3.8 6.7 9.0 11.2 13.6 16 Let 𝑛 = 7. 2 Creating a table to find the values of 𝑥 and 𝑥𝑦: x y xy x^2 1 1.5 1.5 1 2 3.8 7.6 4 3 6.7 20.1 9 4 9.0 36 16 5 11.2 56 25 6 13.6 81.6 36 7 16 112 49 Total 28 61.8 314.8 140 2 The values ∑ 𝑥 = 28, ∑ 𝑦 = 61. 8, ∑ 𝑥𝑦 = 314. 8, and ∑ 𝑥 = 140 are found. Finding the values of the slope and y-intercept: For slope: 𝑛∑𝑥𝑦−∑𝑥∑𝑦 𝑚= 2 2 𝑛∑𝑥 −(∑𝑥) 7(314.8)−(28)(61.8) 𝑚= 2 7(140)−(28) 473.2 𝑚= 196 𝑚 = 2. 4142857 For y-intercept: ∑𝑦−𝑚∑𝑥 𝑏= 𝑛 61.8−(2.4142857)(28) 𝑏= 7 𝑏= − 0. 828571 Substituting those values into the algebraic equation 𝑦 = 𝑚𝑥 + 𝑏, the equation of the estimated regression line is 𝑦 = 2. 41𝑥 − 0. 83. Example: The cetane number is a critical property in specifying the ignition quality of a fuel used in a diesel engine. Determination of this number for a biodiesel fuel is expensive and time-consuming. The article “Relating the Cetane Number of Biodiesel Fuels to Their Fatty Acid Composition: A Critical Study” (J. of Automobile Engr., 2009: 565–583) included the following data on and number for a sample of 14 biofuels. The iodine value is the amount of iodine necessary to saturate a sample of 100 g of oil. The article’s authors fit the simple linear regression model to this data, so following their lead: x 132.0 129.0 120.0 113.2 105.0 92.0 84.0 83.2 88.4 59.0 80.0 81.5 71.0 69.2 y 46.0 48.0 51.0 52.1 54.0 52.0 59.0 58.7 61.6 64.0 61.4 54.6 58.8 58.0 Let 𝑛 = 14. In finding the regression line, creating a table would help us solve for 2 2 the 𝑥 , 𝑥𝑦, and 𝑦. Calculating for the sums of each column: x y xy x^2 y^2 132.0 46.0 6072 17424 2116 129.0 48.0 6192 16641 2304 120.0 51.0 6120 14400 2601 113.2 52.1 5897.72 12814.24 2714.41 105.0 54.0 5670 11025 2916 92.0 52.0 4784 8464 2704 84.0 59.0 4956 7056 3481 83.2 58.7 4883.84 6922.24 3445.69 88.4 61.6 5445.44 7814.56 3794.56 59.0 64.0 3776 3481 4096 80.0 61.4 4912 6400 3769.96 81.5 54.6 4449.9 6642.25 2981.16 71.0 58.8 4174.8 5041 3457.44 69.2 58.0 4013.6 4788.64 3364 TOTAL 1307.5 779.2 71,347.3 128,913.93 43,745.22 𝑆𝑥𝑦 To solve for the estimated slope of the true regression line β1 = 𝑆𝑥𝑥 , first, we find the values of 𝑆𝑥𝑥 and 𝑆𝑥𝑦. 2 2 𝑆𝑥𝑥 = Σ𝑥𝑖 − (Σ𝑥𝑖) /𝑛 2 𝑆𝑥𝑥 = 128, 913. 93 − (1307. 5) /𝑛 𝑆𝑥𝑥 = 6802. 7693 𝑆𝑥𝑦 = Σ𝑥𝑖𝑦𝑖 − (Σ𝑥𝑖)(Σ𝑦𝑖)/𝑛 𝑆𝑥𝑦 = 71, 347. 3 − (1307. 5)(Σ779. 2)/14 𝑆𝑥𝑦 = − 1424. 41429 Substituting those values into the equation of the estimated slope of 𝑆𝑥𝑦 the true regression line, β1 = 𝑆𝑥𝑥 : 𝑆𝑥𝑦 β1 = 𝑆𝑥𝑥 −1424.414.29 β1 = 6802.7693 β1 = − 0. 20938742 The data suggests that the expected change in true average cetane number associated with a 1 g increase in iodine value is -0.209, implying a decrease of 0.209. Since 𝑥 = 93.392857 and 𝑦 = 55.657143, the intercept of the true regression line is: β0 = 𝑦 − β1𝑥 β0 = 55. 657143 − (− 0. 20938742)(93. 392857) β0 = 75. 212432 Therefore, the equation of the estimated regression line is 𝑦 = 75. 212 − 0. 2094𝑥. The figure below shows a scatter plot of the data with the least squares line. C. CORRELATION: ESTIMATING THE STRENGTH OF LINEAR RELATION In the context of regression analysis, correlation measures the strength and direction of the linear relationship between two variables. Let’s focus on understanding the correlation between two random variables X and Y and how it is quantified using the correlation coefficient. Concept of Correlation Correlation is a statistical technique used to determine the degree to which two variables move in relation to each other. Unlike regression analysis, which predicts the value of a dependent variable based on the value of an independent variable, correlation quantifies the relationship between the variables without implying causation. Mathematical Formulation The population correlation coefficient ρ measures the linear association between two variables and is defined as: 𝑝 = σᵪᵧ/σᵪσᵧ where σₓᵧ is the covariance between Y and X, and σₓ and σᵧ are the standard deviations of X and Y. The sample correlation coefficient r (Pearson’s r) estimates ρ and is given by: 𝑟 = ∑ (𝑥ᵢ − 𝑥̄)(𝑦ᵢ − ȳ)⁄[√(𝑥ᵢ − 𝑥̄)²(𝑦ᵢ − ȳ)²] Interpretation of Correlation Coefficient r =+1 Perfect positive linear relationship. r=−1 Perfect negative linear relationship. r=0: No linear relationship. Values of r close to +1 or -1 indicate a strong linear relationship, while values close to 0 indicate a weak or no linear relationship. 3 TYPES OF CORRELATION SCATTER DIAGRAMS Zero Correlation - If two variables X and Y have no correlation then 𝑝 = 0. This means that the two variables have no linear relationship at all. Positive Correlation - Two variables X and Y have a positive correlation if high values of one variable X shows also high values of another variable Y or low values of one variable X shows also low values of another variable Y. Negative Correlation - Two variables X and Y have a negative correlation if high values of one variable X shows low values of one variable Y or low values of one variable X shows high values of another variable Y. EXAMPLE: Analyze the correlation between the compressive strength of concrete (Y) and the curing time (X) using Pearson's r. Find the mean of variable X and Y: 𝑥̄ = (3 + 7 + 14 + 21 + 28 + 35 + 42 + 49 + 56 + 6)/10 = 31. 8 ȳ = (12. 5 + 18. 7 + 25. 4 + 32. 1 + 38. 0 + 42. 3 + 46. 7 + 50. 5 + 53. 8 + 56. 2)/10 = 37. 62 Plug in the values and solve for r: 𝑟 = ∑ (𝑥ᵢ − 𝑥̄)(𝑦ᵢ − ȳ)⁄[√(𝑥ᵢ − 𝑥̄)²(𝑦ᵢ − ȳ)²] 𝑟 = (2766. 7)⁄[√(3861. 6)(2046. 2)] 𝑟 = (2766. 7)⁄(2810. 9) 𝑟 = 0. 98 CONCLUSION: The Pearson correlation coefficient 𝑟 = 0. 98 indicates a very strong positive linear relationship between the curing time and the compressive strength of concrete. This implies that as the curing time increases, the compressive strength of concrete also increases significantly. Scatter Diagram: The scatter diagram shows a positive linear relationship. D. HYPOTHESIS TESTS IN SIMPLE LINEAR REGRESSION When examining the relationship between a quantitative outcome or response variable and a single quantitative predictor variable, simple linear regression is the most commonly used analytical method. A key aspect of evaluating the adequacy of a linear regression model involves testing statistical hypotheses about the model parameters and constructing appropriate confidence intervals.In simple linear regression, this is equivalent to saying “Are X and Y correlated?” To test hypotheses about the slope and intercept of the regression model, we must make the additional assumption that the error component in the model, , is normally distributed. Thus, the complete assumptions are that the errors are normally and independently distributed with mean zero and variance 2, abbreviated NID(0, 2). Variable’s Roles Dependent Variable - This is the variable whose values want to explain or forecast.. - Its values depend on something else. - We denote it as Y. Independent Variable - This is the variable that explains the other one. - Its values are independent. - We denote it as X. The mathematical formula of the linear regression can be written as Y= β0+ β1X + e , where: β0 and β1are known as the regression beta coefficients or parameters: ○ β0(Intercept): Indicates the starting value of Y when X is zero. ○ β1(Slope): Indicates how much Y changes for a one-unit change in X. e is the error term (also known as the residual errors), the part of y that can be explained by the regression model ○ regression line. In reviewing the model, Y= β0+ β1X + e, as long as the slope (β1) has any non‐zero value, X will add value in helping predict the expected value of Y. However, if there is no correlation between X and Y, the value of the slope (β1) will be zero. Example: 1. Suppose you are a social researcher interested in the relationship between income and happiness. You survey 500 people whose incomes range from 15k to 75k and ask them to rank their happiness on a scale from 1 to 10. Your independent variable Y (income) and dependent variable X (happiness) are both quantitative, so you can do a regression analysis to see if there is a linear relationship between them. Null Hypothesis (H0): There is no linear relationship between income and happiness. β1=0 (Income does not predict happiness). Alternative Hypothesis (HA): There is a linear relationship between income and happiness. β1≠0 (Income predicts happiness). Here, β1 represents the slope of the regression line. If β1=0 it means changes in income do not predict changes in happiness. If β1≠0 , it means changes in income are associated with changes in happiness, indicating a significant linear relationship. I. USE OF T-TESTS In linear regression, the t-test is a statistical hypothesis testing technique used to test the hypothesis related to the linearity of the relationship between the response variable and different predictor variables. The t-test is performed as a hypothesis test to assess the significance of regression coefficients in the linear regression model. A statistically significant variable has a strong relationship with the dependent variable and contributes significantly to the model’s accuracy. Suppose we wish to test the hypothesis that the slope equals a constant, say, β1,0. The appropriate hypotheses are H0: β1= β1,0 H1: β1≠ β1,0 Using the test statistic β1 − β1,0 β1 − β1,0 𝑇0 = = 𝑠𝑒(β1) σ 2 𝑆𝑥𝑥 follows the t distribution with n-2 degrees of freedom under H0: β1 = β1,0. We would reject H0: β1 = β1,0 if |𝑡0 | > 𝑡α/2,𝑛−2 A similar procedure can be used to test hypotheses about the intercept. to test H0: β0 = β0,0 H1: β0 ≠ β0,0 the statistic to be used is β0 − β0,0 β0 − β0,0 𝑇0 = = 𝑠𝑒(β0) 2 σ ⎡⎢ 𝑛 + ⎤ 2 1 𝑥 𝑆𝑥𝑥 ⎥ ⎣ ⎦ and reject the null hypothesis if the computed value of this test statistic, t0, is such that |𝑡0 | > 𝑡α/2,𝑛−2 A very important special case of the hypotheses of testing the slope is H0: β` = 0 H1: β1 ≠ 0 These hypotheses relate to the significance of regression. Failure to reject H0: β1=0 is equivalent to concluding that there is no linear relationship between x and Y. This may imply either that x is of little value in explaining the variation in Y and that the best estimator of Y for any x is 𝑦 = 𝑌 (a). ] Or that the true relationship between x and Y is not linear (b). Alternatively, if H0: β1 = 0 is rejected, this implies that x is of value in explaining the variability in Y. Rejecting H0: β1 = 0 could mean either that the straight-line model is adequate(a) or that, although there is a linear effect of x, better results could be obtained with the addition of higher order polynomial terms in x(b). Example: Consider the data in this table where y is the purity of oxygen produced in a chemical distillation process, and x is the percentage of hydrocarbons that are present in the main condenser of the distillation unit. Test for significance of regression using t-test with a df pf n-2, at 5% level of significance. Testing the slope H0: β` = 0 H1: β1 ≠ 0 Following the decision rule: Reject Ho if t0 > tα/2,n-2 at 5% level of significance Using the T Statistic β1 − β1,0 𝑇0 = 2 σ 𝑆𝑥𝑥 where: 𝑆𝑥𝑦 β1 = 𝑆𝑥𝑥 2 𝑆𝑆𝐸 σ = 𝑛−2 , Estimator of Variance 2 𝑆𝑆𝐸 = Σ𝑦𝑖 − β0 Σ𝑦𝑖 − β1Σ𝑥𝑖 𝑦 , Error Sum of Squares 𝑖 2 ( ) 𝑛 𝑛 ∑ 𝑥𝑖 2 𝑖=1 𝑆𝑥𝑥 = ∑ 𝑥𝑖 − 𝑛 𝑖=1 β0 = 𝑦 − β1𝑥 Using these formulas to compute t0 : 𝑛 = 20 𝑥 = 1. 1960 𝑦 = 92. 1605 20 20 20 2 ∑ 𝑥𝑖 = 23. 92 ∑ 𝑦𝑖 = 1, 843. 21 ∑ 𝑥𝑖 = 29. 2892 𝑖=1 𝑖=1 𝑖=1 2 ( ) 20 20 ∑ 𝑥𝑖 2 2 (23.92) 1.) 𝑆𝑥𝑥 = ∑ 𝑥𝑖 − 𝑖=1 𝑛 = 29. 2892 − 20 = 0. 68088 𝑖=1 ( )( ) = 2, 214. 6566 − 20 20 20 ∑ 𝑥𝑖 ∑ 𝑦𝑖 (23.92)(1,843,21) 2.) 𝑆𝑥𝑦 = ∑ 𝑥𝑦 − 𝑖 𝑖 𝑖=1 20 𝑖=1 20 = 10. 17744 𝑖=1 𝑆𝑥𝑦 10.17744 3.) β1 = 𝑆𝑥𝑥 = 0.68088 = 14. 94748 4.) β0 = 𝑦 − β1𝑥 = 92. 1605 − (14. 94748)1. 196 = 74. 28331 2 5.) 𝑆𝑆 = Σ𝑦 − β Σ𝑦 − β Σ𝑥 𝑦 𝐸 𝑖 0 𝑖 1 𝑖 𝑖 = 170, 044. 532 − (74. 28331𝑥1, 843. 21) − ( 14. 94748𝑥 2234. 30) = 21. 25704 2 𝑆𝑆𝐸 21.25 6.) σ = 𝑛−2 = 20−2 = 1. 18 so the t-statistic becomes: β1 − β1,0 14.94748− 0 𝑇0 = 2 = 1.18 = 11. 35 σ 0.68088 𝑆𝑥𝑥 CONCLUSION:, since the reference value of t is t0.05,18 = 2.1009, the value of the test statistic is very far into the critical region, implying that H0 : β1= 0 should be rejected. Rejecting the null hypothesis implies that x is of value in explaining the variability in y. II. ANALYSIS OF VARIANCE (ANOVA) APPROACH TO TEST SIGNIFICANCE OF REGRESSION Analysis of Variance (ANOVA) is a statistical formula used to compare variances across the means (or average) of different groups. A range of scenarios use it to determine if there is any difference between the means of different groups. ANOVA is a statistical relevance tool designed to evaluate whether or not the null hypothesis can be rejected while testing hypotheses. It is used to determine whether or not the means of three or more groups are equal. ANOVA can help you test the overall effect of a categorical predictor, or the interaction effect between two or more categorical predictors, on the outcome. The analysis of variance identity is rewritten as follows: 𝑛 𝑛 𝑛 ∑ (𝑦𝑖- 𝑦) = ∑ (𝑦𝑖- 𝑦) + ∑ (𝑦𝑖- 𝑦𝑖)2 2 𝑖=1 𝑖=1 𝑖=1 This equation may also be written as: SST = SSR + SSE Where, SST = total corrected sum of squares SSR = regression sum of squares SSE= error sum of squares For simple linear regression, the regression mean square (MSR) and the mean square error (MSE) are just the SSR and SSE divided by their respective degrees of freedom Mathematically, SSR / (1) = MSR SSE / (n-2) = MSE F-Test An F-test is used to determine the effectiveness of independent variables in explaining the variation of the dependent variable. The f statistic is a ratio of the regression mean square and the mean square error. f0 = SSR/1 = MSR SSE/(n-2) MSE It provides a statistic for testing the hypothesis that H1: β1≠0 against the null hypothesis that H0: β1=0. If the null hypothesis H0: β1=0 is true, the statistic follows the F1,n-2 distribution, and we would reject H0 if f0 > fα,1,n-2. Results of the ANOVA procedure are set out in an ANOVA table: Source of Sum of Squares Degrees of Mean Variation Freedom Square F0 Regression SSR = Sxy 1 1 MSR MSR / MSE Error SSE = SST - 1 Sxy n-2 MSE Total SST n-1 2 Note:MSE = Example: We will use the analysis of variance approach to test for significance of regression using the oxygen purity data model from the first example. Recall that SST = 173.38, 1 = 14.947, Sxy = 10.17744, and n = 20. The regression sum of squares is SSR = Sxy= (14.947)(10.17744) = 152.13 1 And the error sum of squares is SSE = SST - SSR = 173.8 - 152.13 = 21.25 The analysis of variance for testing H0: β1 = 0 is summarized in the output in Table 11-1. The test statistic is f0 = MSR / MSE = 152.13/1.18 = 128.86, for which we find that the P-value is P = 1.23X10-9, so we conclude that β is not zero. E. PREDICTION OF NEW OBSERVATIONS Linear regression models are commonly used to forecast and predict new observations based on the underlying modeled process. It entails developing a model that describes the linear relationship between a dependent (Y) or future observation and one independent variable (X), or the level of the regressor variable. The point estimator of the new or future value of the response 𝑌0 is, 𝑌0 = β0 + β1𝑥0 Where, 𝑌0 = the dependent variable or future observation β0 = the constant or intercept β1 = the slope or coefficient 𝑥0 = the independent variable or level of the regressor variable The error in prediction, which is defined as the difference between the measured and the predicted value is given by 𝑒 = 𝑌0 − (β0 + β1𝑥0) or 𝑒 = 𝑌0 − 𝑌0 𝑝 𝑝 Where, 𝑒 = the error in prediction 𝑌0 = the measured 𝑝 value 𝑌0 = the predicted value With mean zero, the variance of prediction error is given by, 𝑉(𝑒 ) = 𝑉(𝑌0 − 𝑌0) 𝑝 𝑉(𝑒 ) = 𝑉(𝑌0) + 𝑉(β0 + β1𝑥0) 𝑝 2 2 2⎡ 1 (𝑥0−𝑥) ⎤ 𝑉(𝑒 ) =σ + σ ⎢𝑛 + ⎥ 𝑆𝑥𝑥 𝑝 ⎣ 2 ⎦ 2⎡ 1 (𝑥0−𝑥) ⎤ 𝑉(𝑒 ) =σ ⎢1 + 𝑛 + 𝑆 ⎥ 𝑝 ⎣ 𝑥𝑥 ⎦ 2 2 Note that the 𝑌0 is independent of 𝑌0. Using σ to estimate σ , we can show that the standardized variable, 𝑌0−𝑌0 𝑇= 2 2 (𝑥 −𝑥) ⎤ ⎡ 1 σ ⎢1+ 𝑛 + 0𝑆 ⎥ ⎣ 𝑥𝑥 ⎦ has a t distribution with n - 2 degrees of freedom. Prediction Interval The prediction interval is of minimum width at 𝑥0 = 𝑥 and widens as |||𝑥0 − 𝑥||| increases. It also depends on both the error from the fitted model and the error associated with future observation. A 100(1 − α)% prediction interval on a future observation 𝑌0 at the value 𝑥0 is given by 2 2 (𝑥0−𝑥) ⎡ 1 ⎤ 𝑦0 − 𝑡α/2,𝑛−2 σ ⎢1 + 𝑛 + 𝑆𝑥𝑥 ⎥ ≤ 𝑌0 ⎣ ⎦ 2 2 (𝑥0−𝑥) ⎡ 1 ⎤ ≤ 𝑦0 + 𝑡α/2,𝑛−2 σ ⎢1 + 𝑛 + 𝑆𝑥𝑥 ⎥ ⎣ ⎦ where: 𝑦0 = computed from the regression model 𝑦0 = β0 + β1𝑥0 2 2 σ = estimate of σ Note (Additional Formulas): 2 ( ) ( )( ) 20 𝑛 𝑛 𝑛 ∑ 𝑥𝑖 𝑛 ∑ 𝑥𝑖 ∑ 𝑦𝑖 2 𝑖=1 𝑖=1 𝑖=1 𝑆𝑥𝑥 = ∑ 𝑥𝑖 − 𝑛 𝑆𝑥𝑦 = ∑ 𝑥𝑖𝑦𝑖 − 𝑛 𝑖=1 𝑖=1 𝑆𝑥𝑦 2 𝑆𝑆𝐸 β1 = 𝑆𝑥𝑥 β0 = 𝑦 − β1𝑥 σ = 𝑛−2 Example 1. To illustrate the construction of a prediction interval, suppose we use the data in the table below and find a 95% prediction interval on the next observation of oxygen purity at 𝑥0 = 1. 00%. We find that the prediction interval is: Table 1. Oxygen and Hydrocarbon Levels Observation Number Hydrocarbon Level 𝑥(%) Purity 𝑦 (%) 1 0.99 90.01 2 1.02 89.05 3 1.15 91.43 4 1.29 93.74 5 1.46 96.73 6 1.36 94.45 7 0.87 87.59 8 1.23 91.77 9 1.55 99.42 10 1.40 93.65 11 1.19 93.54 12 1.15 92.52 13 0.98 90.56 14 1.01 89.54 15 1.11 89.85 16 1.20 90.39 17 1.26 93.25 18 1.32 93.41 19 1.43 94.98 20 0.95 87.33 20 20 20 2 ∑ 𝑥𝑖 = 23. 92 ∑ 𝑦𝑖 = 1, 843. 21 ∑ 𝑥𝑖 = 29. 2892 𝑖=1 𝑖=1 𝑖=1 𝑛 = 20 𝑥 = 1. 1960 𝑦 = 92. 1605 2 ( ) 20 20 ∑ 𝑥𝑖 2 2 𝑖=1 (23.92) 𝑆𝑥𝑥 = ∑ 𝑥𝑖 − 𝑛 = 29. 2892 − 20 = 0. 68088 𝑖=1 ( )( ) = 2, 214. 6566 − 20 20 20 ∑ 𝑥𝑖 ∑ 𝑦𝑖 𝑖=1 𝑖=1 (23.92)(1,843,21) 𝑆𝑥𝑦 = ∑ 𝑥𝑖𝑦𝑖 − 20 20 = 10. 17744 𝑖=1 2 𝑆𝑆𝐸 21.25 σ = 𝑛−2 = 20−2 = 1. 18 𝑆𝑥𝑦 10.17744 β1 = 𝑆𝑥𝑥 = 0.68088 = 14. 94748 β0 = 𝑦 − β1𝑥 = 92. 1605 − (14. 94748)1. 196 = 74. 28331 𝑦0 = β0 + β1𝑥0 = 74. 283 + 14. 947𝑥0 , 𝑥0 = 1. 00% 𝑦0 = 74. 283 + 14. 947(1. 00) = 89. 23 𝑡α/2,𝑛−2 = 2. 101 Substituting all of the values to the formula, 2 1 (1.00−1.1960) 89. 23 − 2. 101 1. 18⎡⎢\1 + 20 + 0.68088 ⎤≤𝑌 ⎥ 0 ⎣ ⎦ 2 1 (1.00−1.1960) ≤ 89. 23 + 2. 101 1. 18⎡⎢1 + 20 + 0.68088 ⎤ ⎥ ⎣ ⎦ The prediction interval is, 86. 83 ≤ 𝑌0 ≤ 91. 63 F. ADEQUACY OF REGRESSION MODELS Regression model and its residual analysis focuses on the essential steps and techniques used to determine the validity and reliability of the regression model. The coefficient of determination R2, is a critical measure used to judge the adequacy of a regression model. It represents the proportion of the variance in the dependent variable that is predictable from the independent variables. For instance, an R2 value of 0.877 in the oxygen purity regression model implies that 87.7% of the variability in the data is accounted for by the model. Assumptions in Regression I. Linearity: The relationship between the dependent and independent variables should be linear. II. Independence: The residuals should be independent. III. Homoscedasticity: The residuals should have constant variance. IV. Normality: The residuals should be normally distributed. I. RESIDUAL ANALYSIS Residual analysis involves examining the differences between observed and predicted values to assess the model's assumptions and detect potential issues. Residuals should ideally be randomly distributed with a mean of zero and constant variance. The residuals from a regression model are ei = yi - ŷi where yi is an actual observation and ŷi is the corresponding fitted value from the regression model. Analysis of the residuals is frequently helpful in checking the assumption that the errors are approximately normally distributed with constant variance, and in determining whether additional terms in the model would be useful. We may also standardize the residuals by computing di.= ei / 2 √σ. If the errors are normally distributed, approximately 95% of the standardized residuals should fall in the interval (-2,+2). Residuals that are far outside this interval may indicate the presence of an outlier, that is, an observation that is not typical of the rest of the data. Various rules have been proposed for discarding outliers. However, outliers sometimes provide Formula: 𝑒𝑖 = yi - ŷi Formula if the errors are normally distributed: 𝑒𝑖 𝑑𝑖= 2 √σ 𝑖 EXAMPLE: Observation Hydrocarbon Oxygen Predicted Residual Level, x Purity, Value, ŷ e=y-ŷ y 1 0.99 90.01 89.081 0.929 2 1.02 89.05 89.530 -0.480 3 1.15 91.43 91.473 -0.043 4 1.29 93.74 93.566 0.174 5 1.46 96.73 96.107 0.623 6 1.36 94.45 94.612 -0.162 7 0.87 87.59 87.288 0.302 8 1.23 91.77 92.669 -0.899 9 1.55 99.42 97.452 1.968 10 1.40 93.65 95.210 -1.560 11 1.19 93.54 92.071 1.469 12 1.15 92.52 91.473 1.047 13 0.98 90.56 88.932 1.628 14 1.01 89.54 89.380 0.160 15 1.11 89.85 90.875 -1.025 16 1.20 90.39 92.220 -1.830 17 1.26 93.25 93.117 0.133 18 1.32 93.41 94.014 -0.604 19 1.43 94.98 95.658 -0.678 20 0.95 87.33 88.483 -1.153 The adequacy of a regression model can be effectively judged using the coefficient of determination, residual analysis, and plots. It is crucial to verify the assumptions of normality, independence, linearity, and homoscedasticity to ensure the reliability of the model. By examining residuals through various plots and statistical tests, one can identify potential violations of these assumptions and make necessary adjustments to improve the model. II. COEFFICIENT OF DETERMINATION Coefficient of Determination is a key statistic in regression analysis that quantifies how well the regression model explains the variability of the dependent variable. A proportion of the variance in the dependent variable that is predictable from the independent variable(s). It is denoted as R2 or r2 and ranges from 0 to 1. Interpretation R2 Value Interpretation 0 The model explains none of the variability in the response data. 0 < R2≤ 0.1 The model explains a very small portion of the variability. 0.1 < R2≤ 0.3 The model explains a small portion of the variability. 0.3 < R2≤ 0.5 The model explains a moderate portion of the variability. 2 0.5 < R ≤ 0.7 The model explains a good portion of the variability. 0.7 < R2≤ 0.9 The model explains a large portion of the variability. 0.9 < R2< 1 The model explains nearly all the variability. 1 The model perfectly explains all the variability in the response data (usually a sign of overfitting with real-world data). More technically, R2 is a measure of goodness of fit. R2 evaluates the scatter of the data points around the fitted regression line. Graphing your linear regression data usually gives you a good clue as to whether its R2 is high or low. Three different scatter plots of bivariate data are shown. In all three plots, the y-values vary significantly, indicating high variability. In the first plot (a), all points lie exactly on a straight line, meaning 100% of the variation in y is explained by its linear relationship with x and the variation in x. In the second plot (b), the points do not fall exactly on a line, but the deviations from the line are small compared to the overall y variability. This suggests that much of the y variation can be explained by the approximate linear relationship between x and y. In the third plot (c), there is a lot of variation around the least squares line relative to the overall y variation, indicating that the simple linear regression model does not adequately explain the variation in y in relation to x. Calculating the coefficient of determination Choose between two formulas to calculate the coefficient of determination (R²) of a simple linear regression. The first formula is specific to simple linear regressions, and the second formula can be used to calculate the R² of many types of statistical models. Formula 1: Using the correlation coefficient R2 = (r)2 where r = Pearson correlation coefficient Sample Problem 1: You are studying the relationship between the number of construction workers who regularly work on site and the amount of time to finish constructing the new commercial building, and you find that the two variables have a negative Pearson correlation: r = -0.35 Given: r = -0.35 Formula: R2 = (r)2 Solution: R2 = (r)2 R2 = (-0.35)2 R2 = 0.1225; Therefore, the linear regression model explains a small portion of the variability. Formula 2: Using the regression outputs 2 𝑆𝑆𝐸 𝑆𝑆𝑇 − 𝑆𝑆𝐸 𝑆𝑆𝑅 𝑅 = 1 − 𝑆𝑆𝑇 = 𝑆𝑆𝑇 = 𝑆𝑆𝑇 where: SSE = Error Sum of Squares SST = Total Sum of Squares SSR = Regression Sum of Squares Error Sum of Squares The Error Sum of Squares (SSE), also known as the Residual Sum of Squares (RSS), is a measure of the discrepancy between the data and an estimation model. In regression analysis, it quantifies the difference between the observed values and the values predicted by the model. A smaller SSE indicates a better fit of the model to the data. Formula: For a set of observations (𝑦𝑖) and their corresponding predicted values (𝑦𝑖), ESS is calculated as: 𝑛 2 𝑆𝑆𝐸 = ∑ (𝑦𝑖 − 𝑦𝑖) 𝑖=1 where: 𝑦𝑖 are the observed values 𝑦𝑖 are the predicted values from the model n is the number of observations Calculation Steps 1. Collect Data: Gather the observed values (𝑦𝑖) and the corresponding predicted values (𝑦𝑖). 2. Compute Residuals: Calculate the residuals ( 𝑦𝑖 - 𝑦𝑖 ) for each observation. 2 3. Square the Residuals: Square each residual to get ( 𝑦𝑖 − 𝑦𝑖 ) 4. Sum the Squared Residuals: Sum all the squared residuals to get the ESS. Total Sum of Squares Total Sum of Squares (SST) is a measure of the total variability in the observed data. It quantifies how much the observed values deviate from their mean. Formula: 𝑛 2 𝑆𝑆𝑇 = ∑ (𝑦𝑖 − 𝑦) 𝑖=1 where: 𝑦𝑖 are the observed values 𝑦 is the mean of the observed values. n is the number of observations Calculation Steps 1. Calculate the Mean of Observed Values 2. Compute Deviations from the Mean: For each observation, calculate (𝑦𝑖 − 𝑦) 2 3. Square the Deviations: Square each deviation to get (𝑦𝑖 − 𝑦) 4. Sum the Squared Deviations: Sum all the squared deviations to get the TSS. Regression Sum of Squares Regression Sum of Squares (RSS), also known as the Explained Sum of Squares (ESS), measures the variability in the observed data that is explained by the regression model. It quantifies how much the predicted values (𝑦𝑖) deviate from the mean of the observed values (𝑦). 𝑛 2 Formula: 𝑆𝑆𝑅 = ∑ (𝑦𝑖 − 𝑦) = 𝑆𝑆𝑇 − 𝑆𝑆𝐸 𝑖=1 where: 𝑦𝑖 are the predicted values 𝑦 is the mean of the observed values. n is the number of observations Calculation Steps 1. Calculate the Mean of Observed Values (𝑦). 2. Compute Deviations of Predicted Values from the Mean: For each predicted value, calculate (𝑦𝑖 − 𝑦) 2 3. Square the Deviations: Square each deviation to get (𝑦𝑖 − 𝑦) 4. Sum the Squared Deviations: Sum all the squared deviations to get the RSS. Sample Problem 2 You are an engineer tasked with predicting the tensile strength of a new alloy based on its composition. You conduct an experiment with 10 samples and obtain the following tensile strength measurements (in MPa) and corresponding predictions from your model: Sample Observed Strength (𝑦𝑖) Predicted Strength (𝑦𝑖) 1 210 200 2 220 215 3 215 210 4 230 225 5 240 235 6 225 220 7 235 230 8 245 240 9 250 245 10 255 250 1.) Get the SSE Solution: Sample Observed Predicted 2 Strength Strength ( 𝑦 𝑖 − 𝑦𝑖 ) ( 𝑦 𝑖 − 𝑦𝑖 ) (𝑦𝑖) (𝑦𝑖) 1 210 200 10 100 2 220 215 5 25 3 215 210 5 25 4 230 225 5 25 5 240 235 5 25 6 225 220 5 25 7 235 230 5 25 8 245 240 5 25 9 250 245 5 25 10 255 250 5 25 SSE: 325 2.) Get the SST Solution: A. Calculate the mean (𝑦) (210+220+215+230+240+225+235+245+250+255) 𝑦 = 10 2325 𝑦 = 10 = 232. 5 B. Compute Deviations from the Mean, Square the Deviation, and get the Sum Sample Observed 2 Strength (𝑦𝑖) (𝑦𝑖 − 𝑦) (𝑦𝑖 − 𝑦) 1 210 -22.5 506.25 2 220 -12.5 156.25 3 215 -17.5 306.25 4 230 -2.5 6.25 5 240 7.5 56.25 6 225 -7.5 56.25 7 235 2.5 6.25 8 245 12.5 156.25 9 250 17.5 306.25 10 255 22.5 506.25 SST: 2062.5 3.) Get the SSR 𝑆𝑆𝑅 = 𝑆𝑆𝑇 − 𝑆𝑆𝐸 𝑆𝑆𝑅 = 2062. 5 − 325 𝑆𝑆𝑅 = 1737. 5 4.) Get the Coefficient of Determination (R2) 2 𝑆𝑆𝐸 𝑆𝑆𝑇 − 𝑆𝑆𝐸 𝑆𝑆𝑅 𝑅 = 1 − 𝑆𝑆𝑇 = 𝑆𝑆𝑇 = 𝑆𝑆𝑇 2 325 2062.5 − 325 1737.5 𝑅 = 1 − 2062.5 = 2062.5 = 2062.5 2 𝑅 = 0. 84 Therefore, the R2 of 0.84 indicates that approximately 84% of the variability in the observed tensile strengths is explained by the regression model. This suggests that the model provides a good fit to the data. Pitfalls of Using R2 While R2 is a useful measure of how well a regression model fits the data, it has several limitations that can lead to potential pitfalls: 1. Does Not Indicate Causation: a. A high R2 value indicates a strong correlation between the independent and dependent variables, but it does not imply that the independent variable causes the changes in the dependent variable. b. Simple Example: Ice cream sales and drowning incidents might have a high R2, but eating ice cream does not cause drowning. Both are related to hot weather. 2. Not Useful for Comparing Models with Different Dependent Variables: a. R2is specific to the dataset and the model, and it cannot be used to compare models with different dependent variables. b. Simple Example: Comparing R2 values of a model predicting house prices and a model predicting car prices is meaningless because they are different problems. 3. Does Not Reflect Model Complexity: a. A higher R2 does not always mean a better model. Adding more independent variables can increase R2 even if those variables do not have meaningful relationships with the dependent variable. b. Simple Example: Including unrelated factors like shoe size in a model predicting salary might increase R2 but does not improve the model's usefulness. 4. Can Be Misleading with Non-Linear Relationships: a. R2 assumes a linear relationship between the variables. It can be low for models with a non-linear relationship, even if the model fits the data well. b. Simple Example: In a scenario where the relationship between hours studied and exam scores is exponential, a linear model might have a low R2 despite good predictions. 5. Does Not Provide Information About Model Bias or Variance: a. R2 does not indicate whether the model is overfitting or underfitting the data. b. Simple Example: A model with very high R2 on training data but poor performance on new data is overfitting, capturing noise instead of the underlying pattern. 6. Insensitive to Changes in the Scale of Data: a. R2 remains the same if you multiply all dependent variable values by a constant factor. b. Simple Example: If you change the units of measurement from meters to centimeters, R2 stays the same, even though the scale of the data has changed. While R2 is a useful metric for understanding the fit of a regression model, relying solely on it can be misleading. It should be used in conjunction with other metrics and domain knowledge to fully assess the model's performance and validity. G. CORRELATION Correlation, derived from 'co' meaning together and 'relation' meaning connection, refers to the statistical association between variables. It indicates that a change in one variable is accompanied by a corresponding change in another variable, either directly or indirectly. This statistical technique measures the strength of the connection between pairs of variables, providing insight into how they relate to one another. Correlation can be positive or negative. When two variables move in the same direction—meaning an increase in one variable results in a corresponding increase in the other, and vice versa—they are said to be positively correlated. Conversely, when two variables move in opposite directions—such that an increase in one variable leads to a decrease in the other, and vice versa—this is known as negative correlation. Correlation Analysis: Correlation analysis examines the linear relationships between variables. The degree of association is measured by a correlation coefficient, denoted by r , which quantifies the strength and direction of the linear relationship. Uses of Correlation: Prediction - If there is a relationship between two variables, we can make predictions about one from another. Validity - Concurrent validity (correlation between a new measure and an established measure). Reliability - 1. Test-retest reliability (are measures consistent?). 2. Inter-rater reliability (are observers consistent?). Theory verification -Predictive validity. Pearson’s correlation coefficient r : It used to determine the strength of the correlation which varies between -1 and +1 𝑟 = 𝑛 ·∑𝑋𝑌 − ∑𝑋 · ∑𝑌⁄ [√(𝑛 ·∑𝑋² − (∑𝑋)²) ·(𝑛 ·∑𝑌² − (∑𝑌)²)] Interpretation of Correlation Coefficient r =+1 Perfect positive linear relationship. r=−1 Perfect negative linear relationship. r=0: No linear relationship. Values of r close to +1 or -1 indicate a strong linear relationship, while values close to 0 indicate a weak or no linear relationship. 3 TYPES OF CORRELATION SCATTER DIAGRAMS Zero Correlation (r=0) - If two variables X and Y have no correlation then 𝑝 = 0. This means that the two variables have no linear relationship at all. Positive Correlation (0< r ≤1) - Two variables X and Y have a positive correlation if higher values of X tend to be associated with higher values of Y, and lower values of X tend to be associated with lower values of Y. Negative Correlation (0< r ≤ -1) - When two variables, X and Y, have a negative correlation, it means that high values of X tend to coincide with low values of Y, and low values of X tend to coincide with high values of Y. A frequently asked question is, “When can it be said that there is a strong correlation between the variables, and when is the correlation weak?” Here is an informal rule of thumb for characterizing the value of r: Weak Moderate Strong -0.5 ≤ r ≤ 0.5 either -0.8 < r < -0.5 either r ≥ 0.8 or r ≤ -.08 or 0.5 < r < 0.8 It may surprise you that an ras substantial as.5 or goes in the weak category. The rationale is that if r =.5 or -.5 , then r^2 = 0.25 then in a regression with either variable playing the role of y. A regression model that explains at most 25% of observed variation is not in fact very impressive. Example 1: An accurate assessment of soil productivity is critical to rational land-use planning. Unfortunately, as the author of the article “Productivity Ratings Based on Soil Series' ' (Prof. Geographer,1980: 158–163) argues, an acceptable soil productivity index is not so easy to come by. One difficulty is that productivity is determined partly by which crop is planted, and the relationship between the yield of two different crops planted in the same soil may not be very strong. To illustrate, the article presents the accompanying data on corn yield x and peanut yield y(mT/Ha) for eight different types of soil. X 2.4 3.4 4.6 3.7 2.2 3.3 4 2.1 Y 1.33 2.12 1.8 1.65 2 1.76 2.11 1.63 Step 1: Find XY , XX and YY and its sums as it was done in the table below. X Y XY XX YY 2.4 1.33 3.192 5.76 1.7689 3.4 2.12 7.208 11.56 4.4944 4.6 1.8 8.28 21.16 3.24 3.7 1.65 6.105 13.69 2.7225 2.2 2 4.4 4.84 4 3.3 1.76 5.808 10.89 3.0976 4 2.11 8.44 16 4.4521 2.1 1.63 3.423 4.41 2.6569 ∑𝑋 = 25.7 ∑𝑌 = 14.4 ∑𝑋𝑌 = 46.856 ∑𝑋²= 88.31 ∑𝑌²= 26.4324 Step 2: Use the following formula to work out the correlation coefficient knowing that n = 8. 𝑟 = 𝑛 ·∑𝑋𝑌 − ∑𝑋 · ∑𝑌⁄ [√(𝑛 ·∑𝑋² − (∑𝑋)²) ·(𝑛 ·∑𝑌² − (∑𝑌)²)] 𝑟 = (8 ·46. 85) − (25. 7 · 14. 4)⁄[ √(8 ·88. 31 − 25. 7²) ·(8 ·26. 4324 − 14. 4²)] 𝑟 ≈ 0.3473 Knowing the approximate value of r, this suggests according to the informal rule of thumb for characterizing the value of r that the correlation between corn yield and peanut yield would be described as weak. Example 2: Suppose a civil engineer collected data on 10 different road segments. For each segment, they rated the quality of construction materials on a scale from 1 to 10 (with 10 being the highest quality) and measured the longevity of the pavement in years. The engineer wants to determine if there is a relationship between the quality of construction materials and the longevity of the pavement. Question: Is there a significant correlation between the quality of construction materials and the longevity of the pavement? If so, describe the nature and strength of this relationship. Data Given: Quality of Material (X): Longevity (Y): Step 1: Find XY , XX and YY and its sums as it was done in the table below. Segment Quality of Longevity XY XX YY Materials ( Years, Y) (X) 1 8 15 120 64 225 2 6 12 72 36 144 3 9 18 162 81 324 4 7 13 91 49 169 5 5 10 50 25 100 6 8 16 128 64 256 7 6 11 66 36 121 8 7 14 98 49 196 9 9 19 171 81 361 10 10 20 200 100 400 ∑𝑋 = 75 ∑𝑌 = 148 ∑𝑋𝑌 = 1158 ∑𝑋²= 585 ∑𝑌²= 2296 Step 2: Use the following formula to work out the correlation coefficient knowing that n = 10. 𝑟 = 𝑛 ·∑𝑋𝑌 − ∑𝑋 · ∑𝑌⁄ [√(𝑛 ·∑𝑋² − (∑𝑋)²) ·(𝑛 ·∑𝑌² − (∑𝑌)²)] 𝑟 = (10·1158) − (75 · 148)⁄[ √(10 ·585 − 75²) ·(10 ·2296 − 148²)] 𝑟 ≈ 480/505.57 𝑟 ≈ 0.95 Interpretation A Pearson correlation coefficient of approximately 0.95 indicates a very strong positive correlation between the quality of construction materials and the longevity of the pavement. This suggests that higher-quality materials tend to be associated with longer-lasting pavements. Testing for the significance of the Pearson correlation coefficient The Pearson correlation coefficient can also be used to test whether the relationship between two variables is significant. The Pearson correlation of the sample is r. It is an estimate of rho (ρ), the Pearson correlation of the population. Knowing r and n (the sample size), we can infer whether ρ is significantly different from 0. Hypothesis; H0: There is no correlation between the quality of construction materials and the longevity of the pavement. H0:ρ=0 H1: There is a correlation between the quality of construction materials and the longevity of the pavement H1:ρ≠0 To test these hypotheses, we use the Pearson correlation coefficient calculated earlier. Given the correlation coefficient (r) is approximately 0.95, we need to determine if this value is statistically significant. Calculate the Test Statistic: The test statistic for the Pearson correlation coefficient is given by: t = [r·√(n-2)]/ √(1-r²) Using the calculated r and n=10: t = [r·√(n-2)]/ √(1-r²) t = [0.95·√(10-2)]/ √(1-0.95²) t = [0.95·√8]/ √(0.0975) t ≈ 8.61 Determine the Degrees of Freedom: The degrees of freedom (df) for the t-test in this case is n−2n - 2: df=10−2=8 Determine the Critical Value and p-value: We compare the test statistic to the critical value from the t-distribution table at a given significance level (commonly α=0.05). For df=8 and α=0.05 (two-tailed test), the critical value is approximately 2.306. Compare and Make a Decision: If ∣t∣ > t critical, we reject the null hypothesis H0. If ∣t∣ ≤ t critical, we fail to reject the null hypothesis H0. Given t≈8.61 and t critical≈2.306: ∣t∣=8.61 is much greater than 2.306. Conclusion Since the test statistic ∣t∣=8.61 is significantly greater than the critical value t critical, we reject the null hypothesis H0. This means there is significant evidence to suggest that there is a correlation between the quality of construction materials and the longevity of the pavement. The very high Pearson correlation coefficient of 0.95 indicates a strong positive correlation. IV. LIST OF TECHNICAL TERMS Empirical Model - statistical models built on the foundation of observation rather than established theories Non-deterministic relationship - y cannot be predicted perfectly from knowledge of the corresponding x Regression analysis - statistical method used to determine the relationship between a dependent variable and one or more independent variables. Regression Model- predicts the value of a dependent variable based on the values of one or more independent variables. Correlation - a statistical measure that expresses the extent to which two variables are linearly related (meaning they change together at a constant rate). It's a common tool for describing simple relationships without making a statement about cause and effect. Coefficient of Determination - denoted as R2, measures the proportion of the variance in the dependent variable that is predictable from the independent variables in a regression model. Error Sum of Squares - measures the total deviation of the observed values from the values predicted by a regression model, quantifying the model's prediction error. Total Sum of Squares - measures the total variation in the observed data, representing the sum of the squared differences between each observation and the overall mean. Regression Sum of Squares - measures the variation in the observed data that is explained by the regression model, representing the sum of the squared differences between the predicted values and the overall mean. V. REFERENCES: Bevans, R. (2024, May 09). One-way ANOVA | When and How to Use It (With Examples). Scribbr. Retrieved July 2, 2024, from https://www.scribbr.com/statistics/one-way-anova/ Montgomery, D. C., & Runger, G. C. (2010). Applied Statistics and Probability for Engineers (5th ed.). Wiley. Turney, S. (2022, April 22). Coefficient of Determination (R2) |Calculation & Interpretation. Scribbr. https://www.scribbr.com/statistics/coefficient-of-determination/ ‌ UNIT 10: SIMPLE LINEAR REGRESSION AND CORRELATION _________________________________________ FINAL TERM FORMAL REPORT ES23: ENGINEERING DATA ANALYSIS ______________________________________________ College of Engineering Central Mindanao University University Town, Musuan, Maramag, Bukidnon By: Astronomo, Constantine Nichoe Serquiña, Raya Reyna J. Lopez, Roxshan Gale L. Bulahan, Carl Anthon T. Sibay, Febbi Akizah M. Caling, Justine Paul A. Delfin, Allyza Mae B. Francisco, Kristel C. Lagno, Juanito A. July 03, 2024

Group 2_1A_UNIT 10_SIMPLE LINEAR REGRESSION AND CORRELATION.pdf

Document Details

Tags

Related

Full Transcript

Upgrade to continue