Correlation and Regression PDF
Document Details
P. Nyenje
Tags
Summary
This document details correlation and regression analysis techniques, including Pearson correlation and linear regression. It covers examples, definitions and explanations of the techniques. Equations and diagrams associated to the techniques are included.
Full Transcript
3. Correlation and Regression 3.1 Correlations Correlations help to determine the inter-relationship between two or more variables. A correlation coefficient, ρ explains the degree of the association as a linear dependence. The symbol r is used for sample data correlations and , ρ can always be in...
3. Correlation and Regression 3.1 Correlations Correlations help to determine the inter-relationship between two or more variables. A correlation coefficient, ρ explains the degree of the association as a linear dependence. The symbol r is used for sample data correlations and , ρ can always be inferred from r. E.g. If Concentrations of EC and nitrate in shallow groundwaters are measured in springs in an urban area. For each sample, the concentration of one is plotted versus the concentration of the other. As EC concentrations increase, so do nitrate. How might the strength of this association be measured and summarized? Correlations can be used Range: -1 < ρ < 1 When ρ = 0: No correlation between variables When ρ = 1: Strong correlation between variables When ρ = +ve: One variable increases as second increases When ρ = -ve: Variables vary in opposite direction Monotonic vs Linear correlations Data can be correlated either in a linear on non-linear way. A monotonic correlation is where: y increases or decreases as x increases the correlation can be non-linear (e.g. exponential) or linear 32 P. Nyenje (MAK) MEC7116 Linear correlations are not good for non-linear datasets Common measures of correlations There are several types of correlation coefficients used in statistics. These include: a. Kendal tau b. Spearmans Rho c. Pearson's r The first two are able to measure all monotonic relationships. (READ) The last (Pearson's) is the most commonly used but it only measures linear correlation Pearson correlation coefficient For hydrologic purposes and other engineering disciplines, the most commonly used correlation coefficient is the Pearson correlation coefficient. 33 P. Nyenje (MAK) MEC7116 This coefficient of linear correlation (-1 < ρ < +1), between two variables X and Y is defined as 1 ∑((𝑥𝑖 −𝑥̅ )∗(𝑦𝑖 −𝑦̅)) ∑((𝑥𝑖 −𝑥̅ )∗(𝑦𝑖 −𝑦̅)) 𝑛 𝜌= = 1 1 √ ∑(𝑥𝑖 −𝑥̅ )2 ∗ ∑(𝑦𝑖 −𝑦̅)2 √∑(𝑥𝑖 −𝑥̅ )2 ∗∑(𝑦𝑖 −𝑦̅)2 𝑛 𝑛 where n total number of observations i 1, 2, 3, etc. xi, yi ith observation of series x and y The larger the ρ (ignoring the sign), the stronger the correlation. When ρ is 1 or -1, variables are perfectly correlated. A positive sign shows positive correlation and a negative sign indicates negative (or inverse) correlation. If ρ is near 0, the correlation between the two variables is poor. The significance of ρ (or how strong r is) can be tested by determining whether ρ differs from zero. The test statistic tr is computed using the equation below, and compared to a table of the t distribution with n−2 degrees of freedom. 𝜌√𝑛 − 2 𝑡𝑟 = √1 − 𝜌2 Example If the Person coefficient, = 0.174 for a sample size of 8, to test whether ρ is significantly different from 0, and therefore that y is linearly depend on x, 0.174√8−2 𝑡𝑟 = = 0.508 Does this represent a strong or weak relationship? √1−0.1742 For these tests, the null hypothesis is that there is no significant relationship. The computed coefficient of 0.508 has a p value of 0.63 from the t-distribution. Since p-value is more than 0.05, then we accept the null hypothesis that there is no significant relationship. Hence, y is not statistically related to x. 3.2 Regression Regression is a statistical tool for the investigation of relationships between variables. It involves developing a mathematical model or equation to describe association between two or more variables. The mode is purely an empirical model. It can also be used to estimate or predict values of a variable given values of other variables. 34 P. Nyenje (MAK) MEC7116 In regression, we assume that a random variable yi (dependent variable) is affected by another independent variable xi (simple regression) or multiple independent variables x1i, x2i, …(multiple regression analysis). The independent variable may or may not be random. The most common form of regression is a simple linear regression analysis where it is assumed that there is a linear relationship between two variables, say x and y. 3.2.1 Simple linear regression A simple linear regression is a model or equation where the mean of the random variable, y is assumed to be linearly functionally dependent on x. The regression line is derived to pass through the mean values of the distribution so that for any given value of x, the mean value of y is obtained. Hence, the model can be stated as: E[y] = a + bx where E[Y] is the expected value of the variable, yi The values of yi can be expressed as: yi= a + bx + E where yi = the dependent variable x = independent variable E = error term (a random variable with mean of 0 and constant variance s2) = residual a, b = model parameters 35 P. Nyenje (MAK) MEC7116 In practice, the true regression line (See figure above) is not known. This is because for each sample of data collected, a new regression close to the true regression is obtained. The resulting error between the estimated model and the actual value of y usually follows a normal distribution with a mean of zero. Hence, an ideal case is where the errors in y are independent of x (or the variability in y should be similar for different values of x) Since the true line is never known, the model parameters (denoted as 𝑎̂ and 𝑏̂) are only an estimate of the true parameters a and b. Hence, linear regression usually involves estimating the parameters 𝑎̂ and 𝑏̂ as the best estimate of the model parameters 36 P. Nyenje (MAK) MEC7116 3.2.2 Estimating model parameters The parameters of the regression model are estimated by the least-square method. 𝑎̂ and 𝑏̂ are chosen such that the squared differences between the observed y and the expected values 𝑎̂ + 𝑏̂𝑥 is minimized. The least-square method minimizes the mean-square-error, MSE. This is done by taking the partial derivative of MSE with respect to 𝑎̂ and 𝑏̂ and equating this to zero. where 𝑎̂ and 𝑏̂ are the best parameters of the linear model The resulting equation is 𝑦̂= (𝑎̂ + 𝑏̂𝑥) is called the least-squares regression equation. The error term, E, which is also called the residue is therefore given as: yi - (𝑎̂ + 𝑏̂𝑥) The residue is the vertical distribution between and 𝑦̂ an estimate of the true line y=a +bx. In practice, the true line is never known. Instead, we always try to estimate this line from available data. 3.2.3 Building a good regression model First plot the data and check: o Does the relationship look non-linear o is the variability of y markedly different for different levels of x If the relation look non-linear o Transform x such that the relationship look linear (see previous section) If there is variability in y for different values of x o We need to ensure uniform variance of y (i.e. homoscedasticity) o Transform y or x and y Compute least square regression statistics and estimate the best esimtate of the linear regression parameters 37 P. Nyenje (MAK) MEC7116 Check the residuals plot o plot error vs y to check for heteroscedasticity or curvatures Residuals plot showing curvature and changing variance. Example of a residuals plot for a good regression model (Non-uniform variance - heteroscedasticity ) (Uniform variance - homoscedasticity ) g variance. Example of a residuals plot for a good regression model (Uniform variance - homoscedasticity ) Residual plots help to identify characteristics or patterns still apparent in data even after fitting a linear regression model. The figure below shows three scatterplots with linear models in the first row and residual plots in the second row. Can you identify any patterns remaining in the residuals? 38 P. Nyenje (MAK) MEC7116 No obvious patterns in the Some curvature in the Very little upwards trend and residuals. Residuals appear to scatterplot, which is more the residuals show no obvious be scattered randomly around obvious in the residual plot. patterns. But the slope may the dashed line that represents Simple linear regression not not be significantly different 0. applicable from zero implying no regression 3.2.4 Measure of goodness of fit of a linear regression yi= a + bx + E E is the residue and it also gives the uncertainty in the model. The total uncertainty in the model is a combination of the uncertainty in the model parameters (a and b) and the uncertainty in the model structure. 39 P. Nyenje (MAK) MEC7116 or 2 2 𝑆𝑦2 = 𝑆𝑡𝑜𝑡𝑎𝑙 = 𝑆𝐸2 + 𝑆𝑚𝑜𝑑 where: S2y = S2total = total variance (=SSY) S2E = Error variance (SSE) S2mod = model variance (SSR) 𝑦̂𝑖 = model estimate of yi The ratio S2mod / S2total is the proportion of the total variance explained by the model and is therefore a measure of the goodness of fit of the model. This ratio is also called the determination coefficient, R2 2 2 𝑆𝑚𝑜𝑑 𝑆𝐸2 𝑅 = 2 = 1− 2 𝑆𝑡𝑜𝑡 𝑆𝑡𝑜𝑡 2 𝑆 𝐸 𝑅2 = 1 − 2 𝑆𝑡𝑜𝑡 R2 ranges between 0 and 1 For R2 = 1: Perfect regression and perfect model => A strong linear relationship between x & y. For R2 = 0: No regression => No significant relationship between x and y. 3.2.5 Conditions of least square regression Below are the conditions of performing a least square regression 1. Data should show a linear trend. Check using a scatter plot 2. We should have nearly normal residuals 3. Data should show constant variability 4. Observations should be independent Outliers tend to influence the slope of the regression line. There is need to examine if the outlier is an influential point or not. Therefore, it may not be advisable to remove outliers without a strong reason. 3.2.5 Multiple linear regression and Logistic regression (a) Multiple linear regression Is a linear model with two or more explanatory variables. Usually, it is computed using a software such as excel, R or SPSS. 40 P. Nyenje (MAK) MEC7116 (b) Logistic regression Is used to build regression models when there is a categorical response variable with two levels. For example, if we want to determine the relationship between number of stances and their performance either as dirty or clean. The response is either clean (assigned a value of 0) or dirty (assigned a value of 1). This is a binary response. Suppose that the model has the form Yi = a + b xi + ε. This is modelled in two ways: First model the response variable using a probability distribution Then model the parameters of the distribution using a collection of predictors and a special form of a regression equation. The outcome/ response takes on a value of 0 or 1. If response of 1 has a probability of pi = p(y=1), then the response of 0 has a probability of 1-pi. The logistic regression model relates the probability that our response has a value of 1 or 0 with the predictors x1, x2, x3 ….xk using a tranformation. 𝑦̂ ≈ transformation (pi) = 𝑎̂ + 𝑏̂x Basically, we choose a transformation of a probability distribution such that the values on the left (which are now continuous variables) are equal to the values on the right. If there is no transformation, the left hand side will take on a value of 0 and 1 which is difficult to model. The transformation commonly used: 𝑝𝑖 𝑙𝑜𝑔𝑖𝑡(𝑝𝑖) = 𝑙𝑜𝑔𝑒 ( ) 1 − 𝑝𝑖 𝑝𝑖 Hence, 𝑙𝑜𝑔𝑒 (1−𝑝𝑖) = 𝑎 + 𝑏𝑖 𝑋𝑖 𝑝𝑖 For many predictor variables, 𝑙𝑜𝑔𝑒 (1−𝑝𝑖) = 𝑎 + 𝑏1𝑖 𝑋1𝑖 + 𝑏2𝑖 𝑋2𝑖 + ⋯ where a is the intercept, b is the slope coefficient, X is the explanatory variable. For a single predictory variable, the probability of obtaining a response of 1 can be expressed as: exp(𝑎 + 𝑏𝑋) 𝑝= [1 + exp(𝑎 + 𝑏𝑋) Below is an example of a logistic response function to model the regression between performance of toilets (a binary response which is either 1 for dirty and 0 for clean) based on the number of stances. The figure below is a scatter plot of the data. Note that toilets tend to become dirty for higher number of stances. The logit model fit is shown computed using a software. 41 P. Nyenje (MAK) MEC7116 Figure: Example of the logistic response function. Instead of fitting a line, logit model fits S shaped logistic function 42 P. Nyenje (MAK) MEC7116