Linear Regression Framework (2023) PDF

5.1 Linear Regression Framework Modelling Data | Linear Model © University of Sydney DATA1001/1901 27 August 2023 Unit Overview Population 3 Sampling Data 1 Exploring Data Sample 4 Decisions with Data 2 Modelling Data 2/51   Module2 Modelling Data Normal Model What is the Normal Curve? How can we use it to model data? Linear Model How can we describe the relationship between 2 variables? When is a linear model appropriate? 3/51   5.1 Linear Regression Framework Data Story | How can we model the air quality in Sydney? Quick taster Scatter Plot Linear Correlation Regression Line Residual Plot Extension 4/51 Data Story How can we model the air quality in Sydney? AQI data Air quality data is updated hourly at measuring stations and then a daily air quality forecast is made for the Greater Sydney Metropolitan Region at 4 pm each day. See today’s update. At each site, data readings are taken of pollutants (Ozone (O3), Nitrogen dioxide (N02), Visibility (NEPH), Carbon monoxide (CO), Sulfur dioxide (SO2) and Particles (PM10, PM2.5)), which are then combined into the air quality index (AQI).  Who is the AQI index useful for? What might be confounding variables? 6/51 Currently the data released is the separate pollutants not the AQI. So here, we will consider some historical data from July 2015 for two regions: Sydney’s central-east (CE) and Sydney’s north-west (NW) We upload the AQIJuly2015.csv data into RStudio. head(AQIJuly2015,1) ## Date SydneyCEAQI SydneyNWAQI ## 1 01/07/2015 99 92 str(AQIJuly2015) ## ## ## ## 'data.frame': $ Date : $ SydneyCEAQI: $ SydneyNWAQI: 31 obs. of 3 variables: Factor w/ 31 levels "01/07/2015","02/07/2015",..: 1 2 3 4 5 6 7 8 9 10... int 99 32 70 74 95 71 31 58 108 82... int 92 44 82 96 100 98 65 71 74 67... 7/51 Quick taster Linear Regression Framework Given bivariate data (x, y), and the research question (Is y linearly related to x ?): Step 1. Produce a scatter plot Step 2. Produce a Regression line Step 3. Calculate the correlation coefficient Step 4. Produce a residual plot Step 5. Check assumptions Step 6. Perform predictions 9/51 Applying to the AQI data 1. Produce a scatter plot  Is this linear? 10/51 2. Produce a Regression line model = lm(AQIJuly2015$SydneyNWAQI ~ AQIJuly2015$SydneyCEAQI) summary(model) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: lm(formula = AQIJuly2015$SydneyNWAQI ~ AQIJuly2015$SydneyCEAQI) Residuals: Min 1Q -25.8556 -8.5835 Median 0.5649 3Q Max 9.6374 27.4342 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 19.8874 6.2915 3.161 0.00367 ** AQIJuly2015$SydneyCEAQI 0.7138 0.1141 6.257 7.9e-07 *** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 13.67 on 29 degrees of freedom Multiple R-squared: 0.5744, Adjusted R-squared: 0.5598 F-statistic: 39.15 on 1 and 29 DF, p-value: 7.897e-07 So the linear regression line is: NWAQI = 20 + 0.7CEAQI. 11/51 ggplot(AQIJuly2015, aes(x=SydneyCEAQI, y = SydneyNWAQI)) + geom_point() + geom_point(aes(x=mean(SydneyCEAQI),y=mean(SydneyNWAQI)),colour="indianred", size = 5) + stat_smooth(method = "lm", formula = y ~ x, geom = "smooth") 12/51 3. Calculate the correlation coefficient cor(AQIJuly2015$SydneyCEAQI,AQIJuly2015$SydneyNWAQI) ## 0.757917  How strong is the linear association? 13/51 4. Produce a residual plot ggplot(model, aes(x =.fitted, y =.resid)) + geom_point() + geom_hline(yintercept = 0)  Is this a random scatter? 14/51 5. Check assumptions The scatterplot looks linear (Step 1) and the residual plot looks random (Step 4). Hence the linear model seems to be approprirate. 6. Perform predictions We could perform predictions. Example: If CE Sydney had an AQI of 70, we would predict NW Sydney to have an AQI of 20 + 0.7 × 70. model$coefficients + model$coefficient*70 ## (Intercept) ## 69.85204 15/51 Scatter plot Graphical Summary for Bivariate Data Bivariate data involves a pair of variables. We are interested in the relationship between the 2 variables. Can one variable be used to predict the other? Formally, we have (xi , yi ) for i = 1, 2, … , n, where X is called the independent variable (or explanatory variable, predictor or regressor), and Y is called the dependent variable (or response variable).  What are common examples of bivariate data?  Scatter Plot  A scatter plot is a graphical summary of 2 variables on the same 2D plane, resulting in a cloud of points. 17/51 Linear Correlation Linear association  Linear association  The linear association (or association) between 2 variables describes how tightly the points cluster around a line. If there is a strong association, the cloud of points are tightly clustered around a line, and this allows for good predictions from 1 variable to the other. If one variable tends to increase with the other, then we have positive association.  How do we measure linear association? 19/51 Summarising a Scatter Plot A scatter plot can be summarised by the following 5 numerical summaries: · mean and SD of X (x̄, SDx ) · mean and SD of Y (ȳ, SDy ) · correlation coefficient (r ). Source: Freedman et al, Statistics p125 20/51 · The centre of the cloud is represented by the point of averages (x̄, ȳ). · The horizontal spread of the cloud is measured by SDx. We expect most of the points to fall with 2 SDs from x̄. · The vertical spread of the cloud is measured by SDy. We expect most of the points to fall with 2 SDs from ȳ. Source: Freedman et al, Statistics p125 Note that both clouds have the same centre and horizontal and vertical spread. However they have different clustering around a line (linear association). 21/51 The correlation coefficient  Correlation coefficient  The correlation coefficient r is a numerical summary which measures the clustering around the line. - It indicates both the sign and strength of the linear association. - The correlation coefficient is between -1 and 1. - If $r$ is positive: the cloud slopes up. - If $r$ is negative: the cloud slopes down. - As $r$ gets closer to $\pm 1$: the points cluster more tightly around the line. 22/51 Examples Source: Freedman et al, Statistics p127 23/51 Guessing the correlation coefficient We could compare our scatterplot to other data. Try the games:  1 2 3 4 24/51 Definition of the correlation coefficient  Population correlation coefficient  The population correlation coefficient (r pop ) is the mean of the product of the variables in standard units. head(AQIJuly2015, 4) ## ## ## ## ## 1 2 3 4 Date SydneyCEAQI SydneyNWAQI 01/07/2015 99 92 02/07/2015 32 44 03/07/2015 70 82 04/07/2015 74 96 25/51 Calculation by hand x y standard units standard units product x−50.77 21.88 y−56.13 20.61 ( x−50.77 )( 21.88 quadrant y−56.13 20.61 ) 99 92 2.20 1.74 3.84 upper right 32 44 -0.86 -0.59 0.51 lower left 70 82 0.88 1.26 1.10 upper right 74 96 1.06 1.93 2.05 upper right ⋮ ⋮ ⋮ ⋮ ⋮ mean=+0.76 26/51 27/51 Calculation in R Long way library(rafalib) SU_x=(AQIJuly2015$SydneyCEAQI-mean(AQIJuly2015$SydneyCEAQI))/popsd(AQIJuly2015$SydneyCEAQI) SU_y=(AQIJuly2015$SydneyNWAQI-mean(AQIJuly2015$SydneyNWAQI))/popsd(AQIJuly2015$SydneyNWAQI) mean(SU_x*SU_y) ## 0.757917 Short way cor(AQIJuly2015$SydneyCEAQI,AQIJuly2015$SydneyNWAQI) ## 0.757917 28/51 Population vs Sample Note that there are two slightly different formulas depending on whether we have a population or sample. However, they give the same result! Population Correlation Coefficient: rpop = 1 n ∑ ni=1 xi −x̄ yi −ȳ SDx SDy SU_x=(AQIJuly2015$SydneyCEAQI-mean(AQIJuly2015$SydneyCEAQI))/popsd(AQIJuly2015$SydneyCEAQI) SU_y=(AQIJuly2015$SydneyNWAQI-mean(AQIJuly2015$SydneyNWAQI))/popsd(AQIJuly2015$SydneyNWAQI) mean(SU_x*SU_y) ## 0.757917 Sample Correlation Coefficient: rsample = 1 n−1 n xi −x̄ yi −ȳ ∑ i=1 sx sy n = length(AQIJuly2015$SydneyCEAQI) SU_x=(AQIJuly2015$SydneyCEAQI-mean(AQIJuly2015$SydneyCEAQI))/sd(AQIJuly2015$SydneyCEAQI) SU_y=(AQIJuly2015$SydneyNWAQI-mean(AQIJuly2015$SydneyNWAQI))/sd(AQIJuly2015$SydneyNWAQI) sum(SU_x*SU_y)/(n-1) ## 0.757917 29/51 Why does r measure association? The correlation coefficient divides the scatter plot into 4 quadrants, at the point of averages (centre). Hence a majority of points in the upper right (+) and lower left quadrants (+) will be indicated by an overall + value of r. Source: Freedman et al, Statistics p127 30/51 Properties of the Correlation Coefficient 1. Value The correlation coefficient (r ) is a pure number (no units). It lies between -1 and 1 (inclusive). When r = ±1, all the points lie on a line (no cloud; perfect correlation) 31/51 Note: r = 0 occurs when the points don’t fit around a line. But beware - this can happen in many different ways! 32/51 2. Symmetry The correlation coefficient is not affected by interchanging the variables. cor(AQIJuly2015$SydneyCEAQI,AQIJuly2015$SydneyNWAQI) ## 0.757917 cor(AQIJuly2015$SydneyNWAQ,AQIJuly2015$SydneyCEAQI) ## 0.757917 33/51 3. Scaling The correlation coefficient is shift and scale invariant. CE = AQIJuly2015$SydneyCEAQ NW = AQIJuly2015$SydneyNWAQI cor(2*CE+2,3*NW-3) ## 0.757917 34/51 Regression Line How do we find the optimal line? What lines are the points clustered around? How do we find the optimal line? Experiment with 2 main options here. 36/51 1st Option: SD Line SD line connects the point of averages (x̄, ȳ) to (x̄ + Or (x̄, ȳ) to (x̄ + SDx , ȳ + SDy ) (for r > 0). SDx , ȳ − SDy ) (for r < 0). 37/51 Visually, the SD line looks like a good candidate as it includes the point of averages and the data points generally seem to cluster around it. For example, a AQI pair where both are 0.5 SDs above the mean would lie on the SD line. Source: Freedman et al, Statistics p131 38/51 However, it does not use the correlation coefficient, so it is insensitive to the amount of clustering around the line. So it underestimates (LHS) and overestimates (RHS) at the extremes. 39/51 Best Option: Regression Line To fully describe the scatter plot, we need to use the 5 summaries: x̄, ȳ, SDx , SDy , r. The Regression line connects (x̄, ȳ) to (x̄ + SDx , ȳ + r SDy ) 40/51 · Note the improvement at the extremes. 41/51 Comparing the Regression Line and the SD Line Feature SD Line Regression Line (x̄, ȳ) to (x̄ + (x̄, ȳ) to (x̄ + SDx , ȳ + r SDy ) SDx , ȳ + SDy ) (r ≥ 0) (x̄, ȳ) to (x̄ + SDx , ȳ + SDy ) (r < 0) SDy Slope (b) (r ≥ 0 ) SDx −SDy (r < 0 ) SDx Intercept (a) ȳ − bx̄ Connects r SDy SDx ȳ − bx̄ We can derive the (least-squares) regression line using calculus. 42/51 The graph of averages  Graph of averages  The graph of averages plots the average y for each x. The regression line is a smoothed version of the graph of averages. If the graph of averages is a straight line, that line is the regression line. 43/51 Residual Plot Residual Plot  Residuals  A residual is the vertical distance (or ‘gap’) of a point above and below the regression line. A residual represents the error between the actual value and the prediction. Formally: a residual is ei = yi − ŷi , given the actual value (yi ) and the prediction (ŷi ). Statistics, Freedman et al p182 45/51 Residual Plot  Residual plot  A residual plot graphs the residuals vs x. If the linear fit is appropriate for the data, it should show no pattern - ie be random about 0. 46/51  Does this residual plot look random?  47/51 Extension Transformation If the data is very spread out, then we can try transforming 1 or both of the original variables. For example, we could take ln(y) or ln(x) as the new variables. 49/51 Multiple regression · The natural extension to linear regression is multiple regression, in which we look at the connection between y and 2+ x variables. · The equation becomes y^i = a + b1 xi,1 + b2 xi,2 + … + bn xi,n · The coefficient bj represent the association between variables xi,j and yi. The sign of bj is the direction of the association. · Changing the set of variables can change the model suprisingly. · Multicollinearity occurs when 2 variables are highly correlated with each other. · A binary quantiative variable can be added to a multiple regression by coding a “dummy variable” as 0 and 1. 50/51 Summary The scatter plot is a cloud of points which represents bivariate data (a pair of variables). The scatter plot is summarised by the point of averages, the SD of the 2 variables and the correlation coefficient. The population correlation coefficient is the mean of the product of the variables in standard units. The sample correlation coefficient can be found using cor(). For prediction, the Regression Line is better than the SD line as it uses all 5 numerical summaries for the scatter plot. It is a smoothed version of the graph of averages. It is important to check the scatter plot before making any predictions. Fitting a linear model is easy in R, but requires careful thought to make sure it is appropriate. Otherwise any predictions are invalid. The residual plot is a diagnostic for seeing whether a linear model was appropriate - if it is random, then linear model seems appropriate. 51/51

Linear Regression Framework (2023) PDF

Document Details

Tags

Related

Summary

Full Transcript