PSYC 3910_Lec5_Regression_post.pptx

Full Transcript

Advanced Data Analysis: PSYC 3910 Lecture 4: Bivariate Regressions Instructor: Bobby Stojanoski Today’s aims • Understand linear regression with one or more predictors • Understand how we assess the fit of a regression model • Total sum of squares • Model sum of squares • Residual sum of squares...

Advanced Data Analysis: PSYC 3910 Lecture 4: Bivariate Regressions Instructor: Bobby Stojanoski Today’s aims • Understand linear regression with one or more predictors • Understand how we assess the fit of a regression model • Total sum of squares • Model sum of squares • Residual sum of squares •F • R2 • Know how to do regression using R • Interpret a regression model Required packages • install.packages("car"); • install.packages("QuantPsyc") • library(QuantPsyc) • library(car) • library(boot) • library(ggplot2) • library(dplyr) What is a regression? • A way of predicting the value of one variable from another. • It is a hypothetical model of the relationship between two variables. • The model used is a linear one. • Therefore, we describe the relationship using the equation of a straight line. • Slightly fancier version of a Pearson Correlation Describing a straight line 𝑦 =𝑚𝑥 +𝑏 𝑌 𝑖=𝑏 0+𝑏𝑖𝑋𝑖+ 𝜀 𝑖 •m • bi • Gradient (slope) of the line • Regression coefficient for the • Direction/strength of predictor • Gradient (slope) of the regression relationship line •x • Direction/strength of relationship • Intercept of the line (value of Y when X = 0) • b0 • Intercept (value of Y when X = 0) • Point at which the regression line crosses the Y-axis (ordinate) Describing a straight line • Y is a line w.r.t. X plus some error • Error assumed to be independent, identically distributed Gaussian noise Intercepts and gradients Y b1 bo 0 X Indicates the amount of increase in Y for a one-unit increase in X Describing a straight line Bivariate regression model 𝑌 𝑖=𝑏 0+𝑏𝑖𝑋𝑖+ 𝜀 𝑖 Bivariate prediction equation • Yʹ = predicted value • a: Intercept Intercepts and gradients • we want to predict Y given X • we are modelling Y using a linear equation Height Weight 55 140 61 150 67 152 83 220 65 190 82 195 70 175 58 130 65 155 61 160 Intercepts and gradients slope means that every inch in height is associated with 2.6 pounds of weight β0 = 120 β1 = 2.6 Height Weight 55 140 61 150 67 152 83 220 65 190 82 195 70 175 58 130 65 155 61 160 How good is the model? • Like the mean, you can think of the regression line as a model based on the data. • Also like the mean, this model (line of best fit) might not reflect reality (the data in your sample). • Like before, we need some way of testing how well the model fits the observed data. • How? The method of least squares + - + - All + deviations + + + - All - deviations + - - This graph shows a scatterplot of some data with a line representing the general trend. The vertical lines (dotted) represent the differences (or residuals) between the line and the actual data Least squares: Sums of squares Least squares estimate: a line that minimizes sum of squared errors Sum of squared error = 3174 If we don’t get to vary slope from 0, our line minimizing squared error is the horizontal that passes through y. What value is this? Least squares: Sums of squares Least squares estimate: a line that minimizes sum of squared errors What if we shift the intercept and slope in this direction? Sum of squared error = 7050 What happens to the sum of squared error? Least squares: Sums of squares Least squares estimate: a line that minimizes sum of squared errors What if we shift the intercept and slope in this direction? Sum of squared error = 885 What happens to the sum of squared error? Least squares: Sums of squares Least squares estimate: a line that minimizes sum of squared errors What about this? Sum of squared error = 93 Types of sum of squares • SST :Total variability (variability between scores and the mean). SST uses the differences between the observed data and the mean value of Y SSR uses the differences between the observed dat and the regression line the differences Types of sumSSofusessquares T between the observed data and the mean value of Y SSR uses the differences between the observed data and the regression line • SSModel: Model variability (difference in variability between the model and the mean). SSM uses the differences between the mean value of Y and the regression line Types of sum of squares • SSR: Residual/error variability (variability between the regression model and the actual data). SST uses the differences between the observed data and the mean value of Y SSR uses the differences between the observed data and the regression line Linear Regression: R • We run linear regression analyses using the lm() function • lm stands for ‘linear model’ • • • • The lm() function can feel, and is complicated: Type ?lm in the console, and look at the help documentation lots of arguments that you can specify, most won’t make sense to you now Only two of them that you need to worry about • Formula and data Regression in R • lm() function takes the general form: • yourmodel = lm(formula = X, data = X)) • formula specifies the regression model. For bivariate regression (single predictor) the formula takes the form: outcome ~ predictor “predicted from” • Data is the data frame containing the variables of interest • yourmodel = lm(outcome ~ predictor, data = dataFrame) Linear Regression: An example • Your friend is a new parent and find that they always feel grumpy, and are not sure why? They collected some data on how long they slept and how long their child slept. • Data • Sleep duration for parent and child for 100 nights • Ratings of grumpiness • Outcome variable: • Grumpiness • Predictor variable: • Parent sleep duration Linear Regression: An example Load the data: parenthood= read.csv("/your path/Parenthood.csv") What does the data look like: Plot the data Find model that best explains the data Linear Regression: An example Load the data: parenthood= read.csv("/your path/Parenthood.csv") What does the data look like: Set up model: parenthood.1 <- lm(grump ~ parent_sleep, data = parenthood) “predicted from” OR parenthood.1 = lm(parenthood$grump ~ parent_sleep$adverts) Output of a simple regression • We have created an object called parenthood.1 that contains the results of our analysis. • If you type parenthood.1 into the console it will output the 2 coefficients of the regression: 1) intercept and 2) slope Note: This output does not show us if the regression is significant. We need to run Hypothesis tests for the regression model Testing the model • To run hypothesis tests we need hypotheses • One model might be that there is no relationship between the predictor and outcome (null) • Formally, this tests the model when there are no predictors and only include the intercept (what do you think this is?) • The other is that there is a relationship between the the predictor and outcome and the model does a good job of predicting that relationship (alternative) • How do we test this? • We analyze differences in sources of variance. In other words, it’s an analysis of variance. Where have you heard this before? Testing the model: Analysis of variance (ANOVA) SST Total Variance in the Data SSM Improvement Due to the Model SSR Error in Model • If the model results in better prediction than using the mean, then we expect SSM to be Testing the model: Analysis of variance (ANOVA) SST uses the differences between the observed data and the mean value of Y SST uses the differences between the observed data SSR uses the differences and the data mean value of Y between the observed and the regression line SSM uses the differences SST uses the differences between the mean value of Y between the observed data and the regression line and the mean value of Y SSR uses the differences between the observed data and the regression line SSM uses the differences between the mean value of Y SS R uses the differences and the regression line between the observed data and the regression line Testing the model: R2 • R2 • The proportion of variance accounted for by the regression model. • The Pearson Correlation Coefficient Squared 𝑆𝑆 𝑀 𝑅 2= 𝑆𝑆 𝑇 ^−𝑌) 𝛴 (𝑌 𝑅 2= 𝛴 (𝑌 − 𝑌 ) Testing the model: Analysis of variance (ANOVA) • Mean squared error • Sums of squares are total values. • They can be expressed as averages. • These are called mean squares, MS. 𝑀𝑆 𝑀 𝐹= 𝑀𝑆 𝑅 Divide by df: MSmodel that is K (# of predictors) MSresidual that is N (# of participants) minus K minus 1 Testing the model: R • parenthood.1 = lm(grump ~ parent_sleep, data = parenthood) • To determine if regression is significant, output the contents of the parenthood.1 object by executing: summary(parenthood.1) Interpreting results • Overall model fit: R2= 82 • Proportion of Y (grumpiness) explained by predictor X (parent’s sleep duration) • = Pearson r! Run correlation between 2 variables to test for yourself • Model estimate • -8.94 • This is the slope • Also, represents the unit change in outcome associated with unit change in predictor. • What does this mean? • For every unit change in parent’s sleep (we measured this in hours) there is a corresponding change of 8.9 units of grumpiness • Every lost hour of sleep will result in an 8.94 unit increase in grumpiness Using the model to make predictions Bivariate regression model 𝑌 𝑖=𝑏 0+𝑏𝑖𝑋𝑖+ 𝜀 𝑖 Bivariate prediction equation 𝐺𝑟𝑢𝑚𝑝 𝑖=𝑏 0+ 𝑏𝑖𝑃𝑎𝑟𝑒𝑛 𝑡 𝑆𝑙𝑒𝑒𝑝 𝑖 = 125.96 + (-8.94 x sleep duration) = 125.96 + (-8.94 x 12) = 18.68