Regression Analysis PDF

Week 1 Part 2: Revision (plus) of Correlation PSYU2248/PSYX2248: DESIGN AND STATISTICS II Week 1 Part 2 Overview WHAT WILL WE COVER TODAY? This part of the lecture will cover: - Reminder of link between stats and research design - Stata walk-through - Revision (plus) of correlation and scatterplots - Confidence intervals around a correlation - A manual example: heart disease and pace of life - A Stata example: wealth inequality 2248 Week 1 Research Process + Design Research Context of Statistics Remember: statistical analysis always exists within a research context ― We rarely do stats just because (despite the fun!) Before running a statistical analysis, we need to remember the research context: ― What is the RQ / hypothesis? ― What is the sample and broader population? ― What is the unit of measurement? We need to understand the stat details (e.g. alternate hypothesis, null hypothesis, alpha level, etc.) to properly apply and understand the analyses, but… After running a statistical analysis, we need to relate it back to the research context: ― How does this answer our RQ/hypothesis? ― What does this mean? 2248 Week 1 Research Steps (aka scientific method) RESEARCH PROCESS + DESIGN 1. Make an observation 2. Review the literature, identify the theory (/ theories) that govern 3. Generate aims, RQ, and hypotheses 4. Design the study ― Select the appropriate study design ― Identify the population, sample, and sampling method ― Identify how to measure your constructs or phenomena (operationalise) 5. Obtain ethical approval for the study 6. Run the study, collect the data ― Recruit participants Most of 2248 is focused on this, but never forget it occurs ― Disseminate the survey / run the experiment / do the thing… in the broader research 7. Analyse the data context!! 8. Write up and disseminate the findings (e.g. academic paper) 2248 Week 1 Design + Statistics Steps For the rest of the statistical analyses in this unit (regression, ANOVA, non-parametric analyses), we’ll follow a standard process: Before getting into the data, we must understand (design steps): 1. Our research questions and hypotheses we are trying to answer with our data 2. Our sampling population 3. How our variables measured (type and scale) Then, getting into the data analysis, we then (statistics steps): 4. Describe variables using appropriate UNIVARIATE numeric and graphical summaries 5. Describe variables using appropriate BIVARIATE numeric and graphical summaries 6. Fit appropriate statistical model(s) 7. Formally test assumptions 8. Interpret results + draw conclusions 2248 Week 1 Stata Recap Stata Stata can be downloaded directly on your computer from MQ student website: Version 18 https://students.mq.edu.au/support/technology/software/stata Data files can be opened in Stata: download the file (.dta for Stata data files, or import an excel or other file) and open it directly in Stata Need more Stata info?! Very helpful Stata Youtube videos at this playlist: https://www.youtube.com/playlist?list=PLN5IskQdgXWnnIVeA_Y0OBGmnw21fvcmU 2248 Week 1 Things you should know how to do in Stata Open up a data file (Stata data file.dta as well as import Excel file) Look around the data file to identify variables, number of observations, etc. Run descriptive statistics for all kinds of variables (numeric summaries and graphical summaries) ― Univariate summaries and bivariate (two variable) summaries Create a new variable, attach value labels to categorical variables Run statistical analyses: one-sample t-tests, independent t-tests, paired t-tests, correlations, chi-square goodness of fit, chi-square test of independence Run assumption checks: Shapiro-wilk, Levene’s test, linearity 2248 Week 1 Revision (plus) of Correlations + Scatterplots Correlation + Design CORRELATION + SCATTERPLOTS Remember the distinction between experimental and non-experimental design Correlation analysis naturally* only follows non-experimental design (*almost always… 99.9%!) “Correlational design” means simply measuring pre-existing phenomena (no intervention or involvement or anything from the researcher) Remember correlation doesn’t equal causation! Criteria for a cause-and-effect (causal) relationship: 1. Covariance rule: there must be a relationship! 2. Temporal precedence: the cause must precede the effect! 3. Internal validity: excluding other potential causes of the effect Correlations can be true correlations or spurious correlations! 2248 Week 1 Dancing Statistics: Correlation CORRELATION + SCATTERPLOTS 2248 Week 1 https://www.youtube.com/watch?v=VFjaBh12C6s&ab_channel=TheBritishPsychologicalSociety Correlation vs Causation CORRELATION + SCATTERPLOTS Example: in US cities, infant mortality rate and number of doctors in population strongly correlated Could be… ― (X -> Y) Presence of doctors kills babies? ― (Y -> X) Higher mortality rate causes more doctors to be present ― (Z -> X and Z -> Y) another factor, e.g. population density, causes both Howell text, Fig. 9.2 2248 Week 1 Spurious Correlation CORRELATION + SCATTERPLOTS A correlation between two variables can be spurious or true ― True doesn’t mean causal, instead direct relationship Spurious correlations could be due to: ― A third factor, AKA common cause ▪ E.g. height and intelligence in children ― or random chance ▪ E.g. worldwide space launches and sociology PhDs https://www.tylervigen.com/spurious-correlations https://dx-doi-org.simsrad.net.ocs.mq.edu.au/10.4135/9781412984522.n1 2248 Week 1 Correlation Coefficient CORRELATION + SCATTERPLOTS Is there a linear relationship between two variables? ― Is there a positive relationship? Or a negative relationship? Pearson’s product-moment correlation ― Population correlation represented by 𝞀 (rho) ― Sample correlation represented by r ― H0: 𝞀 = 0 ― H1: 𝞀  0 Two numeric variables (could be IV and DV, but not necessarily) 2248 Week 1 Correlation and Covariance CORRELATION + SCATTERPLOTS Correlation measures co-variation ― Different statistic = covariance; ― Correlation coefficient is just covariance, standardised! Strength of correlation: ― From 0 to 0.10 = no real relationship ― 0.10 to 0.30 = weak relationship ― 0.30 to 0.50 = moderate relationship ― 0.50 to 1 = strong relationship 2248 Week 1 Strong vs weak correlations CORRELATION + SCATTERPLOTS Don’t confuse steepness of the slope with how tightly clustered the points are! Perfect correlation = data points making a straight [pos or neg] line ― Doesn’t matter how steep the line is, as long as there is some positive/negative straight line trend Weak correlation = data points more dispersed ― Still doesn’t matter how steep the line is No correlation = no [pos or neg] linear pattern to the data points https://en.wikipedia.org/wiki/Pearson_correlation_coefficient#/media/File:C orrelation_examples2.svg 2248 Week 1 Different types of Correlation Coefficients CORRELATION + SCATTERPLOTS Pearson’s product-moment correlation: the normal one Spearman’s correlation rs ― Correlation on ranked data: use for violated assumptions (e.g. relationship monotonic but non-linear? Ordinal variables?) ― Non-parametric correlation: we’ll cover towards the end of the unit Point-biserial correlation rpb ― Use for one numeric and one dichotomous variable ― We’ll come back to this concept in regression Phi φ ― Use for two dichotomous variables ― Largely ignore this (we’ll do Chi-square and Cramer’s V instead) 2248 Week 1 Scatterplots CORRELATION + SCATTERPLOTS For any scatterplot, we should consider 7 aspects… 1. Monotonic (does the trend keep in one direction?) 2. Linear (can it be best summarised by a straight line?) 3. Direction of association (positive or negative?) 4. Effect of X on Y (how steep is the slope?) 5. Correlation (how strong is the correlation?) 6. Gaps (are there any gaps?) 7. Outliers (are there any outliers?) We need to check the scatterplot FIRST, to see if a formal correlation test is appropriate! Forms part of our assumption testing: the association needs to ― Be linear (and monotonic – if it’s linear, then it is also monotonic) ― Have no gaps or problematic outliers for a correlation to be appropriate 2248 Week 1 Scatterplots CORRELATION + SCATTERPLOTS Positive relationship Negative relationship No relationship 2248 Week 1 Outliers CORRELATION + SCATTERPLOTS Outliers are only problematic if they’re distorting the results 2248 Week 1 Assumptions of Correlation CORRELATION + SCATTERPLOTS Blurry line between formal assumptions and important checks of the data… Numeric Data: both variables measured on numeric (interval/ratio) measurement scale Independence of observations Monotonic, linearity of any given relationship (nothing non-linear present) NB: this DOESN’T mean that there HAS to be a linear relationship, just that there isn’t anything non-linear No major gaps or outliers (inspect the scatterplot) We’ll cover assumptions in more detail in Linear Regression (talk about normality + homoscedacity) 2248 Week 1 Calculating Correlation and Covariance CORRELATION + SCATTERPLOTS Correlation formula can be expressed multiple ways 𝑐𝑜𝑣(𝑥,𝑦) [ σ 𝑥−𝑥ҧ (𝑦− 𝑦)]/𝑛−1 ത 𝑛 σ 𝑥𝑦 − σ 𝑥 σ 𝑦 𝑟= 𝑟= 𝑟= 𝑠𝑥 𝑠𝑦 𝑠𝑥 𝑠𝑦 [𝑛 σ 𝑥2 − σ 𝑥 2 ] [𝑛 σ 𝑦 2 − σ 𝑦 2 ] 𝑟 𝑛−2 Test statistic for a correlation: 𝑡 = 1−𝑟 2 (breathe…) ― n = sample size (here, number of pairs – not observations) ― x = our x variable ― y = our y variable ― s = standard deviation ― Σ = sum ― cov = covariance 2248 Week 1 Covariance: A Simple Example CORRELATION + SCATTERPLOTS Let’s test the relationship between mood and eating ― Mood measured using a 9-point rating scale (higher = more positive) ― Eating measured by calories consumed X (mood) Y (eating) Below-average mood 6 480 corresponds with above- average eating 4 490 Mean mood 7 500 Mean eating 4 590 2 600 5 400 Above-average mood corresponds with below- 3 545 average eating 1 650 M=4 M = 531.785 2248 Week 1 Calculating Covariance CORRELATION + SCATTERPLOTS Correlation = standardized covariance Covariance = unstandardized correlation σ 𝑥 − 𝑥ҧ (𝑦 − 𝑦) ത 𝑐𝑜𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = 𝑛−1 −835 = 7 = −119.3 Covariance is an unstandardized measure: it’s in the unit of measurement that our variables are in ― No standard interpretation for what is ‘big’ vs ‘small’ ― Not bounded by +/- 1 like correlation ― The correlation here [trust me!] is r = -.74 (v strong) 2248 Week 1 Confidence Intervals CORRELATION + SCATTERPLOTS Remember – point estimates vs interval estimates ― Point estimates are more precise, but less likely to be accurate because of that precision ― I predict my dog will jump up on my bed at exactly 3am – vs – I predict my dog will jump up on the bed sometime between 1:30am and 4am When making generalisations from our sample back to the wider population: ― Predictions based on the precise value we obtained from our sample unlikely to be precisely accurate ― Point estimates will vary from sample to sample (sampling variability) ― Interval estimates are more realistic Confidence intervals are interval estimates: if we performed the study many times, 95% of the results would contain the true population estimate (also expressed as: 95% confident the population [value of whatever] lies within those two bounds) If you’re interested: the Howell textbook briefly discusses these two different interpretations in 9.11; also see https://training.cochrane.org/handbook/current/chapter-15 section 15.3.1 for a good explanation We can get 95% CIs around a correlation! 2248 Week 1 Confidence Intervals CORRELATION + SCATTERPLOTS Basic formula for 95% CI: point estimate +/- 1.96 x SE ― 1.96 is critical z score for 5% error rate (alpha) ― SE (standard error) is a measure of variability; different calculation of SE for different kinds of values (e.g. SE of mean vs SE of correlation vs SE of regression coefficient) 1 In correlation: 𝑆𝐸 = 𝑛−3 ― The bigger the sample size, the smaller the SE, therefore the narrower the interval ― The smaller the sample size, the bigger the SE, therefore the wider the interval Calculating a CI for a correlation involves a few steps… ― The correlation coefficient r needs to be transformed into a z ― We’re not going to do it by hand – but you should understand the principle! ▪ The smaller the sample size, the wider the CI, the less precise the estimate ▪ The bigger the sample size, the narrower the CI, the more precise the estimate Tip: if the 95% CI doesn’t cross over 0, then the correlation is stat. significant, p <.05! 2248 Week 1 Stata Commands CORRELATION + SCATTERPLOTS twoway scatter var1 var2 (or, can just use scatter var1 var2!) will get you a scatterplot ― Adding line of best fit via lfit var1 var2 afterwards ― DV goes on Y axis (variable specified first), IV goes on X axis (specified second) ― The pwcorr command is the preferred correlation command: basic pwcorr var1 var2 will give you Pearson’s correlation coefficient ― Options after a comma can be added for p-value, number of observations, etc: pwcorr var1 var2, sig obs is a good idea for you to do by default! Confidence Interval package ci2 can be installed to produce 95% CI around correlation ― Install via net install ci2 (if that gives you a red error message, try this instead: ssc install ci2 and if it still doesn’t work, type net search ci2 and click on the package called “ci2”) ― ci2 var1 var2, corr to get the 95% CI around the correlation coefficient 2248 Week 1 Manual Example: Pace of Life and Heart Disease Example: Pace of life and heart disease PACE OF LIFE AND HEART DISEASE Example from Levine (1990) as used in Howell Fundamental Statistics 9.2 Background: are people who live fast-paced lives more prone to heart disease? ― Psych context: think experience of stress, stress appraisal, etc. Research hypothesis: there will be a positive relationship between pace of life and heart disease (the faster the pace of life, the higher the incidence of heart disease) Study design: non-experimental, observational Unit of observation and sample: 36 individual US cities Measurement of pace of life: ― Composite of 3 factors: “…surreptitiously used a stopwatch to record the time that it took a bank clerk to make change for a $20 bill, the time it took an average person to walk 60 feet, and the speed at which people spoke” ― Numeric variable Measurement of heart disease: ― “…the age-adjusted death rate from ischemic heart disease for each city” ― Numeric variable 2248 Week 1 Example: Pace of life and heart disease PACE OF LIFE AND HEART DISEASE Data used and referred to in Howell textbook Find the data itself at https://www.uvm.edu/~statdhtx/fundamentals9/DataFiles/Fig9-5.dat (Side note: excel can create scatterplots too!) 35 30 25 Heart Disease 20 15 10 5 14 16 18 20 22 24 26 28 30 Pace of Life 2248 Week 1 Calculating Correlation PACE OF LIFE AND HEART DISEASE Calculating the correlation by hand using the long formula (gasp): 𝑛 σ 𝑥𝑦 − σ 𝑥 σ 𝑦 𝑟= [𝑛 σ 𝑥 2 − σ 𝑥 2 ] [𝑛 σ 𝑦 2 − σ 𝑦 2 ] 36 16487.4 − (822.32)(713) = 36 19101.7 − 822.322 [36 15073 − 7132] 593546.4 − 586314.16 = 687661.2 − 676210.1824 [542628 − 508369] 7232.24 = 11451.0176 𝑥 34259 7232.24 = 19806.57 =.36514 2248 Week 1 Calculating Correlation PACE OF LIFE AND HEART DISEASE Calculating the correlation by hand using the long formula (gasp): 𝑛 σ 𝑥𝑦 − σ 𝑥 σ 𝑦 𝑟= [𝑛 σ 𝑥 2 − σ 𝑥 2 ] [𝑛 σ 𝑦 2 − σ 𝑦 2 ] 36 16487.4 − (822.32)(713) = 36 19101.7 − 822.322 [36 15073 − 7132] 593546.4 − 586314.16 = 687661.2 − 676210.1824 [542628 − 508369] 7232.24 = 11451.0176 𝑥 34259 7232.24 = 19806.57 =.36514 Correlation is r =.37: In US cities, the faster the pace of life, the higher the incidence of heart disease, with a moderate sized effect 2248 Week 1 Correlation in Stata PACE OF LIFE AND HEART DISEASE Scatterplot with line of best fit in Stata scatter heart pace || lfit heart pace Assumptions for correlation are met ― Numeric data ― No non-linear trend present ― No gaps or outliers. pwcorr pace heart, sig obs ― …etc pace heart H0: 𝞀 = 0 pace 1.0000 H1: 𝞀  0 Degrees of freedom: sample size minus 2 36 First row is correlation coefficient heart 0.3651 1.0000 Second row is p-value 0.0285 Third row is no. observations 36 36 Reject the null hypothesis of no correlation as p <.05 2248 Week 1 Reporting Correlation PACE OF LIFE AND HEART DISEASE Reporting in APA format!. pwcorr pace heart, sig obs pace heart There is a statistically significant, positive, moderate correlation between pace 1.0000 pace of life and heart disease, r(34) =.37, p =.028. In US cities, the faster the pace of life, the greater the incidence of heart disease tends to 36 be. heart 0.3651 1.0000 0.0285 36 36 95% CI: from.042 to.619 ― Wide range! Not very precise estimate. ci2 heart pace,corr Confidence interval for Pearson's product-moment correlation of heart and pace, based on Fisher's transformation. Correlation = 0.365 on 36 observations (95% CI: 0.042 to 0.619) 2248 Week 1 Confidence Interval: Sample Size PACE OF LIFE AND HEART DISEASE Remember from prev: sample size affects width of CI Let’s demonstrate: same data points, same correlation coefficient, but bigger sample = narrower CI 95% CI 95% CI lower bound correlation upper bound n = 36 n=36 0.042 0.365 0.619 n=72 0.146 0.365 0.55 n=144 0.214 0.365 0.499 n = 72 n = 144 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Correlation Coefficient 2248 Week 1 Stata Example: Wealth Inequality An example using Stata: Wealth Inequality WEALTH INEQUALITY Example from Open Stats Lab https://sites.google.com/view/ope nstatslab/home Real psych paper: Dawtry et al. (2015) https://doi.org/10.1177%2F09567 97615586560 (our exercise is from Study 1a) 2248 Week 1 Dawtry et al. (2015) Wealth Inequality WEALTH INEQUALITY Dawtry, Sutton, and Sibley (2015) wanted to examine why people differ in their assessments of the increasing wealth inequality within developed nations. Previous research reveals that most people desire a society in which the overall level of wealth is high and that wealth is spread somewhat equally across society. However, support for this approach to income distribution changes across the social strata. In particular, wealthy people tend to view society as already wealthy and thus are satisfied with the status quo, and less likely to support redistribution. In their paper Dawtry et al. (2015) sought to examine why this is the case. The authors propose that one reason wealthy people tend to view the current system is fair is because their social- circle is comprised of other wealthy people, which biases their perceptions of wealth, which leads them to overestimate the mean level of wealth across society. 2248 Week 1 Study Methods + Hypotheses WEALTH INEQUALITY Design: cross-sectional online survey study Sample: 305 US adults recruited from an online survey pool Amazon’s Mturk Participants reported: ― their own annual household income household_income ― estimated income level of those within their own social circle social_circle_mean_income ― estimated income for the entire population population_mean_income ― their attitudes toward redistribution of wealth (measured using a four-item scale) redist1 – redist4 We’ll focus on the variables above (there are many more in the dataset!) All these are either numeric (income variables) or ordinal (redist items) 2248 Week 1 Study Methods + Hypotheses WEALTH INEQUALITY The hypotheses we’re testing: People with lower support of redistribution of wealth will tend to be wealthier, have wealthier friends and estimate population wealth as higher ▪ Pairwise correlations between 4 variables: attitudes towards redistribution of wealth (DV) and each of household income, social circle income, and population income (three IVs) 2248 Week 1 Redistribution Composite Variable WEALTH INEQUALITY 4 items measuring attitudes towards redistribution of wealth, adapted from a Gallup Organization (1998) poll, rated on a Likert-type scale 1 = strongly disagree, 6 = strongly agree 1. The government should redistribute wealth through heavy taxes on the rich. 2. The government should not make any special effort to help the poor, because they should help themselves. 3. Money and wealth in this country should be more evenly distributed among a larger percentage of people. 4. The fact that some people in the US are rich and others are poor is an acceptable part of our economic system. Higher scores on 1 & 3 = more in favour of redistribution Higher scores on 2 & 4 = less in favour of redistribution We need to reverse-score (AKA reverse-code) items 2 & 4 so that all items are consistently scored (going in the same direction) before averaging to make a composite score (one variable that represents attitudes towards redistribution) 2248 Week 1 Over to Stata now! WEALTH INEQUALITY Steps in our exercise are: 1. Opening data file and inspecting data 2. Reverse-scoring redist2 and redist4 3. Creating (generating) a new composite variable, support_for_redistribution 4. Running univariate descriptives on all key variables 5. Producing scatterplots 6. Running correlational analysis and confidence intervals 7. Interpreting results 2248 Week 1 Results WEALTH INEQUALITY The hypotheses we’re testing: People with lower support of redistribution of wealth will tend to be wealthier, have wealthier friends and estimate population wealth as higher ▪ Pairwise correlations between attitudes towards redistribution of wealth (DV) and each of household income, social circle income, and population income (each IVs) There were statistically significant, negative, weak correlations between attitudes towards redistribution of wealth and each of the three income variables. Individuals who reported lower support for redistribution of wealth had higher average household income, r(299) = -.21, p <.001, 95% CI [-.32, -.10], had higher social circle average income, r(303) = -.25, p <.001, 95% CI [-.35, -.14], and also estimated the population average income as higher, r(303) = -.18, p =.002, 95% CI [-.28, -.07]. This offers support for the theory that one reason wealthy people tend to view the current system is fair is because their social-circle is comprised of other wealthy people, which biases their perceptions of wealth, which leads them to overestimate the mean level of wealth across society. 2248 Week 1 Lecture Learning Outcomes THE END! After this week’s lecture, you know: ― What correlational research methods are, and when a correlational analysis is appropriate ― The concept of a correlation and what a correlation coefficient tests ― How to manually calculate covariance and correlation (…but won’t be asked to do it!!) ― How to interpret scatterplots, correlation coefficients + confidence intervals, and write them up In Stata, you should be able to: ― Open data files ― Recode and create new variables ― Run univariate descriptive statistics (numeric and graphical) ― Produce scatterplots, correlational analysis, and confidence intervals ― Create and save a.do file for your commands (syntax) 2248 Week 1 Week 2: Simple Linear Regression PSYU2248/PSYX2248: DESIGN AND STATISTICS II Week 2 Overview WHAT WILL WE COVER TODAY? This week’s lecture will cover: - Regression Line + Regression Model - Testing Statistical Significance in Regression - Running Regression in Stata - Standardized effect sizes in regression - Predicting scores from the regression equation - Stata Demonstration - Conclusions 2248 Week 2 2 Introduction to Regression Scatterplots CORRELATION AND REGRESSION Positive relationship Negative relationship No relationship 2248 Week 2 4 Equation of a straight line CORRELATION AND REGRESSION https://www.mathsisfun.com/equation_of_line.html 2248 Week 2 5 Correlation into Regression CORRELATION AND REGRESSION Correlation and regression are inter-related analyses: regression is an extension of correlation ― Regression: predict Y from X ― Correlation: relationship between X and Y Still a numeric DV and a numeric IV* (*caveat – we will see exceptions to this!) Non-experimental designs (see exceptions with categorical variables!) We want to explain variation in Y (our DV, our outcome) ― Why some people have higher scores, and some people have lower scores ― What can explain or predict this variation? ― What factors are reliably associated with this outcome? Regression-type RQs are phrased like this Regression models describe this predictive relationship 2248 Week 2 6 Scatterplot + Regression Line CORRELATION AND REGRESSION You’re already familiar with scatterplots (and line of best fit AKA regression line) Regression model produces the equation of that line! 𝑌෠ = 𝛼 + 𝛽𝑥 Line has two properties: ― Intercept (“a”) – the value of Y when X = 0 (AKA the point the line cuts the Y axis) ― Slope (“b”) – how steep (vs flat) the line is, and what direction (positive or negative) Regression model fits that line (defines it), and tells us how well it fits ― How well can we predict Y, given a value of X? ― How much variance in Y can we explain? 2248 Week 2 7 Scatterplot + Regression Line CORRELATION AND REGRESSION Line has two properties: Intercept (“a”) ― the value of Y when X = 0 ― the point the line cuts the Y axis Slope (“b”) ― how steep (vs flat) the line is, and what direction (positive or negative) ― average change in Y per unit increase in X Agresti, Fig 9.2 2248 Week 2 8 Regression + Research Design Why do we do research? REGRESSION RESEARCH DESIGN As psychological scientists, we want to… 1. Describe human behaviour (what is going on?) 2. Predict human behaviour (use information about one thing to make a decent guess at something else) 3. Explain human behaviour (understand why things are related) 4. Control human behaviour (make change!) These different levels of questions correspond to different research questions and study designs (and approaches to data analysis and presentation) 2248 Week 2 10 Prediction REGRESSION RESEARCH DESIGN In regression, we predict the DV from the IV: prediction isn’t causal!!! ― If we successfully predict one thing (DV) from another thing (IV), that could be because IV causes DV, but isn’t necessarily! ― IVs are also called predictors in regression For example: ― If you’re a psych student, I predict you’re female ▪ Definitely not causal ― If you answer questions in class, I predict you’re high on extraversion ▪ Definitely not causal (could be reverse-causal) ― If you watch lectures and complete prac exercises, I predict you’ll do well in the final exam ▪ Could be causal, but also many other factors (potential confounds) Prediction isn’t causal – but, often we suspect the IV might cause the DV, we just cannot determine it without the right study design Each individual study is conducted within a context, builds upon other research: coupled with experimental research, it can provide supportive evidence for a causal relationship, but cannot determine it on its own 2248 Week 2 11 Regression Research Designs REGRESSION RESEARCH DESIGN Regression analyses (typically) apply to non-experimental designs For example: ― Cross-sectional survey studies where many factors are measured ― Longitudinal studies where something at one time point is used to predict something at a different time point Common to have a number of factors (IVs) to predict an outcome: multiple regression (next week!) 2248 Week 2 12 2248 Week 2 https://dx-doi-org.simsrad.net.ocs.mq.edu.au/10.1007/s00221-021-06182-w 13 2248 Week 2 https://dx-doi-org.simsrad.net.ocs.mq.edu.au/10.1093/geront/gnz054 14 2248 Week 2 https://dx-doi-org.simsrad.net.ocs.mq.edu.au/10.1007/s12144-021-02076-w 15 Regression Line + Model Regression concepts REGRESSION LINE + MODEL Remember: regression is about explaining variance in the DV, Y We’ll introduce some concepts: ― Residual (error): difference between predicted Y and actual Y ― Variance around the regression line (sums of squares residual) ― Variance explained by the regression line (sums of squares regression model) ― R2, a measure of effect size ― Conditional and marginal distribution Simple linear regression: a single independent variable Multiple linear regression: multiple independent variables We’ll talk about “the model” (regression model) ― Simple linear regression: “the model” = the one IV ― Multiple linear regression: “the model” = all the IVs 2248 Week 2 17 X and Y REGRESSION LINE + MODEL Simple example: x and y x y 0 1 Variability in y is what we’re trying to explain or predict 1 2 2 1 That degree of variability is represented by total sums of 3 2 squares (TSS): 4 3 ത 2 𝑇𝑜𝑡𝑎𝑙 𝑆𝑆 = ෍(𝑌 − 𝑌) 5 5 6 4 7 5 Representing variability around the mean (Y-bar) 8 4 Regression aim: explain that variability by the model (represented by regression line) 2248 Week 2 18 Regression Line REGRESSION LINE + MODEL Simple example: x and y Regression line predicts a score of Y for any 6 given value of X (red triangle) 5 There are also actual scores of y for each 4 value of x (blue circles) 3 2 1 0 0 1 2 3 4 5 6 7 8 9 2248 Week 2 19 Regression Line REGRESSION LINE + MODEL Regression line predicts a score of Y for any given value of X 6 The closer the actual scores are to the predicted scores, the better the model predicts y, the less 5 variability around the line there is 4 The further away from the line (predicted scores) the actual scores are, the worse the model predicts 3 y, the more variability around the line there is 2 Difference between predicted Y and actual Y for 1 any given value of X is called a residual 0 0 1 2 3 4 5 6 7 8 9 2248 Week 2 20 Regression Line REGRESSION LINE + MODEL The closer the actual scores are to the predicted scores, the better the model predicts y, the less variability around the line there is 6 The further away from the line (predicted scores) the 5 actual scores are, the worse the model predicts y, the more variability around the line there is 4 Regression model minimises residual variance via 3 the method of least squares! 2 Aim: have less variability around the regression line than total variability in Y around the mean 1 (total sums of squares) 0 0 1 2 3 4 5 6 7 8 9 2248 Week 2 21 Total Variance (Sums of Squares) REGRESSION LINE + MODEL 6 ത 2 𝑇𝑜𝑡𝑎𝑙 𝑆𝑆 = ෍(𝑌 − 𝑌) 5 4 Mean of y = 3 3 2 1 0 0 1 2 3 4 5 6 7 8 9 2248 Week 2 22 Residual or Error Variance (Sums of Squares) REGRESSION LINE + MODEL The regression model wants to minimize error variance around the regression line Residual variance, also called error variance (SSE) 6 ෠ 2 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 𝑜𝑟 𝐸𝑟𝑟𝑜𝑟 𝑆𝑆 = ෍(𝑌 − 𝑌) 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 2248 Week 2 23 Residuals REGRESSION LINE + MODEL ෠ 2 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 𝑜𝑟 𝐸𝑟𝑟𝑜𝑟 𝑆𝑆 = ෍(𝑌 − 𝑌) 6 Each observation in the dataset has a residual (e): 5 ෠ e = (𝑌 − 𝑌) (difference between actual y and predicted y) 4 Positive residual = above the regression line 3 Negative residual = below the regression line Residuals sum to 0 (regression line sits in the middle 2 of data points!) 1 0 0 1 2 3 4 5 6 7 8 9 2248 Week 2 24 Regression or Model Variance (Sums of Squares) REGRESSION LINE + MODEL The regression model wants to minimize error variance around the regression line 𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑜𝑟 𝑀𝑜𝑑𝑒𝑙 𝑆𝑆 = ෍(𝑌෠ − 𝑌) ത 2 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 2248 Week 2 25 Sums of Squares Visualisation https://www.reddit.com/r/educationalgifs/comments/xf6mxr/visualizing_a_linear_regression_using_sum_of/?rdt=48635 2248 Week 2 26 Explaining Variance (Sums of Squares) REGRESSION LINE + MODEL “Explained” variance (regression or model SS) is the difference between total SS and residual/error SS Agresti, Statistical Methods for the Social Sciences, Fig. 9.12 2248 Week 2 27 Conditional and Marginal Distribution REGRESSION LINE + MODEL Marginal distribution of Y: spread or variance of scores around the mean Conditional distribution of Y: spread or variance of scores around the regression line, for any given value of X Remember the aim is to have less variation around the regression line compared to variation around the mean of Y We’ll revisit this in assumption testing 2248 Week 2 Agresti, Fig. 9.8 28 R-squared: Coefficient of Determination REGRESSION LINE + MODEL Regression aim: explain that variability by the model (represented by regression line) Regression model sums of squares is the difference between total SS and error/residual SS We use these SS values to represent that concept R2 AKA coefficient of determination: what proportion of total variance is explained by the regression model? 𝑆𝑆(𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛) 𝑆𝑆(𝑡𝑜𝑡𝑎𝑙)−𝑆𝑆(𝑒𝑟𝑟𝑜𝑟) 𝑅2 = 𝑅2 = 𝑆𝑆(𝑡𝑜𝑡𝑎𝑙) 𝑆𝑆(𝑡𝑜𝑡𝑎𝑙) R2 ranges from 0 to 1 (often expressed as a percentage) ― Closer to 0 (or 0%), the weaker the effect, the less variance the model explains ― Closer to 1 (or 100%), the stronger the effect, the more variance the model explains R2 is literally the square of r, the correlation coefficient!* (*caveat: only simple regression. When we get to multiple regression, it isn’t!) The square root of R2 (i.e. R) is the correlation between the predicted values of Y and the actual values of Y 2248 Week 2 29 Regression Equation REGRESSION LINE + MODEL Regression equation can be written multiple (equivalent) ways: 𝑌෠ = 𝛼 + 𝛽𝑥 𝐸 𝑌 = 𝛼 + 𝛽𝑥 𝑌 = 𝛼 + 𝛽𝑥 + 𝜀 Y-hat / E(Y) is predicted or expected Y score (we cannot perfectly predict!) Alpha α is the value of the intercept in the population; a is the estimate from the sample Beta β is the value of the slope in the population; b is the estimate from the sample Epsilon ε is error (we cannot perfectly predict!) Intercept (“a”) ― the value of Y when X = 0 ― the point the line cuts the Y axis Slope (“b”) ― how steep (vs flat) the line is, and what direction (positive or negative) ― average change in Y per unit increase in X 2248 Week 2 30 Testing Statistical Significance in Regression Significance testing in Regression We have two kinds of effects that are tested in regression: ― Model-as-a-whole effects ― Individual variable or predictor effects (In simple regression, these are the same; in multiple regression, they are not!) Statistical significance is tested: ― with an F ratio and p-value for model-as-a-whole ― with a t statistic and p-value for an individual predictor We’ll still use.05 as our cut-off for statistical significance, “critical alpha” (…for now) 2248 Week 2 32 F ratio in Regression Significance of the model as a whole is tested via an F ratio 𝑆𝑆(𝑚𝑜𝑑𝑒𝑙) 𝑀𝑆(𝑚𝑜𝑑𝑒𝑙) ൗ𝑑𝑓(𝑚𝑜𝑑𝑒𝑙) 𝐹= 𝐹 = 𝑆𝑆(𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙) 𝑀𝑆(𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙) ൗ𝑑𝑓(𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙) ― MS = mean square (sums of squares divided by degrees of freedom) ― df(model) = number of IVs/predictors in the model (always 1 for simple regression) ― df(residual) = N – df(model) – 1 H0: the model is no better than the intercept-only model (AKA the null model) H0: all regression coefficients = 0 H1: the regression model is significantly better than the null model H1: at least one regression coefficient ≠ 0 This will make more sense in multiple regression when we have multiple IVs!) Conceptually, testing the proportion of variance explained in the DV by the model – is it big enough to be statistically significant (likely reflecting a real effect in the population)? 2248 Week 2 33 t-statistic in Regression Significance of an individual IV or predictor is tested via a t-statistic 𝑏 𝑡= 𝑆𝐸(𝑏) b = beta coefficient SE(b) = standard error of beta coefficient H0: b = 0 H1: b ≠ 0 Conceptually, testing the effect of X on Y, the degree of change in Y per unit change in X – does it change enough to be statistically significant (likely reflecting a real effect in the population)? In simple regression, F = t2 (square root of the F-ratio is the t-statistic) 2248 Week 2 34 Statistical Significance vs Effect Size Remember: a statistically significant effect is not the same as a meaningful or large or useful effect! Statistical significance is assessed via p-values: p <.05 is stat. sig. (…for now) There are many effect sizes in regression: ― Model-as-a-whole: R-squared ▪ 2% - 12% = small effect ▪ 13% - 25% = medium effect ▪ 26% and above = large effect ― Individual IV: beta coefficient ▪ Beta coefficient = unstandardized beta, no standard cut-offs ▪ Need to understand the variables’ scales to appreciate whether it’s a big vs small effect For example: PAL attendance and final unit mark b = 5 For each additional PAL session attended, predict final mark to increase by 5 points ▪ To come: standardized beta (interpret similarly to correlation coefficient) 2248 Week 2 35 Running Regression in Stata Regression in Stata RUNNING REGRESSION IN STATA Regression is very simple in Stata! regress DV IV. regress y x Source | SS df MS Number of obs = 9 -------------+---------------------------------- F(1, 7) = 21.00 Model | 15 1 15 Prob > F = 0.0025 Residual | 5 7.714285714 R-squared = 0.7500 -------------+---------------------------------- Adj R-squared = 0.7143 Total | 20 8 2.5 Root MSE =.84515 ------------------------------------------------------------------------------ y | Coefficient Std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- x |.5.1091089 4.58 0.003.2419983.7580017 _cons | 1.5194625 1.93 0.096 -.2283336 2.228334 ------------------------------------------------------------------------------ 2248 Week 2 37 Regression Output RUNNING REGRESSION IN STATA ANOVA table Source SS df MS Number of obs = 9 F(1, 7) = 21.00 Model 15 1 15 Prob > F = 0.0025 Residual 5 7.714285714 R-squared = 0.7500 Adj R-squared = 0.7143 Total 20 8 2.5 Root MSE =.84515 y Coefficient Std. err. t P>|t| [95% conf. interval] x.5.1091089 4.58 0.003.2419983.7580017 _cons 1.5194625 1.93 0.096 -.2283336 2.228334 coefficients table 2248 Week 2 38 Total Variance (Sums of Squares) RUNNING REGRESSION IN STATA ത 2 𝑇𝑜𝑡𝑎𝑙 𝑆𝑆 = ෍(𝑌 − 𝑌) Source SS df MS Number of obs = 9 F(1, 7) = 21.00 Model 15 1 15 Prob > F = 0.0025 6 Residual 5 7.714285714 R-squared = 0.7500 Adj R-squared = 0.7143 Total 20 8 2.5 Root MSE =.84515 5 4 y Coefficient Std. err. t P>|t| [95% conf. interval] 3 x.5.1091089 4.58 0.003.2419983.7580017 _cons 1.5194625 1.93 0.096 -.2283336 2.228334 2 1 0 0 1 2 3 4 5 6 7 8 9 2248 Week 2 39 Total Variance (Sums of Squares) RUNNING REGRESSION IN STATA ෠ 2 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 𝑜𝑟 𝐸𝑟𝑟𝑜𝑟 𝑆𝑆 = ෍(𝑌 − 𝑌) Source SS df MS Number of obs = 9 F(1, 7) = 21.00 Model 15 1 15 Prob > F = 0.0025 6 Residual 5 7.714285714 R-squared = 0.7500 Adj R-squared = 0.7143 5 Total 20 8 2.5 Root MSE =.84515 4 y Coefficient Std. err. t P>|t| [95% conf. interval] 3 x.5.1091089 4.58 0.003.2419983.7580017 _cons 1.5194625 1.93 0.096 -.2283336 2.228334 2 1 0 0 1 2 3 4 5 6 7 8 9 2248 Week 2 40 Total Variance (Sums of Squares) RUNNING REGRESSION IN STATA 𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑜𝑟 𝑀𝑜𝑑𝑒𝑙 𝑆𝑆 = ෍(𝑌෠ − 𝑌) ത 2 Source SS df MS Number of obs = 9 F(1, 7) = 21.00 Model 15 1 15 Prob > F = 0.0025 6 Residual 5 7.714285714 R-squared = 0.7500 Adj R-squared = 0.7143 5 Total 20 8 2.5 Root MSE =.84515 4 y Coefficient Std. err. t P>|t| [95% conf. interval] 3 x.5.1091089 4.58 0.003.2419983.7580017 _cons 1.5194625 1.93 0.096 -.2283336 2.228334 2 1 0 0 1 2 3 4 5 6 7 8 9 2248 Week 2 41 Stata Regression Output RUNNING REGRESSION IN STATA Sums of squares are additive Source SS df MS Number of obs = 9 F(1, 7) = 21.00 (15 + 5 = 20) Model 15 1 15 Prob > F = 0.0025 Degrees of freedom are additive Residual 5 7.714285714 R-squared = 0.7500 (1 + 7 = 8) Adj R-squared = 0.7143 Total 20 8 2.5 Root MSE =.84515 Total DF = sample size minus 1 Means squares are not additive! y Coefficient Std. err. t P>|t| [95% conf. interval] F = MS(model) / MS(residual) x.5.1091089 4.58 0.003.2419983.7580017 _cons 1.5194625 1.93 0.096 -.2283336 2.228334 R-sq = SS(model) / SS(total) 2248 Week 2 42 Model as a whole effects RUNNING REGRESSION IN STATA Source SS df MS Number of obs = 9 The regression model is F(1, 7) = 21.00 statistically significant, Model 15 1 15 Prob > F = 0.0025 F(1,7) = 21.00, p =.003 Residual 5 7.714285714 R-squared = 0.7500 Adj R-squared = 0.7143 Total 20 8 2.5 Root MSE =.84515 We reject the null hypothesis: the model with x is significantly better y Coefficient Std. err. t P>|t| [95% conf. interval] than the model with no predictors x.5.1091089 4.58 0.003.2419983.7580017 R2 = 0.75 (75%), a large amount of _cons 1.5194625 1.93 0.096 -.2283336 2.228334 variance explained 2248 Week 2 43 Effect of X: Coefficients Table RUNNING REGRESSION IN STATA Source SS df MS Number of obs = 9 The effect of x is statistically significant, F(1, 7) = 21.00 t(7) = 4.58, p =.003 Model 15 1 15 Prob > F = 0.0025 Residual 5 7.714285714 R-squared = 0.7500 Adj R-squared = 0.7143 We reject the null hypothesis: x has a Total 20 8 2.5 Root MSE =.84515 statistically significant effect on y (AKA the slope of x is significantly different from 0) y Coefficient Std. err. t P>|t| [95% conf. interval] x.5.1091089 4.58 0.003.2419983.7580017 For every one-point increase in x, y _cons 1.5194625 1.93 0.096 -.2283336 2.228334 increases by 0.5 points (b = 0.5) Need to understand the scale of x and y to properly appreciate that degree of change Note this is an unstandardized beta 2248 Week 2 44 Intercept: Coefficients Table RUNNING REGRESSION IN STATA Source SS df MS Number of obs = 9 The intercept (AKA constant term) is the Model 15 1 15 F(1, 7) Prob > F = = 21.00 0.0025 bottom row “_cons” Residual 5 7.714285714 R-squared = 0.7500 Adj R-squared = 0.7143 Total 20 8 2.5 Root MSE =.84515 a = 1 (the point the line cuts the y axis) y Coefficient Std. err. t P>|t| [95% conf. interval] The predicted score on y when x = 0 is 1 x.5.1091089 4.58 0.003.2419983.7580017 _cons 1.5194625 1.93 0.096 -.2283336 2.228334 6 5 4 3 2 1 0 0 2 4 6 8 10 2248 Week 2 45 Standardized Regression Coefficient Effect of X: Coefficients Table STANDARDIZED BETA Source SS df MS Number of obs = 9 The effect of x is statistically significant, F(1, 7) = 21.00 t(7) = 4.58, p =.003 Model 15 1 15 Prob > F = 0.0025 Residual 5 7.714285714 R-squared = 0.7500 Adj R-squared = 0.7143 We reject the null hypothesis: x has a Total 20 8 2.5 Root MSE =.84515 statistically significant effect on y (AKA the slope of x is significantly different from 0) y Coefficient Std. err. t P>|t| [95% conf. interval] x.5.1091089 4.58 0.003.2419983.7580017 For every one-point increase in x, y _cons 1.5194625 1.93 0.096 -.2283336 2.228334 increases by 0.5 points (b = 0.5) Need to understand the scale of x and y to properly appreciate that degree of change Note this is an unstandardized beta 2248 Week 2 47 Unstandardized vs standardized effects STANDARDIZED BETA Ordinary regression coefficient is an unstandardized coefficient: in the unit of measurement that the variables are in Unstandardized effect sizes are useful if we know (and appreciate) the ‘natural’ scale – e.g.: Each one additional child a parent has predicts one less hour sleep per night Each one additional year of teaching experience predicts 5 more memes used in lecture slides Often, using psychometric measurement tools, a “one-point” increase is not naturally meaningful – need information on the scale itself to appreciate the size One point on a 1-7 scale is much bigger than one point on a 0-100 scale Also, when comparing size of effects, we cannot make direct comparisons using unstandardized coefficients when variables are on different scales Solution? Standardized effect sizes!! 2248 Week 2 48 unstand. stand. beta beta Unstandardized beta is not comparable between different IVs on different scales! https://psycnet.apa.org/doi/10.1521/scpq.2006.21.2.148 2248 Week 2 49 Standardised regression coefficients STANDARDIZED BETA Enter the standardized regression coefficient (standardized beta)! 𝑠𝑥 Standardized measure of the association between X and Y: 𝑠𝑡𝑎𝑛𝑑. 𝛽 = 𝑏 × 𝑠𝑦 Stand. beta will always have the same sign as unstand. beta, but differ in value In simple linear regression: standardized coefficient IS the correlation coefficient! 2248 Week 2 50 Stand. beta in Stata STANDARDIZED BETA pwcorr y x 𝑠𝑥 𝑠𝑡𝑎𝑛𝑑. 𝛽 = 𝑏 × 𝑠𝑦 | y x -------------+------------------ 1.49 y | 1.0000 = 1.35 × x | 0.8231 1.0000 2.45 = 0.823 regress y x, beta Source | SS df MS Number of obs = 8 0.8232 = 0.677 -------------+---------------------------------- F(1, 6) = 12.60 Model | 28.4516129 1 28.4516129 Prob > F = 0.0121 Residual | 13.5483871 6 2.25806452 R-squared = 0.6774 -------------+---------------------------------- Adj R-squared = 0.6237 Total | 42 7 6 Root MSE = 1.5027 ------------------------------------------------------------------------------ y | Coefficient Std. err. t P>|t| Beta -------------+---------------------------------------------------------------- x | 1.354839.3816826 3.55 0.012.8230549 _cons |.0967742 1.349452 0.07 0.945. ------------------------------------------------------------------------------ 2248 Week 2 51 Using the Regression Equation to Predict Scores Our simple X & Y example USING THE REGRESSION EQUATION ෡ = 𝜶 + 𝜷𝒙 𝒀 𝑬 𝒀 = 𝜶 + 𝜷𝒙 𝒀 = 𝜶 + 𝜷𝒙 + 𝜺 regress y x Source | SS df MS Number of obs = 9 -------------+---------------------------------- F(1, 7) = 21.00 Model | 15 1 15 Prob > F = 0.0025 E(Y) = 1 + 0.5(X) Residual | 5 7.714285714 R-squared = 0.7500 -------------+---------------------------------- Adj R-squared = 0.7143 Total | 20 8 2.5 Root MSE =.84515 ------------------------------------------------------------------------------ y | Coefficient Std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- x |.5.1091089 4.58 0.003.2419983.7580017 _cons | 1.5194625 1.93 0.096 -.2283336 2.228334 ------------------------------------------------------------------------------ We can substitute in values of X to find predicted score for Y… When X = 1 then Y = 1 + 0.5(1) = 1.5 When X = 2 then Y = 1 + 0.5(2) = 2 When X = 3 then Y = 1 + 0.5(3) = 2.5 2248 Week 2 53 Our simple X & Y example USING THE REGRESSION EQUATION Regression line predicts a score of Y for any given value of X (red triangle) 6 There are also actual scores of y for each value of x (blue circles) 5 4 We can substitute in values of X to find predicted score for Y… 3 When X = 1 then Y = 1 + 0.5(1) = 1.5 2 When X = 2 then Y = 1 + 0.5(2) = 2 When X = 3 then Y = 1 + 0.5(3) = 2.5 1 When X = 8 then Y = 1 + 0.5(8) = 5 0 0 1 2 3 4 5 6 7 8 9 2248 Week 2 54 Predicting with the Regression Equation USING THE REGRESSION EQUATION Using the regression equation (AKA prediction equation), we get predicted or expected scores on Y Remember the aim of regression is to predict an outcome (DV, Y): this lets us do that! For our sample data, the closer the actual score on Y is to the predicted score, the smaller the residual, the more precise the model This process becomes actually useful, though, when make predictions beyond our sample (generalize to the wider population) 2248 Week 2 55 Stata Example: Wealth Inequality An example using Stata: Wealth Inequality WEALTH INEQUALITY Example from Open Stats Lab https://sites.google.com/view/ope nstatslab/home Real psych paper: Dawtry et al. (2015) https://doi.org/10.1177%2F09567 97615586560 (our exercise is from Study 1a) 2248 Week 2 57 Dawtry et al. (2015) Wealth Inequality WEALTH INEQUALITY Dawtry, Sutton, and Sibley (2015) wanted to examine why people differ in their assessments of the increasing wealth inequality within developed nations. Previous research reveals that most people desire a society in which the overall level of wealth is high and that wealth is spread somewhat equally across society. However, support for this approach to income distribution changes across the social strata. In particular, wealthy people tend to view society as already wealthy and thus are satisfied with the status quo, and less likely to support redistribution. In their paper Dawtry et al. (2015) sought to examine why this is the case. The authors propose that one reason wealthy people tend to view the current system is fair is because their social- circle is comprised of other wealthy people, which biases their perceptions of wealth, which leads them to overestimate the mean level of wealth across society. 2248 Week 2 58 Study Methods + Hypotheses WEALTH INEQUALITY: PART 2 Design: cross-sectional online survey study Sample: 305 US adults recruited from an online survey pool Amazon’s Mturk Some different variables this week compared to last week! Participants reported: ― their attitudes toward redistribution of wealth (measured using a four-item scale) redist1 – redist4 (which we turned into support_for_redistribution last week) ― their political orientation (1 = extremely liberal, 9 = extremely conservative) political_preference ― perceived fairness of the distribution of household income across the US population, fairness 2248 Week 2 59 Study Methods + Hypotheses WEALTH INEQUALITY: PART 2 The hypotheses we’re testing (part 2): 1. Support for redistribution for wealth will be predicted by perceived fairness of wealth distribution (individuals will have less support for it if they think the current system is fair) ▪ SLR: IV = fairness, DV = support for redistribution ▪ Hypothesising a negative predictive relationship 2. Support of redistribution of wealth will be predicted by more liberal, less conservative, political orientation ▪ SLR: IV = political preference, DV = support for redistribution of wealth ▪ Hypothesising a negative predictive relationship 2248 Week 2 60 Over to Stata now! WEALTH INEQUALITY: PART 2 Steps in our exercise are: 1. Opening data file and inspecting data 2. Fitting regression model 3. Interpreting regression results 2248 Week 2 61 Results WEALTH INEQUALITY The hypotheses we’re testing: 1. Support for redistribution for wealth will be predicted by perceived fairness of wealth distribution (individuals will have less support for it if they think the current system is fair) Higher perceived fairness of wealth did statistically significantly predict less support for redistribution of wealth; F(1, 303) = 234.57, p <.001. Each one-point increase in perceived fairness of wealth distribution predicted a decrease of 0.36 points in support for redistribution, 95% CI [-0.41, -0.31]. 43.6% of the variance in support for redistribution of wealth was explained by perceived fairness, which is a large effect size. 2. Support of redistribution of wealth will be predicted by more liberal, less conservative, political orientation More conservative political orientation also statistically significantly predicted less support for redistribution of wealth, F(1, 299) = 144.17, p <.001. The effect size was also large, with 32.5% of the variance explained, although smaller than the effect of fairness. A one-point increase in political orientation (greater tendency towards conservative candidates) predicted a 0.29-point decrease in support for redistribution, 95% CI [-0.34, -0.24]. 2248 Week 2 62 Conclusions Regression-type Qs are all about explaining variance in the outcome, DV: understanding why some people have high scores vs low scores ― Does the IV help us to accurately predict the DV? Understanding the equation of the regression line, and the mechanics of the regression model, (hopefully) help you understand conceptually the regression analysis Simple linear regression: a single IV, predicting a DV Regression model gives us information about the model as a whole (ANOVA table + R-sq) plus the specific effect of the IV (beta coefficient) We have both standardized and unstandardized effect sizes (differently useful) 2248 Week 2 63 Lecture Learning Outcomes THE END! After this week’s lecture, you know: ― The link between correlation and regression, and what a regression line is estimating ― What a regression is and how it differs from correlation ― What sort of research questions and research designs are appropriate for regression ― Conceptually (and mechanically), what a regression model is doing ― How to interpret and write up results from regression output ― How to use the regression equation to predict scores ― What the difference is between an unstandardized and standardized beta coefficient (and how to interpret each) In Stata, you should be able to: ― Open data files ― Run a regression analysis ― Create and save a.do file for your commands (syntax) 2248 Week 2 64 Week 3: Simple + Multiple Linear Regression PSYU2248/PSYX2248: DESIGN AND STATISTICS II Week 3 Overview WHAT WILL WE COVER TODAY? This week’s lecture will cover: - Simple Linear Regression continued… - Assumption testing - Dichotomous independent variables - Multiple Linear Regression! - Concepts of Multiple Regression - A Multiple Regression Example 2248 Week 3 2 Recap Week 2 INTRODUCTION TO WEEK 3 Spend time to recap: do you remember from simple linear regression… What kinds of research questions / research hypotheses lead us to regression (vs correlation)? Formula for a regression equation? What a / α is? (Where it comes from and what it represents?) What b / β is? (Where it comes from and what it represents?) How to test statistical significance? (F vs t?) What R-squared is? 2248 Week 3 3 Recap Week 2: Practice https://doi.org/10.4103/jfmpc.jfmpc_189_17 7 simple linear regression analyses (each a different DV) Note: don’t report p-values as equal to 0!!! 2248 Week 3 4 Assumptions of Regression Regression Assumptions ASSUMPTIONS OF REGRESSION Regression has 4 (5) assumptions: 1. Independence of observations (residuals) 2. Normal distribution of residuals 3. Homoscedacity (AKA constant variance AKA homogeneity of variance) 4. Linearity 5. No collinearity (only applies to multiple regression) 2248 Week 3 6 Regression Assumptions: Independence ASSUMPTIONS OF REGRESSION Regression has 4 assumptions: 1. Independence of observations (residuals) Met through the study design and sampling procedure ― No people were sampled twice ― No one person’s scores affected anyone else’s scores ― Participants aren’t related to each other in any way (e.g. a paired / related groups design would violate this) 2248 Week 3 7 Regression Assumptions: Normality ASSUMPTIONS OF REGRESSION Regression has 4 assumptions: 1. Independence of observations (residuals) 2. Normal distribution of residuals No assumption of normality of our variables themselves, instead assumption of normality of the residuals Residuals can be saved through the regression analysis (saved as separate variables) Check the normality via histogram or normal probability plot (new concept!), and Shapiro-wilk test 2248 Week 3 8 Residuals REGRESSION LINE + MODEL ෠ 2 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 𝑜𝑟 𝐸𝑟𝑟𝑜𝑟 𝑆𝑆 = ෍(𝑌 − 𝑌) 6 Each observation in the dataset has a residual (e): 5 ෠ e = (𝑌 − 𝑌) (difference between actual y and predicted y) 4 Positive residual = above the regression line 3 Negative residual = below the regression line Residuals sum to 0 (regression line sits in the middle 2 of data points!) 1 0 0 1 2 3 4 5 6 7 8 9 2248 Week 3 9 Normal probability plot vs histogram of residuals ASSUMPTIONS OF REGRESSION Two ways of graphing the regression residuals to assess normality: ― Histogram (just like you already know!) ― Normal probability plot (P-P plot) Different ways of plotting the same data, used to test the same thing (whether the residuals are normal) ― Residuals have to be saved as a variable after running the regression analysis Histogram: looking for a normal(ish) bell curve ― Most common non-normal residuals are skewed or leptokurtic ― Remember: doesn’t have to be perfect, we’re looking for severe departure from normality Normal probability plot: looking for a straight diagonal line ― Most common non-normal residuals are snaking around the line ― Remember: doesn’t have to be perfect, we’re looking for severe departure from normality Interpretation of either graph can be supported by running Shapiro-Wilk normality test on the residuals 2248 Week 3 10 Normal probability plot ASSUMPTIONS OF REGRESSION pnorm residvarname Assumption is MET! Assumption is VIOLATED!. swilk r Shapiro–Wilk W test for normal data Variable Obs W V z Prob>z r 18 0.96743 0.716 -0.669 0.74829 2248 Week 3 11 Regression Assumptions: Homoscedacity ASSUMPTIONS OF REGRESSION Regression has 4 assumptions: 1. Independence of observations (residuals) 2. Normal distribution of residuals 3. Homoscedacity (AKA constant variance AKA homogeneity of variance) Variation (spread) of Y scores should be equal at different values of X Similar to assumption of equal variances in t-test! Can get a sense through the bivariate scatterplot More thorough check via residual-vs-fitted plot (rvf plot): fanning indicates violated assumption (heteroscedacity) 2248 Week 3 12 Regression Assumptions: Homoscedacity ASSUMPTIONS OF REGRESSION 2248 Week 3 https://dataaspirant.com/10-homoscedasticity-vs-heteroscedasticity/ 13 Regression Assumptions: Homoscedacity ASSUMPTIONS OF REGRESSION rvfplot Assumption is MET! Assumption is VIOLATED! 2248 Week 3 14 Regression Assumptions: Linearity ASSUMPTIONS OF REGRESSION Regression has 4 assumptions: 1. Independence of observations (residuals) 2. Normal distribution of residuals 3. Homoscedacity (AKA constant variance AKA homogeneity of variance) 4. Linearity Is there any clear non-linear trend? (this is linear regression!) Preliminary check via the bivariate scatterplot (see previous!), especially for simple regression More thorough check via residual-vs-fitted plot (rvf plot): clear trend indicates violation of linearity ― Are there approximately even numbers of points above and below ‘0’ on y axis as you move from left to right across the graph? 2248 Week 3 15 Regression Assumptions: Linearity ASSUMPTIONS OF REGRESSION rvfplot, yline(0) Assumption is VIOLATED! Assumption is MET! Assumption is VIOLATED! 2248 Week 3 16 Regression Assumptions ASSUMPTIONS OF REGRESSION Valid interpretation of the results of the regression assume that the assumptions are met 1. Independence of observations (residuals) 2. Normal distribution of residuals 3. Homoscedacity (AKA constant variance AKA homogeneity of variance) 4. Linearity 5. No collinearity (multiple regression to come) Violated assumptions means the results aren’t telling you what you think they’re telling you: biased or inaccurate or misleading If assumptions badly violated, shouldn’t be interpreting results of regression! ― Need to fit the regression model to be able to test assumptions (eg residuals are produced from the regression) ― However, should be confident assumptions are met before interpreting the regression results 2248 Week 3 17 Assumption demonstration ods + H ypotheses Study Meth RT 2 EALT H IN EQUALITY: PA W lth s ti n g (p a rt 2): rc e iv e d fa irness of wea eses we’re te will be predicte d by p e urrent system is The hypoth n fo r w e a lt h e y th in k th e c r redistributio ss support for it if th 1. Support fo ls w ill h a v e le dividua distribution (in on fair) V = s u p p o rt for redistributi fairness, D § SLR: IV = ti v e p re d ictive relation ship ra l, le ss conservative, ising a n e g a more lib e § Hypothes w e a lth w ill b e predicted by of f redistribution 2. Support o ti o n re d is trib ution of wealt h ri e n ta rt fo r political o ference, DV = suppo p o litic a l p re § SLR: IV = ti v e p redictive relati onship ising a n e g a § Hypothes 64 2248 Week 2 2248 Week 3 18 Dichotomous IV Different types of Correlation Coefficients DICHOTOMOUS IV Pearson’s product-moment correlation: the normal one Spearman’s correlation rs ― Correlation on ranked data: use for violated assumptions (e.g. relationship monotonic but non-linear? Ordinal variables?) ― Non-parametric correlation: we’ll cover towards the end of the unit Point-biserial correlation rpb ― Use for one numeric and one dichotomous variable ― We’ll come back to this concept in regression Phi φ ― Use for two dichotomous variables ― Largely ignore this (we’ll do Chi-square and Cramer’s V instead) 2248 Week 3 20 Dichotomous IVs DICHOTOMOUS IV Dichotomous (AKA binary) variables are common Regression can handle dichotomous predictors! If your main IV was dichotomous, you wouldn’t use a regression (even though you could). (You would use...?) Typically, dichotomous predictors are used as covariates in regression models - to control for extraneous and/or confounding variables (eg demographics) You will see that using 0 and 1 as codes is special! ― “dummy coding” or “dummy variable” 2248 Week 3 21 Highschool maths? DICHOTOMOUS IV 2248 Week 3 https://doubleroot.in/lessons/straight-line/two-point-form/ 22 Simple example: Love of Stats DICHOTOMOUS IV Let’s try a simple example: Love of statistics in Psychology vs non-Psychology students “How much do you love stats?” 0 = it is my mortal enemy, 10 = it’s the light of my life 2 groups: Psychology (coded 1) vs non-psychology (coded 0) students 2248 Week 3 23 Simple example: Love of Stats DICHOTOMOUS IV Psychology student status Love of list psych love stats Non-psychology student 1 +-------------------------------+ | psych love | Non-psychology student 1 |-------------------------------| Non-psychology student 4 1. | Non-psychology student 1 | Non-psychology student 2 2. | Non-psychology student 1 | 3. | Non-psychology student 4 | Non-psychology student 3 4. | Non-psychology student 2 | Psychology student 8 5. | Non-psychology student 3 | Psychology student 9 |-------------------------------| 6. | Psychology student 8 | Psychology student 7 7. | Psychology student 9 | Psychology student 6 8. | Psychology student 7 | Psychology student 5 9. | Psychology student 6 | 10. | Psychology student 5 |

Regression Analysis PDF

Document Details

Tags

Related

Summary

Full Transcript