Personality and Social Psychology Lecture 3: Correlation and Regression PDF
Document Details
Uploaded by WellBredTurtle345
Tags
Summary
This lecture covers correlation and regression in personality and social psychology. It discusses visualizing associations, correlation coefficients, and regression analysis. The lecture also touches upon measurement, reliability, and validity.
Full Transcript
Personality and Social Psychology Lecture 3: Correlation and Regression Overview of my lectures Introduction to personality (week 2) What is personality? History and measurement Correlation and regression (now) Analysing data in personality and social psychology...
Personality and Social Psychology Lecture 3: Correlation and Regression Overview of my lectures Introduction to personality (week 2) What is personality? History and measurement Correlation and regression (now) Analysing data in personality and social psychology Personality: Development and change (week 4) Is our personality stable across the lifespan? Can we choose to change? Personality and consequential outcomes (week 5) The predictive power of personality Achievement, health, quality of life, social indicators Persons and situations (week 6) The ‘person-situation debate’ Overview for today Visualising and quantifying associations (relations between variables) Correlation coefficients: Spearman and Pearson Regression analysis Measurement, reliability, and validity “..the existing evidence broadly suggests that levels of Agreeableness and Conscientiousness are positively associated with age whereas levels of Extraversion and Openness are negatively associated with age” Donnellan, M. B., & Lucas, R. E. (2008). Age differences in the big five across the life span: Evidence from two national samples. Psychology and Aging, 23(3), 558–566. https://doi.org/10.1037/a0012897 “The relationship between extraversion and happiness or subjective well-being (SWB) is one of the most consistently replicated and robust findings in the SWB literature.” Pavot, W., Diener, E., & Fujita, F. (1990). Extraversion and happiness. Personality and Individual Differences, 11(12), 1299-1306. https://doi.org/10.1016/0191-8869(90)90157-M “Conscientiousness is the most potent noncognitive predictor of occupational performance..” Wilmot, M. P., & Ones, D. S. (2019). A century of research on conscientiousness at work. Proceedings of the National Academy of Sciences, 116(46), 23004-23010. https://doi.org/10.1073/pnas.1908430116 Patterns of association One of the most fundamental questions we ask in psychology: Is X related to Y? Are boys more aggressive than girls? Does our wellbeing rise in the summer and fall in the winter? Does extraversion predict leadership skills? Does anxiety “run in families”? Do people act a bit “crazy” when there is a full moon? Does lecture attendance predict higher grades? Visualising these patterns: Scatterplots Plots pairs of X-Y scores: Y-axis Y-axis X-axis X-axis Scatterplots The scatterplot allows you to visualise the association between X and Y Correlation Associations or relations between two variables (X, Y) can be quantified in terms of a correlation coefficient (r) Correlation is a form of bivariate analysis The correlation coefficient quantifies a linear relation in terms of… Direction of association Degree of association Building block for more sophisticated methods… Multiple regression, factor analysis, structural equation modeling, partial correlations Direction, Degree, and Form of Association Direction A correlation can be positive or negative e.g., a baby’s age is positively related to their weight, but negatively related to the time they spend crawling Degree Correlation coefficients (r) range in size from -1 to 1 An r of +1 or -1 is a perfect correlation An r of zero indicates no association between the variables Form Direction—positive, negative, or no relation? Positive association: increases in X accompanied by increases in Y Negative association: increases in X accompanied by decreases in Y No relationship: knowing something about X tells you nothing about Y and vice versa Y-axis Y-axis Y-axis X-axis X-axis X-axis Direction—positive, negative, or no relation? Positive Negative Source: Statistics for the Behavioural Sciences, Sixth Edition: Teaching Resources Degree—Strong, moderate, weak? PERFECT POSITIVE CORRELATION PERFECT NEGATIVE CORRELATION r = +1 r = -1 6 6 5 5 Y Values Y Values 4 4 3 3 2 2 1 1 0 0 0 2 X Values 4 6 0 2 4 6 X Values Perfect linear relation: every change in the X variable is accompanied by a corresponding change in the Y variable. Degree—Strong, moderate, weak? r=0 2.50 r=.2 r=.4 2.50 2.50 y20 0.00 y0 y40 0.00 -2.50 -2.50 -2.50 R Sq Linear = 0.002 R Sq Linear = 0.053 R Sq Linear = 0.165 -2.50 0.00 2.50 -2.50 0.00 2.50 x x -2.50 0.00 2.50 x r=1.0 4.00 r=.6 4.00 r=.8 2.50 2.00 2.00 y100 0.00 y80 y60 -2.00 -2.00 -2.50 -4.00 R Sq Linear = 0.345 R Sq Linear = 1 -4.00 R Sq Linear = 0.659 -2.50 0.00 2.50 -2.50 0.00 2.50 -2.50 0.00 2.50 x x x Degree—Strong, moderate, weak? A correlation is a measure of effect size Traditional ‘rules of thumb’ (Cohen, 1988) Small effect: r ~.10 Medium effect: r ~.30 Large effect: r ~.50 But the research context matters! In psychology/medicine, only 1/3 of correlations exceed r =.30 (Hemphil, 2003) In personality/social psychology, the average correlation is r =.21 (Richard et al. 2003) Alternative discipline-specific effect size guidelines have been proposed Form of the relation Linear Non-Linear POSITIVE: e.g., Ability and performance e.g., Happiness today and happiness tomorrow e.g., arousal e.g., practice and NEGATIVE: and performance e.g., Depression and Performance positive mood No Relationship e.g., internet e.g., length of speed and sleep and streaming quality tiredness upon waking Form of the relation What pairs of variables might have these kinds of non-linear associations? 5.00 50.00 0.00 25.00 yquadratic ycubic -5.00 0.00 -10.00 R Sq Quadratic =0.779 -25.00 R Sq Cubic =0.637 -15.00 -2.50 0.00 2.50 x -2.50 0.00 2.50 x Form of the relation Correlation measures the linear relation between two variables. If there is a nonlinear relation, the correlation value may be misleading. Linear association is zero, (r = 0) Form of the relation Correlation measures the linear relation between two variables. If there is a nonlinear relation, the correlation value may be misleading. Some more examples… How would you describe these correlations? Direction? r=-.30 Form? r=.80 Degree? r=.00 r=-1.00 Extreme Scores Juan Extreme Full Sample r =.31 15 scores or Without Juan, r =.42 Number of “close friends” Also without Kiki, r =.51 “outliers” Kiki can greatly 10 influence the size of a 5 correlation 0 0 5 10 15 20 Extraversion Score Check out this simulation to better understand the effect of extreme scores: http://www.uvm.edu/~dhowell/SeeingStatisticsApplets/PointRemove.html Extreme Scores Original data After adding extreme data point a) Direction? Positive F None Linear b) Form? r =.85 None c) Degree? B r = -.08 A Slide from Statistics for the Behavioural Sciences, Sixth Edition: Teaching Resources Extreme Scores As the number of observations increases, the influence of extreme scores on the correlation decreases In other words, correlations stabilize as sample size (N) increases An illustration using simulated correlations: Schönbrodt & Perugi, 2013 Calculating the correlation coefficient (r) Correlation is a function of two things: 1. Variability: how much a given variable (X) varies from observation to observation: 2. Covariability: how much two variables (X and Y) vary together… Calculating the correlation coefficient (r) Sum of Variability The difference Squares: between each observation and the mean summed Obs. Mea together Sum over all More variability n data points = larger sum of squares (99-100)2 = 1 +… (85-100)2 = 225 + … (124-100)2 = 576 + … Calculating the correlation coefficient (r) Sum of Variability The difference Squares: between each observation and the mean summed Obs. Mea together Sum over all More variability n data points = larger sum of squares Sum of Covariability Products: The difference between each observation and the mean for each variable multiplied Sum over allObs. Mea Obs. Mea together. data points n n Note that if X = Y, SP = SS Calculating the correlation coefficient (r) When SS = SP, the two terms cancel out and r =1 When SP = 0, r =0 In other words… r = Covariability of X and Y / Variability of X and Y separately From earlier… The scatterplot allows you to visualise the association between X and Y Calculating the correlation coefficient (r) Variability For X = (1-3.66)2 + (1-3.66)2 + (3-3.66)2 … = 31.33 For Y = (1-3.33)2 + (3-3.33)2 + (2-3.33)2 … = 13.33 Covariability Mean: 3.66 3.33 = (1-3.66) x (1-3.33) + (1-3.66) x (3-3.33) … =15.66 Correlation = 15.66 / √(31.33 x 13.33) = 0.76 Calculating the correlation coefficient (r) Variability For X = (1-3.66)2 + (1-3.66)2 + (3-3.66)2 … = 31.33 For Y = (1-3.33)2 + (3-3.33)2 + (2-3.33)2 … = 13.33 Covariability Mean: 3.66 3.33 = (1-3.66) x (1-3.33) + (1-3.66) x (3-3.33) … =15.66 Correlation = 15.66 / √(31.33 x 13.33) = 0.76 Calculating the correlation coefficient (r) Variability For X = (1-3.66)2 + (1-3.66)2 + (3-3.66)2 … = 31.33 For Y = (1-3.33)2 + (3-3.33)2 + (2-3.33)2 … = 13.33 Covariability Mean: 3.66 3.33 = (1-3.66) x (1-3.33) + (1-3.66) x (3-3.33) … =15.66 Correlation = 15.66 / √(31.33 x 13.33) = 0.76 Null hypothesis significance testing Often our research questions are not framed in terms of direction, degree, or form. We want a Yes/No answer… Examples 1. Is personality associated with age? 2. Is extraversion associated with subjective well-being? 3. Is conscientiousness associated with occupational performance? When can we say “yes”? Null hypothesis significance testing Suppose we ask “is X correlated with Y?” 1. We first assume the true correlation is 0 (the null hypothesis) 2. We then determine whether we have evidence that allows us to reject this assumption (reject the null) The null hypothesis: H0: ρ = 0. ‘rho’ is the correlation in the population (the ‘true’ correlation) We then inspect r, the correlation in our sample Our question then becomes… “What is the probability of finding an r this big, if the association in the population (ρ) is zero?” Null hypothesis significance testing How improbable should the observed r be under the null, H0, for us to reject the null? Probability values, or “p values” (p) p = 1.00: p(H0|r) =1 … we should expect to observe this r 100% of the time under the null p = 0.50: p(H0|r) =.5 … 50% of the time… p = 0.05: … 5% of the time p <.05: The threshold (“alpha level”) at which we usually reject the null If we reject the null, we say the correlation is “significant” — r is significantly different Null hypothesis significance testing Why p <.05? Why not p <.01 or p <.10? Arbitrary! Just a “rule of thumb” (like Cohen’s effect size guidelines) p <.05 represents the frequency of times we would be comfortable being wrong in concluding that there was an association – i.e., five times out of 100. Some have argued for a more strict default alpha level of p <.005! (see Benjamin et al., 2018, Nature Human Behavior) Null hypothesis significance testing Note that the size of our obtained p value will depend on the size of the sample, N. More precisely, it is a function of the degrees of freedom df. For correlation df = N – 2. SIGNIFICANT p VALUE df.10.05.02.01 Null hypothesis significance testing Note that the size of our obtained p value will depend on the size of the sample, N. More precisely, it is a function of the degrees of freedom df. For correlation df = N – 2. SIGNIFICANT p VALUE df.10.05.02.01 Null hypothesis significance testing Note that the size of our obtained p value will depend on the size of the sample, N. More precisely, it is a function of the degrees of freedom df. For correlation df = N – 2. With 2 degrees of SIGNIFICANT p VALUE freedom a df.10.05.02.01 correlation needs to be ≥.95 to be significant With df = 5 , ≥.75 With df = 40, ≥.30 Why does significance depend on sample size? Remember Schönbrodt & Perugi, 2013? Correlation estimates stabilise as sample size increases. i.e., at smaller samples, it is more likely for large correlations to emerge purely by chance An illustration… Suppose Kiki correctly guesses 75% of coin flips… can we conclude she has ESP? 🤨 What if it was 75% of four coin flips? 🤯 What if it was 75% of 200 coin flips? Example output from JASP Can we confirm that age is significantly positively associated with conscientiousness and agreeableness? N = 980 adults assessed on the Big Five… data from https://osf.io/9wf2s/. Correlations are small (A) to average (C)… Example output from JASP Can we confirm that age is significantly positively associated with conscientiousness and agreeableness? N = 980 adults assessed on the Big Five… data from https://osf.io/9wf2s/. … and p <.05, so they are significant Example correlation matrix Various Their means These are And their trait and standard measure of correlations with measures deviations reliability each other (SD is a (We’ll come measure of to that later!) variability) Table from Wilt et al., 202 Spearman’s Correlation So far we have examined the correlation coefficient proposed by Karl Pearson — “Pearson’s r” An alternative is that proposed by Charles Spearman — “Spearman’s rho”: Karl Pearson Data converted to ranks before calculating correlations e.g., lowest value is scored 1, next lowest value is scored 2, etc. all the way up to the highest Outliers: a sequence of data such as 15, 20, 2532 would be recoded as 1, 2, 3 – the outlier is only one rank from the next data point Provides a solution for non-linear data and dealing with outliers… Charles Spearman Spearman’s Correlation Non-linear data… Converting to ranks can linearise non-linear data! r rs Spearman’s Correlation Outliers… Outlier is just *one rank* above the next highest value r r Example using JASP Research question Do anxious people tend to be less satisfied with their work? Operationalised RQ: Is the anxiety facet of Big Five neuroticism (significantly) negatively correlated with self-reported job satisfaction The data: Sample of managers (n = 28) Important: This sample is REALLY SMALL and we could never trust the results in reality! Participants complete scales to measure job satisfaction and anxiety (100-point scale) Example using JASP How to report: “There was a significant negative correlation between satisfaction and anxiety, r(26) = –.581, p =.001.” [Remember, df for correlaton = N – 2] Correlation versus causation Correlation is not evidence of causation! A significant correlation does not mean that one variable causes the other e.g., Neuroticism is negatively correlated with job satisfaction Perhaps low job satisfaction increases neuroticism Perhaps low neuroticism increases job satisfaction Perhaps a third variable is responsible for both: work stress, problems at home, From Correlation to Regression… Correlations quantify patterns of association— r (and rho) describe a bivariate association between X and Y Regression involves prediction—e.g., based on values of X, can I predict values of Y? If two variables are correlated, we can base our prediction on that correlation… From Correlation to Regression… Pay-as-you go phone plan: $30 per month subscription, plus 20c cents per call. Monthly bill = 30 + 0.20 (no of calls) A perfect correlation: we can predict the exact bill for any number of calls The general form of a perfect linear relationship: Y=a+bX From Correlation to Regression… A less than perfect linear relation – Subjective Well-being extraversion and subjective well-being: What is the best linear relationship of the form Y=a+bX for this data? i.e., what is the line of Extraversion best fit? From Correlation to Regression… Line of best fit (regression line): Subjective Well-being SWB = 0.65 +.65xExtra But prediction will not be exact—there will be error SWB = 0.65 +.65xExtra + e Extraversion From Correlation to Regression… Sometimes Y will Subjective Well-being be perfectly predicted Sometimes not! Extraversion These are our errors of prediction, or residuals The Regression Model e general form of a perfect linear relationship: Criterion = Intercep + Slope × Predictor variable t Variable Y=a+bX Does anxiety predict job satisfaction? What is the best linear relationship of the form Y = a + b X for this data? What is the line of best fit that serves to model Y based on X? Satisfaction = 108 -.941 x The Regression Model The actual data: Criterion = Intercep + Slope × Predictor + Erro variable t Variable r Y=a+bX+ e Data in the population is dispersed randomly around this population regression line, where… a is the intercept parameter (sometimes called the constant) b is the slope parameter (sometimes called the constant) e is an error or residual term. Errors are assumed to be: – Independent – Normally distributed (with a mean of zero) – Homoscedastic » Equal error variance for levels of predicted Y The least squares parameter estimates Observed Predicted (prediction Y Score from Y Score regression (actual data) line) (88 – 56)2 = 322 Estimate = 1024 parameters by (73 – 40)2 2 minimising the total = 33 squared error: = 1089 (45 – regression 71) coefficients =-262 =676 Satisfaction and Anxiety Mean Anxiety: 73 Hand calculations: Mean Satisfaction: 39 SS Anxiety: 6,040 Remember: 2 SS X ( X X ) 2 2 X X SP: -5,685 SP ( X n X )(Y Y ) XY X Y n SP 𝑏=− 5686 =−.940 b= SS X 6040 SP b SS X a Y bX =107.62 JASP output: Regression coefficients: a = 107.91 Note: Small differences between hand calculations and JASP Output is due to rounding b = -.941 Intercept, aka Slope, aka a b Constant Rise (-18.83) Y-intercept over Run (20) Predicted = slope value of Y (-.941) Effect on Y for when X is 0 a unit (1) increase on predictor Regression Diagnostics Remember assumptions from earlier… 1. Independence of residuals—usually assumed based on independence of observations 2. Normality of residuals—can assess by inspecting a histogram of residuals Predicted is Predicted is greater than less than observed; observed; thus thus positive negative residual residual Residual = 0: no error Regression Diagnostics 3. Residuals are homoscedastic—a scatterplot of residuals against predicted values us used to check for heteroscedasticity. Absence of any systematic pattern supports the assumption of homoscedasticity Homoscedasticity: vs heteroscedasticity: Variance explained (R2) When there is a perfect correlation, all Y scores fall exactly on the regression line. Phone bill example: The variation in my pay-as- Y = 30 +.20X you-go phone bill is completely explained by the number of calls I make. Proportion of variance explained = 100% R2 = 1.00 Variance explained (R2) But in our other example, not all variation in job satisfaction can be explained by anxiety Job satisfaction The Y scores (job example: satisfaction) do tend to follow the regression line.. Y = 108 …but the Y scores also -.914xAnxiety vary around the regression line. Proportion of variance explained = 33.7% R2 = 0.337 Null Hypothesis Significance Testing (for the regression model) Does the regression model help to explain the variance in Y? Can we reject the null hypothesis (H0): R2 = 0 This can be tested using an “Analysis of Variance” or ANOVA: JASP output: The F test is our test statistic for regression Because our p value <.05 we can reject the null Thus, the proportion of variance explained by the regression model is significantly different from zero. Null Hypothesis Significance Testing (for regression parameters) Regression parameters Intercept, a: estimated value of Y when X = zero Slope, b: indicates: the strength of association between X and Y the direction of that association (positive or negative, cf correlation) estimated change in Y when X increases by 1 Slope can be translated into standardised form standardised regression coefficient (called beta in SPSS or JASP.) compare with correlation coefficient for bivariate data Can also have confidence intervals for regression coefficients Null Hypothesis Significance Testing (for regression parameters) Does the slope parameter, b, help us to predict the dependent variable Y? i.e. is the slope non-zero? (H0: b = 0) Can use a t-test to examine this null hypothesis (the t- test and F test are closely related!) If p <.05, reject the null hypothesis. Example using JASP Research Question(s): Are people who experience more positive moods happier? (correlation) Can we predict happiness from positive mood? (regression) Or, can we model happiness based on positive mood? Example using JASP Happiness Positive Mood Example using JASP Example using JASP Example using JASP R-squared: Variance explained by the model The model (% var explained) is significant Regression p