3003PSY Survey Design & Analysis Tutorial 5 PDF

3003PSY SURVEY DESIGN & ANALYSIS Tutorial 5 TODAY’S TASKS Task 1: Multiple regression revision & recap Task 2: Regression diagnostics Task 3: Lots & lots & lots of running regression!...

3003PSY SURVEY DESIGN & ANALYSIS Tutorial 5 TODAY’S TASKS Task 1: Multiple regression revision & recap Task 2: Regression diagnostics Task 3: Lots & lots & lots of running regression! You will need: ATSTute4.sav Tutorial worksheet THE DATASET ATSTute4.sav Looks at attitudes toward doing statistics courses (DV), and predictor variables: age, gender, and levels of math education o Course (DV) à higher scores = more positive attitudes o Gender (predictor) à dummy coded, male = 0, female = 1 o Maths education (predictor) à dummy coded, did not do Maths B/C = 0, Maths B/C = 1 o Age (predictor) à participants’ age represented in years WHY DO WE NEED TO CHECK ASSUMPTIONS? Regression is a statistical procedure based upon the theory of the normal curve. o We assume that various aspects of the data are normal and the most important indicator of this is the set of residuals that we obtain after running multiple regression. o Severe departures from normality lead to violation of various assumptions upon which we have based our modelling. => the inferences we draw from our results may be faulty. Also, some problems with data can lead to issues with even running the analysis. vTedious but extremely crucial part of data analysis o Need to go through this process for any data we collect o Not just limited to datasets for surveys RUNNING PRELIMINARY ANALYSES & DIAGNOSTICS Involves: o running primary analysis à take note of major findings o “cleaning” data à check for data entry errors o check whether assumptions are met q if YES, re-run primary analysis & write up results In write-up, justify why certain modifications were/were not made q if NO , need to screen for outliers & possibly transform data At each stage, need to re-run primary analysis to check whether actions have altered results In write-up, justify why certain modifications were/were not made PRELIMINARY ANALYSIS Run this syntax, interpret the output: regression var = course age mathsBC /statistics = defaults zpp /dep = course /enter. Take note of: o Value of R2? Significant amount of variance explained? o Significant predictor(s)? Strongest/weakest? o Amount of unique variance accounted for by each predictor? CHECKING FOR ERRORS Step 1: Check for outliers & data entry errors Run range & logic check (descriptives) e.g., 7-point scale (any –ve numbers? any numbers above 7?) e.g., year of birth (any ridiculously old people? any people from the future?) e.g., time limits (superhuman ability to watch 42hrs of Netflix in a 24hr day?) Run descriptives for each variable of interest *Run descriptive statistics for the variables ‘course’ ‘age’ and ‘mathsBC’* descriptives var = course age mathsBC. Highlighted Consequences of Step Action Observation Decision Case(s) Exclusion/Change Check for data entry 1 errors 2 3 4 5 6 CHECKING ASSUMPTIONS Step 2: Normality check Normality of residuals (e = Y – Y’) Are our residuals normally distributed and centred around zero? Inspect histogram of standardised residuals Assumption VIOLATED Assumption MET NORMALITY Are our residuals normally distributed and centred around zero? regression var = course age mathsBC /statistics = defaults zpp *standardizes everyone’s residual score /dep = course and plots them on a histogram* /enter /residuals = histogram(zresid) id(ID) /scatterplot (*zresid, *zpred). Highlighted Consequences of Step Action Observation Decision Case(s) Exclusion/Change Check for data entry 1 errors - No errors found - - Check normality 2 assumption 3 4 5 6 CHECKING ASSUMPTIONS Step 3: Homoscedasticity check Similar distribution of residuals across values of the predictors Ensure that regression is good at predicting low and high levels of the predictor Do the residuals have a constant variance? Inspect scatterplots: Rectangle = homoscedasticity; fan-shape = heteroscedasticity Assumption VIOLATED Assumption MET HOMOSCEDASTICITY regression var = course age mathsBC /statistics = defaults zpp *plots residual scores against the predicted values in a scatterplot* /dep = course /enter /residuals = histogram(zresid) id(ID) /scatterplot (*zresid, *zpred). Highlighted Consequences of Step Action Observation Decision Case(s) Exclusion/Change Check for data entry 1 errors - No errors found - - Check normality 2 assumption Check homoscedasticity 3 assumption 4 5 6 REGRESSION DIAGNOSTICS What happens when our assumptions aren’t met? L v Test for univariate outliers v Test for multivariate outliers v Transform data TESTING FOR UNIVARIATE OUTLIERS v To check for statistical outliers, we need examine var = course age mathsBC to examine: /id = ID descriptives (skewness, kurtosis) histograms /plot = histogram boxplot npplot boxplots /statistics. pplots for each variable of interest Highlighted Consequences of Step Action Observation Decision Case(s) Exclusion/Change Check for data entry 1 errors - No errors found - - Check normality 2 assumption Check homoscedasticity 3 assumption Testing for univariate 4 outliers 5 6 TESTING FOR UNIVARIATE OUTLIERS NEVER ever ever ever ever delete data Instead, we can just de-select it from our analysis We do this by creating a filter variable (only do this once) *create a new variable called varFilter. For each participant ID larger than 0, assign the value of 1 into the new variable. create value labels where 1 = selected, 0 = not selected. when running analyses, only use selected cases (those assigned the value of 1) and ignore non-selected cases (0). don’t forget, write these numbers into our dataset (execute)* compute varFilter = ID > 0 = 1. execute. value labels varFilter 0 “not selected” 1 “selected”. filter by varFilter. Re-run regression analysis. Have the results changed? If not, return excluded case(s) data back into the analysis. Highlighted Consequences of Step Action Observation Decision Case(s) Exclusion/Change Check for data entry 1 errors - No errors found - - Check normality 2 assumption Check homoscedasticity 3 assumption Testing for univariate 4 outliers 5 6 TESTING MULTIVARIATE OUTLIERS If data still violates assumptions, we may need to take a look at multivariate outliers. § Multivariate outliers à outliers when considered against all variables simultaneously e.g., 18 year old who earns $100,000/year 18 year old not necessarily an outlier in sample from the general public earning $100,000/year not necessarily an outlier in sample from the general public but TOGETHER, this is likely to be a multivariate outlier We use Mahalanobis’ Distance to identify multivariate outliers § provides 10 most extreme points in the dataset (aren’t necessarily outliers) § compare these points using chi-square cut-off value (at p <.001 level) o cut-off value differs based on number of predictors *Multivariate are particularly problematic so we ALWAYS remove them.* TESTING MULTIVARIATE OUTLIERS Mahalanobis’ Distance 𝝌𝟐 table provides 10 most extreme points in the dataset compare to chi-square (.001) cut-off value differs based on number of predictors Any values exceeding this cut-off is considered to be a multivariate outlier Filter out any such cases regression var = course age mathsBC /statistics = defaults zpp /dep = course /enter /residuals = outliers(mahal) id(ID). Highlighted Consequences of Step Action Observation Decision Case(s) Exclusion/Change Check for data entry 1 errors - No errors found - - Check normality 2 assumption Check homoscedasticity 3 assumption Testing for univariate 4 outliers Testing for multivariate 5 outliers 6 DATA TRANSFORMATIONS If, after univariate and multivariate checks, our data is still violating assumptions L …we may need to transform the data …OR linear regression may not be the most appropriate technique to analyse data First, we examine skewness & kurtosis § Normally distributed data: o skewness = 0 o kurtosis = 0 § Divide skewness & kurtosis values by their respective standard errors § If value > ± 3.29, this indicates significant skewness or kurtosis (at p <.001 level) If there is significant, positive skew, we could apply a logarithmic transformation DATA TRANSFORMATIONS Which variable(s) might we need to transform? To transform a positively-skewed variable we use the following syntax: *create a new variable called lnage which comprises the logarithmic transformations of values in the original variable ‘age’. write these newly transformed numbers into our dataset (execute). examine plots and descriptives for the variable lnage.* compute lnage = ln(age). execute. examine var = lnage /id = ID /plot = histogram boxplot npplot /statistics. Did this help with normality? Rerun regression using this new transformed variable. Did this change the results at all? Should we keep the transformed variable or go back to the original? Highlighted Consequences of Step Action Observation Decision Case(s) Exclusion/Change Check for data entry 1 errors - No errors found - - Check normality 2 assumption Check homoscedasticity 3 assumption Testing for univariate 4 outliers Testing for multivariate 5 outliers 6 Transform data DATA TRANSFORMATIONS Positively-skewed variables à logarithmic transformations compute lnVarName = ln(originalVar). e.g., if my variable age is positively skewed: compute lnAge = ln(Age). Negatively-skewed variables à reflective transformations THEN log Take the maximum score a given variable can take and add 1 to it (maxPlusOne) compute reflectedVarName = maxPlusOne – originalVar. e.g., if my variable Netflix is negatively skewed and has a maximum score of 24: compute reflectedNetflix = 25 – Netflix. Note: to transform a negatively-skewed variable, we must first reflect it (to make it positively skewed); THEN we can do the log transformation on the (now positively-skewed) variable. ***Make sure to do the log transformation on your reflected variable (the one that is now positively-skewed) – not your original (negatively skewed) variable! SUMMARISE & WRITE UP v Make note of everything you do in terms of data screening & cleaning v You will need to write it up in the results section v Only needs to be brief o State what you did o Did it make a difference to the findings? This is a repetitive process that involves a lot of running and re-running analyses. Remember: o When testing for univariate outliers à if excluding cases did not change the nature of the results, the outlier should be returned to the dataset o When testing for multivariate outliers à any identified outliers should always be removed o When transforming variables à if transformation did not change the nature of the results, use original variable

3003PSY Survey Design & Analysis Tutorial 5 PDF

Document Details

Tags

Related

Summary

Full Transcript