3003PSY Survey Design & Analysis Week 8 2024 Student Lecture Slides (PDF)

Summary

These lecture slides present an overview of regression analysis, including assumptions, violations, handling methods, and diagnostic checks. The document also touches on data entry errors, linearity, residuals, outliers, skew, and transformation techniques. They apply to survey design and analysis within a course.

Full Transcript

Like many tests we use, regression is subject to a set of assumptions We have assumptions because the math assumes certain characteristics about our data Linearity Regression Normality of residuals Homos...

Like many tests we use, regression is subject to a set of assumptions We have assumptions because the math assumes certain characteristics about our data Linearity Regression Normality of residuals Homoscedasticity assumptions Independence of residuals We test these assumptions by requesting information in our syntax Violations of these assumptions may reflect skew, outliers, and/or non- linear relationships and may impact our ability to make inferences with our data! When assumptions are violated, short of creating an entirely new model (and conducting an entirely new study), we investigate the nature of our variables It may be that addressing things like normality, the presence of outliers, etc., will allow us to Violations to meet our assumptions assumptions Variables are rarely perfectly bell shaped – we frequently have nonnormal data (e.g., skewed/kurtotic) We need to ask ourselves why our data is the way it is: sampling issue, measurement issue, small N, did we recruit members of the wrong population? An older approach is to check all variables blindly and routinely transform any data that is nonnormal But, a more “active” approach is better – looking at your data, trying to understand why it presents how it does, and using targeted solutions to Handling address the issues There is some subjectivity to this process – we don’t have clear cut-offs for some of the checking processes violations For the assignment – check skew, outliers, etc. even if you think the residual plots “look good” To really understand these issues and address them, you have to look at your data and consider your entire research process thus far (remember, the data are just numbers, you as the researcher have the theory and context from which this data arises) Checking the assumptions & diagnosing the problems 1 Check for data entry errors (often easy to identify, can have an immense effect on the data) 2 Generate scatterplots to examine relationships between variables 3 Check your residuals 4 Assess data for univariate outliers 5 Assess data for multivariate outliers 6 Assess data for skew and kurtosis (apply transformations as required) 7 Check transformations Data entry errors Data entry errors are less common with online surveys, but still occur Ensure you pay close attention to variables where Ps responded via numerals or text (e.g., what is your age?) We simply delete the incorrect response In this case, I would simply delete 118 and 555 Thankfully, data entry errors are often very obvious – especially when you know the measures you used very well Linearity Assumption - Scatterplots Linearity Assumption - Scatterplots If you have removed your data entry errors, and your scatterplots show linear relationships, it’s time to run your regression Check your & look at your residual plots residuals We get a histogram and a scatterplot (1) Histogram – normally distributed residuals (2) Scatterplot – equal variance of scores across the entire plot (Homoscedasticity) Histogram of the distribution of the residuals Scatterplot of the residuals Assumptions violated – what do I do? The data looks quite normal when the outlier doesn’t skew the distribution So we have discovered that our residuals are not normal! Time to examine the data and see if we can’t figure out the issue Univariate outliers – a data point that is extreme on one variable Need to investigate this Sometimes, you may look at the case, why is it so far away from the rest of the histogram for a variable and think it is distribution? skewed, but it is actually the effect of an outlier causing the skew! The M/SD seem reasonable (I generated this data to have a M = 0 & SD = 1, so that fits). But my maximum value is 5! It is possible that this case comes from an individual that doesn’t belong to our population of interest, will need to make a decision about retaining this case in our model. After identifying that we have a univariate outlier, we remove them from our analysis & decide if that makes a difference – follow the flow chart available on L@G Multivariate outliers At this stage, we have (1) removed data entry errors, (2) established linear relationships between our variables, (3) determined our residuals are nonnormal, & (4) investigated univariate outliers Now we must handle multivariate outliers – cases with unusual/extreme scores on at least two variables There are many ways to do this, but for 3003PSY we use Mahalanobis’ Distance See the tutorial materials & screencasts! Multivariate outliers Don’t get We only check for uni/multivariate outliers a single time stuck in a If we removed 3 outliers & re-run our identification protocol, we will find that 3 new outliers have appeared! Sometimes, people get stuck in a perpetual state of removing perpetual “outliers”, rerunning analyses, removing more “outliers”, etc. You want to avoid this! loop of In an ideal world, we don’t delete data unless we know for certain it comes from a population we are not interested in outlier But we almost never have that certainty, so we tend to remove outliers to see if their influence is high removal! Skewness & Kurtosis The final aspect of regression diagnosis involves looking at the skew and kurtosis of the variables We will decide if we need to transform anything now! We check the shape of each distribution, and we get our skewness & kurtosis test statistics We take the Skewness & Kurtosis statistics, and divide them by their respective standard errors and get a test statistics If this value exceed 3.29 in absolute value, we consider the variable skewed or kurtotic! As a final attempt to rectify the residuals, we transform our skewed variables.722 = 2.99.241 Tabachnick, Barbara G. & Fidell, Linda S.. 2007. Using multivariate statistics. 4th ed. Boston ; London: Allyn and Bacon. Transforming Moderate Skew Tabachnick & Fidell recommend square-root transformations; in SPSS we would use syntax like this: compute Dep_SQ = SQRT(Dep). Execute. Transforming Severe Skew Sometimes you get severely skewed data, and you will need to apply a “stronger” transformation In 3003PSY we use the natural logarithmic transformation compute Anh_LN = ln(Anh). Execute. 1.18.221 = 5.34 compute ref_Y = 20 – Y. execute. Transforming Negative Skew compute Y_SQ = SQRT(ref_Y). execute. If you have negatively skewed data, you need to reflect the raw data before applying the transformation Then you just need to be mindful of your interpretations as the scale is going to be in the inverse order High scores for the raw data = high depression High scores on the reflected + transformed data = low depression Interpreting the coefficients accurately is important so you need to keep this in mind! Run your model again Check your residuals to see if they are now Check to see if normal the Check the specific predictors performance in the model (e.g., check if the transformed transformations variable has become significant when it was fixed your previously non-significant) problem See the tutorial notes + the screencasts If the transformations have no significant impact on the model, use the raw data Limitations to outlier identification & transformations Outlier identification is a contentious area – there is a risk that researchers begin tailoring their samples to meet their expectations (over-fitting their model) The outlier detection techniques are innately tied to the outlier We use the M/SD to determine if individual cases are outliers, but the M/SD are, by their very nature, influenced by the outliers Transformations fundamentally alter the data – they are no longer the raw score The data now does not necessarily represent the raw-scores we originally worked with Drawing inferences with the altered data may not be reflective of what is happening in the population We need to be mindful of the limitations to our approaches, we are trying to strike a balance between (i) fitting a model that makes sense given the limitations of our analytical approach & (ii) fitting a model that has utility in the world Univariate outliers: Do individual cases stand apart from the rest of Summary of the data or are they just extreme the decision Does dropping the uni outlier make a difference to the nature of the results (e.g., X1 is no longer rules: significant?) If so, note the outliers(s) to be removed and report that the case was dropped Multivariate outliers: Are there any multivariate outliers present using Summary of Mahalanobis’ Distance? If so, remove them (we always remove the decision multivariate outliers) and make a note if the rules: model changes Summary of the decision rules: Non-normality (skew/kurtosis) Are any distributions skewed/kurtotic Does applying a transformation “fix” the residuals of the model and/or change the nature of the results? If yes, note: The variable(s) that was(were) transformed How they were transformed (e.g., SQRT vs LN) Report the results with the transformation If no, note: Report the results of the raw (untransformed) variable, note that a transformation was checked Summary of decision rules: Ideally, we want to report results with as few alterations as possible We only make alterations when we absolutely need to! It is hard to interpret transformed data – it loses some meaning because it isn’t the raw data! Example write up of assumption checking is available on L@GU – make sure you use it!

Use Quizgecko on...
Browser
Browser