Regression Assumptions Lecture PDF

So regression assumptions, you're obviously familiar with them. You did them in your tea tests in your anos, um, correlations and so on. So regression tests. I don't really need to tell you what they are. But for those who need a little reminder, they're really a set of mathematical formulas that we put our data into, and they assume certain things about our data and those assumptions usually need to be met in order for us to run certain analysis. Um so for regressions, which is obviously what we're looking at this trimester, we've got linearity, uh, normality of residuals, Homos, Caity, independence of residuals. We're mostly focusing on the 2nd and 3rd 1 today. Normality of residuals and, uh, homos SCIC mostly because they are the ones that can really be more formally tested. OK, and all we do to get this is just request more information from syntax when we do have violations to our data or, um, the assumption isn't met. This can be due to multiple diMerent reasons. OK, so we're going to look at some of those. It could be entry errors. There could be outliers that are causing the skew. Um, there's there's quite a few things. So we're gonna look at some of those and how we can address them, because sometimes there are things that we can do to kind of to not ﬁx the data, but tinker with it a little bit rather than just having to throw the whole thing in the bin. So when assumptions are violated, I mean, like, I just said, we could just throw the whole thing in the bin. But the ﬁrst thing that we really want to do is investigate the nature of our variables. Where is it coming from? Why is the assumption not being met? This is something that I don't think we've really looked at when you have done assumptions in previous years. You know, for example, um in anos, in stats 21 of the assumptions is homogeneity. You would just say it's either met or not met if it wasn't met. We didn't expect you to go into the data ﬁle and ﬁnd the participants and look at why and so on. That's what you will be doing in this course. What is causing assumptions, not be violated to be violated? Sorry. OK. And this is also the ﬁrst time that you're really getting to see data that really looks like actual data. You know, in ﬁrst year and second year, it's just a couple of columns and everything's ﬁne and the assumptions are met just so you can kind of skim past that and then do the Anno drill data is very, very messy. Um, especially who wants to do honours next year? or do honours at some point. OK, so you will have the magical experience of accessing the ﬁrst year psychology research participant pool. Luckily for you, they really, really care about your data. And they really, really care about your research, and they're going to take a lot of time to do your study. Um, and your data is gonna be perfectly normal and ﬁne not it's gonna be a nightmare. It's gonna be an absolute mess. They don't care about your study. They just want their credit points. Most of them are just gonna click strongly, agree to everything or start to make pretty pictures. That is gonna be real messy data. So think of your third year data set as a bridge between what you've been exposed to in 1st and 2nd year and a real data set. OK, it's starting to get messier, but it's also starting to get much more realistic. And you were ﬁrst years, you know, they don't care, because you were at that point and you didn't care. You just wanted the credit point, OK, it's karma. It comes around, you know, everything. It comes full circle, so welcome. Uh, but sometimes it's not something like, You know, we need to throw the data away. Sometimes it is just a couple of ﬁrst year students that didn't care, and they're just outliers in your data set. And that can be fortunate because we can just exclude them, philtre them out, and then our data's normal. OK, so sometimes it's not the entire data set. But that's what I mean by starting to investigate what's causing assumptions to be violated. So things that it says sampling issues, measurement issues, small sample size Did you recruit members of the wrong population? So things like that as well can also cause violations? Um, traditionally or an older approach when assumptions have been violated, the kind of go to used to just be I'll just transform it. For those of you who did your tutorials this week, um, you would have looked at transformations and reﬂecting. For those of you who have also looked at the module content, it would just go to that. It would just immediately let's just transform it. But a more active approach, like I said, is better looking at the data. Try and understand what it presents, what it represents. Why and using targeted solutions to address the issues. OK, the other thing that I want to point out here is that a lot of people will say they like statistics because it's black and white and or relative to the other psychology courses. But you're also gonna see today and hopefully this week and hopefully throughout this course that there actually is a much more grey area than what you're kind of originally led to believe. Even for example, if you just think about like a, um a what's it called a bell curve? And I ask you if it's skewed or not, Sometimes it's kind of skewed, kind of not. And it it's There's a subjectivity to it. OK, so there's much more grey that's coming into the into the scene here, and that's the other thing to think about is that SBS S doesn't have context to your research. It doesn't know what you're trying to do. It doesn't know what your population is. It just gives you the numbers. You can ask it for anything, and it will give you the numbers whether they are meaningful, whether they're helpful. Whether they have important implications for society is something that you infer from those numbers. OK, so this is kind of the guidelines and the order in which you would check assumptions. First, we check for errors, and obviously, these are all things that you should be doing in your assignment. Which is why this is gonna be really helpful class today. First thing that you should be doing is checking for data data, entry errors. Then we check scatter plots to examine the relationship between variables That's gonna help us look at the assumption of linearity, residuals, univariate, outliers, Multivariate, outliers, skewing coos. That's where we may or may not apply transformations and reﬂect the data, if that's also necessary, and then check the transformations. Yeah, I will be uploading these slides at the end of the day. So we're gonna go through these one by one ﬁrst. Yep, the ﬁrst one's data entry errors. This is pretty obvious. So, for example, in this picture here, there's someone who's, um, 118 years old. This is obviously an error. I mean, it could be, I guess some people have lived 100 and 18, but this is most likely an error, and you can't go in and ﬁx it. You can't assume. Oh, they must have meant 18. Even though it does kind of seem like that's what the age demographic was. And they've just accidentally clicked one twice. We can't make inferences like that and go in and ﬁx their data for them. You would remove this, not the entire participant, But you would remove this entry as if they had forgotten to answer the question. OK, so you wouldn't delete the entire participant in this situation, but you wouldn't go in and ﬁx it yourself. You would just remove that entry. Um, some students have noticed that there is a data entry for Oh, by the way, I'm just gonna give heaps of tips tips in this lecture, um, that there is a data entry error for year of birth. I think one of them is 2022. So obviously the participant was apparently two years old. That's not possible. Um, and my response to that was that age is not one of the variables in your research question. It's great that they found it means that, you know, really looking at the data and examining it thoroughly, but wasn't one of the um, research questions or the variables that we were interested in and so you wouldn't need to touch that at all. However, maybe if for some random reason, that two year old participant also said prefer not to say for education. Um and so then when you're coding that it would be listed as system missing And so now they have missing data in a variable that we are interested in, You might delete them. I don't know. I don't I don't know. I don't know. But anyway, some of that could very give you give you data. The other thing where they can occur is in your actual scales, and this is something you could check and should check. So, for example, if I had a scale that was measured on a 1 to 7 like a scale right and I found a participant who had a score of eight, obviously, that would be an impossible score. When the if the maximum was seven, I would go in and check that probably delete it. It is pretty rare, especially if they're doing like an online survey where you click them. It wouldn't be possible to get an eight, but it could have been an accident by the researcher or something. So for the variables that we are interested in, you should be checking that there were no impossible scores and then entry errors in other demographics that we were interested in. That's kind of the easiest one to check. Our next assumption is linearity. So this requires the variables your predictive variables to be linearly. How's the word associated with the outcome variable? So that means that you should be checking ﬁve of these? Well, actually four. Sorry, because one of your variables is dichotomous. So your four continuous predictor variables and if they have a linear relationship with the continuous outcome variable OK, this is an assumption because regressions are based in correlations. Obviously you've been looking at, you know, the correlation coeMicient. Correlations are inherently related to the assumption of whether there's a linear relationship between the two variables. Like I was saying before, how important it is that you apply context to this SPS S will let you put a line through anything. OK, so it will let you do that and it will go. Yep. There you go. Here's a line through it that obviously does not mean that that relationship is linear. OK, it's very clearly curvilinear, uh, where participants on this side have a negative correlation between the two variables and participants on this higher end have a positive correlation. OK, so this is a curvilinear relationship. You wouldn't be able to do a regression. If you had this type of relationship, you would need to do a more complex rela. Um analysis. Sorry. OK, so this is the, uh, second thing Sorry that you would check after data entry errors for your four continuous variables. You cannot check it for education because it's a dichotomous variable. You want a case of her? OK, excellent residuals. So we've checked data, data, entry errors and our scatter plots show linear relationships. It's time to run your aggressions and look at your residual plots, OK? There's two things that we need to look at when we're looking at our residuals, the histogram. So we need a normal distribution of our residuals and we need homoscedasticity, which is equal variance of the scores across the entire plot. Has anyone actually started to do this with the assignment yet? Yep. OK, cool. Cool. So we want our residuals to be normal. Obviously, we want it to look normal. I That should be obvious. This is something that you're gonna have to eyeball. The main thing that we're looking for in your assignment is when you make decisions, can you justify them? Are you providing a justiﬁcation for the decisions that you came to? So, for example, if yours, um, this is obviously course, which was one of the, um substructures of the attitudes towards statistics scale that you've been looking at hands up. Who would say this is normal Whos up to say it's scoot. OK, I'd say a third of you said Yes. Third of you said no, a third of you did not reply to me. The point is, I would actually be happy with either all but it's a justiﬁcation. It's the presence of some type of remark as to why it looks ﬁne overall. And this is again going back to the sometimes there's just grey. In statistics, there just is the next thing we have when we're looking at our residuals is the scatter plot of them. This is assessing the assumption of homos staticity We want it to look that one looks pretty good. We don't want it to look like that, OK, there's two diMerent things here that you should be checking. One is the concentration of the residuals, and the other is the, uh, width of the spread. So as you can see that this left one, there's a high concentration of the residual points over on the left side. And there's also are really widespread. OK, what this means in terms of what this is actually saying is whether your variables can explain your outcome. Variable consistently at each point of your dependent variable. So, for example, in the context of your assignment with pro environmental behaviour, if I had heteros staticity, it would mean that my predictive variables risk perception, identity, blah, blah, blah can explain someone who is highly frequently engaging in pro environmental behaviour better than it can explain someone who is not or, very, um, seldomly engaging in pro environmental behaviour. So can it explain variance in your predictive variable equally across its continuous scale? That's a little bit confusing, but does that kind of make a little bit of sense? Cos it might explain high engagement better than it? It explains low engagement basically, you would want it to explain it equally when it's not. You can see there's this fanning eMect, so that's something else that you would need to note. And again, it's a justiﬁcation. Using words like a fanning eMect concentrated the the, um width a widespread on this side. Therefore, heteros there was hetero. The data was heterogenic. I can't say that word there was heteros heterosis Sorry. OK, so let's just say our residuals, um, were violated our histograms. So, for example, like this one going back here, let's just say that was really skewed. Obviously, that one's ﬁne, but let's just say it was really skewed. What would you do? The ﬁrst thing is that we're going to look at our unvaried outliers because sometimes it can just be one or two pesky participants that are skewing the whole thing. OK, so for example, in this data ﬁle, can everyone see him? Everyone can see him. Yep, Cool. This could be a participant that's skewing our entire data set, and if we remove them, our data would then be normal. But it's not just that simple for a couple of reasons. First, I want you to start thinking about the ethics of removing participants because that can be a very slippery slope into model overﬁtting and deleting participants until you get the signiﬁcant result that you want. And if that doesn't match, that person doesn't match what I want. Or they said my intervention didn't work. I'm gonna delete them. I'm gonna delete them. I'm gonna delete them. OK, so there's also an ethics to it. The main reason why you would remove a participant or a good question to ask yourself is, Is this person representative or do they probably belong to the population that I'm trying to measure? Cos if they don't, then that would be an ethical justiﬁcation for removing them. So as you can see here, it's normal. And we've got this one person need to investigate this case, why it's so far away from the rest of the distribution. If this was standardised, um, where the mean was zero set to zero, and we had a standard deviation of one. You can see that they're not too far oM that. The other thing we can look at, though, is the, um, frequencies of the values. So we have. It's sitting approximately around zero. Obviously, like I said, we've standardised it. Minimum of negative 0.255 is ﬁne, but you can see we have a maximum of ﬁve, which is putting it all the way out here. And if we also look at our frequency distribution table, we've got that one participant that was sitting at ﬁve. But then the next one down is at 2.41. That is a huge jump in values from 2.41 up to ﬁve. So this participant is really sitting further away from the rest of the sample and likely causing the skew. So we could say it is possible that they're just something simply someone that I'm not interested in. And there's nothing inherently wrong with that. It might not be a data entry error. Um, for example, if I was interested in a population of people, I was looking at people aged, um, 18 to 25 and a person didn't read that, and they're 60 years old and took my survey, OK, There's nothing inherently wrong with being that old that age. It's just that they're not the population that I was interested in, and that's why they're sitting out there. Yep. Is that clear for everybody? Excellent. Thank you for the feedback. So it's possible the case, like I said, doesn't belong to um After identifying, we decide if that makes a diMerence. This is what we're gonna do. And this is what I want you to do in your assignment. Once you have identiﬁed someone as being a univariate outlier, run your regressions. Check your normality with and without them, not just your not just look at this, but actually run your regression without them. That's gonna be a really big thing here. If I run my regression with them and my data is not signiﬁcant and then I remove them and it becomes signiﬁcant or the other way around, and that can be really annoying. It's signiﬁcant with them. And then you take them out and your data becomes not signiﬁcant and that you wanted it to be signiﬁcant. That can be annoying. But if there is a change where it crosses over that threshold of 0.05 based on whether they are included or not, remove them. That's also been a frequent question. Um, that we've had in the inbox of what? What if I had lots of or 34 Univ. Variant outliers. Do I just remove them all at once? No, because it's possible that some of them aren't causing that signiﬁcant change in your data, and some of them are. And so if you just remove them all at once, you wouldn't be able to tell that. OK, so take one. Run your regressions with and without them if they make no changes. If it's signiﬁcant with and without them, and keep them in. Take your next univa out layer. Run your regression with or without them. If it makes changes, keep them or remove them one at a time. OK, that's been a question. We've been getting a bit, and that makes sense why you wouldn't just remove them all at once, one at a time. Check if it changes your aggression. Yep, society wasn't singing and then say it was who SPEAKER 1 made the daughter for me. SPEAKER 0 Yeah, If there's a change in either direction, remove them like from signiﬁcant to not signiﬁcant. Also nonsigniﬁcant to remove them. We Any other questions? SPEAKER 2 Yes, so we have the ﬁrst one. Let's say have an eMect. Remove it. We come across a second one. Would you then use that ﬁrst outlier one again with that ﬁrst one? Say it again. So you have. Let's say you've identiﬁed the ﬁrst event. Change. Remove it. You get to your second new V outlet. So when you're running the with and without it, are you including that ﬁrst I know what you're saying. SPEAKER 0 No, if once you've removed them and then you move on to your next one because then they would just that would be hundreds of possible combinations. So I get what you're saying. No one make a decision. Remove it. Looking at your next outlier with this modiﬁed data set. But don't I know what you're saying? Yep. Why possibly the thing is that what I want you to do is just look at the P, the univa outliers that were originally univariate outliers. Because I get what you're saying. Like if you run it and you have four, they're your four you're going to examine. But if I deleted two of them and that's a loop and there is a slide on this coming up, but you can get stuck in a loop where OK, now, if I remove those two participants. And now I look, it could create new outliers based on this new data set. And now I've created more of that. And then you can get stuck in this loop of an iterative process every time I move with someone, or then there's new people that are now gonna be on the edges. Just do it once. Take your outliers. Don't add outliers as you're removing them. Does that make sense? It's a very good question, and there is a slide on that. Otherwise, you're just gonna get stuck in this continuous loop of ﬁnding new ones, deleting them, ﬁnding new ones, deleting them. Cool. Everyone OK with unvaried outlays. Awesome. Next one we're gonna look at is Multivariate outliers. What's the time? Multivariate, outliers, Multivariate outliers are people that it's not always just that they have an extreme score on multiple scales. It can also be that it's that their combination across multiple variables is unusual. That's why it's diMerent. It's not just that there are a series of uni varied outliers. Sometimes it's the combination that can make them weird. OK, so at this stage, moving through it, we've checked. If we have data entry errors. We've checked linearity. We've checked, uh, residuals. We've checked univariate outliers. Now we're gonna look at our Multivariate outliers. We're gonna be checking this through Mahala nobis distance. Um, your tutors may have mentioned this to you, but there are quite a few ways to check for multi varied outliers. There's, uh, cook's distance, and there's junto, residuals and some other ones. Um, in this course, you're just learning mahalanobis, OK? Oh, and that's milanos on the right. Cos you know, there's these very smart people that exist the person on the left. I found this photo and I wanted to add it into the slides. Um, the person on the left is, uh, sir Ronald Fisher. Uh, he popularised the P value. He didn't, like, Invent it, but he popularised it. Um, and he's also responsible for a test called Fisher's exact test, which is an extension of the Chi Square test, which you'll be looking at next week. But he was kind of a racist. Um, he was he Was he lo he researched, uh, eugenics a lot. Uh, but yeah, Mahala nobis is on the right. Smart guy. Sorry. I just like the photo. OK, so This would be an example of an unusual thing. Let's just say that someone was interested in the correlation between age and yearly income. Can everyone see this dude up here? OK, so there's nothing inherently wrong with being 10 years old, and obviously, you know that it might not be that that person or they weren't interested in 10 year olds. So it's not like, Oh, that person wasn't a part of the population. I was interested in terms of age. They could have been looking at people aged 10 to 50 and again with income. There's nothing inherently wrong with having an income of $80,000. Some people were getting close to that. It's that it's the combination that makes it weird. It's that they're a 10 year old making $80,000 a year, and the other thing is that it might not be an error. There's nothing. It's not impossible. You know, there's, I don't know, Children of famous people and famous YouTube kids. I don't know the ones that review toys and stuM, but it's not an error. Um, we they weren't, you know, a mistake. They didn't not read the study in the age group, and they are part of the age that we're interested in, but they're clearly an outlier. Um, there is some, uh, some debate as to whether you should just ﬂat out always delete Multivariate outliers in this course. We are of the argument. Yes. And that was a little bit We We spoke about this for a little bit yesterday, cos when we were just talking about the univariate outliers. You know, I said, delete them and then run your aggression. And if it changes, then keep them. Or don't keep them. If you have somebody, um, who exceeds that? Mahalo. Nobis cut oM. So they're a multi vote outlier. Delete them. You don't need to run your aggression with or without them and see if it changes. Delete them. You would have learned in your tutorials this week how we determine if someone is a multi vote outlier. So it's that equation. Um, it's based on the number of predictive variables you have. So in this context it's ﬁve, and you can also adjust the P value. I think we're all using 0.001 in this course and it gives you a cut oM of 20.70 something. 20.70 something. Yeah, it's smart. It's cool. 20 point something, Something. Um, I did hear you, but I just forgot immediately. Um, so if somebody exceeds that value, delete them immediately. OK? You don't need to do as much exploration as you do with your uni varied outliers. Some people think you should, but in this course, just delete them. They're usually almost always problematic. And it would be really diMicult because they're in this example. They're just unusual in two variables, but it's they can unusual on at least two variables, so they could be 80,000. Um, 5, 10 years old and 5 ft seven. You would have to go through and ﬁnd what variables that they're performing. Weird on. What's the combination? It can just get a bit too complex. It's not expected of you. Just remove them. OK, major story that, Yes, they're just problems. So they're just gonna cause diMiculties in your data. And we're just gonna assume that we're not interested in them. Yeah, yeah, yeah. OK. Any other questions? Cool. Awesome. Um, and this is obviously you can see. That was that's what it would look like I mean, that would be a Mahala Nobis. So I think, like I said, yours is gonna be, like 20 point something something, and it it would be greater than that. But, um, theirs was one. And then you can see that that person's was full. Their Mahala nobis was full. So there was a very large gap between them and the next person At 1.5 OK, and you would just use the philtre. When I say delete or remove, I mean, philtre them out. Don't actually delete them. Um, this is going back to something that I think we just mentioned earlier. Do not get stuck in a perpetual loop of outlier removal. If we remove three and then rerun our entire identiﬁcation protocol again, we'll ﬁnd that three more have appeared. OK, Sometimes people can get stuck in that loop of doing that. Don't do that. In an ideal world, we don't delete data unless we can know for certain. It comes from a population we are not interested in. Um but we almost never have that certainty and this way, like it can come into ethics and model overﬁtting and stuM like that. So don't get into that loop. Just work on the original allies that you identify in your ﬁrst step. OK, the ﬁnal aspect is skins and Kassis. This is of our variables of all of our variables. Except, um, education. You obviously can't have a normality distribution for a dichotomous variable, but that would include your four continuous predictive variables and your outcome variable of pro environmental behaviour. You wanna check the normality of it? Um, in this one, they're looking at depression. You can see there's maybe a little bit of skew, but we're going to be looking at this one a bit more formally with that value that we can calculate. Did you guys do this in class this week? Calculate the skinner. You've deﬁnitely done it before Instant. Yep. OK, awesome. This is also where transforming comes in. OK, so for those who need a recap, um, you divide, please divide. A lot of people would forget to divide and just report this value. For example, of 0.722 it's 0.722 divided by the standard error, and that's gonna give you your Snus value. Please also note that your assignment doesn't ask you about Skinner. Enro Tosis. It just says skins. Um, namely because skins and ketosis are just inherently related. Like if all your participants are skewed, then it's going to be Lepto Co. Um, so we're we're just interested, interested in skew if your value Also in this course Sorry, we're using a cut oM of 3.29. This is another area in research that's a little bit grey. A lot of people do use 3.29. It's not something we just made up for this course or GriMith users. There's a lot of literature that uses 3.29 but there's other cutoMs, like three to even some people suggest. So again, it's a bit of a grey area. But for your assignment, we want you to use, um, a cut oM value of absolute 3.29. If you calculated it and it was under this, it means that your variable is normal enough and you do not need to transform it. Like, for example, if it was 2.99 it's obviously ﬁne, and it's good. However, let's say that you did have a variable and it was very skewed or multiple of your variables were skewed. What do we do? Well, we can try and transform it. So this study here or this paper sorry, is from a very, very cited paper. If you Google it or on in Google Scholar, it has, like, 26,000 citations or something. Ah, but it's It's a very widely cited textbook. Um, it means you can see there's, um, what they suggest to do for skews. So we're gonna look at that a bit more. It's kind of like a Bible, but anyway, when you have a moderate skew, we wouldn't re request like they do here, that you attempt a square root transformation and see what it does. OK, you learn this in. Yes, there is ambiguity. There is no cutoMs as to what constitutes moderate skew for severe skew. Use your judgement. Use your judgement. You let this in this as well. What it does. Is it kind of squishes all of the variables closer together. It's kind of what it does when it transforms it. So you can see in this one where it's been. Um, the square root transformation has been applied that the values on the Y axis are much smaller. So here it was like 7-Up to 17. Here it's 23, all the way up to four. So the V the values are much smaller, as you can also see. He didn't really do much. It still has a bit of a skew to it, so you would simply just recalculate. Go back. Look at the descriptive statistics for your transformed variable. Recalculate it. And if it's now under 3.29 that would help you also determine whether it's normal. But what I want you to actually think about. And this is another thing, actually, I mean, save that point for a second if your variable is severely skewed and again there's kind of no cut oM for that. But the example here is 5.34. So that would be a good, um, approximate for you to think about. We apply a stronger transformation, which is the log transformation. OK, as you can see here again, um, this is measuring aidonia, which is, um, the ABI inability to feel pleasure. But you can see here that it's been transformed. The numbers are smaller, but this the curve still kind of looks pretty similar. What we were talking about in class yesterday is and this is obviously a very big question. OK, so do I. Do do I just use the transform variable? What some of you may have found is that you do have a skewed variable completely hypothetically, let's just say that you had a skewed variable and you transformed it. And if you went and recalculated your skewness, it was now under 3.29. Would I now use the transformed variable? I do have an aMect you on this, but it it's up to you. Obviously, normality is an assumption. And so the fact that your transform variable is now normal is good because it's meeting that assumption. However, there are also quite a few other things to consider. First, we're going to talk about the fact in a little bit that we like to keep the data as representative and reﬂective as the participants as much as possible. If you're transforming it now, this person seven is now a two. You're getting further and further away from what the participant actually reported, which is not really what we like to do. The other reason is that the numbers are much harder to interpret. For example, let's just say this was age rather than, um, Depression. And I was interested in people, um, aged 18 to 50. If I now, if I applied a transformation, I looked at my mean. My mean is gonna be, like three. And so it's much more diMicult to interpret the values of a transformed variable that makes sense. The main thing I want you to look at and ask yourself is, Does it actually change the regression results? So let's just say you have a skewed variable. You transform it, you calculate the skew, and it's now normal. That's great. That is good, that it's normal. But let's just say your aggression results are the exact same. It's signiﬁcant whether it's transformed or not, or it's nonsigniﬁcant, whether it's transformed or not, considering that it's now harder to interpret and you're further away from the participants what they reported their raw score. Was it worth? It was meeting the assumption of normality worth the fact that the that the um, regression outputs the same, like your signiﬁcance or lack of signiﬁcance, is still the same. Your data is now harder to interpret. And it's not as you. You've kind of messed with their data a little bit. Yes, if the regression results changed with the trans, if you use the raw data data. And it was, for example, nonsigniﬁcant. And then you ran the regression with the transform data and it was signiﬁcant. That would be ﬁne. You can. And we We've spoken about it. We're not gonna mark people down who do use transform data. Um, if again, going back to what I said at the very, very start, it's about justiﬁcations. If you said I applied it and then I checked my skewness, and now my skewness is met. And so now I'm meeting the assumption of normality. You've provided some justiﬁcation. OK, we're not gonna mark you down either way. It's really about whether you're not you're providing that justiﬁcation to us. Hopefully that's fair and makes sense. And I've also provided you justiﬁcation for whatever you choose. The other thing you would have learnt in your class is that you cannot transform a, uh, negatively skewed variable. It just doesn't work. So you need to do this lovely thing called reﬂecting. That's the code for it, obviously, That's the variable. The number 20 is going to be, which we would have discussed in shoots is the maximum possible score plus one. So if I had, um, a 1 to 7 scale and it was an average, so the minimum someone could have got was one, and the maximum someone could have got was seven. Then I would put that as eight maximum score plus one. So it's tricky because what you're gonna do is you're gonna take your raw composite that you've made, reﬂect it and then make a third variable that's transforming the reﬂected variable. Then run your aggression with it. And then if your results are the same, just use your raw one that you did at the start. But your marks are for that showing us that you did that. You thought about this. You tried this. You reﬂected it. You made the decision. It was normal. It was not normal. The regression changed. Regression didn't change. Therefore, I did this OK, sometimes they're for nothing. Yup. Well, because maybe it did change your aggression results. And so it's not for nothing, but sometimes it's nothing. Whether it is or not, you're gonna have to ﬁnd out yourself? Yes. SPEAKER 2 So to show that we have all that process and it didn't change anything with any sort of trans, but still writing that step by step down that you be able to see Yeah, yeah. SPEAKER 0 Um, the other thing. Why, This is bad. Not bad, but it's tricky, and you need to be mindful of that when you do reﬂect the interpretation of the values as the opposite. So, for example, if I was meas measuring, uh, depression or anxiety where a higher score would usually indicate more depression or more anxiety, a higher score would indicate less depression or less anxiety. If you've reﬂected it cos that's what reﬂecting is, you're making the inverse. So if it's a negative skew, you've inverse, it's now it's a positive skew and you can transform it. But now all your values mean the inverse thing. Cool. Any questions? I feel like there's no questions because it's just a lot. It's confusing. We OK? You're doing great. You're here. That's the main thing. You know, you showed up. Actually, it's 2. 50. Do you guys OK? 301. You guys look about 2% better than you did before the break. Um, S Sy kind of just wanted to clarify something, and I'm I'm happy to just clarify it. So when you're looking at your dependent variable, there's really gonna be two histograms cos you're gonna have your, uh, residuals. That's not the one that we're doing Transformations for and stuM. OK, that's your residuals, which is error. And then you're also gonna have your, um, normal one, which is the actual data points. OK, you've got your residual histogram. That's me. That's something you just eyeball. You don't transform and do anything with residuals one. And then you have your normal one, which is the actual data points. That's the one where you, um, can use the cut oM and then apply Transformations reﬂect. And so on. Ok, not the residuals. Sorry if that confused anyone even further. OK, so run your model again. Check if your residuals to see if they're now normal, you can check them like you can't transform the residuals. Transforming the no normal data points will change the residual histogram. If that's what I'm saying. So you can't actually transform the residuals itself? You transform the normal data points and then that may or may not change the histogram of the residuals and the, um, scatter plot as well. Check to see if the speciﬁc predictors performed in the model. So check to see this is kind of what I was saying before check to see if the transformed or not transformed changed your aggression results. Another question I got in the inbox, um was does that mean the overall model signiﬁcance Or they, um, the coeMicient of the variance of that speciﬁc predictor? We mean both If it changed the overall regression signiﬁcance or just that speciﬁc variable signiﬁc, um, signiﬁcance changes to either of those would constitute deleting the participant or using the transform data. OK, so there's kind of two P values you would look at. Does that make sense? Yes. SPEAKER 1 Or running water or running? Or is it child? Uh, by the prime minister? Thanks for eating. Yeah, I know. Who is it? Like when you transfer transfer the job? SPEAKER 0 Yes. And it's no. So I wouldn't I get what you're saying. I wouldn't like I wouldn't transform a variable and then look at the regression with everything except it changed. Um, that doesn't matter. Transform all your variables that you need to change If it is more than one, run the regression with your new model. No. What do you mean? How do you ﬁnd it? SPEAKER 1 Like you mean like a no. SPEAKER 0 In its regression, like you know how you have the Anno table and it tells you, and then you have the coeMicients table, and it gives you a signiﬁcant. So, for example, if I transformed egoistic values and it changed the Anno P value or it could change its P value in the regret in the regression table, then I so either of those ones cool. And like I says, if the transformations have no signiﬁcant impact on the model, use the raw data. But I also said, I don't I'm not too fast if you do use the transformed and make the argument that it's because, um, it made the distribution normal. Yeah, as long as you justify. As you can see, there are quite a few limitations. I wanna kind of not spend too much time on this cos I feel like I've pointed them out a few times throughout, Um, the lecture outlier identiﬁcation is a contentious area. Like I said, you can kind of start to get stuck in this loop of when you delete some new outliers can appear. You don't wanna get stuck in that loop. The other issue is that outliers are calculated based on the mean and the standard deviation, which was calculated based on the presence of that those outliers. And so you're kind of determining which one's the outliers, based on a calculation that included them so they're inherently connected, which can be a bit confusing and often cited as a limitation. Um, transformations. I've clearly gone over the limitations of those the do the data now doesn't necessarily represent the raw score or what the participants reported. Um, reﬂecting the data can be a bit confusing. Um, and obviously it can be more diMicult to draw inferences. If you know, like I said, you were interested in a population of people age 18 to 50 you had to transform it and other mean age is three, so it can be trickier to draw inferences. We need to be mindful of the limitations. It's really a balance between getting our model to ﬁt. We don't wanna over ft it. We don't want to make our data so far away and so transformed and reﬂected. And it did three cartwheels. And now it's not really representative of what the participants actually said at the start. Yeah. So it's a balancing act between those two things that makes sense. Right? Um, this is a decision tree thing that Natalie Lockton made for you guys. It's in your tutorials folder. I think with your tutorial, PowerPoint and PDF stuM, please use it. Uh, when working through your assignment and you're not sure what to do, you have an outlay. You don't know what to do. Use it. It's very, very helpful. OK, so this is kind of you pass to it. So summary of of the decision rules univar outlays Do the individual cases stand apart from the rest of the data, or are they just extreme? We can't just, um, in good ethics, delete someone just because we don't like them or they don't, um, coincide with what we wanted to ﬁnd. We're asking ourselves if they belong to the population we're interested in, or if they're causing undue inﬂuence in our data set. Does dropping it make the diMerence to the nature of the results? If so, note the outliers to be removed and report that the case was dropped or not dropped. Multivariate outlays. Are there any Multivariate using my hylo nobis distance? If so, remove them. We always N loxton was varied, um, of the belief that we always move them. So in this case, we're always going to remove our Multivariate outliers and make a note if the model changes uh, some of the decision rules where we have normality, Are there any distributions? So skewer kosis Does a applying a transformation quote unquote ﬁx the residuals of the model and or change the nature of the results? They're really the things that we're interested in. If, yes, then you can note the variables that was or were transformed, how they were transformed. So if you use a square root or a log and report the results with the transformation or without depending on whether it actually made a diMerence or not. If no report the results of the raw score and note that a transformation was checked and why you did a transformation which would have been the non normality, I would expect you to report the, um, skewness value that originally prompted you to decide to do a transformation in the ﬁrst place. Ok, is everyone OK? Kind of, but not really. That's ﬁne. Summary of the decision rules. Ideally, we want to report results with as few alterations as possible. Hopefully I've made that very, very clear. Now we only make alterations when we absolutely need to. It is hard to interpret transform data. It loses the meaning because it isn't the real the raw data. Ah, this is a write up. You already have access to it? Um, it's in your tutorial content. It's an example of assumption checking. This is, I think, also in your tutorial ﬁle page somewhere. OK, your assignments due on the 17th. So two weeks and one day at 5 p.m. Um, there's a huge FA Q, please look at it. I'm already starting to just refer people to it when they send emails. That's already in FA Q. Please read it before you send an email just to make sure that we haven't already answered your question. Um, because I'm just gonna refer you to it for formatting. I've also put an FA Q, but you'll note that there are marks ordered to a P A, um, And even though it's not a lab report, obviously things like font size um, you have to do a correlation table. So I would expect that to be in a P A format as well. Um, there are some questions that ask you to use in text citations. Um, so I would expect those to be in a P A format. You know, there's quite a few things that we would still expect you to have the money correctly. OK, Um so these are some I'm just gonna see it. I'm really not feeling well today. Um, I made these slides late Tuesday night, so I'm sorry if it's just not funny. Um, but these are some FA QS, uh, that are more commonly asked And, um, some of them I've pulled out from the FA Q but expanded upon them. Some are questions that we're getting in the inbox a lot. Um, and some are questions that were brought up yesterday at the mat lecture. Uh, so one question we are getting is do I report the Alpha correlations and means before the data is cleaned or after the answer is after. Ok, It doesn't make sense. I'm gonna stand. I missed you. Um, it doesn't make sense to be like these are my means. You know, the average of risk perception was six point blow up 45. Then I deleted 10 people. And now these are the regression results. Does that make sense? Why, That kind of would read weird. You would clean the data and then report your descriptive statistics and your inferential statistics with the exact same people. It would be weird to have descriptive statistics that used 100 and 55 people. And then you cleaned your data, and now you have however many you remove, clean the data and then report your combats alpha, your correlations and your means with that ﬁnal dollar set. The other reason why we do this and this is quite important, is sometimes it can look like your crom backs alpha, for example. It could be seem like it's unreliable. You could get an alpha of like 0.65 which is obviously below the reliability of 0.7. But it could just be an outlier that's causing, you know, they they answered weirdly. And so now you think that your, um questionnaire isn't good. Oh, I gonna need to delete some items. I need to rewrite the items. Maybe they were unclear, but really, it was actually an outlier. And if you remove them, then you did your alpha. Now it is acceptable. OK, so that's another reason why it wouldn't make sense to report things like that before you've cleaned the data. So if anyone needs to go back now and redo their oMers and the and the whole correlation table, uh, what do I mean by composite? I mean average OK, I mean, average. But it means some one of the main reasons I've just always kind of assumed. And all the tutors did as well that when someone says composite, like we mean average, um, one of the main reasons we want you to do the average, though, is because when you try to create a mean in, uh, syntax and SBS S, it will let you create a mean for a participant, even if they didn't answer every item. Like if I had seven Quest. Oh, for example, Ego has ﬁve questions. If someone missed one of them, it would still make an average for that person based on their responses from items 234 and ﬁve. Yep. However, if you use the sum, it's only going to give you a value for the participants that answered every item. So in that case, I wouldn't have a value. I wouldn't have a score for that participant that didn't answer the ﬁrst egoistic question. Obviously, more participants is better. That's ﬁne. If they didn't answer one, I can still create a meaningful average with the other que four questions that they did respond to. You're gonna have way less participants than someone who uses the average. OK, clear. Good, good, good, good, good. Come on. Why, why The Chrome Alpha is diMerent in the code book to what I'm getting. That's because they're the ones from the author. The Krombach Alphas in the Code book are the ones that the author found when they made the questionnaire. We didn't give you the Alphas from your data. You need to do that yourself. That's why if you're getting values diMerent to the ones in the code book, because the ones in the code book are from the paper that's referenced, not your data. The reason they're included is for a couple of reasons in research. It is good practise to include that information. It shows the reader that you didn't just ﬁnd the ﬁrst scale oM Google. You know, you said I'm gonna do a study. I want to measure risk perception. I'm gonna make sure that I ﬁnd a good, reliable scale to use for my research. So we write things like, you know, this scale it's made by this person. It has seven items on a 1 to 7 scale from strongly agree to disagree. And when they made it, the Alpha was 0.78. In my study, it was also reliable at 0.75. OK, question six is gonna be similar to what I just said. It asks you to cite the, um, Alpha in the code book and your own in kind of a phrasing. Like I just said, Does that make sense where applicable? Because I know one of the items, um, access to resources doesn't have an oMer because I write these questions. None of the none of the items in the data set have R on them. Is this an error? No, this is not an error. This is because in the code book, the R reﬂects the items that are negatively phrased. You need to make a variable that reverse scores them. We've given you the raw data. We haven't reversed any items for you. And if I called the items G, SI, six R, then what would you call it when you need to go and reverse it? Ok, so I've called it G SI, so that you can reverse it. And you can call your new variable G SI six R. That makes sense, right? So that's why there's no rs in the syntax or the output. Sorry in the data ﬁle until you do it yourself. And remember to do that before you make the composites, obviously so that you're using the right ones in your composites. Um, I've already mentioned this one, applying the transformations to my skew data make my distribution normal. Does this mean I should use the transform variable? In my analysis, it's up to you. This page just lists the pros and cons of each that I've discussed. And you're welcome to use those in your justiﬁcation for whatever decision you come to. Um and we're not gonna mark deduct marks either way, as long as there is some type of justiﬁcation provided. Hopefully, that's very clear. Um oh, another question that come came up quite a bit is in question two. What does it mean by operationalized? This should be like Max. Two sentences. Question one. Should be one sentence. Question two should be two sentences I want you to think about. Well, cos I can't give you the answer straight. It'd be too easy. But think about all the diMerent ways that someone could measure anxiety. OK, think about all of the diMerent ways that I could Operationalize anxiety. I could measure it through, um, heart rate, like put some things on the heart rate. I could measure the blood pressure. I could measure Sweatt. I could measure observational things like, you know, they're facing away from me. I can measure it through your self report. I don't know. That's how it would be operationalized. And there are all the diMerent ways that anxiety could be operationalized. How are your variables? Operationalized doesn't need to be paragraphs. Um, you can also use dot points for some of the questions. Um, I wouldn't use it for the larger ones like question. I think it's six that asks you to do like your whole summarise your aggression assumption diagnostics. Um, but like, for example, one question that's like, What are the Alphas? You could just say dot point risk perception. Alpha equals this blah, blah, blah. So for some, you can't use dot points. Just use common sense with which ones. OK, I've left a little bit of time now before the quiz for other questions. Hopefully there's less now that I've covered the very frequently occurring ones. Um, but please ask questions. Please don't think that they're silly questions or stupid questions. There's no such thing. Yes. SPEAKER 1 What are you saying about health with the A. P? A. Yeah, don't use a table. SPEAKER 0 Yeah, so you should have a title, which would be your level one heading and that's in bold. And then a P A Your level two heading, which is less left justiﬁed in bold. That's gonna be your question. And then the text starts on the next line in Denton. Yes, but it doesn't count towards you won't count as you saying. SPEAKER 1 Question 11 of the most is that is the whole thing. SPEAKER 0 Level two heading? Yes. Yes. Good question. Yes. SPEAKER 1 Yes, because I saw that the you along the trip. So we have to Yeah. SPEAKER 0 So it will still create a composite for them based on the items that they did. Answer. No, no, no. The composite for that person. So if ego has ﬁve questions and they didn't answer one, it's still gonna create an average for that person based on the questions that they did answer. So if you look at your composite column, they're gonna have a number with them. SPEAKER 1 Hello? Oh, yes. Um, I I can express my question because the question if I said something, I'm going to This one. Didn't answer that question a lot in the last. So if I like it will give me the eMort for me. SPEAKER 0 Incorrect. What? Sorry. What did you say? Yes, it will exclude the dot Yes. It won't count it as like a zero or something. And give them a weird number. Yeah, yeah, yeah, yeah, yeah. It it will just exclude the dot Yes. Did that answer your question? OK, cool. The only, um And again, this is a little hint. Um, I don't think there are because the only reason why someone would not have a composite is if they didn't answer every question in a scale which I don't think happened. It didn't happen. The only question would be maybe your dichotomous variable, because it was only assessed with one question. Um, and there was three options. Yes. No, prefer not to say. And obviously, if you're making it a dichotomous variable, you're only really including the Yes, and the nos. And you should be. Including, I think three was prefer not to say three would get coded as system missing, and then that entire participant would get excluded. Yeah. Is that helpful? Possibly too helpful. I'm gonna regret this, but it's OK. Any other questions? Can't tell if you guys like tired or just a little broken or all over it. Yes, totally ﬁne. I love it. Consult, Consult. And I have a consult as well. Like I, the tutors and myself have a consult. My consults on Fridays at two. So if you wanna come to my consult, we can chat on friday about that. You're very welcome. I don't know if I'm, like, worried or happy that you don't have quote. I think I'm worried. No, I'm happy. I'm worried. I'm both. I see. I don't Yeah. Thank you. Thank you, Thank you. I really appreciate it. SPEAKER 1 Ok, I think the sound like OK, do you guys SPEAKER 0 wanna do a fun little quiz now? It's not very long because you've only really been given two weeks of content post the mid trimester exam, Um, the multiple regression. And then this week on regression and diagnostics, Uh, for those who don't want to play or don't have a smartphone or a computer that you can do it on, I'm And even if you do, I'm gonna upload these questions as a PDF when I upload the PowerPoint anyway, OK?

Regression Assumptions Lecture PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue