Logistic Regression PDF
Document Details
Uploaded by TruthfulRealism2101
Princess Nourah Bint Abdulrahman University
Tags
Summary
This document discusses the concept of logistic regression, explaining the relationship between odds and probability, as well as how to transform probability data. It outlines the functional form and describes how to use logistic regression and how to replicate the results from real-world situations using Python/Jupyter notebooks.
Full Transcript
We discussed the major shortcomings of the LPM due to the limited nature of the binary outcome variable. The logistic regression gets the binary data better than the LPM as it provides a more realistic model for probabilities. As I mentioned, the most serious problem of the LPM was the functional fo...
We discussed the major shortcomings of the LPM due to the limited nature of the binary outcome variable. The logistic regression gets the binary data better than the LPM as it provides a more realistic model for probabilities. As I mentioned, the most serious problem of the LPM was the functional form of it. We need to transform the functional form of the binary dependent variable on the left side of the linear equation one way or another. Here you can see the functional form of the logistic regression model in which the functional form of the dependent variable on the left-hand side of the equation is transformed. The functional form of the dependent variable is referred to the logit function. The logit function is just the log of the odds, and that's why we often call the logistic regression model as the logit model. Then what is the odds in this functional form? Basically, the concept of the odds is very similar to the concept of the probability as each concept is talking about the chance of certain events that would occur. In our everyday speech, we use the odds and the probability interchangeably in a vague and informal way, but we can't discern the difference between the two concepts from how to calculate each of which concept. As you can see here, the probability is the ratio of something happening to everything that could happen. The odds in contrast, is the ratio of something happening to something not happening. Both concepts are dealing with the degree of occurrence of the specific event from slightly different perspectives. Let's take a look at this simple example. Visually, you can see that there's a total of five games, and my team will win only one out of five games. In other words, my team will lose four games. In this case, the probability of my team winning is 20 percent, which can be calculated by dividing the winning game with the total number of games which is five. Now, what are the odds in favor of my team winning? The chance of not winning is the same as the chance of losing, which is four, so we can say that the odds in favor for my team winning the match are one to four. Alternatively, we can write this as a fraction, one over four, which is same as 0.25. Here's another example in which my team performs much better than the previous example. In this case, the probability of my team winning is a 60 percent, which is equivalent to three over five as at the same all. We can also say that the odds in favor of my team winning the match are three to two. Alternatively, we can write this as a fraction, three over two, which is the same as 1.5. From the examples, we can extract some insights regarding the major features of the odds in comparison to the probability. Basically, we can see that the odds in favor of my team winning get closer to zero as my team loses more often than not. Now, let's consider the situation where my team wins more than losing, indicating that the numerator is larger than the denominator in the equation. We can see that the better my team is, meaning that my team wins more than 50 percent of the games, then the odds of my team winning begins at one and greater. We can see that the range of the odds vary between zero and positive infinity, while the range of the probability varies between 0-1. Speaking of the scale, I'd like to highlight to one another important issue about the scale of the odds, as it is associated with the reason for putting the log to the odds as a functional form of the logistic regression. This chart gives us side by side comparison of three values in terms of how each of which value is scaled regarding the same record. For instance, take a look at the first row in which my team win one out of 100 games. It should be noted that the numerator in the odd is the number of wins while the denominator denotes the number of losses. As we observed previously, if my team loses more then the odds of my teams winning varies from 0-1, whereas if my team wins more then the odds of my team winning varies from one to positive infinity, which makes the odds asymmetrical. If you don't understand, take a look at the difference between 1 over 99 and 99 over one in the odds column. As you may remember, we didn't first form the right-hand side of the regression model, so the model still takes the form of the linear equation. Without getting into the too much details in mathematics, we can solve this scaling issue pertaining to the odds by taking logo to the odd as the log will make everything symmetrical as you can observe here. By taking the log to the odds, the logit function varies from negative infinity to positive infinity. To summarize, odds are just the ratio of something happening to something not happening. Logit is just the log of the odds, and logit makes everything symmetrical. Once we transform the probability into the logit, then we can fit the linear model. Let's talk about the ideas of logistic regression. The logit function can be traded as a continuous variable as it varies from negative infinity to positive infinity, so we can basically fit the linear model once the binary dependent variable we just scaled into the logit function. In this regard, it is important to highlight that the useful concept of linear thinking isn't completely gone in logistic regression. However, we have one issue after scaling the original probabilities into the logit function. Well, think about how you interpret the regression coefficient of the logit model. While we can figure out whether or not the independent variable has either positive or negative impact on the dependent variable, it's very hard to interpret the meaning of the unit change in their logit function. For the sake of interpretability, it's often recommended to transform the logit back to the probabilities once we obtain all the desired parameters. Now, let's take a look at the structural model of the logistic regression from the right-hand side. If you solve the logit function for the probabilities, you can end up seeing the formula like this. If you love mathematics, I recommend to solve the equation on your own and check if you'll get the same result. However, I'd like to focus more on the conceptual insights of the logistic regression at this point. I noted that the logistic regression solve the issues with LPM, and if that is the case, this means that we will get the reasonable values for probabilities once we transform the values in the logit function back to the probabilities. For the probability to be used as an outcome variable for regression, we need to somehow constrain the probabilities such that it varies between 0-1. In that regard, we can take the exponential form to the linear model as e^alpha plus beta x is always going to be something positive. Then we need to restrict the range of e^alpha plus beta x by dividing it with something greater, which is the e^alpha plus beta x plus 1. Once again, and this is basically the same as the logit model presented in the previous slide if you solve the logit function for the probability. As it turns out, we can cross-check that the logit model handled the issues with the LPM more effectively as it produces reasonable values for the binary dependent variable in terms of the probability. Visually, we can also articulate how the logit model solve the major shortcomings from the LPM. Why in this chart refers to the conditional probability of the binary outcome variable given x? As you can see here, once we fit the logistic regression, we see that the curvy line from logistic regression fit the binary data better than the LPM, and it gives you reasonable values for the probabilities. Here we can see the results from fading the logistic regression model using Python code. I will demonstrate how to fit the logit model later, but I just want you to focus more on the way to interpret the model here. It should be noted that the model is basically same as the LPM that we ran previously using NHL 2017 regular-season record. However, in this case, now we used the logit function as an outcome-dependent variable. On top of the results table, you see that the table is titled as Generalized Linear Model or GLM. Without getting into the too much details in statistics, it is because we fit the linear model by transforming the binary dependent variable into the logit function, which can be fitted with a linear regression approach. Let's further dig into the logit model. In essence, it seems very similar to the linear model as it consist of alpha, which is a constant, regression beta for the given independent variable. Now, how would you interpret the model? Knowing that this is a linear model, the model can be interpreted as every one percent increase in Pythagorean win percent, the log of the odds for my team winning increases by 5.48. Then what do we mean by the log of the odds? That is very difficult for us to get a sense of the meaning of 5.48 increase in the log of the odds, and that's the reason why we need to transform the logit function to the probability so that the interpretation makes more sense. We can transform the logit back to the probability and the interpretation makes more sense. We can basically factor in the respect table Pythagorean win percent to obtain the chance of winning, as you see here. That's basically pretty much all about the logistic regression, and now let's go ahead in the Jupyter notebook to replicate the results.