Ordered Logit Regression in Python PDF
Document Details
Uploaded by TruthfulRealism2101
Princess Nourah Bint Abdulrahman University
Tags
Summary
This document describes the use of ordered logit regression in Python to analyze English soccer outcomes, focusing on team wage ratios as predictors. The analysis compares the model's predictions with bookmaker odds, demonstrating high accuracy. It likely discusses statistical modeling techniques and data analysis methods.
Full Transcript
Now we're ready to go to estimation. Again as we did in week one of this course, we looked at how to use the ordered Logit Regression in Python. We run this code to access the ordered logit, and then this command here, fits the regression from an ordered logic based on the win value. Bear in mind, w...
Now we're ready to go to estimation. Again as we did in week one of this course, we looked at how to use the ordered Logit Regression in Python. We run this code to access the ordered logit, and then this command here, fits the regression from an ordered logic based on the win value. Bear in mind, what ordered logit means is that there are three possible outcomes, and we generate from this a likelihood of each possible outcome based on the ratio of the team values. When we run the regression, we find the team ratio value. The Beta is 0.76, so that's the coefficient, and then the standard error is 0.03, which is very small, and so we have a p-value, which is essentially zero. This is a regression that fits very well and that shouldn't surprise us because we've already seen that wages are generally a very very good predictor of outcomes in English soccer turns out in probably in many other leagues as well. In an ordinary Least Squares Regression, the constant here would represent home advantage. We're viewing everything in our data from the perspective of the home team. The constant would represent home advantage. What we get with the audit logic regressions is really three possible regions. A region where there's a probability of a home win, a region where there's a probability of a draw, and a region where there's a probability of an away with. That's what the intercepts do. In this case, the constants, they tell us the boundaries between those regions, and of course, the actual prediction is going to be a combination of the impact of those boundaries, plus the impact of the wage ratio, the TM ratio values. We can obtain those coefficients here. The intercept, we've got the Beta, which is the effect of the TM ratio, and then we've got these two intercepts which define the boundaries between the possible outcomes. These boundaries would define the probabilities perfectly if both teams spent the same amount of wages, but if you imagine it's shifted up or down based on the amount that is spent by each team relative to the other. We can actually obtain the standard errors for these if we want. You can see that these standard errors are in each case of very small, so our coefficients are all statistically very significant, which is reassuring. We can now generate the predictions, and this is based on a formula which is implied by the Ordered Logit Regression. These lines of code reflect the definition of the probability from an ordered logic and we can include here the value of the coefficients and they have a name in Python that we can use here, and when we run this, you can see the predictions of derived from our model. If you look, say for example, the predictions of our model in the first row in this particular game, ask the villas, the home team and Arsenal still waiting, these are not quite the same as the predictions here, but they're not too dissimilar, so the prediction of a draw, the bookmakers had just under 29 percent probability of a draw, and we've got slightly low 28.6 percent but pretty close. The probability in away win, the bookmaker was 43 percent. Our prediction of an away win is about just under 40 percent. Our prediction for the bookmakers prediction for the home win is what essentially 28 percent and our prediction for a home win is 31.5 percent. You can see here that these differences are not actually that great for those particular observations, but of course we want to go and see in general, how close are they, on average, taking into account all of the games that were played. We're going to do two things. We're going to, as we did with the bookmaker data, we're going to say, how many correct predictions are there in the sense that the outcome was the same as the outcome predicted with the highest probability, and to do that, we need first to identify what the highest probability outcome was, and then which we do here in this line, so we identify the highest probability max probe in our data. Then we identify the result as being the value for which max probe is. It's the same. We then give a value H, D or A, based on that matching, and so we've got what we've called here, logitpred, which tells you the prediction from our logit, ordered logit model. We can now say as we did with the bookmakers data, we can look at when is our logitpred the same as FTR, the actual result of the game. We give ourselves value of 1 when we're the same and zero otherwise. Now we've calculated how often our model got the correct outcome. We can now calculate the mean value across all of our predictions to see how accurate our model is. If we do that, we'll just take the mean of logittrue, and the mean of that is 0.528 let's say. In other words, 52.8 percent of the time our model got it right. How does that compare with the bookmakers? Well, if we go back and see what the value for the bookmakers getting it right was, we found it was 0.54, so 54 percent of the time. Our model is really very close to the bookmakers in terms of predictive accuracy. What about the Brier Score? Well, we can calculate the Brier Score now for our model, just as we did with the bookmakers. Our Brier Score with Parral model is 0.584. What was the Brier Score for the bookmakers? We found their Brier Score was 0.569, so again, pretty close. Now that should give us some reasonable confidence that our model is a pretty good model in the sense that we can get close to the bookmakers. Not exactly as good as the bookmakers, but close. How closer are our results? When we have these close Brier Scores or the close percentages, how close is really close? How can we get a sense of that closeness? Well, one way to do that is to say that if you were to take away, say, two percent from the probability of a home win for each game, and add one half of that, one percent probability to the probability of a draw, and the other half to the probability of an away win, and then compare the Brier Scores, the difference in the Brier Scores would be greater than the difference in the Brier scores that we see between our model and the bookmaker odds. That gives you a sense of how close we really are. It's equivalent to saying, we are within one or two percent of the bookmaker predictions, which is pretty impressive. Now, of course, when I say impressive, it doesn't mean that you could make money on this because you'd lose. You're still not as good as the bookmakers, so you're on average, you lose money, but you wouldn't lose very much money, and you would certainly lose a lot less money than if you just picked at random. Of course, you would definitely lose money because you still have to pay the over round to the bookmakers. They'll still make their money on the bets, and you'll have to pay any tax on your winnings so overall, you'll still be down. But you will certainly a model that gets you close to what the bookmakers are doing. You might wonder one right reason why this might be true that we're able to get close to the bookmakers is that this might be the way that the bookmakers are making their choices as to what the betting odds are. Bear in mind that if we know that the ratio of the Tn values is a good predictor of the outcome, then we should expect the bookmakers to know that as well. They after-all, they've spent a lot of time trying to figure these sorts of things out. One way to test that is to see how often our predictions come close to the predictions of the bookmakers, so that's what we've done here. We've actually calculate the percentage of times our prediction is the same as the bookmakers' prediction, and it turns out it 88.5 percent of the time our choice was the same as the bookmakers choice. That suggests quite strongly, doesn't prove, but it does suggest quite strongly that the way the bookmakers are thinking about this is very similar to the way we've been thinking about this in terms of our modeling. That shows you that we have a fairly good model for within-sample prediction. The next thing, of course, is to go outside of our sample and predict something that hasn't yet happened, and we're going to think about an exercise for doing that in our next session.