Ordered Logistic Regression in Python PDF

Document Details

TruthfulRealism2101

Uploaded by TruthfulRealism2101

Princess Nourah Bint Abdulrahman University

Tags

ordered logistic regression python programming jupyter notebook data analysis

Summary

This document describes the application of ordered logistic regression in Python, using a Jupyter Notebook. It covers data preparation using 2016 NHL data, and the creation of a home dummy variable. The code shows how to fit an ordered logit model and interpret its results, including thresholds between loss, draw and win.

Full Transcript

Let's go through the ordered logistic regression in the Jupyter Notebook with me. Then as you can see the notebook, the basic data preparation is very similar to the logic model. Because, like I said, we are going to use the same independent variable, which is the Pythagorean win percent. Also, we a...

Let's go through the ordered logistic regression in the Jupyter Notebook with me. Then as you can see the notebook, the basic data preparation is very similar to the logic model. Because, like I said, we are going to use the same independent variable, which is the Pythagorean win percent. Also, we are going to incorporate the home-field advantage as an extra independent variable. Basically, the data preparation process is the same as the previous examples. But then now we're going to use the 2016 NHL regular season records. If you want to challenge your coding skills, please pause the video for a while and do the following analytic tasks on your own, and then you can compare your version of the codes with me later on. But I will go rather quickly. As always, we need to import all the libraries, and then we're going to import the same dataset, which is the NHL dataset. Then as always, it's always good practice to display the raw data and see if you have all the information here. Then again, we are going to use the 2016 regular season record to fit the ordinal regression model here. Then now, let's check whether or not you got the right result, and then you can also take a look at the size of the raw dataset. You can get the types of the data you're reading, and then you can obtain the very simple descriptive statistics as we did before. The data is loaded without having any issues, then we can just move on to the next stage of the data analysis as always. Then again, we are going to create the home dummy variable so that we can incorporate this variable to improve the performance of our model later on. Again, we're going to use the same command that get_dummies, and then we're going to use the home_away column to create the home dummy variable, which basically indicates whether or not the team played at home or away. Then we merge it into the raw dataset, and then we are going to calculate the Pythagorean winning percentages. Again, we have to order the dataset in a sequential order, and then we're going to get the cumulative statistics for gold for and gold against on a team level, and then we're going to get the Pythagorean winning percentages. That's basically the codes we have practiced a lot. I hope you remember all the code. Then we can take a look at the resulting DataFrame as always. Now our dataset is ready for the regression analysis. To run ordered logit regression model in Python, we need to install a new library called bevel library, which provide codes to fit the ordered logistic regression model. Then the library will give us all the options to obtain the desired parameters as well. We can import the bevel libraries here. Now, once you import all the libraries for the ordinal regression, now you can use ol.fit. Basic structure of the model is very similar to the way we fit the logistic regression as well. We are going to use ol.fit and then we're going to dictate the independent variable and dependent variable. Then we can run this command here. As you can see the resulting table does not give you a clear picture of the ordered logit model. It's hard to figure out the meaning of beta and thresholds as well. Actually, there's a better way to present the parameters in the model. Here we can obtain all the parameters such as the regression coefficient for the given independent variable and intercepts for each ordinal outcome. Basically the intercept defined the thresholds between a loss and a draw, and then draw and a win as well. Again, the beta indicates the regression coefficient for the Pythagorean win percent and we have two threshold for a loss and a draw, and a draw and a win here. You can also get the standard error for each parameter. Like we did in the previous example, with logistic regression, it's always possible for us to obtain the linear product from using the fitted model. First of all, we are going to get the linear product by using the Pythagorean win percent. Basically we are going to use the equation obtained from the parameters that we just obtained, and then we are going to factor in the respective Pythagorean win percent to get the linear product of each game. Again, it's hard for us to interpret the meaning of a log of the odds. We are going to transform the logit function back to the probabilities to make sense out of the results. Here we have three different outputs. Win, draw, and loss. Then based on the probabilities attached to each output, we are going to get the highest probability as an indicator of the game result. We can use that predict class to get the fitted ordinal outcomes. It should be noted that the value assigned to each outcome is the same as the way the actual ordinal outcome is encoded in our data set in the win.ord column. Loss is encoded with zero, draw is encoded with one, and win is encoded with two. This way we can compare the fitted outcomes with the actual outcomes easily. Now, we will create a new data frame containing the fitted ordered outcome obtained from the highest probability out of three outcome probabilities. Then now we are going to attach our fitted result into the existing data frame so that you can compare the fitted probabilities with the actual outputs in the win.ord column. We are going to compare the fitted results with the actual output here and see whether or not we got the right result. You can see the true value at the very right end of the data column here, and then we can obtain the success rate. A success rate of our fitted ordinal regression model is about 60.3 percent. Now, well, actually [inaudible] Library gave us with a very easy way to obtain the fitted probabilities and fitted results. But we can also obtain all the results manually as well. I mean, as we went through previously, we can directly applied the model to obtain the fitted probabilities and fitted outcomes. Here you see the Python code here that is basically the Python code version of the mathematic formula that we obtained from fitting the ordinal regression model that we covered previously, so this way, we can obtain the fitted probabilities of each outcome variable here. Then as we run this line of the code, then now we can also obtain the fitted probabilities for each outcome in the dependent variable, and then we can classify the game result based on the probabilities attached to each of which outcome here. Also we can compare the fitted results with the actual outcome and see how accurate our model is. Well, basically we're going to get the same result because this is basically the same as the model we fitted previously, so we got the same result rate. But I just wanted to show you there is a way for you to apply the model parameters directly to obtain the fitted results as well. Now, we can always improve the model performance by incorporating the home field advantage, so now we have two independent variables here. Now, we are going to use these two independent variables, and then we are going to explained the outcome variable here as well. We can obtain all the parameters again. Regression coefficient for the given independent variable, and also we have another regression coefficient for the dummy home variable. Well, basically this is fixed effects of the home dummy variable here, and then we have thresholds for two qualitative outcomes as well. We can directly apply the model parameters to obtain the fitted probabilities for each outcome variable. I mean each output in the variable, and then we can classify the game result based on the fitted probabilities. Then we can compare the fitted probabilities with the actual outcome and see how accurate our model is in predicting the ordered outcomes here, and then we can get the success rate. First model predicted roughly about the 60 percent of the games correctly. Then now our second model, which incorporated the home field advantage, then now our success rate has improved slightly. Well, later model explained the data slightly better than the previous one, which makes sense because, well again, home field advantage is one of the most reliable predictor for the game results. That's basically all about the ordinal regression model. Now, in our last section of this week, I'm going to talk about the basics of forecasting in a real-world setting.

Use Quizgecko on...
Browser
Browser