Logistic Regression Model Replication PDF
Document Details
Uploaded by TruthfulRealism2101
Princess Nourah Bint Abdulrahman University
Tags
Summary
This document describes the replication of a logistic regression model using Python. The code and methodology used for model creation and evaluation are provided, with Python libraries like scikit-learn being highlighted.
Full Transcript
Now here we will work on the Jupyter Notebook to replicate the logistic regression model that I just explained. Well, before you run the logistic regression, here we need to import a new library, which is called the scikit-learn to fit the logistic regression. We need to import this libraries first...
Now here we will work on the Jupyter Notebook to replicate the logistic regression model that I just explained. Well, before you run the logistic regression, here we need to import a new library, which is called the scikit-learn to fit the logistic regression. We need to import this libraries first and then we're all set. We already manipulated all the variables and organized the dataset. The only thing we need to do is to simply run the logistic regression here. On the first line of the code, you can see that the model, it is specified such that the binary win variable, is our binary dependent variable, and pythagorean win percent is our independent variable. That is basically same as the LPM that we practiced together, but now we are going to fit the logistic regression model. As you can see the second line of the code here, the basic structure of the code is simple and very similar to the linear regression, but now we use that GLM, stands for the generalized linear model, and we will dictate the model to be estimated. Basically, we're going to use that GLM command here. Then we specify the model, dictate the data. We are going to fit the logistic regression, and then we are going to dictate the types of the distribution, which is binomial. Then we can print the result here, so we just pass this line of code. Then here you can see the coefficient for the constant and regression data, and also you can see the number of occupations, p-value and standard error, and so forth. Here's another way to obtain all the parameters from the logistic regression model here. Now, we can fit the model to calculate the probabilities of winning on each game. Basically you can just run this line of code to obtain the fitted probabilities obtained from the logistic regression here. The values that you will see as a result of this code is the fitted probabilities from the logistic regression. Based on the probabilities of winning, then here we can create a binary winning variable for 1 indicating win, 0 indicating lose. We are going to use the fitted probabilities to create the fitted win variable. Now, once we obtained all the fitted probabilities and fitted values, then we can evaluate the accuracy of our fitted outputs in comparison to the actual output. There are several ways to do that, and we will cover a couple of handy codes to evaluate the performance of the fitted model. First, we can use the confusion matrix to evaluate the accuracy of the model. We need to import the confusion matrix from scikit-learn library that we used to fit the logistic regression and to create the confusion matrix, you can use confusion_matrix and provide the actual and fitted outputs as the argument. Run this line of the code here. In this case, the size of the resulting confusion matrix is two-by-two, in which the actual outcomes are arranged along the rows, and the fitted values are arranged along the columns, so that we can cross check the output simultaneously. We can look at the values on the diagonal in the matrix as each value in the upper left, and the lower right position represent the number of correct wins and loses respectively. So our model predicted 736 wins and 745 loses correctly. By using this information obtained from the confusion matrix, we can get the success rate here as well. According to our logistic regression model, our regression model predicted roughly about 60.3 percent of the results correctly. We can get a more comprehensive report on the classification with classification on their score report comment. You can use this line of the code and you can see the column under pre-season here. Then this is the success rate when it comes to the losing games and this is the success rate when it comes to the winning games. Our model predicted roughly about 60.4 percent of losing games correctly. While our model, I mean logistic regression model predicted like 60.3 percent of the winning games correctly. Now, we can improve the model performance by incorporating one another explanatory variable, which is the home-field advantage. But before we fit the model, let's visualize the data and see if it's plausible to incorporate the dummy home variable into the logistic regression model. Let's run this model and then the resulting graph is a mirror image of each category home and away, and it seems that the teams win more when they play at home than when they play away. It seems plausible for us to add one another explanatory variable to improve the model performance. Then we will incorporate the home team advantage into the login model, and then we will evaluate the performance, our model again. What basically the Python code is very similar to the previous code that I just demonstrated. If you would like to do this on your own, and I want you to pause the video and do the following tasks on your own. Well, basically the author python codes are the same as the previous codes that we used to fit the logistic regression except that we have one additional independent variable here. Then the python codes are the same as the previous one. We can simply run this piece of the code. Now, as a result, then you can see the coefficient for dummy home variable here as well. We can also obtain all the parameters separately by using print function here and then, as we did previously, we can also obtain the predicted probabilities from using the model that we just faded. Then we can also obtain the faded binary winning variable for one indicating win, zero indicating lose. We can create the confusion matrix to evaluate the performance of the model. Then here you can see that the model with one another explanatory variable, which was the home team advantaging. Actually, it predicted 751 winning games correctly and 769 losing games correctly. We can also use the classification underscore report to get the specific rate on each category as well. As we calculate the success rate here, then seems like model two worked slightly better than the model one as it predicted relatively about 61.9 percent of the total games correctly. Which makes sense because home-field advantage is one of the most reliable predictor variable in the world of sports. We have one another topic to be covered before you move on. Like I mentioned at the very beginning of the course, the most important theme of this course is to develop the reliable forecasting model. We will discuss this topic in great details later this week. But I just want to bring your attention about the meaning of forecasting in a real-world setting. Here I'm going to demonstrate the standard steps often used in forecasting. The purpose of the model is to fit the logit model on the first half of the regular season data to predict the later half of the regular system data. In other words, we will fit the regression model and we'll use the parameters obtained from the training dataset as a basis for forecasting the second half of the regular season data. If you are familiar with the terminologies often used in machine learning community, a subset of data used for regression is called training data. The data used to validate the trended model is called the test or dataset. Let's practice together. Well, first of all, we will split the data into two sub-data, training dataset and the test dataset. We know that the NHL regular season generally runs from early October through early April, so we can use the games played in the calendar year of 2017 as a training dataset for forecasting the games played in the calendar year of 2018. In order to do that, we can extract the year from the date column here. As a result of this code, you will see year column extracted from the date column here. Then we're going to use year column to extract the first half of the regular season which was all the games are played in the calendar year of 2017. Then we're going to use that data to forecast the later half of the season, I mean, the games played in the calendar year of 2018. Here is the way to extract the games played in 2017, and here's the way to extract the games played in 2018. As a result of the previous code, then we can get the dataset here. Then we have roughly similar datasets to make a meaningful forecasting here. Then we're going to use the logit model using the subset of the observations corresponding to the games before the calendar year of 2017. Then we are going to use the logistic regression model to obtain the model. Here you can see all the parameters to obtain the linear product of the faded probabilities. We are going to fit the logistic regression model by using the training data set here. Once you run this model, then you will see all the parameters. Then by using the parameters, then we are going to obtain the fitted values. Here we can get the fitted results here. Then by using the parameters obtained from the logistic regression, then we are going to obtain the fitted probabilities in the later section of the dataset, which is the test dataset. Then we can obtain the fitted binary variables here as well. Then we can create the coefficient matrix. From the training dataset, we predicted roughly about 402 winning games correctly, 383 losing games correctly. You can get the similar result by using the classification on the score report as well here and then we can calculate the success rate. Our model predicted roughly about 59.9 percent of the game results correctly. Now before we move on, I just want you to think about how practical our model is. The everyday meaning of the prediction is to state the expected outcome of an event before it happens. It means the same as forecast. Whereas in statistics, it's often used to mean simply the fitted value from a model. In our example, the model using Pythagorean win percent cannot give us a prediction in the everyday sense. Why? It is because we need to know the goals for and against, which happens at the same time as the game. In other words, in a real-time frame, it's not possible to use Pythagorean win percent to predict the game results, even though we use the model obtained from the previous dataset. For our model to be predictive for future events, we need a plausible independent variable which can be obtained beforehand. We'll stretch our discussion on forecasting from this takeaway.