NHL Forecasting Model PDF
Document Details
Uploaded by TruthfulRealism2101
Princess Nourah Bint Abdulrahman University
Tags
Summary
This document describes a methodology for constructing a forecasting model focused on predicting NHL game outcomes. The model utilizes Pythagorean winning percentages calculated during the first half of the 2016 regular season to predict second-half performance. Comparisons between different models, including win percentage and Pythagorean winning percentage models are included using statistical analysis.
Full Transcript
So far, we've talked about the logistic and all the logistic regression models in dealing with the categorical outcome variables. These regression models will be intensively utilized for the remainder of the course as a basis for forecasting the game results. In this video, we will discuss some fund...
So far, we've talked about the logistic and all the logistic regression models in dealing with the categorical outcome variables. These regression models will be intensively utilized for the remainder of the course as a basis for forecasting the game results. In this video, we will discuss some fundamental issues of building a forecasting model. The purpose of this notebook is to introduce the very simple forecasting model to give you a sense of how to build a reliable forecasting model in a real world setting. Before we go over to Python codes, let's talk about the meaning of a forecasting in the real world settings. In the previous exercise, where we used the Pythagorean winning percent as an independent variable to explain to keep an outcome variable and the model worked quite well in the sense that the period outcomes from the model predicted the actual outcomes quite well. Now, I just said that the model predicted the outputs. Does it make sense to use the word predict in this context? Think about the everyday meaning of predictions. When we say that we predict something, which means to state the expected outcome of an event before it happens, and it means the same as forecast. Whereas in statistics, it's often used to mean simply the fitted value from a model. In that sense, if we use the Pythagorean winning percent as an independent variable to predict the outcome of a match, it cannot give us a prediction in the everyday sense. Why? It is because that we need to know the scores of the match to calculate the Pythagorean winning percent, but these values are not available beforehand as such events happen at the same time as the game progresses. In other words, it's not possible for us to use the Pythagorean winning percent to predict the game results in a real world setting. In that regard, the goal of forecasting is to predict the event that hasn't been occurred by using the predictor variable available beforehand. In this Jupyter Notebook, we'll build a forecasting model in its simplest form, while providing some useful insights about how to build a reliable forecasting model. To start off, let's think about a way to use the variable such as Pythagorean winning percent as a basis to predict the future performance of a team. The very simple approach might be to use the Pythagorean winning percent from the first half of the regular season, and then we can use that value as the indicator of the team's performance for the remainder of the season. That's basically what we will do together in this Notebook. In order to do that, we'll first of all split the data into two subsets and we will use the performance indicators such as win percent and the Pythagorean winning percent from the first half of the regular season to predict the performance of the later half of the season among the teams in the league. Basically, what we're going to do is to compare the performance of two models, the forecasting model using the Pythagorean winning percent and the forecasting model using the winning percent for the first half of the regular season. Let's get started. In this Notebook, we are going to use the NHL dataset, and like I said, we are going to obtain the Pythagorean winning percent from the first half of the regular season to predict the winning percentage over the remaining season. Also we are going to obtain the winning percent from the first half of the regular season as well to compare the performance between the forecasting model with the Pythagorean winning percent and the performance of the forecasting model with the winning percent from the first half of the regular season. First of all, we have to import all the libraries, and also we have to import that dataset, which is NHL dataset. We are going to use the 2016 NHL regular season record. Next, we are going to drop all the unnecessary columns from the original DataFrame. Then the next step is to split the data into half and half, such that we are going to obtain the winning percent and the Pythagorean winning percent from the first half of the regular season. One thing that you have to bear in mind when splitting the data is that the order matters in these type of the data. In this case, we have to somehow extract the data based on the schedule of the regular season. Given that the NHL regular season runs from only October to April, we can get to balance the data by using the date column. Let's run this line of the code and see what happens here. Now, the data column refers to the games schedule of the regular season. Again, the regular season of the NHL runs from only October to only April, so this way, we can get the first half of the regular season by using the year. In a nutshell, we're going to use the games played in the calendar year of 2016 to fit the regression model, and then we are going to use the games played in the calendar year of 2017. Then we are going to use that later half of the regular season to validate our model. By using the Year column that we just created, we are going to split the data. We are going to split the data by using the Year column here. Then 2016, this is the way to extract all the games played in 2016. Then again, we are going to use this data set as our basis for forecasting. Then we are going to extract the second half of the regular season data, which is all the games played that year of 2017. Then we are going to validate our forecasting model using the later half of the regular season. We are going to generate the team level data set by aggregating some value to such as win, goals for and goals against, to be used for calculating the winning percent and the Pythagorean winning percent for each team. By using that groupby Command, we are going to create the team level statistics by using the tree code column here. This way, we are going to obtain the total number of wins and total goals for and total goals against by each team in the first half of the regular season. Also, as you can see, the resulting data frame, what we are missing here is the total number of games played by each team in the first half of the regular season because we are going to obtain the winning percentages by each team as well. By running this line of the code, we are going to get the total number of games played by each team in the calendar year of 2016. We are going to merge this data frame into the previous data frame. The resulting data frame contains all the variables to calculate the winning percentages and the Pythagorean winning percentages for the first half of the regular season. Here, we will use the winning percentages and the Pythagorean winning percentages for forecasting the later half of the regular season. We are going to use this line of the code to predict our variables for the first half of the regular season. We have the winning percentages among the teams in the first half of the regular season. Also, we have the Pythagorean winning percentages among the teams in the first half of the regular season. Then we are going to drop all the unnecessary variables here. Well, again, the resulting data frame contains three columns, the team and the win percent of the team and the Pythagorean win percent for the team in the first half of the regular season. Our data set is ready for forecasting. Now, we are going to manipulate the later half of the regular season. One thing that you have to know is that we don't need the Pythagorean winning percentages when it comes to the second half of the data set, because we'll be using the winning percentages for the second half of the regular season as a dependent variable for the regression models. I will leave this as your self-test, so if you would like to challenge your coding skills, then I want you to pause the video and then do the following analytic task as it is instructed here. Then I'm going over this code again altogether. Then you can compare your version of the code with me and see if there's any differences. Alright. Then now, first of all, we are going to calculate the total number of wins by each team in the later half of the regular season. Then once you run this line of the code, then you can also obtain the total number of games played by each team and the total number of games played by each team in the later half of the season. Then you are going to merge this record into the existing data frame to obtain the winning percentages among the teams in the later half of the season. Then we can get the win percent for the later half of the regular season. Then we are going to drop all the unnecessary variables. Here, the resulting data frame contained the winning percent among the teams in the later half of the regular season. Now we are ready to make a prediction. We can compare two models with different performance indicators by using the winning percent obtained from the second half of the regular season data as the dependent variable in the regression models. In order to do that, we are going to merge these two datasets first. The resulting DataFrame contains four columns. The team, and the winning percentages among the teams in the later half of season. This column is going to be used as the dependent variable in our regression model, and we have, winning percentages for the first half of the regular season, and Pythagorean winning percentages for the first half of the regular season. Before we fit the regression model, let's plot the variables first. First of all, we are going to see the relationship between the winning percentages for the first half of the regular season against the winning percentages for the second half of the regular season. Here, the scatter plot, we can see the relationship between the first half of the winning percentages, and the second half of the winning percentages. The relationship between these two values are positive linear relationship. Now, let's fit the regression model here by using the winning percentages for the first half of the regular season, as an independent variable, and the winning percentages for the second half of the regular season as the dependent variable. The result table will show you all the regression coefficients and the parameters that you are supposed to see, such as T-statistics and the p-values. We are interested in the regression coefficient, attached to the independent variable here. Which is the winning percentages from the first half of the regular season, which is 0.5, and the regression coefficient is strong at the alpha value over 0.005, and also we got a fairly high R-squared value here as well. R-squared value is 0.246, meaning that our independent variable, which is the winning percentages from the first half of the regular season, explained about 24.6 percent of the total variance and winning percentages in the later half of the regular season. Now, let's move on to the second model we are going to estimate. In our second model, we're going to use the Pythagorean winning percentages from the first half of the regular season as our predictor variable, and then we're going to use the winning percentages from the second half of the regular season as the dependent variable. But before we fit the OLS model, we're going to plot those two variables first, and here you'll see the relationship between the two variables. As we can expect, there is a strong positive relationship between the two variables as well. Lastly, we can also fit the regression model, here, we can see the regression coefficient which is 0.5996, and the regression coefficient is larger than the winning percentages from the previous model, and also the regression coefficient is statistically significant as well. R-squared value is 0.326, meaning that the regression model with the Pythagorean winning percentages explained roughly about, 32 percent of the total variance in the dependent variable, and which is also larger than the previous model as well. We can combine the regression results from each model in a table format, and we can simply make a side-by-side comparison in terms of the major parameters. Here again, when it comes to the Pythagorean winning percentages from the first half of the regular season, that is about 0.5996, which is greater than regression coefficient for the winning percentages obtained from the first half of the regular season, and also, as you can see, the R-squared value here as well, then the Pythagorean winning percentages predicted the performance among the teams better than the model with winning percentages. To summarize, regression using the Pythagorean winning percentages as an independent variable, that forecasting model fit the data better than the regression using the winning percentages obtained from the first half of the regular season as a predictor variable. That makes sense, well, think about the calculation process. When it comes to the Pythagorean winning percentages, we have to have more information, total number of goals for and the total number of goals against. That contains more information in terms of how better or worse the team is in the league. While when it comes to the winning percentages, the only information we know is the total number of wins and total number of games played. In that sense, we can expect that the model with the Pythagorean winning percentages, will predict the game results better than the winning percentages.