Baseball Run Expectancy Analysis - 2018 PDF
Document Details
Uploaded by TruthfulRealism2101
Princess Nourah Bint Abdulrahman University
2018
Tags
Summary
This document describes the process of analyzing MLBAM data from 2018 to calculate run expectancy. It uses Pandas and Numpy libraries, and the analysis is presented in a Jupyter notebook. The document also provides references to a book on baseball data.
Full Transcript
In this video, we're going to go ahead and create a couple run expectancy variables that we need for our analysis. For anybody that's interested in some further reading about run expectancy well, pointing to the markdown. And there's a book here called Analyzing Baseball Data with R, written by Marc...
In this video, we're going to go ahead and create a couple run expectancy variables that we need for our analysis. For anybody that's interested in some further reading about run expectancy well, pointing to the markdown. And there's a book here called Analyzing Baseball Data with R, written by Marchi and Albert in 2013. What we will be doing in this week is we'll be going through and coding the run expectancy process in our Jupyter notebook and using MLBAM data. And that's Major League Baseball Advanced Media data. Then analyzing baseball with our book uses retrosheet data and, as the title implies, uses the coding language R to complete this analysis. But the general process is pretty similar. The general process that they've taken their book is pretty similar to what we will be doing this week. So for anybody interested in some further reading, this is a great book to take a look at. Now jumping back into our notebook, we're going to start out as usual with loading our packages. So we're importing pandas as pd and numpy as np. So we'll go ahead and run that. And our packages have loaded. And now we are going to read in our MLBAM data for 2018. So first we're reading in the data called MLBAM 18, and then we will go ahead and drop one column that's pretty irrelevant. It's just called an unnamed colon 0 column that doesn't really contain any real information in it. So we're just going to go ahead and drop that column off right away. And then you can see we have a pd dot set option. We're going to display our max columns as 100 because there are a lot of observations. There are a lot of variables in this data set. So with that, we can run that and see what it looks like. So if you take just a brief look at the data, you can say we have event by event data for basically every event that occurred in the 2018 Major League Baseball season. So there's a lot of information in here. Altogether in terms of sample size we have 185, 771 rows and 62 columns. And just to take a closer look at what those columns are in the raw data. We can run that print MLBAM 18 dot columns to list function, option I should say. And we can see that there are several variables that are including this data set, there's 62 in total. And we don't need a subset of these variables to complete our analysis. So what we're going to do in the next portion of code is we're just going to keep the variables that we need in order to just clean the data set up a bit. So just going through the variables will be keeping. So we are going to start out by calling this RE18 for run expectancy 18 and then that's going to be based off that MLBAM data for 18. And then the variables we're keeping include batterName, batterID, event type. Then we have these variables start1B, start2B, start3B. These three variables represent whether or not first base, second base, and our third base are occupied prior to the plate appearance that the observation represents. If these bases are occupied, then these variables we populate by the player ID of the player that is occupying the base. If the bases are not occupied, then the data point will just be a null value or you will see NaN, which we'll see it here in a second. And then in a similar way, end1B, end2B, end3B represent whether first base, second base, and third base are populated after the plate appearance takes place. And we have startOuts, which is the number of outs before the plate appearance. And that's the number of outs after the plate appearance. runsFuture represents the number of runs that will be scored in the remainder of the season. Runs on the play, the total number of outs that there are in their name. And then we have stand, which represents if the data is right handed or left handed, throws which represents whether the picture's right handed or left handed, venueId stadium and batter position. So those are all the variables that we'll be keeping to continue on with our run expectancy analysis. So we can see now our data set's just a bit cleaner after we've just filtered out all the variables we don't need. And now in the next stage, we can go ahead and start to construct our base out states that we discussed in our first intro video for this week. So we're going to start out by creating this variable start1 in run expectancy, and you can see this is going to be defined by a where statement. And where pd dot is null, we see RE18 start 1B for first base. So to put this in just common terms, this is saying that if this variable start1B is null. So null if you scroll up, that represents null is equivalent to like these NaN values, which means not applicable or this value doesn't exist at that point. So if our value start1B is null, we're going to code the value start1 as 0, which means the base is not occupied. Otherwise, we incurred the value as 1 which means that the base is occupied. And we're going to do the same thing for start2 and start3. So this start2 will represent whether or not second base is occupied. And start3 will represent whether third base is occupied. And this is all whether the bases are occupied prior to the plate appearance. So let's go ahead and run that. And if we scroll over to the right hand side, we can see now we've created three variables. Start1, start2, start3 each which represents whether first base, second base, and third base respectively are occupied prior to the plate appearance. And now the next step is to go ahead and create the actual state that the game is in prior to the plate appearance. So we're going to create this variable called start_state. And what we're going to do is we're going to concatenate start1, start2, and start3 with the dot astype string. We need this astype string after each one of these because we want to concatenate these variables, not some of these variables as numbers. So we concatenating these three variables. And then we're adding a space here. So we see this quotation with a space in between that will create a space in our variable. And then finally, we're concatenating the variable startOuts as a string. So the number of outs that are currently in the number of outs in the inning prior to the plate appearance. So let's go ahead and run that. Let's see what that looks like. And if we scroll to the right we see this variable start_state. And we can determine now the starting state prior to every plate appearance. So we know whether or not each base's occupied and how many outs that are in the season. And now we follow the exact same process, but for the end state. So this will be for after the plate appearance occurs. So we're creating this variable here End1 for first base. Again, we're doing a where statement and then the pd dot is null. And if the variable end1B is null we're going to code this as 0, which means that first base is not occupied. Otherwise, we'll code it as 1, which means it is occupied after the plate appearance takes place. And we'll do the same thing for second base with end2 and third base with end3. So we can run that. And finally, we can do the same type of thing that we did earlier with start_state. But we're just creating this variable end state now, and we can concatenating our end1, end2 and end3 variables as strings. Again adding a space in between and concatenating the end outs after the plate appearance as a stream. So let's run that. And we can scroll over now to the right hand side of the data set. And we have end state. So now we have the state that the game is in after the play appearance occurs, and we we know which bases were occupied and how many outs were in the inning with this end state as we did with the start_state. And the last step for this video is we're going to exclude a couple events that could potentially skew our expectancy analysis. So what we're doing here is we're overriding RE18 data and we're going to subset this data so that the only observations that are included are observations in which start_state is different from end state or the runs on play is greater than zero. And the reason we're putting this condition in there is this portion of the code is going to still keep the vast majority of observations. But the reason we're putting this in there is there's a couple just rare video secret, rare random events that are not really related to batting. So one of them is a foul air. So this is when a batter pops the ball up, hits it into foul territory the fielder should catch it, but he drops it instead and he's given error for this. This doesn't change the state that the game was in at all. It's not really related to the batter's ability. So we're going to exclude an event such as that from our analysis, we don't want that to be taken into account in our run expectancy calculation. And then finally we're also going to make sure that outsInInning is equal to 3. So the only real case where this wouldn't be true is if the home team is up last and they win the game there last at bat. So there won't be three outs in an inning in that case. The reason we do that is because in this situation, the potential to score runs to the end of the inning is reduced, which could potentially skew our expectancy results and analysis. So we're just going to make those changes. But again, it won't remove too many observations from our analysis. So we can see in total after doing this, we have 184,949 rows and 27 columns. So now we've prepared the data, we've cleaned the data, and coded in the variables necessary to complete our analysis. And now in the next video, we will go ahead and start to actually go ahead and calculate through run expectancy. And we'll see how run expectancy can be used to evaluate players later on this week.