Baseball Regression Analysis PDF
Document Details
Uploaded by TruthfulRealism2101
Princess Nourah Bint Abdulrahman University
Tags
Summary
This document analyzes baseball statistics using regression models, focusing on individual performance metrics like home runs and walks in various eras. The analysis explores the relationship between these metrics and player salaries, specifically examining a post-Moneyball effect on the significance of specific performance indicators.
Full Transcript
So now we're going to run our regressions again for the three eras that we identify before, but this time using the disaggregated statistics, the individual statistics for singles, extra base hits, home runs and walks. So in order to set that up, the first thing we have to do is write our functions...
So now we're going to run our regressions again for the three eras that we identify before, but this time using the disaggregated statistics, the individual statistics for singles, extra base hits, home runs and walks. So in order to set that up, the first thing we have to do is write our functions again, and these really are the same as before. The only difference is that we're going to redefine the regression. So here, the first one, you can see if you compare it back to the formula we ran before. This is the same formula to define the regression. But except we now have our individual batting statistics included separately. Then the next two lines of code are the same ones as before. This is telling the Python to run these regressions and store the results in a list and associate the list with the particular regressions for each of the seasons. We don't need to create the header list again because we can reuse the one that we generated before. So with that simple repetition, we can now run our regressions and produce the same kinds of tables as we produced before, only this time with the individual batting performance statistics included, so let's go ahead and do that. You can see here it looks the same. The only difference is at the top, we have these individual batting statistics. Let me just run through each of these regressions quickly. You can see the output here, now it's well worth spending some time looking through these individual outputs to see what patterns you can observe in the relationship between the coefficients. But what I've done here is actually combine the relevant rows of the data for each of the three eras into a single table, so we can review the results for each era simultaneously. So here in the roads are, based on bull's percentage. So the capacity to draw a walk, the singles percentage, the extra base hits percentage and the home run percentage and the first block with the red border is the pre Moneyball era. Then with the blue border is the Moneyball era, and then the final block with the green border is the post Moneyball era. And so one thing that's striking about this, I think, is to look at the last row actually of each block, which is the home run percentage. Home run percentage is fairly consistently statistically significant in determining the salaries of free agents. And that, perhaps, is not surprising, the capacity to hit row runs is not just a valuable batting statistic for winning games, but it also makes for a very attractive game. Players who can generate home runs bring in the audiences, and so it's not surprising that this is significant. And if you look in the different eras, if you compare going through from 1994 to at least 2007, you see the size of the coefficient remains relatively stable. What happens in what we have defined as the post Moneyball era after 2008 is that the stability of that coefficient declines somewhat and in some years, as in 2012 or 2015 home run percentages, not even statistically significant. Now, if you compare that with now the capacity to draw a walk, you will see that in most years prior to the publication of Moneyball, this variable is statistically insignificant. So it's significant in 1995, in 1998, in1999 and those are the only years it's significantpPrior to 2004, the year after the publication of Moneyball. But you can see from 2004 until 2007, the capacity to draw walk is statistically significant. The size of the coefficient is much larger, and we also see this very large effect specifically in 2004, that we identified before. It does seem as if in 2004 there was a rush to hire players free agents who were capable of drawing walks. Then, in the post Moneyball era, we see a somewhat more mixed pattern, but we still see statistically significant effects of the capacity drawer walk in 2009, 2010, 2011, 2014, 2015. So in most of the seasons we're seeing a statistically significant effect. And the size of the coefficient once again is generally larger than it was in any of the pre Moneyball years. So it does seem that our data suggests that there was some significant change around 2004, in the perception of the value of the capacity to draw a walk, and this change relative to the value of the capacity to hit home runs. What's also notable in passing, though, of course, is that singles and extra base hits show no consistent pattern of statistical significance. And indeed, in more years than not, these variables are statistically insignificant. Which suggests that perhaps the reliability of the regressions, given the small number of observations, makes it difficult to pick up some of these effects. It's picking up the two main effects that were interested in, but it's not picking up effects that we think might matter. Which suggests we should do what we did before with the on base percentage and slugging percentage data. And that is to pull our data set and see whether there is a, specific post Moneyball effect, starting from 2004 onwards. So this is exactly the same process as we carried out before. But this time we're including our individual batting performance statistics in our aggression. And so we define the data frame for the post multiple regression or the pooled regression, and then we run that regression and again we compare coefficient. And once again you noticed that a couple of things are going to stand out. I If we consider the entire period, we can see the effect of the capacity to draw a walk. BBPCT is statistically significant, as is singles, extra bases and home runs. And clearly the home run coefficient is far larger than the coefficient on all of the other variables. So that suggests that all of these, capacities were significant, even the capacity to draw a walk with significant throughout the period that we've covered the 22 seasons that we've included. But now we want to see if there was a change following the publication of Moneyball. And so we look at the post MB colon variables for BBPCT, that's walks SinglePCT singles, XBHPCT, the extra bases and HRPCT home runs. And again, what is very striking about those coefficients is that the post Moneyball walked variable is statistically significant and positive. And the post Moneyball home run variable is statistically significant and negative. Which suggests, if anything, a revaluation of the value of drawing a walk relative to the value of being able to hit home runs. So this is again seems to be confirmation of the basic Moneyball story when we extend our data analysis over a longer period of time. So what we found out as a result of all of this? Well, we've looked at the data over a longer period, and we've broken down the batting statistics into its component parts. And we've found broad least confirmation of the Moneyball hypotheses there does seem to be a change around the time of the publication of Moneyball in the valuation of the capacity to draw a walk. And that does seem to have led to a downgrading, at least in the salary rewards, to other batting capabilities, such as the ability to hit home runs. That said, there are a couple of caveats that one should make to these results. So the first thing is the very striking effect of the 2004 year, and that no season in relation to the effect of on base percentage or the capacity to draw a walk looks as significant as 2004. So it does appear that there's a reaction in that season, but it's not clear the extent to which that reaction is sustained, and it does seem that other adjustments have been made since. The second thing to say is that there are other potentially confounding factors which could intervene to have cause some of these results. And one factor which stands out for anyone who is familiar with the story of baseball in these periods, is the effect of the steroid era. The data we've been talking about coincides with the steroid era, which broadly means a period where players appear to have been taking steroids in order to increase their capacity to hit home runs. That era is thought to have been at its peak, really during the 1990s and early 2000s. And around the time of the publication Moneyball and soon after came the well known BALCO scandal. In the BALCO scandal, there are a number of prosecutions which related to taking steroids, which exposed the fact that a number of players had been taking steroids systematically. And this in turn provoked major league baseball to adopt a rather stricter policy in relation to drug testing, which meant that the ability of players to bulk up was probably restricted. And so the ability to hit home runs probably fell as well. How those changes how that whole steroid era and its consequences interacted with the valuation of player skills, walks, home runs and so on is a little complicated and not entirely clear from our data. And therefore, even if our results are broadly supportive of the Moneyball hypothesis, one should always bear in mind that these caveats exist and there's probably scope to do further analysis in order to isolate these effects more fully. So that concludes our analysis of the Moneyball story. We've looked at the data during the period when Moneyball actually was published. We've extended the data over a longer period of time, and we've broken down the data in a little bit more detail. And by and large, our analysis has largely confirmed the Moneyball story notwithstanding the caveats that we mentioned at the end. Now we're going to look at some more advanced statistical analysis in baseball. And in particular, we're going to focus on something called run expectancy, which defines player performance in a more contextualized situation. And then we're going to use that to generate statistics such as winds above replacement, which have become popular metrics in baseball today in order to evaluate player performance.