Regression Analysis PDF
Document Details
Uploaded by TruthfulRealism2101
Princess Nourah Bint Abdulrahman University
Tags
Summary
This document details a regression analysis of player salaries, focusing on variables such as on-base percentage, slugging percentage, and playing positions. The analysis covers data from multiple years and explores how these factors influence player salaries. The results are presented in tables showcasing changes in coefficients over time.
Full Transcript
So now we're going to set up the basic regression, import our packages, the data and start our first exploratory analysis. So, as always we're going to import the packages. We need pandas, map, plot case. We did want to draw any diagrams. Numpty for calculations and stats model for running our regre...
So now we're going to set up the basic regression, import our packages, the data and start our first exploratory analysis. So, as always we're going to import the packages. We need pandas, map, plot case. We did want to draw any diagrams. Numpty for calculations and stats model for running our regressions. So we run that line of code and import those packages. The next thing we're going to do is import the data files that we need. And we have already saved from last week the master data file which we constructed. And so we're going to use that now. Import it and that will give us all the data that we have already created. And remember when we did that we didn't limit ourselves to the years 1999 to 2004 that were used in the heat and sound. We actually constructed that data for all of the years that were in the data, which did run up to 2015. So that helps us use a larger data set now, so it's useful just to see what variables we've got in the data. These are the variables. Most of the ones we either created or we added to our analysis and let's go ahead and create the experience variable the square term that we want to use. We've already got the experience variable EXP, and so we create EXP2, which is defined as EXB squared, and we run that we've created that variable. So let's begin with just run one regression for one season to see what it looks like. So let's look at the first year of data we're going to use. That's 1994 and remember we're only going to be looking at free agents. So this command here defines that subset of data for us, and we can now run the regression and you can see in the regression it looks pretty similar to the one we run before. On the left hand side, we have the log of player salaries on the right hand side we have on base percentage, slugging percentage, plate appearances. Now we have experienced and experienced squared, which depends on the number of years you've been in the majors. And then finally we have this expression. See parenthesis. POS close parenthesis. The POS is the playing position of the players which we got from the original dataset. And what happens when you create this C parenthesis expression around a variable is that it creates individual dummy variables for each element in the data set. So in this we have several different positions in the data and what this does is it takes for each position. It says let's create a variable which has a value equal to one. If the player has that position and zero otherwise and it does that for every possible position in the data. And so, we get then an individual estimate of the effect of playing in that each of the particular positions in our data. So if we run this now and generate the regression output, we can see here what we get. So you can see here we have our regression output, and you can see this actually looks rather like the regression output for any one of the years that we were looking at before in hates and sour table three. In particular, you can see all the individual positions separately though second base, the base capture designated hitter, outfielder and shortstop. You can see the impact of home base percentage which is actually negative in this period. Although not statistically significant, you can see the impact of slugging percentage 3.5 to positive and statistically significant. Again played appearances statistically significant and the experience variables interestingly not significant at all in this regression. So that's an example of one regression that we produced for one year using our model that we've adapted, focusing on free agents and using the slightly different set of variables. So what we can also do is we did before is use the summary column option, which we import from stats model to produce this data in a single column, produce the regression coefficients in a single column and print them out. And if we run that now we can do that. We can see our output here, and this will be then a model for what we want to do next. Which is to do this with many years summarized in one simple table. And then finally well, we also want to do with this is include the number of observations in the R squared, and that entails creating this info Dict, which enable us to specify the R squared and the number of observations. And if we run that regression again with the info addict, we can see here. Not only do we have all the regression coefficients in a single list, but we have the R squared and the number of observations listed at the bottom. And now what we want to do is go on and create tables which consists of many years regressions all combined together so that we can see the change in the profile of the coefficients over time.