Data Analysis for Sports Club Comparison
Document Details
Uploaded by TruthfulRealism2101
Princess Nourah Bint Abdulrahman University
Tags
Summary
This document details the merging of two data files to compare sports club wages and TM(transfer market) values. Using Python, the author demonstrates a matching process and a regression analysis, exploring the relationship between these variables across multiple years. The output shows strong correlations and regression coefficients, highlighting the impact of wages on team value.
Full Transcript
Now, what we want to do is merge the two files so that we can compare the wage value from the financial statements to the TM valuation of the players. In order to do that, we need some means to match the two variables, the wage variable and the TM value variable. To do that, we need a unique index f...
Now, what we want to do is merge the two files so that we can compare the wage value from the financial statements to the TM valuation of the players. In order to do that, we need some means to match the two variables, the wage variable and the TM value variable. To do that, we need a unique index for each club. Because clubs can appear in multiple years, just the name of the club is not enough. We need to create an index which reflects both the name of the club and the year in which the club appears. What we're going to do is we're going to merge two of our variables in the data to create an index, which we can use as the basis for matching. The natural way to do that is to merge the name of the club with the season in which they were playing, and that will be a unique identifier for that club in that season. This is how we write that, we create this variable team ID and we just add together the club name with the season year end. The only thing we need to tell it is though, that when it's treating the season year end which is a number, 2005 for example, then we need to tell Python that we're treating it as a string, not as an integer and therefore we have this parenthesis str at the end, in order to indicate that it's being treated as a string. We create the identical variable for each data frame and that way, when we try to match them, they will match exactly. Now, one thing to remember about if you're doing this with new data sets, the big problem with data sets is often that the names are not identical. For example, in one data set you might refer to a club as Manchester City whereas in another data set that might be abbreviated to Man City, that's quite a common abbreviation that's used. Often, if you're doing this, you need to check first that the names are exactly the same and that will go down to checking that there aren't extra spaces in the data frame as well, which sometimes happen. That can get fiddly at times, but here, this data has been pre-checked to make sure that the names are exactly the same in each data frame. If we run that, you can see now here team ID in the wagedat data frame, you can see Arsenal '97, Arsenal 1998, Arsenal 1999, and so on. We do exactly the same thing in the TMdat data and you can see there again, Arsenal 2011, Aston Villa 2011, and so forth. We see all the teams lined up there. Now, we just do a merge for these two data frames, merging them using the team ID. That's on team ID, is the way that we say that. We emerging TMdat with the wagedat, and we're specifying that we're going to just use the team ID and the wages. It's really only the wages were interested in. When we do that, we will now see in our TMdatwage data frame, we will see the wages from our wagedat file alongside the TM values. Now these are the two things that we want to compare. One thing you'll notice about this is that the currency is in pounds, and it's just the absolute number of pounds whereas the TM values are actually in much smaller, that's because these refer to millions of pounds, not pounds. In order to make it look a bit clearer, it's not necessary, but convenient to divide the wages by 1 million just to make the numbers n the same scale and if we do that, it doesn't change anything about the data fundamentally, but now it's much easier to see and compare. If you just eyeball in this, you can see that for many clubs the wage value and the TM value look fairly similar, but they're not identical by any means. The first thing we could do is, do a plot to see how closely correlated they look, so just let's take a look. We'll use.relplot to do this and we'll have wages on the x-axis and the TM value on the y-axis, and let's look at what that comes out like. We've used this command hue so you can see it uses the same color for different seasons and it's not particularly important to do that. But what you can see here is that there is a fairly strong correlation between the two. One thing that the hue is useful for is, to show you that it's not all about one year at this end and another year at that end, that they are actually spread out. Even though, bear in mind that wages were tending to rise over these periods, these valuations were changing over time. A player of the same ability in 2020 would have been paid more than a player in 2011. The next thing we can do is run a regression to see how closely related they are. If we run a regression, we'll get a precise statistical lesson, but a numerical lesson of the relationship between the wages and the TM value. If we do that we've added the regression line onto our chart, onto our plot, and you can see there the regression line going through that. But we can actually show the regression itself using this smf.ols command, regress TM value on wages and get a precise estimate of the coefficients. You can see here what the regression output looks like. Now a couple of things, so first we notice the R-squared is 0.909. Remember the R squared goes between 0-1, and one implies an exactly perfect fit between the variables. An R-squared of 0.99 is very high by any standards and suggests a very close fit. Then if you look at the relationship of the coefficient, the coefficient on wages is 2.12 and the standard error of that coefficient is 0.054 which gives you a t-statistic of almost 40, which suggests that this is statistically significant at almost any level, the chances that there's no relationship between wages and the TM value is essentially nil. That's not surprising given the correlation chart that we showed before. It's not surprising to see that these are very close. Now, we might be concerned though, because of the point I just mentioned that wages are going up over time, that looking across all of the years could be misleading if there are changes within years. What we'd like to do really is to look at the regression for each individual year and see what that correlation looks like. We can do that. We can run a regression for each year in the data and see what's the relationship between the TM value and the wage value. In order to do that though, we're not going to run these regressions one by one. We're going to do what we did in course 2 baseball, where we wanted to have multiple regressions on the same table, so we organized the regressions in a vertical column. To do that, we use the summary_call command. We can add an info_dict, which enables you to include the R-squared of the regression and the number of observations on that. If we now run that, we can see just if we run, that's just the regression we looked at above. It's the regression for all of the years, 160 observations or square root 0.91, same coefficients as we saw there. So this output is based on exactly the same regression as the one we just saw. But now we're going to do this season by season. So we're going to add to this table a column for each of the seasons in the data and see what those regressions look like. We can see now, for each year we can see the relationship between the wages and the TM value. In each year we see the same story. It's a very, very high correlation again, almost exactly the same as it for the data as a whole. The reassuring factor is that the value of the coefficient on wages is almost identical for across all of the years, it's around two essentially. Basically the TM value is roughly twice the value of the wages. That's a fairly reliable estimate over time. It's not exact. You wouldn't expect it to be exact, but it's going to be close enough to be for reliable basis. We can be confident that if we use the TM values, that's almost as good as using the audited wage data that we get from the financial statements. It's worth reflecting for a minute that it's interesting to note, it's a very good example of the wisdom of the crowds. How actually what people believe is quite close to what the market dictates in terms of the value of players to a club. You may or may not find that surprising, but it's a pretty powerful result. It's not something that you get in every possible situation, but it's interesting to think about why the wisdom of the crowds works in this particular case. Well, so that concludes this section, showing that we can rely on the TM values in order to think about the influence of spending on performance of teams. In the next video, we're going to go on and look at how to make predictions within an unknown sample of data.