Fundamental Concepts in Machine Learning PDF
Document Details
Uploaded by EnrapturedDragon
Texas A&M University - College Station
2024
Manu Navjeevan
Tags
Summary
This document discusses fundamental concepts in machine learning, focusing on dealing with complex data in econometrics. It explores techniques for handling high-dimensional data and evaluates the goodness-of-fit of predicted models. The document delves into the bias-variance trade-off and introduces regularization methods like Ridge and Lasso.
Full Transcript
Fundamental Concepts in Machine Learning Econ 471:Data Science for Economists Manu Navjeevan August 8, 2024 1 Dealing with Complex Data So f...
Fundamental Concepts in Machine Learning Econ 471:Data Science for Economists Manu Navjeevan August 8, 2024 1 Dealing with Complex Data So far we’ve covered basic modeling techniques such as linear regression and logistic regression. These modeling techniques can be very powerful and useful, especially when our data is “low dimensional” and we care about interpretability of our statistical models. However, in many situations we do not necessarily care about the interpretability of our models and / or are dealing with data that is “high dimensional.” That is, we are dealing with a number of right hand side variables that is (potentially very) large compared to our sample size. Amazon may want to predict how likely someone to purchase paper towels next month. To make this prediction, they have access to a large amount of information about a person, including their purchasing decisions in the last month, demographic characteristics, and even their activity on sites like Facebook. Electric grids need to accurately forecast demand for electricity so that they know how much energy to produce ahead of time. In making this prediction, they do not care about the interpretability of their prediction and may have access to many variables to help them including past energy usage, weather, as well as knowledge of quantities and scales of events around the region. Note: Not concerning with the interpretability of our model and having many right hand side variables can go hand in hand. In simple linear regression, we often avoided adding terms such as X1· X23 as it is unclear how to interpret the coefficient attached to such an interaction. However, if we are only concerned with achieving superior prediction, we can feel free to add these terms. Moreover, as we touched upon in the last set of notes, researchers may have a large number of potential models that could be used to make these predictions. We discussed some versions of this such as using linear vs. Poisson regression or a logit vs. probit model for the conditional probability. As we continue on in this class, the number of options available to you all as researchers will grow even more; in practice practitioners have to decide not just between various generalized linear models but techniques such as Lasso, random forest, or neural networks. Even within a class of models, such as linear regression, the researcher has many options available to her such as the number and type of terms to include in the regression model. While careful thinking is always important, it can still be unclear at the outset which model is best for our exact setting. In particular, standard methods for model evaluation and criterion that we looked for in estimators in low-dimensional settings may not be well suited to settings with complex models or high- dimensional data. 1.1 In Sample vs. Out of Sample Fit When evaluating regression models, we used the R2 criterion to evaluate the goodness-of-fit of our predicted model. Pn 2 (Yi − Ŷi )2 R = 1 − Pi=1 n 2 i=1 (Yi − Ȳ ) 1 Dealing with Complex Data Page 2 The R2 measures how well the model fits the data in the observed sample. Of course, we don’t care directly about the goodness of fit in our sample but rather the goodness of fit out of sample or in the overall population. That is for a model ĝ(X) that is fit on observations i = 1,... , n, we care about the expected mean squared error;1 MSE(ĝ) := E[(Y − ĝ(X))2 ] This is the criterion that measures how well we can expect our model to predict on new data that it has not seen before. Of course, the problem is that we do not know the joint distribution between Y and X so we cannot evaluate this mean squared criterion directly. Instead we must find ways of measuring the mean squared error using our data. When the models we are considering are simple, such as in the case of linear regression with a few right hand side variables, the in sample fit R2 is a good measure of the out of sample fit: we can expect simple models with a high R2 to generalize well out of sample and have a low MSE. This is not the case when we are considering complex models or when our data is high-dimensional. The problem is that we are evaluating goodness-of-fit on the same sample that we are estimating the model on. This means that we can estimate a very complex model that fits well in sample but fails to generalize out of sample. Formally, we call this problem overfitting. To demonstrate we can consider the case of simple linear regression. Suppose we have an outcome variable Y and a single explanatory variable X. Our sample {Yi , Xi }ni=1 may look like: As a first pass, we can use fit a linear regression model that uses just X visualized below; 1 Here the expectation is taken over the joint distribution of (Y, X) while leaving the estimate ĝ(X) fixed. Dealing with Complex Data Page 3 This single regression model leaves us with an R2 ≈ 0.5243. In an attempt to improve this, we consider adding another term to the model, say now including both X and X 2 in our regression model This model visually seems to fit the data better and indeed has a much higher R2 ≈ 0.6897. Encouraged by this, we consider adding even more terms and try a linear regression model that contains all polynomial terms up to X 20. Dealing with Complex Data Page 4 We can see here that this complex model does seem to fit our sample quite well, and indeed that is confirmed by the high R2 ≈ 0.841. However, we can see that this will probably not generalize well out of sample; the function behaves very erratically in regions where there is not that much data. Indeed, when we try to use this model for out of sample prediction we can see it behaves very poorly In fact, when we calculate the out of sample R2 , using the same R2 formula as before but generating predicted values from the regression model on the original sample, here we get a R2 of negative 1795.603. This suggests that our model is doing much worse at predicting Y out of sample than the sample mean alone. Dealing with Complex Data Page 5 While the in-sample R2 is always positive when using OLS with a constant term, in general both the in-sample and out-of-sample R2 can be negative. This simply tells that the estimator is predicting worse than the sample mean. In sample, this can occur if we fit an OLS model without a constant term. Out of sample, this can occur, as above, if we overfit a model so that it does not generalize well. Of course, this example seems contrived, and indeed we should be able to see from the visualization that the model with 20th degree polynomials looks funky and may not generalize well out of sample. However, when we deal with complex models and higher dimensional data such easy visualizations are not available to us and it can be easy to overfit a model without realizing it. 1.2 Cross-Validation Given the problem of overfitting, how should practitioners approach model selection when dealing with complex data and statistical models. One popular approach to combat overfitting is cross-validation. The basic idea here is to emulate calculating the out-of-sample MSE by leaving part of the data out when estimating the model and then using that part of model to evaluate the out of sample fit. To get a better estimate of the generalizability of the statistical modeling technique, we can then switch the roles of the hold out split (testing split) and the split that we use to estimate the model (training split). More rigorously, we can define the K-fold cross validation algorithm as follows. 1. Pick a small number K. Typical values of K are 5 or 10. SK split the data into K evenly sized pieces. We can label the K index sets as I1 , I2 ,... , I2 2. Randomly where k=1 Ik = {1,... , n} and Ij ∩ Ik = ∅ whenever j ̸= k. 3. Start with k = 1. Estimate a conditional mean model ĝ1 (·) using all of the data except for that in the first split, I1. 4. Evaluate the mean-squared-error on the data in the first left out piece I1 X MSE1 = (Yi − ĝ1 (Xi ))2 i∈calIk Notice that, since the data in I1 is not used to estimate ĝ1 , we can expect MSE1 to be an unbiased estimate of the true mean squared error. 5. Repeat the process for holdout splits k = 2,... , K, each time estimating the model on the data not in Ik and calculating MSEk 6. Report the average out of sample mean squared error over all K splits, K 1 X MSE = MSEk K k=1 This average out of sample mean squared error is often called the cross fit (CV) score. We would like to pick the model with the lowest CV score. To see how this would work in the linear regression model from above, let’s consider the data above and try choosing a linear regression model based on the cross validation score as opposed to the R2. Mean Squared Error and the Bias-Variance Trade-off Page 6 Here, we can see that the CV score decreases until a model of polynomial model of degree three and then increases afterward. We can compare this to the results we got using the in-sample R2 , which was always improving as we considered more complex models. Indeed, the CV criterion suggests picking a regression model of degree three, which turns out to be the “correct” choice in this situation! The true conditional expectation in the simulation is a polynomial of degree three. 2 Mean Squared Error and the Bias-Variance Trade-off When studying estimators in a low-dimensional context, a property that we desired was unbiasedness of the estimator. That is, given an estimator θ̂ of an underlying parameter θ we desired estimators that satisfied E[θ̂] = θ Examples of unbiased estimators include the sample mean X̄; n n 1X 1X E[X̄] = E[Xi ] = E[X] = E[X] n i=1 n i=1 or the OLS estimator; E[β̂1 ] = β1. Unbiasedness is an attractive property when we want to interpret our parameter estimates as it means that the (sampling) distribution of the estimator is, in some sense, “centered” at the true underlying parameter. If say β1 represents the effect of some policy, unbiasedness assures us that we can interpret β̂1 as the true effect plus a mean zero noise term. However, when we are interested only in prediction and when our data is complex, it is less clear that unbiasedness is a desired property. To illustrate, we consider the conditional mean function which relates our outcome, Y , to the explanatory variable, X, via Y = g(X) + ϵ, E[ϵ|X] = 0 We are interested in learning the conditional expectation at a particular test point x0 , g(x0 ), and would like to pick an estimator of the conditional mean, ĝ, that has a low mean squared error, MSE(ĝ(x0 )) = E[(g(x0 ) − ĝ(x0 ))2 ] Mean Squared Error and the Bias-Variance Trade-off Page 7 where the expectation is taken over the sampling distribution of ĝ(·). It turns out that we can always decompose this into two terms, MSE(ĝ(x0 )) = Var(ĝ(x0 )) + [Bias(ĝ(x0 ))]2 The first term, Var(ĝ(x0 )) = E[(ĝ(x0 ) − E[ĝ(x0 )])2 ] describes the “variance” of our estimator, in other words how far or close we expect ĝ to be to it’s mean value. If we are assuming a homoskedastic linear regression model, Y = β0 + β1 X + ϵ where E[ϵ|X] = 0 and E[X] = 0, then Var(ϵ) Var(ϵ) Var(β̂0 ) = , Var(β̂1 ) = , and Cov(β̂0 , β̂1 ) = 0 n n Var(X) Thus we have that the variance of our conditional mean estimator, Var(ĝ(x0 )) = Var(β̂0 + β̂1 x0 ) can be expressed as Var(ϵ) Var(ϵ) 2 Var(ĝ(x0 )) = + x n n Var(X) 0 The second term describes the squared bias of the estimator, where the bias Bias(ĝ(x0 )) = E[ĝ(x0 )] − g(x0 ) is how far we expect the estimator to be from the truth on average. So far we have only studied estimators that are unbiased. However, a fundamental idea in machine learning is that of a Bias-Variance Trade-off. That is, we can often improve the MSE of our estimator by introducing bias into our estimator if doing so sufficiently reduces the variance of our estimator. To illustrate in a simplified context, we start with a famous example by way of the statisticians Willard James and Charles Stein. 2.1 Bias-Variance Trade-off and the James-Stein Mean Estimate To illustrate the potential benefits of introducing bias into our estimator, let us consider a simplified model where we are interested in estimating a 3-dimensional mean. That is, we have variables X1 , X2 , X3 that are independently and normally distributed with the same known variance σ 2 ; X1 ∼ N (θ1 , σ 2 ), X2 ∼ N (θ2 , σ2 ), X3 ∼ N (θ3 , σ 2 ) While σ 2 is known to us, the means θ1 , θ2 , and θ3 are unknown. In order to estimate these means we have access to n i.i.d observations, {X1i , X2i , X3i }ni=1. The standard approach would be to estimate θ1 , θ2 , and θ3 using their familiar sample means; that is θ̂1 = X̄1 , θ̂2 = X̄2 , θ̂3 = X̄3. We know that these estimators are each unbiased for the true values θ1 , θ2 , and θ3. Moreover, we know for 2 each estimator that Var(θ̂) = σn. Thus, using the bias-variance decomposition from above we can write the total mean squared error over the three estimation procedures as 3 3 3σ 2 X X MSE(θ̂k ) = Var(θ̂k ) +[Bias(θ̂k )]2 = | {z } | {z } n k=1 k=1 2 =0 = σn In the class of unbiased estimators of (θ̂1 , θ̂2 , θ̂3 ), it turns out that this is the lowest MSE we can achieve. However, given what we know about the Bias-Variance trade-off, we may ask if a lower MSE estimator of the population means θ1 , θ2 , and θ3 is possible by introducing some bias. Regularization Page 8 In a famous was that such an improved estimator is indeed possible. They started off with some preliminary guess of the true parameters θ1 , θ2 , and θ3. Let us call this preliminary guess v = (v1 , v2 , v3 ). The idea behind the James-Stein estimator is to “shrink” the sample mean towards this preliminary guess. For each k = 1, 2, 3, the James-Stein estimator is defined ! JS σ2 θ̂k = 1 − P3 (X̄k − vk ) + vk 2 k=1 (X̄k − vk ) The total MSE of the James-Stein estimator is given K " # X JS 3σ 2 σ2 3σ 2 MSE(θ̂k ) = −E < n n(1 + 2W ) n k=1 P3 where W follows a Poisson distribution with mean k=1 θk2 /2. Despite the fact that the James-Stein esti- mator is biased, this total MSE is strictly lower than the MSE of the standard sample mean. Interestingly, this is the case regardless of which choice of preliminary guess (v1 , v2 , v3 ) we choose to shrink towards. It is also notable that this result holds only when we are estiamting three or more means and the optimality is only for the total MSE. Indeed, the genius of the James-Stein is that we will decrease the MSE of some of the components while increasing the MSE for others. However, without knowing (θ1 , θ2 , θ3 ) a priori we cannot know in which directions we improve and in which directions we degrade. 3 Regularization In high-dimensional settings where we have many possible RHS variables that we can use in our model, we want to be careful in how we select which statistical model to use and in how we estimate these models in order to avoid overfitting. An “overfitted” estimate depends heavily on the exact sample we draw, so we can think about overfitting as an example of choosing an estimator with a variance that is too high. In the example above, if we were to draw a different samples of {(Yi , Xi )}ni=1 and fit a 20th degree polynomial to each sample, we would get a drastically different estimates each time because the 20th degree polynomial would try to exactly fit the data we observe. Comparatively, because there are very only two parameters in a single linear regression model, on each sample our estimates of β̂0 and β̂1 may look very similar. However, a single linear regression model may not accurately predict the outcome, Y if the true conditional mean is not linear. This contrast represents a tradeoff between bias and variance. By choosing a simpler model we may be biased, as we cannot capture the true conditional mean function, but our estimator will have a low variance. Conversely, by choosing a more complicated model we will be less biased but we may have a high variance. Ideally, we would like to be somewhere between these two extremes. To prevent overfitting, we may want to use some principled methods to tame the complexity of the fitted model. These methods are called regularization methods, and in terms of the bias-variance tradeoff we can think of them as introducing some bias (we are “biasing towards” a less complicated model) in order to greatly reduct the variance, and thus the MSE, of our estimated model. It is important that these methods are principled, i.e that they not rely too heavily on user input. For example, a naive approach to model selection might be to run linear regression with all variables to begin with, calculate p-values on the estimated coefficients, and then remove insignificant variables one-by-one until you have a parsimonious and statistically significant model. This type of approach is a bad idea for a few reasons 1. When you have multiple highly correlated regressors, the p-values for each of them may look large even if any one of the variables provides useful information about the response variable. Thus you may mistakenly remove a useful variable. Regularization Page 9 2. The p-values themselves may not be interpretable when testing many coefficients at a time, similarly to how we discussed issues with multiple hypothesis testing with sample means. This sort of approach, where we start with a complex model and then pare down to a simpler model is called backwards stepwise regression. For the reasons listed above it should be avoided. What about the other direction, however? What if we start with a simple model and keep building up until we reach a reasonable, more complicated model? 3.1 Forward Stepwise Regression This is called a forward stepwise regression. Suppose we have an outcome variable Y and many possible responses, X1 ,... , Xd. A forward stepwise regression would be implemented as follows; 1. Fit all univariate models; Y = β0 + β1 Xj + ϵ, j = 1,... , d Choose the model with the highest in-sample R2 value, say the one that uses covariate Xk. 2. Fit all bivaariate models including Xk. Y = β0 + β1 Xk + β2 Xj + ϵ, j ̸= k Choose the one with the highest in-sample R2 value, say Xs , and add Xs to your inclusion set. 3. Repeat, given an inclusion set of variables S = {Xk , Xs ,... }, fit all models Y = β0 + β1 Xk + β2 Xs + · · · + βm Xj + ϵ, j ̸∈ S. and again add the variable Xj that maximizes the in-sample R2. This process keeps going until you reach some preset level of complexity or stop based on some model selection rule. Regardless of what exact form the stopping rule takes, this stopping rule is our first example of a regularization technique. By not considering the full set of possible models, we are potentially introducing bias; since our more simple model may not be complex enough to capture the true underlying conditional mean function. However, by selecting a smaller set of coeffecients to include, we ensure that our model can be estimated precisely; which prevents overfitting and reduces the variance of our estimator. In R, this can be implemented using the “step” function as follows. Suppose that our variables Y, X1 ,... , Xd are contained in the data frame “dFrame.” reg