Chapter 7 Supervised Learning - Model Estimation PDF

Chapter 7: Supervised Learning - Model Estimation Learning Objectives This chapter builds on concepts learned in Chapter 3, with a focus on the estimation of linear regression models using ordinary least squares and maximum likelihood methods, the estimation of model parameters when data with nonlinear characteristics is used, and the optimization of model parameters using gradient descent method. Initial insight on the predictive value of models and techniques for improving model output is also covered, including over-and-under-fitting, bias-variance tradeoff, and methods for adjusting models with highly correlated features. After completing this chapter, you should be able to: Compare and contrast the Ordinary Least Squares (OLS) and Maximum Likelihood methods. Explain how gradient descent method is used to optimize parameter estimates. Explain how backpropagation is used to determine the weights in neural networks. Discuss the differences between underfitting and overfitting and potential remedies for each. Describe the tradeoff between bias and variance. Explain the use of regularization techniques to simplify models. Describe cross-validation and its uses. Describe the accuracy-interpretability tradeoff. Describe how grid search and bootstrapping can be used to optimize hyperparameter estimation. Estimating the parameters (often called weights in machine learning parlance) is an essential step in the process of building a model. There are three families of techniques that could potentially be used for this purpose: 1. Least squares – we choose the parameter values that minimize the residual sum of squares. 2. Maximum likelihood – we form a likelihood function and choose the parameter values that maximize it. This provides parameter estimates that maximize the likelihood that we would observe the data that occurred. 3. The method of moments – we construct a set of “moment restrictions” based on an assumed distribution for the data and we solve them to choose the parameters. In fact, under certain conditions, both the least squares and the maximum likelihood techniques are nested within the method of moments framework. Consequently, for simple models, the three approaches will yield identical parameter estimates. The third approach would not be very useful in a machine learning context and so is rarely used for such applications and will not be discussed further here. The remainder of this chapter therefore concentrates on least squares and maximum likelihood. These two methods, and their various extensions, constitute virtually the only tools we might need to estimate machine learning models. In all cases, we form an objective function and either maximize (for maximum likelihood) or minimize (for least squares) it. For least squares, the objective function is also known as the loss function. Optimization is the process of finding the parameter estimates that best fit the data. An essential distinction we need to draw is between analytical and numerical methods for optimization. Analytical methods are those where there is a closed-form solution to the optimization problem and therefore the optimal parameter values will be unique and can be calculated using a formula or set of formulae. Analytical solutions will usually be available when the objective function can be differentiated with respect to the parameters. On the other hand, when such a differentiation cannot be conducted or is too complex to be evaluated (for example, because of high dimensionality), then a numerical procedure can be employed to estimate the parameters. Numerical processes usually start with an initial guess for each of the parameters and then the optimizer tries to improve on these initial values to gradually move towards a set of estimates that fit the data well. Both analytical and numerical approaches will be discussed below. Many of the estimation methods we discuss in this chapter involve the use of hyperparameters. These are parameters that are used to specify the configuration of a model or the learning process while training the model. Examples are the number of hidden layers in a neural network, and the learning rate used in the optimization methods discussed later in the chapter. This chapter is organized as follows. Section 7.1 begins by examining how linear regression models can be estimated using ordinary least squares, which finds a line or hyperplane in the feature space that best fits the training sample data. We also examine how the model parameters are estimated in nonlinear models, using nonlinear least squares. The method for determining the model parameters numerically using gradient descent is also discussed in detail in that section. Section 7.2 discusses the maximum likelihood method. We then discuss over- and under-fitting and bias-variance tradeoff in Section 7.3. This is followed by a presentation of regularization methods that are used to “trim” large models where the features are highly correlated in Section 7.4. Section 7.5 explains how cross-validation with a grid-search procedure can be used to make more efficient use of the data and allow for effective hyperparameter tuning when the overall sample size is small. 7.1.1 Ordinary Least Squares Ordinary least squares (OLS) is the most straightforward of the available estimation methods and is an analytical approach used to estimate the parameters in linear regression models of the type discussed in Chapter 3. To see how it works, let us return to the example used in that chapter of the relationship between bank worker years of experience and salary. As a reminder, the 14 data points are repeated in Table 7.1. Table 7.1: Salary and experience for a notional sample of bank employees Observation, i Experience, xi Salary, yi 1 1 15.46 2 3 22.40 3 7 27.47 4 12 34.31 5 9 33.08 6 3 18.70 7 20 35.06 8 22 35.78 9 0 14.81 10 4 14.84 11 6 25.78 12 8 28.17 13 4 26.36 14 2 11.12 In this case, suppose that we use a simple linear regression so that there is one feature, xi (experience) and one label or target, yi (salary). We would like to identify the intercept and slope parameters in the model that best describe the relationship between the variables. This line of best fit was shown in Figure 3.1, but we will now give more detail on where it came from. OLS finds this line by forming the residual sum of squares as a function of the parameters and minimizing it. We can write the regression equation as: (7.1) Here, xi and yi are the variables, and we have 14 observations for each; b0 and b1 are the intercept and slope parameters, respectively, to be estimated; ui is a random error term, which is assumed to have a zero mean and constant variance, also known as a disturbance term. The error term allows for the fact that it is very unlikely there would be a perfect linear relationship between the two variables. We define the residual, , as the difference between the actual value of the series, yi, and the fitted value from the model for that data point i, : Then we can construct the residual sum of squares, RSS, which is obtained by taking each of the residuals, squaring them, and then adding them over all N data points: (7.2) Note that we square the residuals first rather than simply summing them, otherwise the positive and negative residuals (corresponding to points above and below the fitted line, respectively) would cancel each other out, and squaring makes them all positive. The mean squared error (MSE) is another commonly used metric. It is the average of the residual sum of squares: (7.3) We want to find the values of the parameters that minimize this RSS, and so we substitute with the fitted equation for the line, , where the hats above the parameters denote the estimated values: (7.4) We partially differentiate this expression for the RSS separately with respect to each of the parameters, set the derivatives to zero and then rearrange the formulae to obtain expressions for the optimal intercept and slope estimates. The resulting formulae are, for the intercept: (7.5) And for the slope: (7.6) Where and denote the mean of the observations on x and y, respectively. If we plugged the 14 observations for the experience and salary data of Table 7.1 into these formulae, we would end up with the estimates given in Chapter 3, which were = 16.88 and = 1.06. Any other regression line having a different intercept or slope coefficient would fit the data less well and would lead to a higher RSS, as shown for the impact of changing the slope in Figure 7.1. The same results will be obtained if we use MSE instead of RSS. Figure 7.1 How RSS changes as the parameter moves away from its optimal value OLS minimizes the sum of the squares of the distances from all the points together to the fitted line. Figure 7.2 shows the fitted line, the actual data point, the fitted value, and the residual for just a single data point. In this case we plot the data point # 5 in Table 7.1, but we could have done this for any of the points in the table. We can see here that the actual value of the salary is $33.08 per hour, but the model would have predicted a value of $26.42 (calculated as: 16.88 + 9(1.06)). Thus, for this data point, the value of the residual is 33.08 − 26.42 = 6.66. Figure 7.2 The residual, actual and fitted values for data point number 5 Once the parameters are estimated, we can similarly calculate the fitted values and residuals for all the data points, as shown in Table 7.2. Table 7.2: Salary and experience for a notional sample of bank employees with calculated fitted values and residuals Observation, i Experience, xi Salary, yi Fitted value, ŷi Residual, ûi 1 1 15.46 17.94 −2.48 2 3 22.40 20.06 2.34 3 7 27.47 24.30 3.18 4 12 34.31 29.59 4.71 5 9 33.08 26.42 6.66 Observation, i Experience, xi Salary, yi Fitted value, ŷi Residual, ûi 6 3 18.70 20.06 −1.36 7 20 35.06 38.07 −3.01 8 22 35.78 40.19 −4.40 9 0 14.81 16.88 −2.07 10 4 14.84 21.12 −6.28 11 6 25.78 23.24 2.54 12 8 28.17 25.36 2.81 13 4 26.36 21.12 5.24 14 2 11.12 19.00 −7.88 As Table 7.2 shows, although OLS will minimize the collective distances from the data points to the line, none of the residuals is very close to zero, and some points are well above the line (positive residual), while others are well below it (negative residual). If we believe the noise level is low, then we may suspect that the model fit is not very good, which might motivate us to consider alternative or additional predictors such as the square of the experience term discussed in Chapter 3. A similar approach would be used in the multiple linear regression context where we have more than one explanatory variable. If there were two explanatory variables (i.e., three parameters to estimate: an intercept and two slopes), we could derive a set of three equations like the two above. But in the case of more explanatory variables than two, although the formulae can still be derived, they become increasingly long and unwieldy. Therefore, it is more common to use a matrix notation for the model and the estimation formula, which will extend naturally and straightforwardly to however many explanatory variables there are. 7.1.2 Nonlinear Least Squares OLS can be used in situations where the underlying model is linear in the parameters, but this does not apply to many machine learning models, such as neural networks. In these cases, a more flexible approach is needed. Nonlinear least squares (NLS) is an approach that can be used when the model is nonlinear, and it works using the same principles as OLS – i.e., by minimizing the residual sum of squares. This would still be written as: (7.7) But in this case, where f can be any nonlinear function of the m explanatory variables or features, which are denoted by xi, and the corresponding parameters are denoted by wi (also known as weights in the case of neural networks). Similarly, mean squared error (MSE) can also be calculated. Because the relationship between the features and the output could in principle take any form, it is often not possible to derive a set of closed form solutions to this minimization problem. Therefore, NLS usually uses a numerical approach to finding the optimal parameter estimates, and it proceeds with the following steps: 1. Begin with a set of initial values for the parameters – these could either be randomly generated or ‘best preliminary guesses.’ 2. Evaluate the objective function (RSS or MSE). 3. Modify the parameter estimates and re-evaluate the objective function. 4. If the improvement in the objective function is below a pre-specified threshold, then stop looking and report the current parameter values as the chosen ones. If not, return to step 2. The third of the above steps is the crucial aspect, and usually, a gradient descent algorithm is employed, which is discussed in a following sub-section after a preliminary discussion of hill climbing1. 1 Genetic algorithms are one of the alternative approaches for machine learning model parameter optimization. They are discussed in Appendix 7.A. 7.1.3 Hill Climbing A simple form of optimizer for estimating the parameters of a nonlinear model is hill climbing. This involves starting with initial guesses of each parameter and then making small changes in both directions to each parameter one-at-a-time. The aim is to maximize the value of an objective function (for instance, increasing the value of a likelihood function or increasing the negative of the RSS) until no further improvement in its value is observed. Hill climbing is very straightforward because it does not require the calculation of derivatives, and therefore it can be applied to non-differentiable functions. It is also simple to implement and for this reason it is sometimes termed a “heuristic optimizer,” but it has several disadvantages: Of all the optimization techniques available, hill climbing is the most susceptible to getting stuck in local optima. Convergence to the optimal solution can be very slow. Only one parameter can be adjusted at a time, meaning that it is easy for the algorithm to miss optimal parameter combinations, particularly for complex and highly interconnected models. Given these limitations of hill climbing, an approach based on gradient descent instead usually forms the workhorse of parameter optimization for machine learning models. 7.1.4 The Gradient Descent Method A popular numerical procedure for parameter estimation is known as gradient descent, which is illustrated in Figure 7.3. In this method, the objective function, for example the residual sum of squares, is minimized. Suppose that all the parameters to be estimated are stacked into a single vector W and the objective function in this case is known as the loss function and is denoted as L(W). At each iteration, the algorithm chooses the path of steepest descent (“slope”), which is the one that will minimize the value of the loss function the most. The method works similarly when the maximum likelihood method is used, in which case the negative of the log-likelihood function is minimized. Figure 7.3 Gradient descent. The left panel demonstrates how gradient descent works; the center panel shows how an overly high learning rate leads to overshooting; and the right panel shows the loss function containing a local optimal or plateau. Figure 7.3 demonstrates how the technique works for a single parameter, wi. Starting with an initial guess , the aim is to move towards the optimal value, , which occurs where the loss function is minimized. The gradient of the function is given by its partial derivative with respect to the parameter, evaluated at the initial guess: The guess of the weight is then updated to a revised value according to the formula: (7.8) Here, is the partial derivative of the loss function with respect to the weight wi and η is called the learning rate, usually in the (0,1) range. We can also write this partial derivative of the loss function with respect to the weight wi as: (7.9) If mean squared error is selected as the loss function, the partial derivative of the loss function with respect to the weight wi is: (7.10) Determining the gradient of the loss function with respect to the weight involves calculating the partial derivative of the output with respect to the weight. Note that adjustment to weight wi takes place in the opposite direction to the gradient (or “slope”), and hence the negative sign in the equation 7.8 above for updating the weight. When the gradient is positive, as it is in the illustrative example of the curve in the left panel of Figure 7.3, the estimate of the weight wi will be reduced at the next step. At the same time, if the gradient had been negative, the current value of wi would be too small, and hence it would be increased at the next step. As described earlier, the process of calculating the gradient and updating the weights continues until the value of the loss function can no longer be improved, indicating convergence has been achieved. To avoid the iteration going into infinite cycles, a maximum number of iterations is assigned to the algorithm. In machine learning parlance, each iteration, comprising calculating the loss function and gradient then adjusting the weights using the whole training data sample, is known as an epoch. We usually use the entire training data sample and minimize the loss function with respect to all of it, which is known as batch gradient descent. A slightly less common alternative approach is to apply gradient descent to each data point at a time, individually selected at random from the training set. This is known as stochastic gradient descent, or applying it to subsets of the training data, known as mini-batch gradient descent. The benefit of stochastic gradient descent is that it does not require the entire database to be loaded into memory simultaneously, which reduces the required computational resources compared with the batch approach. However, because updating the weights will take place after the algorithm sees each new data point, the convergence on the optimal values will require more iterations and will be less smooth. The parameter η is a hyperparameter that defines how much adjustment to weights takes place. This hyperparameter must be chosen judiciously: if it is too small, each iteration will yield only modest improvements in the loss function and will result in a slow movement towards the optimum, requiring many iterations in total. On the other hand, if η is too large, there is a danger of overshooting and an erratic path towards the optimal . The algorithm can overshoot the minimum, oscillate around it, or may not converge, as illustrated in the center panel of Figure 7.3. For batch gradient descent, η is fixed a priori, but for stochastic gradient descent, the learning rate can be made a diminishing function of the number of epochs that have already occurred so that the weight updating slows down as the algorithm comes closer to the optimum. In other words, we could employ dynamic learning, which entails starting with a larger η to get close to the optimal solution faster, but to then reduce η as the learning proceeds to avoid overshooting. Technically, this means that η is subject to a decay function. Popular choices are the exponential decay: (7.11) or the inverse decay: (7.12) where the parameter k controls the rate of the decay and t is an epoch. Gradient descent is a neat technique that usually works well, but it can run into problems. One such issue occurs when the function does not have a single, global optimum but rather a series of local optima or an extended plateau away from the true optimum, as illustrated in the right panel of Figure 7.3. In such circumstances, the optimizer can get stuck at the local optimum or plateau and never reach the optimal solution . 7.1.5 Illustration of the Gradient Descent Method Although we know that the OLS estimator can be derived analytically as shown above, it might be useful for pedagogical purposes to illustrate how gradient descent would work if we had used it to derive the estimates for the regression of salary on experience. To keep complexity at a minimum and focus on illustrating how the optimization works, assume that we already know b0 = 16.88 and hence we only need to optimize over b1. The function that we are trying to minimize is: (7.13) We also know from above that the gradient of this function is given by: (7.14) Suppose that we start from an initial value of 0.8 and we use a learning rate equal to 0.005. Table 7.3 illustrates the iterations that lead the algorithm to converge to the value of b1 for which the MSE is minimized. Using the initial value, we compute the gradient, which is equal to −48.66. Given the learning rate of 0.005, the change that we need to apply to b1 is −0.005 × −48.66 = 0.243 (as discussed above, we move in the opposite direction to the gradient). Therefore, the new value of the parameter is 1.043. We continue in this fashion for a number of iterations as shown in the table below. Table 7.3: Value of b1 in successive iterations when the learning rate is 0.005. Iteration b1 Gradient Change in b1 New b1 0 0.800 −48.664 0.243 1.043 1 1.043 −3.024 0.015 1.058 2 1.058 −0.188 0.001 1.059 3 1.059 −0.012 0.000 1.059 4 1.059 −0.001 0.000 1.059 5 1.059 0.000 0.000 1.059 After five iterations, the algorithm converges to 1.059. If we calculate the gradient for this parameter value, it is approximately zero as we have reached the minimum point, and no further adjustments are needed. 7.1.6 Backpropagation Although the above discussion illustrated the technique by focusing on one parameter, wi, in practice, the gradients will be calculated for all the weights at each iteration. Determining the optimal weights in a neural network model is particularly challenging because, even with a single hidden layer, the output is a function of a function. It is like running a logistic regression on the output from another logistic regression. Therefore, to use a technique such as gradient descent would require repeated use of the chain rule of differentiation. As outlined in Chapter 4, a technique known as backpropagation is used along with gradient descent to determine the weights in neural network models. The backpropagation algorithm involves starting on the right-hand side of the neural network (i.e., beginning with the output layer) and then successively working backward through the layers to update the weights estimates at each iteration. This begins by calculating the errors (actual – fitted values) for each target data point, then these errors are “assigned” to each of the weights in the layer before it. Gradient descent can then be applied to the weights to calculate improved values. The output layer error in the target is determined via a feedforward of the feature values with the updated weights, and the process continues again. The derivatives are computed starting from the output layer and moving backward, with an application of the chain rule (hence, the name backpropagation). The algorithm stops when convergence is achieved – that is, when updating the weights no longer reduces the cost function (or reduces it by a trivially small amount). The key to backpropagation is to consider each layer separately rather than trying to do all the computation in a single step because breaking it down in this way greatly simplifies the mathematics. In summary, the steps to implement the backpropagation approach to learning the optimal ANN weights are: 1. Generate initial guesses (usually at random) for all the weights, including the biases. 2. Given these weights, feedforward the values of the inputs to calculate the values at each neuron and then finally the value of the outputs. This is done separately for each of the N data points in the training sample. 3. Once the fitted values of the outputs are determined, the error, which is the difference between the network output and the actual value, can be calculated for each observation. a. If this is the first iteration, proceed to step 4. b. If the residual sum of squares is below a particular threshold or has not improved much since the previous iteration, or the number of iterations has reached a pre-specified maximum value, stop and fix the weights at their current values. Otherwise, proceed to step 4. 4. During the backward pass, the gradient descent method is used to calculate improved values of the weights. In this process the error is propagated through the network, and the weights are updated to minimize the loss function. Return to step 2 and run through a further iteration.2 A solution is obtained after several iterations (epochs). The required number of iterations depends on the problem at hand and criteria used to determine convergence of the solution as described in step 3(b). 2 The gradient of the loss function is calculated for each datapoint by working backward using the chain rule of partial differentiation. 7.1.7 Computational Issues In the previous section we discussed how to find the weights such that some cost function is minimized. Clearly, we hope to find the global minimum of such a function. However, a risk exists that the algorithm will fail to find a global minimum but will instead get trapped in a local optimum (see the right-hand diagram in Figure 7.3 for an illustration of this). Sometimes a momentum term is added to the optimizer, which increases the learning rate if the previous change in the weights and the current change are in the same direction but reduces it if they are in the opposite directions. The benefit of this approach is that it speeds up convergence but at the same time reduces the probability of overshooting the optimal parameter values. Incorporating momentum involves a modification of the updating scheme discussed above to: (7.15) where is the weight before and μ is the momentum rate, which can be chosen between 0 and 1. The parameter μ controls how much of the previous weight change we will keep in the next iteration. This works by ‘overshooting’ the target, which helps prevent the algorithm from getting stuck in local minima. Generally, the problem of local minima is less critical in networks with many hidden layers (deep networks). However, deep networks are plagued by another computational issue: the so-called vanishing gradient. Loosely, the problem can be understood as follows. Backpropagation is an application of the chain rule, which entails the multiplication of several derivatives (as many as the layers in the network). When the derivatives are numbers between 0 and 1 (which is always the case for the logistic function), their product tends to become very small very quickly. An opposite problem is an exploding gradient, where the product of the derivatives becomes larger and larger. When one of these problems emerges, the only way to find the optimum is by using an extremely large number of small updates, which of course makes the learning very slow. Vanishing or exploding gradients are by far the most relevant problems for deep network structures such as convolutional and recurrent neural networks (see the discussion below). Solutions to these issues include: An appropriate choice of activation function. In recent years, the use of the logistic function as an activation function for hidden layers has been abandoned in favor, for instance, of the ReLU function that is less prone to the vanishing gradient problem. Batch normalization. This consists of adding ‘normalization layers’ between the hidden layers. The normalization works in a fashion like what we discussed in Chapter 1, where the features were normalized by subtracting the mean and dividing by the standard deviation. Here, it is the new inputs originating from the hidden layers that are normalized. Specific network architectures. Some network architectures have been developed to be resistant to the vanishing or exploding gradient problem. An example is the long short-term memory (LSTM) network (see the discussion in Chapter 10). 7.2 Maximum Likelihood Maximum likelihood (ML) is another estimation technique that can be used for nonlinear models as an alternative to NLS. ML can also be used for linear models, in which case it will give identical estimates for the intercept and slope parameters as OLS if the random errors are independently normally distributed with zero mean and constant variance. To a certain degree, ML is a more flexible framework that enables it to be used to estimate the parameters for a wide range of specifications if the distribution of the data generation mechanism is known. For instance, it is used for numerous models from the GARCH (generalized autoregressive conditionally heteroskedastic) family that are commonly used for forecasting volatility. More relevant here, ML is also used to estimate logit and probit, which, like GARCH models, are nonlinear. Recall from Chapter 3 that for a logit model, we have two possible outcomes: yi = 0 or yi = 1, which might, for example, correspond to a customer not being granted a loan and being granted a loan, respectively, or did not default on a mortgage payment and did default, etc. The probability that yi = 1 is given by: (7.16) and the probability that yi = 0 is 1−Pi. The function F denotes the logistic function, and this is just a short-hand way of writing the middle of this equation. We could also call Pi the likelihood that yi = 1, and therefore the likelihood that yi = 0 is 1−Pi. The probability distribution for yi given the data on the explanatory variables can be written: (7.17) Here, Pi is the likelihood that yi =1 for just one observation i, but because ML works by selecting the parameters that maximize the chances of the training data occurring, we want the joint likelihood of all N data points taken together, not just a single observation. This joint likelihood, denoted as L, will be obtained by multiplying all the individual probability distributions together: (7.18) Notice that the ∏ notation is used here to denote that the functions are multiplied because the joint probability of all the N data points is the product of the F(y) across the positive outcomes (= 1) in the training set multiplied by the product of the (1 F(y)) across the negative outcomes (= 0) in the training set, as long as they are independent. If yi is 1, the ith function reduces to F(yi); if it is zero, the ith function reduces to (1−F(yi)). It is easier to maximize the log-likelihood function, log(L), than to maximize the likelihood function because the logarithmic transformation turns a multiplication into a sum. The log-likelihood is obtained by taking the natural logarithm of the above expression: (7.19) This can also be written: (7.20) Once the parameters that maximize this expression have been estimated, predictions can be constructed from the model by setting a threshold, Z, estimating the value of Pi using the equation above, and then specifying the category that observation i is predicted to belong to as follows: (7.21) If the costs of being wrong are the same for the two categories (i.e., if incorrectly classifying a value of y as one when it should be zero is just as bad as classifying y as zero when it should be one), we might set Z = 0.5. But in other cases, a different threshold is more useful (see also the discussion in Chapter 8). Consider classifying loans according to the probability that they will default (yi = 1) and the probability that there will be a full payback (yi = 0). In this case, we might set Z equal to a low value such as 0.05 for decision making. This is because the cost of predicting that a loan will pay back when it defaults (i.e., the cost of making a bad loan) is much greater than the cost of predicting that a loan will default when it turns out to be fine (i.e., the profit foregone because the loan was not made). 7.3 Overfitting, Underfitting, and Bias-variance Trade Off Overfitting was only briefly discussed in Chapter 1. We return to the issue here and discuss it in more detail. Ideally, we would always specify the “correct” model containing the “correct” variables, which would lead us to estimate the most appropriate number of parameters given the features and the outputs. However, we will never know the true process generating the data, and we will only have a sample of data upon which to select an appropriate model and estimate the parameters. Consequently, it is an empirical choice as to the size of model we estimate, which leads to the possibility that the model contains too many (overfitting) or too few (underfitting) parameters. Each of these problems and their implications are discussed below. 7.3.1 Overfitting Overfitting is a situation in which a model is chosen that is “too large” or excessively parameterized. A simple example is when a high-dimensional polynomial is used to fit a data set that is roughly quadratic. The most obvious sign of an overfitted model is that it performs considerably worse on new data. When building a model, we use a training data set and a validation data set. The training set is used to estimate the model parameters, and the validation set is used to evaluate the model’s performance on a separate data set. An overfitted model captures excessive random noise in the training set rather than just the relevant signal. Overfitting gives a false impression of an excellent specification because the RSS on the training set will be very low (possibly close to zero). However, when applied to other data not in the training set, the model’s performance will likely be poor, and the model will not be able to generalize well. Overfitting is usually a more severe issue with machine learning than with conventional econometric models due to the larger number of parameters in the former. For instance, a standard linear regression model generally has a relatively small number of parameters. By contrast, it is not uncommon for neural networks to have several thousand parameters. 7.3.2 Underfitting Underfitting is the opposite problem to overfitting and occurs when relevant patterns in the data remain uncaptured by the model. For instance, we might expect the relationship between the performance of hedge funds and their size (measured by assets under management) to be quadratic. Funds that are too small would have insufficient access to resources with costs thinly spread, and funds that are too big may struggle to implement their strategies in a timely fashion without causing adverse price movements in the market. A linear model would not be able to capture this phenomenon and would estimate a monotonic relationship between performance and size, and so would be underfitted. A more appropriate specification would allow for a nonlinear relationship between fund size and performance. Failure to include relevant interaction terms, as described earlier in Chapter 3, would be a further example of underfitting. It is clear from these examples that underfitting is more likely in conventional models than in some machine-learning approaches where only a minimal assumption (such as smoothness) on the signal is imposed. However, it is also possible for machine-learning approaches, as well as econometric models, to underfit the data. This can happen either when the number or quality of inputs is insufficient, or if steps taken to prevent overfitting are excessively stringent. In such cases, the model fit to the training data will be poor, and there will be characteristics of the output variable that remain uncaptured by the model postulated. This will also likely lead to biased estimates of the parameters on the variables that are included in the model. 7.3.3 Bias-variance Trade Off The choice of the “size” of the machine-learning model, which will determine whether the data are overfitted, underfitted, or appropriately fitted, involves what is termed a bias- variance tradeoff. If the model is underfitted, the omission of relevant factors or interactions will lead to biased predictions but with low variance. On the other hand, if the model is overfitted, there will be low bias but a high variance in predictions. This is illustrated in Figure 7.4. Figure 7.4 Illustration of how bias and variance are affected by model complexity.3 An example can help to understand the problem of overfitting discussed above and how an analyst can strike a balance between that and the opposite problem, that is, underfitting. Consider an analyst assigned to the task of predicting the price at which a house will sell using the age of the house as a predictor. A simple approach would be to estimate a linear regression of the house prices on their ages: (7.22) The regression line fitted using a training sample of 201 data points is depicted in Figure 7.4 (Panel A, blue line). Visibly, a linear regression is insufficient to fully capture the relationship between house prices and their ages. In fact, both very new and very old houses (with a historic value) seem to be more expensive, while the linear regression forces the predicted prices to be strictly decreasing in age. Fortunately, the linear regression model can accommodate polynomials of higher degree, as the linearity requirement concerns the parameters, not the explanatory variables (see Chapter 3). Therefore, the analyst could include a quadratic term to the regression above and estimate: (7.23) The red line in Figure 7.5, Panel A is the estimated quadratic regression. This looks like a much more accurate representation of the data. Finally, the green line in Figure 7.5 (Panel A) is the result of fitting a polynomial of the ninth degree to the data. This highly complex function can capture some highly nonlinear patterns in the data. However, it also captures some data-specific noise (e.g., see the curved shape between 20 and 30 years of age) and it does not generalize well when we deploy the estimated model to predict unseen data, as shown in Figure 7.5 (Panel B).4 A highly complex model minimizes the MSE in the training sample but may fail to do so when deployed on new data. In the example illustrated in Figure 7.5, the MSE over the training sample is 157.67 for the linear regression, 130.34 for the quadratic regression, and 121.73 for the ninth order polynomial regression. However, when we use the trained model to predict observations that were not part of the training set, the MSE is 156.37 for the linear regression, 137.08 for the quadratic regression, and 139.82 for the ninth order polynomial regression. Therefore, in this example, the quadratic model strikes the right balance between fitting the training data accurately while also generalizing well to new data. Figure 7.5 A scatter plot of house prices (y-axis) and house ages (x-axis) for a training and a test sample of 201 observations each. Notes: The data have been fitted using a linear regression (blue line), a quadratic regression (red line), and a ninth-degree polynomial (green line). Panel A shows the training sample that has been used to estimate the parameters of the linear, quadratic, and polynomial regressions. Panel B plots the estimated regression lines against the data belonging to the test sample. Figure 7.6 illustrates another example of how underfitting and overfitting can manifest themselves. Here, a single feature is plotted on the x-axis and an output variable plotted on the y-axis. The left panel shows a linear regression fit to the data, which is clearly insufficient to describe the series and will give rise to predictions that are highly biased. The center panel shows the result of applying a high-order polynomial fit. This line contours perfectly with the training set but is evidently overfit. The right panel shows a quadratic polynomial, which has a better balance between over- and underfitting. Figure 7.6 Three line fits to a set of points. The left panel shows a linear regression, the center panel shows a 20th order polynomial, and the right-hand panel shows a quadratic equation. Overfitting is a particularly likely and severe problem with neural networks. The very basic neural network in the example presented in Figure 4.10 in Chapter 4 has 10 parameters. But often, there are several hidden layers and many more nodes per layer. This leads to an enormous number of parameters and the likelihood that there will be overfitting unless specific steps are taken to avoid it. One approach to limit the chances of overfitting is to carry out calculations for the validation data set at the same time as the training data set. As the algorithm steps down the multi-dimensional valley, the objective function will improve for both data sets, but at some stage, further steps down the valley will start to worsen the value of the objective function for the validation set while still improving it for the training set. This is the point at which the gradient descent algorithm should be stopped because further steps down the valley will lead to overfitting the training data and therefore poor generalization and poor predictions for the test sample. 3 Reprinted with permission from Scott Fortmann-Roe 4 The choice to split the sample into two portions with an equal number of observations is for ease of exposition. As discussed above, in machine learning applications, the sample is generally split into three sets with the training sample typically being larger. 7.3.4 Prediction Accuracy Versus Interpretability There is another often unrecognized trade-off between a model’s prediction accuracy and its interpretability (or lack of thereof). Machine learning models are often highly complex and heavily parametrized, so that they have often been accused of being “black boxes.”5 More flexible models often deliver more accurate predictions (with the caveats discussed above concerning overfitting), as they can generate a wider range of shapes for the function that maps the features to the outcome. Therefore, they can fit the highly complex and nonlinear patterns in real-world data. However, these models often lack interpretability. The linear regression model is often inadequate to model the complex nature of real- world relationships between the predictors and the target variable. Yet, its popularity among financial economists remains unchallenged, thanks to its ability to deliver an easy-to-understand relationship between the predictors and outcome that resonates with financial theory. Generally, less flexible but more interpretable models are preferred when the goal is to investigate causal relationships. In contrast, more flexible models tend to be the obvious choice when the goal is to make accurate predictions. 5 Unlike some other machine learning models, decision tree models are highly interpretable. 7.4 Regularization The stepwise selection methods discussed in Chapter 3 add or remove predictors to a regression with the aim of finding the combination that maximizes the model performance. An alternative is to fit a model on all m features but using a regularization technique that shrinks the regression coefficients towards zero. Regularization can be used for standard linear regression models such as those discussed above and for many other machine-learning models discussed in subsequent chapters. It is usually a good practice to normalize or standardize the data beforehand using one of the methods discussed in Chapter 1. The two most common regularization techniques are ridge regression and least absolute shrinkage and selection operator (LASSO). Both work by adding a penalty term to the objective function that is being minimized. The penalty term is the sum of the squares of the coefficients in ridge regression and the sum of the absolute values of the coefficients in LASSO. Regularization can simplify models, making them easier to interpret, and reduce the likelihood of overfitting to the training sample. 7.4.1 Ridge Regression Suppose that we have a dataset with N observations on each of m features in addition to a single output variable y and, for simplicity, assume that we are estimating a standard linear regression model with hats above parameters denoting their estimated values. The relevant objective function (referred to as a loss function) in ridge regression is: (7.24) The first sum in this expression is the usual regression objective function (i.e., the residual sum of squares), and the second is the shrinkage term that introduces a penalty for large-slope parameter values (of either sign). The parameter λ controls the relative weight given to the shrinkage versus model fit, and some experimentation is necessary to find the best value in any given situation. Parameters that are used to determine the model but are not part of a model are referred to as hyperparameters. In this case, λ is a hyperparameter and s are model parameters. 7.4.2 LASSO LASSO is a similar idea to ridge regression, but the penalty takes an absolute value form rather than a square: (7.25) Whereas there is an analytic approach to determining the values of the βs for ridge regression, a numerical procedure must be used to determine these parameters for LASSO because the absolute value function is not everywhere differentiable. Ridge regression and LASSO are sometimes known, respectively, as L2 and L1 regularization due to the order of the penalty terms in these methods. There are key differences between them: Ridge regression (L2) tends to reduce the magnitude of the β parameters, making them closer to, but not equal to zero. This simplifies the model and avoids situations in which for two correlated variables, a large positive coefficient is assigned to one and a large negative coefficient is assigned to the other. LASSO (L1) is different in that it sets some of the less important β estimates to zero. The choice of one approach rather than the other depends on the situation and on whether the objective is to reduce extreme parameter estimates or remove some terms from the model altogether. LASSO is sometimes referred to as a feature selection technique because de facto it removes the less important features by setting their coefficients equal to zero. As the value of λ is increased, more features are removed. Ridge regression and LASSO can be used with logistic regression. Maximizing the likelihood is equivalent to minimizing its negative. Therefore, to apply a regularization, we add λ times the sum of the squares of the parameters or λ times the sum of the absolute values of the parameters to the negative of the expression for the log-likelihood. Then the objective would be to find the values of the parameters that jointly minimize this composite of the negative of the log-likelihood and the sum of the absolute values of the parameters. 7.4.3 Elastic Net A third possible regularization tool is a hybrid of the two above, where the loss function contains both squared and absolute-value functions of the parameters: (7.26) By appropriately selecting the two hyperparameters (λ1 and λ2), it is sometimes possible to obtain the benefits of both ridge regression and LASSO: reducing the magnitudes of some parameters and removing some unimportant ones entirely. 7.4.4 Regularization Example Suppose that we were interested in running a regression of a time-series of stock index returns on a set of Treasury yields for different maturities using the data from the PCA example in Chapter 1 as features. As mentioned, the features are usually rescaled before using ridge or LASSO. But in this case, the magnitudes are similar and so, to keep the example simple, we will skip the rescaling step. The results presented in Table 7.4 show that an ordinary least squares (OLS) regression provides β parameters that are quite large in magnitude, with some βs having a sign that is opposite to what is expected. This indicates that the features in this model are highly correlated. The ridge regressions reduce the magnitude of the parameters, with the higher value of λ shrinking them more. Some of the βs changed signs in going from OLS to Ridge regression. LASSO, on the other hand, reduces some coefficient values to zero. When λ = 0.1, only one coefficient (plus the intercept) is non-zero with a negative sign, indicating an inverse relationship between stock returns and treasury yields Conducting a regularized regression effectively requires selecting the hyperparameter carefully. Often, this involves choosing a value of λ that produces a model that is easy to interpret while still producing accurate forecasts. The data can be split into a training set, validation set, and test set as explained in Chapter 1. The training set is used to determine the coefficients for a particular value of λ. The validation set is used to determine how well the model generalizes to new data, and the test set is used to provide a measure of the accuracy of the chosen model. Sometimes, the simpler models produced using regularization generalize better than the original OLS linear regression model. Table 7.4 OLS, Ridge, and LASSO Regression Estimates. An illustration of ridge regression and LASSO applied to a regression containing highly correlated features, with two different hyperparameter values. Ridge, Ridge, LASSO, LASSO, Feature OLS λ = 0.1 λ = 0.5 λ = 0.01 λ = 0.1 Intercept 5.17 2.67 2.46 2.61 2.39 USTB1M −23.22 −6.55 −2.00 −1.13 0 USTB3M 50.64 10.00 2.45 1.35 0 USTB6M −37.64 −3.82 −0.51 0 0 USTB1Y 11.00 0.70 0.40 0 0 Ridge, Ridge, LASSO, LASSO, Feature OLS λ = 0.1 λ = 0.5 λ = 0.01 λ = 0.1 USTB5Y −5.55 −1.75 −1.41 −1.22 −0.71 USTB10Y 9.13 0.57 −0.11 0 0 USTB20Y −5.88 −0.08 0.36 0.14 0 7.5 Cross-validation and Grid Searches The methods presented so far have concentrated on training a model using the data set, validating it, and then testing it. This section presents more advanced techniques for developing machine learning models which use the model fitting methods discussed so far repeatedly in a systematic manner as they search for models with better fit or to identify optimal values of parameters that will result in better models. The motivation for these techniques comes from the necessity of making efficient use of available data and to identify optimal values for hyperparameters such as the regularization term in regressions or the number of layers in a neural network, so computational costs can be optimized. 7.5.1 Cross-Validation So far, we have discussed fitting a model, after partitioning the available data set into training, validation, and test samples. Ideally, the available dataset will be sufficient to allow for reasonably sized training, validation, and test samples. But sometimes, that will not be the case. For example, suppose that we only have a whole dataset of 100 instances. To split this into three sub-samples in the conventional way might entail a training sample of 70 instances and just 15 for validation and 15 for testing. In such circumstances, cross-validation is a technique that can be deployed to use the data more efficiently. It involves combining the training and validation data into a single sample, with only the test data held back. This means that there is effectively not a separate validation sample, only a combined sample, which we now call the training sample. Then, this training data are split into equally sized sub-samples, with the estimation being performed repeatedly and one of the sub-samples being left out each time. The technique known as k-fold cross-validation splits the total data available, N, into k samples, and it is common to choose k = 5 or 10. Suppose for illustration that k = 5. Then the training data would be partitioned into five equally sized, randomly selected sub-samples, each comprising 20% of that data. If we define the sub-samples ki (i = 1,2,3,4,5), the first estimation would use samples k1 to k4 with k5 left out. Next, the estimation would be repeated with sub-samples k1 to k3 and k5, with k4 left out, and so on. This situation is illustrated in Figure 7.7 with the training folds in green and the validation folds in blue. In this case, k = 5, there will be a 5 × 5 framework. At the end, there will be k = 5 “validation samples” (one for each iteration) that can be averaged to determine the model’s performance. Figure 7.7 Five-fold cross-validation A larger value of k will imply an increased training sample size, which might be valuable if the overall number of observations is low. The limit as k increases would be k = N, which would correspond to having as many folds as the total number of data points in the training set. This situation is known as N-fold cross-validation, jack-knifing, or leave- one-out cross-validation (LOOCV). Cross-validation represents a very resourceful use of the data, but it has at least two disadvantages: Using LOOCV will increase the size of the matrix in Figure 7.7 to N × N, which maximizes the sizes of the training and validation samples but will be computationally expensive as the data are trained at each of the N iterations. (Remark: For linear models, LOOCV is computationally inexpensive because of the updating formula, i.e., the Sherman–Morrison–Woodbury formula.) For nonlinear models, cross-validation exercises always increase the computational costs, which grow with k. For unordered data, the points allocated to each fold will usually be selected randomly. This implies that k-fold cross-validation cannot be used when the data have a natural ordering. For example, if they are time-series observations, where there is a need to preserve the order and where the validation sample would usually comprise points that are chronologically after the training data. In this case, the appropriate framework would be to use a rolling window. 7.5.2 Stratified Cross-validation When the overall sample is very small, there is a heightened risk that one or more of the training or validation sub-samples will, purely by chance, comprise a set of datapoints that are atypical compared with the other sub-samples. Some classes or types of data will be overrepresented in the training sample and underrepresented in the validation sample and vice versa for the other classes or types of data. The use of cross-validation will mitigate this issue to some extent, but an extreme case of it could cause even more severe problems. Specifically, if the output data are categorical but unbalanced between categories, it might be the case that there are no instances of one or more categories in one or more of the sub-samples. For instance, suppose that we have N = 100 sample observations on whether a car loan borrower defaulted (=1) or repaid their loan (=0), with only ten customers who defaulted. If we use cross-validation with k = 10, it is likely to be the case that for some of the ten iterations, there will be no defaulting customers in the validation sample, which will lead to a distorted evaluation of the models applied to the validation sample in those cases. A potential solution to this problem is to use stratified k-fold cross-validation. In that case, instead of drawing k samples (without replacement) to comprise the validation data, the positive and negative outcomes would be sampled separately in proportion to their presence in the overall sample. So if, for example, k = 10 and we have N = 100 with 90 non-defaults and 10 defaults, we would select nine non-defaults and one default at random from the overall sample, so that each of the 10 folds would contain the same 9:1 ratio of the two classifications. This approach guarantees that each class is represented to an equal degree in all training and validation samples. An alternative way to deal with imbalanced classes is to generate further artificial instances of the minority (underrepresented) class, using what is known as SMOTE (Synthetic Minority Over-sampling Technique). Or to use an asymmetric loss function that puts more weight on any incorrect predictions for this class. However, these approaches are more complex than those described above and hence are not considered further. 7.5.3 Bootstrapping A further variant on k-fold cross-validation is to use a bootstrap. Bootstrapping is a simulation technique where new data distributions are created by sampling with replacement from the original data. In this context, it would involve, for each iteration, drawing a sample of size N (the combined size of the training and validation sample) with replacement. It is highly likely that this sample will contain some instances more than once from the original sample and some instances will not appear at all – typically, around a third of the original data will not be sampled in each iteration6. Those instances not appearing in the bootstrapped training sample (called out-of- bootstrap data) then comprise the validation sample for that iteration. A large number of iterations would be performed (10,000 or more if computational resources permit) and the results averaged over the iterations as for k-fold cross-validation. Cross-validation and bootstrapping represent more efficient ways to deal with the data than an arbitrary separation between training and validation sets, because, effectively, every observation appears in both the training and validation samples for different folds. Cross-validation is also straightforward to implement, but a disadvantage is that it might be computationally expensive if the model is complex, or the number of folds is large, or the total sample size is large. If there are k folds and h different possible values for the hyperparameter to consider, this will involve estimating kh separate models each for a sample of size N (1 − 1/k). That could be computationally infeasible. Bootstrapping with a large number of iterations will be even more computationally demanding, although recent advances in computing have made this less onerous than previously. 6 Technically, the expected fraction of data points not appearing in the bootstrapped training sample is 1/e ≈ 37%, where e is the exponential number (2.72…) if the sample is large. 7.5.4 Grid Searches The purpose of cross-validation might be to determine the optimal value of a hyperparameter. To do this, the researcher might use a grid search procedure, which involves selecting a set of possible parameter values. To illustrate, suppose that the model under study involves specifying one hyperparameter, λ. This might, for example, be the hyperparameter that controls the strength of the penalty term in a Lasso regularization. Assume that the researcher determines that a range of 0 to 100 is plausible to investigate, with a step size of 1. Using 5-fold cross-validation to determine the most appropriate value of λ could be achieved using the following steps: 1. Separate the composite training sample into five randomly assigned sub- samples. 2. Set λ = 0. 3. Perform the following operations: a. Combine four of the sub-samples and estimate the model under study on that composite. b. Using the remaining subsample, calculate a performance measure, such as the percentage of correct classifications or the mean squared error of the predictions. 4. Repeat steps 3a and 3b for the other four combinations of the sub-samples. 5. Calculate the average of the performance measure across the five validation folds. 6. Add one to λ, and if λ ≤ 100, repeat steps 3 to 5, otherwise proceed to the next step. 7. There will now be 101 performance statistics: one for each value of λ (0,100). Select the optimal value of λ (call this λ*) corresponding to the best value of the performance statistic. 8. Perform one final estimation of the model, this time using the entire training sample with the hyperparameter set to λ*. Constructing a grid search framework has some drawbacks. A first issue is that the researcher may have no idea of even the scale of the hyperparameter, so a power scale might need to be used for λ, such as 10−1, 100, 101, 1010,… It would be possible to search over a coarse grid, including relatively few points over a wide range, but that could leave the best hyperparameter value from the grid search still a long way from the optimal value. Even getting close to the latter using a more refined grid could impose a vast computational burden, but it is becoming less of a concern with the recent advances in computing technologies. A second problem is that searching over too many grid points is another manifestation of overfitting and could lead to weaker test sample performance. An alternative to grid search for hyperparameter selection would be to use random draws. This could reduce the computational time significantly and seems to work surprisingly well compared with more structured approaches. But if the researcher is unlucky, it could be that none of the randomly selected hyperparameter values come close to the optimum. Appendix 7.A Genetic Algorithms Another family of alternative approaches for machine learning model parameter optimization is based on genetic algorithms (GAs). These techniques, sometimes known as evolutionary algorithms, apply thinking from evolutionary biology that capture the fundamental aspects of how populations of animals evolve over the very long term according to Darwinian principles to be more resilient to hazards in their environment. Hence, they are loosely analogous to the genetic processes of reproduction and “survival of the fittest”. GAs have found widespread applicability for the estimation of parameters or fine tuning hyperparameters in decision trees, support vector machines, and neural networks. They can also aid feature selection. GAs treat each individual parameter as a chromosome, and combinations of parameters are people, with all possible parameter combinations considered to be the human population. The parameters are optimized over many generations. The process is initialized by establishing a first-generation population of randomly assigned parameter combinations. Then, at each generation: 1. We define a “fitness” measure (known in GA as a fitness function) in terms of how good the optimization is for that combination of parameters – this could be, for instance the RSS (minimizing) or the value of a likelihood function (maximizing). 2. Each individual person (set of parameters) is entered into a “mating pool” and two people (sets of parameters) are selected at random to combine, but with the selection process such that individuals (sets of parameters) having high levels of fitness (better model fit) are more likely to be selected. 3. The genetic information (parameter values) is combined between these two individuals to create offspring (i.e., new parameter combinations). This is termed “crossover” or “recombination”. 4. Random changes (known as “mutations”) are also made to the parameter values at the time of combination to create more diversity in the population (estimates). A hyperparameter controls the extent of mutation. This process continues until either a pre-specified number of generations has been reached, or the improvement in the fitness of the fittest individual (i.e., the best parameter combination so far) falls below a threshold and the solution has converged. Genetic algorithms constitute a very unstructured approach to parameter optimization. They do not require gradient calculations, and so they are particularly useful for very complex models where solution using gradient-based methods would be challenging. GAs are more robust with respect to local optima than more conventional techniques because they have the potential to search over a wider range of the parameter space, although they can still suffer from this problem when “long-term fitness” is dominated by “short- term fitness” and parameter combinations that maximize the latter are selected over the those that might have maximized the former. GAs also do not require estimates of the derivatives of the loss function with respect to the parameters, which means that they can be employed in a variety of contexts and used with complex model structures. In practice, encoding parameter values into strings (chromosomes) efficiently could pose challenges. If there does not exist an efficient encoding scheme, the length of the chromosomes could be large, and it would make GA to be an infeasible choice of optimization algorithm. GAs can demand considerable computational resources because the required number of iterations could be large, and for each iteration the fitness level must be calculated for the whole population of parameter combinations.

Chapter 7 Supervised Learning - Model Estimation PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue