Supervised Learning PDF

Supervised Learning MInDS @ Mines In this lecture we discuss variations of linear regression that are useful when we have a large number of features. We’ll cover lasso, ridge regression and elastic nets which are variations on linear regression that add regularization parameters to prevent overﬁtting. We want our model to generalize and adding regularization parameters is a key way to do that. ℓp & ℓp,q norms Before we begin discussing the common variations of linear regression that include regularizations, let’s ﬁrst revise some linear algebra regarding ℓp and ℓp,q norms. An ℓp norm of a vector, x is the pth root of the sum of its n values raised to the pth power, ( n )1 ∑ p p ||x||p = xi. (1) i=1 The ℓ1 -norm of a vector is the Manhattan An ℓp,q norm of an n × m matrix, X, with (x0... xm ) as its column vectors, distance which is the total distance along is the ℓq -norm of the resulting vector from calculating the ℓp -norm of each each axis. This is also called the taxicab distance. The ℓ2 -norm of a vector is the row vector, Euclidean distance which is also the straight line distance between two points.    pq  q1 ( n ) q1 ∑ ∑ ∑ n m p  ||X||p,q = |xij |  = ||x ||p i q (2) i=1 j=1 i=1. The ℓp,q norm of a matrix, X where p =q= 2 is called the Frobenius norm and can be Note that p, q ≥ 1, however, we deﬁne the ℓ0 -”norm” as the number represented as ||X||F of non-zero values in a vector. With that deﬁnition of norms, let’s look at some patterns that occur at particular values for a norm, for example at the ||w||p = 1. A great post that may provide more intuition around regularizaiton is available at ||w||0 = 1, we only have one non-zero feature from the vector space. https://medium.com/mlreview/l1-n orm-regularization-and-sparsity-e ||w||1 = 1, we have a diamond-like structure which hits each axis at the xplained-for-dummies-5b0e4be3938a. value of the norm, 1. ||w||2 = 1, we have a round circular/spherical structure which hits each axis at the value of the norm, 1. Figure 1: An overview of ℓp norm visualiza- tions. 2 mi n d s @ m ine s Linear Regression We deﬁne the objective function for linear regression as the minimization of the squared Euclidean distance between the model and the true value, min ||y − wT X||22. T (3) w With a large number of features in X, our model wants to use all the fea- tures and can result in overﬁtting. The model is also more likely to overﬁt when we don’t have a signiﬁcantly large amount of data relative to the num- ber of features. This is as a result of the curse of dimensionality which we will cover in a later lecture when we discuss feature learning. When the model features are highly correlated, the linear regression model is very sensitive to random noise. To handle this problem, we use regularizations. Regularization To prevent our model from overﬁtting, we can add a regularization term to the objective function. Regularization terms allow us to mitigate some of the issues that cause ordinary linear regression to overﬁt to our data. We do that by adding a minimization portion to the objective function that applies to the trained coefﬁcients of our model. In general terms, we usually solve the following function, min f (x) + r(x), (4) where f (x) is a goodness of ﬁt function and r(x) is a regularization function. Generally, when minimizing a function, we can simply add a regularization term in addition to it that allows us to mitigate overﬁtting to the training dataset. Examples of these functions for linear regression follow. Ridge Regression The ridge regression model adds an ℓ2 regularization to reduce the values of coefﬁcients of our model. The objective is, min ||y − wT X||22 + α||w||22 , T (5) w where α is a constant hyperparameter of our model. We can use α to adjust how sensitive the model is to the training data. When the model is trained with a larger value for α it is reducing how the model focuses on the data. With a smaller value for α, the model focuses more on the trends in the data. When α = 0, we end up with a model that is equivalent to the base linear regression. When we have correlations between features and we use a regulariza- tion, the model will focus on gaining as much information as possible from Figure 2: An example of the ℓ2 norm set to a particular value. As we minimize this value, the circle gets a smaller radius. s u p e rv i s e d l e a rn i ng 3 the smallest number of features. This minimizes the effect of colinearity on the learned model. With ridge regression (using the ℓ2 -norm), the model won’t necessarily try to completely eliminate / zero-out a particular feature’s coefﬁcient but it will lower their values. Lasso The lasso model adds an ℓ1 regularization to make some coefﬁcients of our model go to 0. The objective is, min ||y − wT X||22 + α||w||1 , T (6) w where α is a constant hyperparameter of our model. This α works similarly to the Ridge regression method’s α. Just like Ridge regression, the model is trying to minimize the effect of colinearity on the learned model, however, when we use the ℓ1 -norm, the model will try to more strongly reduce the features and we will see many zero values for our coefﬁcients. Elastic Net Figure 3: An example of the ℓ1 norm set to a The elastic net model balances between both approaches of lasso and ridge particular value. As we minimize this value, the diamond gets smaller. regression by utilizing both ℓ1 and ℓ2 -norms. The objective for elastic nets is, min ||y − wT X||22 + λ1 ||w||1 + λ2 ||w||22 , T (7) w where λ1 , λ2 are the coefﬁcients for the ℓ1 and ℓ2 -norms respectively. We can manually control these hyperparameters for each norm separately or we can create a relationship between λ1 and lambda2. We can deﬁne two new hyperparameters, α and ρ where α is the normalization coefﬁcient and ρ is a balancing ratio between the ℓ1 and ℓ2 -norms. This results in the objective as, α(1 − ρ) min ||y − wT X||22 + αρ||w||1 + ||w||22. T (8) w 2 With the elastic net using either hyperparameter approach, the model will focus on lowering the coefﬁcients and focusing on less features that are key to the resulting target. Compared to Lasso, the model should have more features utilized. Compared to Ridge, the model should have more features zeroed-out. With both of these cases, if either the Lasso or Ridge regression model is a better model, when we conduct our hyperparameter search, we will ﬁnd that λ1 = 0 or λ2 = 0. In the case where both λ1 , λ2 = 0, the resulting model is the original linear regression model. Group Lasso When the features of the data belong to some logical grouping, we can often also incorporate the group norm. The group norm is calculated based on a deﬁned grouping of the features. The idea here is to incorporate the 4 mi n d s @ mine s data’s logical grouping when regularizing so that you can identify the most important groups of features. Group Lasso is deﬁned similar to Lasso but Instead of the simple ℓ1 regu- larization we use the group ℓ2 regularization. A group ℓp norm is deﬁned as,  1/p ∑ ∑ p  ||w||gp = |wj | (9) g∈g   j∈g The group ℓ2 norm is therefore,   ∑ √ ∑ 2  ||w||g2 = |wj | (10) g∈g   j∈g The group Lasso can be presented as, Figure 4: An example of the group ℓ2 norm min ||y − wT X||22 + α||w||g2 , T (11) w calculation for a given grouping of features. With the group Lasso, the ℓ1 group norm induces sparsity on each group of features, attempting to eliminate any groups that don’t provide as much value as others. Sparsity Induction With these regularization methods, we are inducing sparsity on the learned model. We are essentially forcing the model to pick less features to use to A dataset, or matrix is said to be sparse if it has many zero values. generalize about the data. This generalization is what we’re after when the focus is to add regularization to prevent overﬁtting. Another aspect of sparsity induction, however, is the ability to identify key features that are predictive of the target. By telling the model to set some values to zero, we force it to learn the select few features that determine the large share of actual behavior. This leads to the model ignoring noise in the data and therefore avoiding overﬁtting. It also leads to the model being interpretable and simple since we can focus on a select set of features and understand how they are predictive of the target.

Supervised Learning PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue