L3 PDF - Northwestern Lecture Notes

Things to note Call for Volunteers for the final presentation (2 bonus points). Find a dataset where logistic regression shows overfitting, and demonstrate how regularization can improve its performance (1.5 bonus points) Review of last Week: Cost function (MSE): Gradient Descent for SLR: Gradient Descent for MLR: j = 1…n Review of last lecture: Feature Scaling Feature scaling is essential for machine learning models that use gradient descent ! After Feature Scaling Quiz2: Implementing GD using python loops From last lecture: np.dot in NumPy Single Prediction in MLR Using for loops Using np.dot from NumPy f = np.dot(w,x) + b Benefits of vectorization More compact equations Faster code (using optimized matrix libraries) Agenda for Today Lecture: After-class Assignment: Vectorized Gradient Descent for MLR Vectorized GD Implementation for Regularized Gradient Descent for MLR MLR MLR in Sklearn Regularized GD Implementation for MLR MLR from Sklearn Performance Comparision Vectorized Gradient Descent for MLR Benefits of vectorization More compact equations Faster code (using optimized matrix libraries) Previous notation Vector notation Parameters 𝑤𝑤1 , ⋯ , 𝑤𝑤𝑛𝑛 w = [ 𝑤𝑤1 ⋯ 𝑤𝑤𝑛𝑛 𝑏𝑏 𝑏𝑏 Model 𝑓𝑓w,𝑏𝑏 x = 𝑤𝑤1 𝑥𝑥1 + ⋯ + 𝑤𝑤𝑛𝑛𝑥𝑥𝑛𝑛 + 𝑏𝑏 𝑓𝑓w,𝑏𝑏 x = w ∙ x + 𝑏𝑏 Cost function 𝐽𝐽 ( 𝑤𝑤1, ⋯ , 𝑤𝑤𝑛𝑛, 𝑏𝑏 ) 𝐽𝐽 w, 𝑏𝑏 Gradient descent repeat { repeat { 𝜕𝜕 𝐽𝐽( 𝑤𝑤 , ⋯ , 𝑤𝑤 , 𝑏𝑏) 𝑤𝑤𝑗𝑗 = 𝑤𝑤𝑗𝑗 − 𝛼𝛼 𝜕𝜕𝑤𝑤 𝜕𝜕 𝐽𝐽 w, 𝑏𝑏 𝑗𝑗 1 𝑛𝑛 𝑤𝑤𝑗𝑗 = 𝑤𝑤𝑗𝑗 − 𝛼𝛼 𝜕𝜕𝑤𝑤 𝑗𝑗 𝜕𝜕 𝐽𝐽 𝑤𝑤 , ⋯ , 𝑤𝑤 , 𝑏𝑏 𝑏𝑏 = 𝑏𝑏 − 𝛼𝛼𝜕𝜕𝑏𝑏 𝜕𝜕 𝐽𝐽 w, 𝑏𝑏 1 𝑛𝑛 𝑏𝑏 = 𝑏𝑏 − 𝛼𝛼 𝜕𝜕𝑏𝑏 } } j = 1…n j = 1…n Two ways to Handle b We will use the first one ! Incorporate b within vector w Vector notation Parameters 𝑤𝑤1 , ⋯ , 𝑤𝑤𝑛𝑛 𝑏𝑏 w = [ 𝑤𝑤0 ⋯ 𝑤𝑤𝑛𝑛 Model 𝑓𝑓w,𝑏𝑏 x = 𝑤𝑤1 𝑥𝑥1 + ⋯ + 𝑤𝑤𝑛𝑛𝑥𝑥𝑛𝑛 + 𝑏𝑏 𝑋𝑋 = 1 𝑥𝑥1 𝑥𝑥2 ⋯ 𝑥𝑥𝑛𝑛 𝑓𝑓w x = w∙ x Cost function 𝐽𝐽 ( 𝑤𝑤1, ⋯ , 𝑤𝑤𝑛𝑛, 𝑏𝑏 ) 𝐽𝐽 w) Gradient descent Model 𝑓𝑓w,𝑏𝑏 x = 𝑤𝑤1 𝑥𝑥1 + ⋯ + 𝑤𝑤𝑛𝑛𝑥𝑥𝑛𝑛 + 1*𝑤𝑤0 repeat { Cost function 𝐽𝐽 ( 𝑤𝑤0 , 𝑤𝑤1, ⋯ , 𝑤𝑤𝑛𝑛) 𝜕𝜕 𝐽𝐽 w, 𝑏𝑏 𝑤𝑤𝑗𝑗 = 𝑤𝑤𝑗𝑗 − 𝛼𝛼 𝜕𝜕𝑤𝑤 𝑗𝑗 } j = 0,1…n Gradients in MLR 𝜕𝜕 𝐽𝐽 ( w, 𝑏𝑏) not depend on each other 𝜕𝜕𝑤𝑤 j Using w = 𝑤𝑤0 𝑤𝑤1 ⋯ 𝑤𝑤15 vectorization d = 𝑑𝑑0 𝑑𝑑1 ⋯ 𝑑𝑑15 w = np.array([0.5, 1.3, … 3.4]) d = np.array([0.3, 0.2, … 0.4]) compute 𝑤𝑤𝑗𝑗 = 𝑤𝑤𝑗𝑗 − 0.1𝑑𝑑𝑗𝑗 for 𝑗𝑗 = 0 … 15 Without vectorization With vectorization for j in range(0,16): w = w − 0.1d w[j] = w[j] - 0.1 * d[j] w = w – 0.1 * d 𝑤𝑤0 = 𝑤𝑤0 − 0.1𝑑𝑑0 w w w 𝑤𝑤1 = 𝑤𝑤2 − 0.1𝑑𝑑1 - - … - ⋮ 0.1 * d d … d 𝑤𝑤15 = 𝑤𝑤15 − 0.1𝑑𝑑15 w w … w Matrix Representation Consider our model for m instances and n features: 𝑤𝑤0 𝑤𝑤1. 𝑤𝑤 =.. 𝑤𝑤𝑛𝑛 Given: n=3 m=4 The vector of predicted values The error The vector of vector true values Cost function J(W) in Matrix Form Approach 2: np.sum(ei**2) Where: Expand the Cost Function Derivative of the Cost Function with respect to W Gradient Descent to update W Where: Another way Closed Form Solution (Set the gradient to zero) Instead of using GD, solve for optimal w analytically Take derivative and set equal to 0, then solve for w: Gradient Descent vs Closed Form Limitation of Closed Form For most nonlinear regression problems there is no closed form solution. Even in linear regression (one of the few cases where a closed form solution is available), it may be impractical to use the closed-form solution when the data set is large (memory constraint). Putting all together: Vectorized Gradient Descent Cost Function: Gradient of the cost function: Gradient Descent Update Rule: Quiz3_1: Vectorized Gradient Descent Implementation You are asked to implement GD using Vectorization: compute_cost_matrix : function to compute the total cost compute_gradient_matrix: function to compute the vector gradient Gradient_descent_matrix: perform gradient descent Do the same using linear regression models from Sklearn Compare the coefficients and performance Regularization to Reduce Overfitting Cost Function with Regularization Quality of Fit Overfitting: The model may fit the training set very well (J(w) 0) But fails to generalize to unseen data How to reduce overfitting Adding more training data Eliminating insignificant features (to reduce model’s complexity) Regularization: L1 and L2 Regularization Idea: Penalize for large values of 𝑤𝑤𝑗𝑗 Can incorporate into the cost function Works well when we have a lot of features, each that contributes a bit to predicting the label Please refer to the “Regularization in Python.ipynb” notebook for Intuition Cost Function with L2 Regularization with b n 1 Where: 𝜆𝜆: Regularization term choose 𝜆𝜆 = 1010 Cost Function with L2 Regularization without b How we get the derivative term (optional) Gradient with Regularization Term L1 Regularization (Lasso) L2 vs. L1 Regularization L1: Least Absolute Shrinkage and Selection Operator (Lasso) Potential issues with L1 Regularization using GD Non-differentiability: The L1 regularization term is not differentiable at zero. At zero, the gradient becomes undefined, making it difficult to find a direction to update the coefficients, causing gradient descent get stuck in the region Coordinate Descent: Coordinate descent, a variant of gradient descent, is often used for Lasso regression because it can handle the non-differentiability issue by updating one coefficient at a time. This makes it more suitable for optimizing L1 regularization. Despite the challenge, gradient descent can still be an effective optimization method for L1 regression, especially when combined with strategies like coordinate descent or stochastic gradient descent. However, it's essential to be aware of this issue and use appropriate techniques to address it for efficient optimization of L1 regularization Linear Regression in Scikit-Learn Linear models in Scikit-Learn Linear models from Sklearn from sklearn.linear_model import LinearRegression Closed Form Solution from sklearn.linear_model import SGDRegressor Stochastic Gradient Descent from sklearn.linear_model import Lasso L1 Regularization from sklearn.linear_model import Ridge L2 Regularization from sklearn.linear_model import ElasticNet L1 + L2 Regularization Alpha: Regularization term (controls the strength of regularization) Quiz3_2: implementing Regularized MLR using vectorization Implement L1 regularization Implement L2 regularization Do the same using lasso, ridge from Sklearn Compare the coefficients and performance We are done with Gradient Descent ! Reference: https://www.deeplearning.ai/

L3 PDF - Northwestern Lecture Notes

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue