Linear Regression Lecture Notes PDF
Document Details
Uploaded by InfallibleLawrencium3753
Northwestern University
Tags
Summary
These lecture notes cover linear regression, including gradient descent, vectorization techniques using NumPy, and regularization. The slides detail cost functions, gradient descent implementations, and adjustments for learning rates.
Full Transcript
Last Time: Implementing Gradient Descent with SLR using for loops Lecture: Cost Function Visualizing the Cost Function In-class Assignment: Gradient Descent Gradient Descent Implementation in SLR Learning Rate in GD (Fill in Code Blanks) Gradient Descent...
Last Time: Implementing Gradient Descent with SLR using for loops Lecture: Cost Function Visualizing the Cost Function In-class Assignment: Gradient Descent Gradient Descent Implementation in SLR Learning Rate in GD (Fill in Code Blanks) Gradient Descent for SLR Implementing Gradient Descent for SLR Running Gradient Descent Review of last lecture: Gradient Descent Algorithm is used to minimize the cost function: J ( w, b) w b Review of last lecture: The goal is to teach you gradient Cost function: descent to set you up for ML algorithms. The mathematical derivation of the derivative is beyond the scope of this course. Please fill in this gap outside the classroom. Gradient Descent for SLR: Agenda for Today Lecture: MLR In-class Assignment: Vectorization in NumPy Gradient Descent Implementation in MLR (Fill in Code Blanks, 5 tasks in total) Gradient Descent for MLR Feature Scaling in Gradient Descent Checking Gradient Descent for Convergence Choosing the Learning Rate Regularization to Prevent Overfitting Linear Regression in Sklearn Linear Regression with Multiple Variables MLR Last time: Single feature (variable) Size in feet2 (𝑥𝑥) Price ($) in 1000’s (𝑦𝑦) 2104 400 1416 232 1534 315 852 178 … … 𝑓𝑓𝑤𝑤,𝑏𝑏 𝑥𝑥 = 𝑤𝑤𝑥𝑥 + 𝑏𝑏 General case: Multiple features (variables) Size in Number of Number of Age of home Price ($) in feet2 bedrooms floors in years $1000’s 2104 5 1 45 460 1416 3 2 40 232 1534 3 2 30 315 852 2 1 36 178 … … … … … x𝑗𝑗 = 𝑗𝑗 𝑡𝑡ℎ feature 𝑛𝑛 = number of features x 𝑖𝑖 = features of 𝑖𝑖 𝑡𝑡ℎ training example x𝑗𝑗 𝑖𝑖 = value of feature 𝑗𝑗 in 𝑖𝑖 𝑡𝑡ℎ training example Model: SLR: 𝑓𝑓𝑤𝑤,𝑏𝑏 𝑥𝑥 = 𝑤𝑤𝑥𝑥 + 𝑏𝑏 MLR: 𝑓𝑓𝑤𝑤,𝑏𝑏 x = 𝑤𝑤1𝑥𝑥1 + 𝑤𝑤2𝑥𝑥2 + ⋯ + 𝑤𝑤 𝑛𝑛 𝑥𝑥𝑛𝑛 + 𝑏𝑏 multiple linear regression multiple linear regression 𝑓𝑓𝑤𝑤,𝑏𝑏 (𝑥𝑥) = 𝑤𝑤1𝑥𝑥1 + 𝑤𝑤2𝑥𝑥2 + ⋯ + 𝑤𝑤 𝑛𝑛 𝑥𝑥𝑛𝑛 + 𝑏𝑏 𝑏𝑏 is a number Parameters and features w = 𝑤𝑤1 𝑤𝑤2 𝑤𝑤3 𝑏𝑏 is a number x = 𝑥𝑥1 𝑥𝑥2 𝑥𝑥3 linear algebra: count from 1 Using loops w = np.array([1.0,2.5,-3.3]) f = 0 b = 4 for j in range(0,n): x = np.array([10,20,30]) f = f + w[j] * x[j] code: count from 0 f = f + b Using Loops 𝑓𝑓w,𝑏𝑏 x = 𝑤𝑤1𝑥𝑥1 + 𝑤𝑤2𝑥𝑥2 + 𝑤𝑤3 𝑥𝑥3 + 𝑏𝑏 f = w * x + w * x + w * x + b Vectors w w … w … * * * x x … x Using NumPy 𝑓𝑓w,𝑏𝑏 x = w ∙ x + 𝑏𝑏 Speed f = np.dot(w,x) + b Code simplicity Without vectorization Vectorization for j in range(0,16): np.dot(w,x) f = f + w[j] * x[j] 𝑡𝑡0 𝑡𝑡0 w w … w f + w * x * * … * 𝑡𝑡1 f + w * x x x … x 𝑡𝑡1 … w*x + w*x +…+ w*x 𝑡𝑡15 f + w * x Please review the “Python_Numpy_Vectorization.ipynb” notebook to get familiar with NumPy vectorization Gradient Descent for Multiple Regression Linear Regression with Multiple Variables Previous notation Vector notation Parameters 𝑤𝑤1 , ⋯ , 𝑤𝑤𝑛𝑛 w = [ 𝑤𝑤1 ⋯ 𝑤𝑤𝑛𝑛 𝑏𝑏 𝑏𝑏 Model 𝑓𝑓w,𝑏𝑏 x = 𝑤𝑤1 𝑥𝑥1 + ⋯ + 𝑤𝑤𝑛𝑛𝑥𝑥𝑛𝑛 + 𝑏𝑏 𝑓𝑓w,𝑏𝑏 x = w ∙ x + 𝑏𝑏 Cost function 𝐽𝐽 ( 𝑤𝑤1, ⋯ , 𝑤𝑤𝑛𝑛, 𝑏𝑏 ) 𝐽𝐽 w, 𝑏𝑏 Gradient descent repeat { repeat { 𝜕𝜕 𝐽𝐽( 𝑤𝑤 , ⋯ , 𝑤𝑤 , 𝑏𝑏) 𝑤𝑤𝑗𝑗 = 𝑤𝑤𝑗𝑗 − 𝛼𝛼 𝜕𝜕𝑤𝑤 𝜕𝜕 𝐽𝐽 w, 𝑏𝑏 𝑗𝑗 1 𝑛𝑛 𝑤𝑤𝑗𝑗 = 𝑤𝑤𝑗𝑗 − 𝛼𝛼 𝜕𝜕𝑤𝑤 𝑗𝑗 𝜕𝜕 𝐽𝐽 𝑤𝑤 , ⋯ , 𝑤𝑤 , 𝑏𝑏 𝑏𝑏 = 𝑏𝑏 − 𝛼𝛼𝜕𝜕𝑏𝑏 𝜕𝜕 𝐽𝐽 w, 𝑏𝑏 1 𝑛𝑛 𝑏𝑏 = 𝑏𝑏 − 𝛼𝛼 𝜕𝜕𝑏𝑏 } } For j =1…n Gradient descent in MLR One feature 𝑛𝑛 features 𝑛𝑛 ≥ 2 repeat { repeat { ⋮ 𝜕𝜕 𝐽𝐽 w, 𝑏𝑏 𝜕𝜕 𝜕𝜕𝑤𝑤1 𝜕𝜕𝑤𝑤 𝐽𝐽 𝑤𝑤, 𝑏𝑏 simultaneously update 𝑤𝑤, 𝑏𝑏 simultaneously update 𝑤𝑤𝑗𝑗 (for 𝑗𝑗 = 1, ⋯ , 𝑛𝑛) and 𝑏𝑏 } } Gradient Descent in MLR 𝜕𝜕 𝐽𝐽 ( w, 𝑏𝑏) not depend on each other 𝜕𝜕𝑤𝑤j Gradient descent w = 𝑤𝑤1 𝑤𝑤2 ⋯ 𝑤𝑤16 in MLR d = 𝑑𝑑1 𝑑𝑑2 ⋯ 𝑑𝑑16 w = np.array([0.5, 1.3, … 3.4]) d = np.array([0.3, 0.2, … 0.4]) compute 𝑤𝑤𝑗𝑗 = 𝑤𝑤𝑗𝑗 − 0.1𝑑𝑑𝑗𝑗 for 𝑗𝑗 = 1 … 16 Without vectorization With vectorization for j in range(0,16): w = w − 0.1d w[j] = w[j] - 0.1 * d[j] w = w – 0.1 * d 𝑤𝑤1 = 𝑤𝑤1 − 0.1𝑑𝑑1 w w w 𝑤𝑤2 = 𝑤𝑤2 − 0.1𝑑𝑑2 - - … - ⋮ 0.1 * d d … d 𝑤𝑤16 = 𝑤𝑤16 − 0.1𝑑𝑑16 w w … w Feature Scaling Crucial to the performance of the GD algorithm Feature and parameter values 𝑥𝑥1: size (feet2) 𝑥𝑥2: # bedrooms 𝑝𝑝𝑟𝑟𝑖𝑖𝑐𝑐𝑒𝑒 = 𝑤𝑤1 𝑥𝑥1 + 𝑤𝑤2𝑥𝑥2 + 𝑏𝑏 range: 300 − 2,000 range: 0 − 5 House: 𝑥𝑥1 = 2000, 𝑥𝑥2 = 5, 𝑝𝑝𝑟𝑟𝑖𝑖𝑐𝑐𝑒𝑒 = $500k size of the parameters 𝑤𝑤1 , 𝑤𝑤2? Is the 1 feet square as significant as a 1 bedroom difference? Feature size and parameter size size of feature 𝑥𝑥𝑗𝑗 size of parameter 𝑤𝑤𝑗𝑗 size in feet2 #bedrooms Features Parameters 𝐽𝐽 w, 𝑏𝑏 𝑥𝑥2 𝑤𝑤2 # bedrooms # bedrooms 𝑥𝑥1 size in feet2 𝑤𝑤1 size in feet2 Feature size and gradient descent Features Parameters 𝑥𝑥2 𝑤𝑤2 𝐽𝐽 w, 𝑏𝑏 # bedrooms # bedrooms 𝑤𝑤1 size in feet2 Without feature scaling, the gradient descent algorithm would oscillate a lot back and forth, taking a long time before finding its way to the minimum point due to elongated contour plots, which cause the algorithm to zigzag across the gradient rather than moving directly toward the minimum! Feature scaling Feature scaling is crucial for gradient descent optimization algorithms when features have different orders of magnitude Two commonly used ways for feature scaling in sklearn Standard Scaling: MinMax Scaling: 𝑋𝑋𝑗𝑗 − 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚(𝑋𝑋𝑗𝑗 ) 𝑋𝑋𝑗𝑗 − 𝑚𝑚𝑚𝑚𝑚𝑚(𝑋𝑋𝑗𝑗 ) 𝑋𝑋𝑗𝑗_𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 = 𝑋𝑋𝑗𝑗_𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 = 𝑠𝑠𝑠𝑠. 𝑑𝑑𝑑𝑑𝑑𝑑. (𝑋𝑋𝑗𝑗 ) 𝑚𝑚𝑚𝑚𝑚𝑚(𝑋𝑋𝑗𝑗 ) − min(𝑋𝑋𝑗𝑗 ) maps to a Standard Normal maps to [0,1]. distribution. Standard Scaling: Z-score normalization 𝑋𝑋𝑗𝑗 − 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚(𝑋𝑋𝑗𝑗 ) Standard Scaling: 𝑋𝑋𝑗𝑗_𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 = 𝑠𝑠𝑠𝑠. 𝑑𝑑𝑑𝑑𝑑𝑑. (𝑋𝑋𝑗𝑗 ) 𝑤𝑤2 Standard Normal Distribution # bedrooms 𝐽𝐽 w, 𝑏𝑏 rescaled 𝑤𝑤1 size in feet2 rescaled Standard Scaling: Z-score normalization The Need for Scaling the Features Note: Scaling is a transformation. In order to have meaningful prediction results, training and test data should undergo the exact same transformation. For Standard Scaling: Training Data: Test Data: 𝑋𝑋𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑛𝑛𝑗𝑗 − 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚(𝑋𝑋𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑛𝑛𝑗𝑗 ) 𝑋𝑋𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑗𝑗 − 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚(𝑋𝑋𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑛𝑛𝑗𝑗 ) 𝑋𝑋𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑛𝑛𝑗𝑗_𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 = 𝑋𝑋𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑗𝑗_𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 = 𝑠𝑠𝑠𝑠. 𝑑𝑑𝑑𝑑𝑑𝑑. (𝑋𝑋𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑛𝑛𝑗𝑗 ) 𝑠𝑠𝑠𝑠. 𝑑𝑑𝑑𝑑𝑑𝑑. (𝑋𝑋𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑛𝑛𝑗𝑗 ) Transform each training instance using the mean and In sklearn, st.dev. of each training column before training..fit to train the scaler (with the training data) Transform each test instance using the mean and st.dev..transform to scale the datasets (training and test) of each training column before prediction. Do not refit the scaler with the test data Same idea with MinMax scaling (using min and max This is just following the general principle: any thing you instead of mean and st.dev.) learn, must be learned from the model's training data. Checking Gradient Descent for Convergence and Choosing the Learning Rate Practical Tips for Linear Regression Gradient descent 1.For Gradient Descent: 1. Learning rates typically fall in the range of 0.1 to 0.0001. 2.For Adaptive Methods (e.g., Adam, RMSprop): J w, b # iterations Values of 𝛼𝛼 to try: … 0.001 0.01 0.1 1… J w, b J w, b # iterations # iterations Identify problem with gradient descent learning rate is too or 𝛼𝛼 is too large large 𝐽𝐽 𝑤𝑤, 𝑏𝑏 𝐽𝐽 𝑤𝑤, 𝑏𝑏 # iterations # iterations Adjust learning rate 𝛼𝛼 is too big Use smaller 𝛼𝛼 With a small enough 𝛼𝛼, 𝐽𝐽 w, 𝑏𝑏 should decrease on every iteration 𝐽𝐽 𝑤𝑤, 𝑏𝑏 𝐽𝐽 𝑤𝑤, 𝑏𝑏 If 𝛼𝛼 is too small, gradient descent takes a lot more iterations to converge parameter 𝑤𝑤1 parameter 𝑤𝑤1 Make sure gradient descent is working correctly objective: min 𝐽𝐽 w, 𝑏𝑏 𝐽𝐽 w, 𝑏𝑏 should decrease w,𝑏𝑏 after every iteration Automatic convergence test Let 𝜀𝜀 “epsilon” be 10−3. If 𝐽𝐽 w, 𝑏𝑏 decreases by ≤ 𝜀𝜀 in one iteration, declare convergence. (found parameters w, 𝑏𝑏 to get close to global minimum) Tolerance (tol) in Logistic Regression (hw1) tol stands for tolerance and controls the stopping criterion for the optimization algorithm (like gradient descent or other solvers). The optimization process (such as gradient descent) iterates to minimize the cost function. However, after a certain number of iterations, the updates to the weights (or loss) become very small. tol (epsilon-like) is the threshold below which the updates are considered insignificant, and the algorithm is allowed to stop. In other words, when the change in the cost Non-convex function function or the coefficients between iterations is smaller than this value, the solver concludes that convergence has been reached and terminates. Linear Regression in Scikit-Learn Gradient Descent in Scikit-Learn Linear Regression in Sklearn from sklearn.linear_model import LinearRegression from sklearn.linear_model import SGDRegressor from sklearn.linear_model import Ridge, Lasso, and ElasticNet (regularized version) From 303-2: Estimating the Regression Coefficients (Mathematically) 36 Linear Regression in Sklearn, Statsmodels Used the close-form solution to Imagine that X is a very large matrix. e.g. calculate the parameters. X might have 100,000 columns and 1,000,000 rows Storing a dense 100,000 by 100,000 element 𝑋𝑋 𝑇𝑇 𝑋𝑋 matrix would then require 1×1010 Feature scaling is not needed in the closed-form solution floating point numbers (at 8 bytes per number, this comes to 80 gigabytes.) This Disadvantages would be impractical to store on anything For most nonlinear regression but a supercomputer! problems there is no closed form solution. In this situation, SGD is the Even in linear regression (one of recommended method for finding the few cases where a closed form parameters w,b solution is available), it may be impractical to use the closed-form solution when the data set is large. LinearRegression vs. SGDRegressor LinearRegression is a straightforward and deterministic algorithm suitable for smaller datasets that can fit into memory, while SGDRegressor is more flexible and scalable, making it suitable for large-scale datasets and online learning scenarios but requiring more parameter tuning and potentially exhibiting more variability in results. What is Stochastic Gradient Descent Regressor? "Batch" gradient descent "Batch": Each step of gradient descent uses all the training examples. x y size in feet2 price in $1000'5 (1) 2104 400 (2) 1416 232 (3) 1534 315 (4) 852 178... (47) 3210 870 Andrew Ng "Stochastic" gradient descent "Stochastic": Each Step of gradient descent uses single training example. x y size in feet2 price in $1000'5 (1) 2104 400 (2) 1416 232 (3) 1534 315 (4) 852 178... (47) 3210 870 Andrew Ng "mini-batch" gradient descent "mini-batch": Each Step of gradient descent uses one batch of training example. x y Batch size size in feet2 price in $1000'5 (1) 2104 400 (2) 1416 232 (3) 1534 315 Batch Size (4) 852 178 Epoch... Iteration (47) 3210 870 Mini-batch gradient descent will be covered in Deep Learning Andrew Ng Advantages of SGD It is easier to fit into memory due to a single training sample being processed by the whole iteration It is computationally fast as only one sample is processed at a time For larger datasets it can converge faster as it causes updates to the parameters more frequently Due to frequent updates the steps taken towards the minima of the loss function have oscillations which can help getting out of local minimums of the loss function (in case the computed position turns out to be the local minimum) Disadvantages of SGD Due to frequent updates the steps taken towards the minima are very noisy. This can often lead the gradient descent into other directions Due to noisy steps it may take longer to achieve convergence to the minima of the loss function Frequent updates are computationally expensive due to using all resources for processing one training sample at a time It loses the advantages of vectorized operations as it deals with only a single example at a time Regularization to Reduce Overfitting Cost Function with Regularization Cost Function with Regularization Where: 𝜆𝜆: Regularization term choose 𝜆𝜆 = 1010 Regularized linear regression Gradient descent repeat { 𝜕𝜕 𝑖𝑖 𝑤𝑤𝑗𝑗 = 𝑤𝑤𝑗𝑗 − 𝛼𝛼 𝐽𝐽 w, 𝑏𝑏 𝜒𝜒𝑗𝑗 𝜕𝜕𝑤𝑤𝑗𝑗 𝜕𝜕 𝑏𝑏 = 𝑏𝑏 − 𝛼𝛼 𝐽𝐽 w, 𝑏𝑏 𝜕𝜕𝑏𝑏 } simultaneous update How we get the derivative term (optional) 𝜕𝜕 𝐽𝐽 w, 𝑏𝑏 = 𝜕𝜕𝑤𝑤𝑗𝑗 Implementing gradient descent repeat { } simultaneous update Reference: https://www.deeplearning.ai/ multiple linear regression 𝑓𝑓𝑤𝑤,𝑏𝑏 (𝑥𝑥) = 𝑤𝑤1𝑥𝑥1 + 𝑤𝑤2𝑥𝑥2 + ⋯ + 𝑤𝑤 𝑛𝑛 𝑥𝑥𝑛𝑛 + 𝑏𝑏 𝑏𝑏 is a number 𝑓𝑓w,𝑏𝑏 x = w ∙ x + 𝑏𝑏 =