Supervised Learning PDF

Supervised Learning Linear Regression, Optimization and Gradient Descent Dr Laura Bravo Assistant professor of Health Data Science Centre for Health Data Science, UOB 1 Statistical learning: tools for understanding data Supervised Learning Unsupervised Learning Building a statistical model for predicting, or Learning relationships and structure from estimating, an output (Y) based on inputs (X) input (X) data, but without a supervising output (Y) 𝑌 =𝒇 𝑿 +𝜖 Relationship between X and Y defined by function f 2 Statistical learning: tools for understanding data Supervised Learning Unsupervised Learning Building a statistical model for predicting, or Learning relationships and structure from estimating, an output (Y) based on inputs (X) input (X) data, but without a supervising output (Y) 𝑌 =𝒇 𝑿 +𝜖 This is like a student learning new material by studying old exams that contain both questions and answers. Once the student has trained on enough old exams, the student is well prepared to take a new exam. 3 Statistical learning: tools for understanding data Building a statistical model for Supervised Learning predicting, or estimating, an output (Y) based on inputs (X) 𝑌 =𝒇 𝑿 +𝜖 If output (Y) is quantitatitve If output (Y) is qualitative Regression Classification Classification models predict the likelihood that something belongs to a category. Unlike regression models, whose output is a number, classification models output a value that states whether or not something belongs to a particular category. For example, classification models are used to predict if an email is spam or if a photo contains a cat. 4 Supervised Learning Regression Predict glucose levels from a dataset containing blood markers and clinical parameters from patients Dataset 5 Supervised Learning Regression Predict glucose levels from a dataset containing blood markers and clinical parameters from patients X/ features / input values Y / label / outcome / dataset output values 6 Supervised Learning Predict glucose levels from a dataset containing blood markers and clinical Regression parameters from patients X/ features / input values Y / label / outcome / dataset output values 𝑋1 𝑋14 age glucose (mmol/L) weight 7 Supervised Learning Predict glucose levels from a dataset containing blood markers and clinical Classification parameters from patients X/ features / input values Y / label / outcome / dataset output values 𝑋1 𝑋14 age glucose (mmol/L) weight 8 Supervised Learning Predict glucose levels from a dataset containing blood markers and clinical Classification parameters from patients X/ features / input values Y / label / outcome / dataset output values 𝑋1 𝑋14 age glucose (mmol/L) weight High or low glucose levels (binary variable) 9 QUESTION TIME If want to use an ML model to predict energy usage for commercial buildings, what type of model would you use? - Classification - Regression 10 SUPERVISED LEARNING EXAMPLE We want to understand better the relationship between age and fasting glucose (here coded as glucose). Can we predict fasting glucose from age? 11 SUPERVISED LEARNING EXAMPLE We want to understand better the relationship between age and fasting glucose (here coded as glucose). Can we predict fasting glucose from age? 200 Glucose Level (mmol/L) 150 100 50 0 0 20 40 60 Age (years) *80 observations from Pima Indians Diabetes Dataset 12 SUPERVISED LEARNING EXAMPLE Studies* do point at a relationship between older people and higher fasting glucose levels. Can we define this relationship better? 200 Glucose Level (mmol/L) 150 100 50 0 0 20 40 60 Age (years) *80 observations from Pima Indians Diabetes Dataset 13 *Croat Med J. 2006 Oct;47(5):709–713. SUPERVISED LEARNING EXAMPLE Lets create a predictive model that defines this relationship 𝑌 =𝒇 𝑿 +𝜖 What is a predictive model? A complex collection of numbers that define the mathematical relationship from specific input feature patterns to specific output label values. The model discovers these patterns through training. Dataset + Learning Algorithm ➪ Predictive Model 14 SUPERVISED LEARNING EXAMPLE Lets create a predictive model that defines this relationship 𝑌 =𝒇 𝑿 +𝜖 What is a predictive model? A complex collection of numbers that define the mathematical relationship from specific input feature patterns to specific output label values. The model discovers these patterns through training. Dataset + Learning Algorithm ➪ Predictive Model 15 SUPERVISED LEARNING EXAMPLE Lets look at the elements that make up a predictive model indiviudally: Dataset + Learning Algorithm ➪ Predictive Model 𝑌 =𝒇 𝑿 +𝜖 16 SUPERVISED LEARNING EXAMPLE Lets look at the elements that make up a predictive model indiviudally Dataset + Learning Algorithm ➪ Predictive Model 𝑌 =𝒇 𝑿 +𝜖 You have already learnt in the stats module about learning algorithms, for example: linear regression 200 Glucose Level (mmol/L) 150 100 50 0 0 20 40 60 Age (years) 17 SUPERVISED LEARNING EXAMPLE Linear regression is a learning algorithm with the following mathematical formula: 𝑌 =𝒇 𝑿 +𝜖 Y = 𝛽0 +𝛽1 𝑋1 glucose =𝛽0 + 𝛽1 ∗ 𝑎𝑔𝑒 18 SUPERVISED LEARNING EXAMPLE Linear regression is a learning algorithm with the following mathematical formula: 𝑌 =𝒇 𝑿 +𝜖 Y = 𝛽0 +𝛽1 𝑋1 glucose =𝛽0 + 𝛽1 ∗ 𝑎𝑔𝑒 Now, to convert this into a predictive model, we have to apply it/fit it to our dataset ෣ = 73.7 +𝟏. 𝟑𝟑 ∗ 𝒂𝒈𝒆 𝒈𝒍𝒖𝒄𝒐𝒔𝒆 19 SUPERVISED LEARNING EXAMPLE Linear regression is a learning algorithm with the following mathematical formula: 200 𝑌 =𝒇 𝑿 +𝜖 Y = 𝛽0 +𝛽1 𝑋1 Glucose Level (mmol/L) 150 glucose =𝛽0 + 𝛽1 ∗ 𝑎𝑔𝑒 100 Now, to convert this into a predictive 50 model, we have to apply it/fit it to our dataset 0 0 20 40 60 ෣ = 73.7 +𝟏. 𝟑𝟑 ∗ 𝒂𝒈𝒆 𝒈𝒍𝒖𝒄𝒐𝒔𝒆 Age (years) Now we have a linear regression model fitted to these 80 data points or training points/samples 20 SUPERVISED LEARNING EXAMPLE Linear regression is a learning algorithm with the following mathematical formula: 200 ෣ = 73.7 +𝟏. 𝟑𝟑 ∗ 𝒂𝒈𝒆 𝒈𝒍𝒖𝒄𝒐𝒔𝒆 Glucose Level (mmol/L) 150 “A complex collection of numbers that 100 define the mathematical relationship from specific input feature patterns 50 to specific output label values.” 0 0 20 40 60 Age (years) Now we have a linear regression model fitted to these 80 data points or training points/samples 21 SUPERVISED LEARNING EXAMPLE Linear regression is a learning algorithm with the following mathematical formula: 200 ෣ = 73.7 +𝟏. 𝟑𝟑 ∗ 𝒂𝒈𝒆 𝒈𝒍𝒖𝒄𝒐𝒔𝒆 Glucose Level (mmol/L) 150 “A complex collection of numbers that 100 define the mathematical relationship from specific input feature patterns 50 to specific output label values.” 0 Input: Age = 40 0 20 40 60 Output: Predicted glucose = ? Age (years) Now we have a linear regression model fitted to these 80 data points or training points/samples 22 SUPERVISED LEARNING EXAMPLE Linear regression is a learning algorithm with the following mathematical formula: 200 ෣ = 73.7 +𝟏. 𝟑𝟑 ∗ 𝒂𝒈𝒆 𝒈𝒍𝒖𝒄𝒐𝒔𝒆 Glucose Level (mmol/L) 150 “A complex collection of numbers that 126.9 define the mathematical relationship 100 from specific input feature patterns to specific output label values.” 50 Age = 40 Predicted glucose = 73.7 + 1.33*40 = 126.9 0 0 20 40 60 Age (years) Now we have a linear regression model fitted to these 80 data points or training points/samples that maps inputs to outputs 23 SUPERVISED LEARNING EXAMPLE Lets look at the elements that make up a predictive model indiviudally Dataset + Learning Algorithm ➪ Predictive Model 𝑌 =𝒇 𝑿 +𝜖 Learning algorithms: linear regression Linear regression with feature transformation 24 THIS IS NOT NEW (MODULE 1) 25 SUPERVISED LEARNING EXAMPLE Linear regression with a quadratic association is a learning algorithm with the following mathematical formula: 𝑌 =𝒇 𝑿 +𝜖 Y = 𝛽0 +𝛽1 𝑋12 glucose =𝛽0 + 𝛽1 ∗ 𝑎𝑔𝑒 2 26 SUPERVISED LEARNING EXAMPLE Linear regression with a quadratic association is a learning algorithm with the following mathematical formula: 200 𝑌 =𝒇 𝑿 +𝜖 2 Y = 𝛽0 +𝛽1 𝑋1 Glucose Level (mmol/L) 150 glucose =𝛽0 + 𝛽1 ∗ 𝑎𝑔𝑒 2 100 50 After applying it to our training points we obtain the predictive model: 0 ෣ = 98.403 +0.016 ∗ 𝑎𝑔𝑒 2 𝑔𝑙𝑢𝑐𝑜𝑠𝑒 0 20 40 60 Age (years) 27 SUPERVISED LEARNING EXAMPLE Lets look at the elements that make up a predictive model indiviudally Dataset + Learning Algorithm ➪ Predictive Model 𝑌 =𝒇 𝑿 +𝜖 Learning algorithms: linear regression linear regression with feature transformation decision trees 28 SUPERVISED LEARNING EXAMPLE Decision trees would amount to the following predictive model: 200 𝑌 =𝒇 𝑿 +𝜖 Glucose Level (mmol/L) 150 79, 𝑖𝑓 𝑎𝑔𝑒 < 23 126, 𝑖𝑓 23 ≤ 𝑎𝑔𝑒 < 27 100 𝑌 = 115, 𝑖𝑓 27 ≤ 𝑎𝑔𝑒 < 39 133, 𝑖𝑓 39 ≤ 𝑎𝑔𝑒 < 51 50 148, 𝑖𝑓 𝑎𝑔𝑒 ≥ 51 0 0 20 40 60 As before we can map input to output: Age (years) Age = 40 Predicted glucose = 133 29 SUPERVISED LEARNING EXAMPLE Now lets look at the other elements that make up a predictive model Dataset + Learning Algorithm ➪ Predictive Model 𝑌 =𝒇 𝑿 +𝜖 Key understanding: Predictive models will be based on the dataset/training set you fit the learning algorithm to! Different training points, different predictive models 30 PRACTICAL (1) Build your own predictive models and investigate differences when using different data sets 31 TAKE-AWAY It is important to note that the relationship learnt via regression — and machine learning in general — is (a) approximate and (b) only holds for the population that the data was drawn from. Therefore, we cannot use the relationship to predict the output of a data that is not from the same distribution. Additionally, if the distribution of the data is shifted, the relationship no longer holds (as seen in the introduction day exercises!) 32 SUPERVISED LEARNING EXAMPLE Linear regression is a learning algorithm with the following mathematical formula: 200 ෣ = 73.7 +𝟏. 𝟑𝟑 ∗ 𝒂𝒈𝒆 𝒈𝒍𝒖𝒄𝒐𝒔𝒆 Glucose Level (mmol/L) 150 “A complex collection of numbers that 100 define the mathematical relationship from specific input feature patterns 50 to specific output label values.” 0 0 20 40 60 Age (years) Now we have a linear regression model fitted to these 80 data points or training points/samples 33 SUPERVISED LEARNING EXAMPLE Linear regression is a learning algorithm with the following mathematical formula: 200 ෣ = 73.7 +𝟏. 𝟑𝟑 ∗ 𝒂𝒈𝒆 𝒈𝒍𝒖𝒄𝒐𝒔𝒆 Glucose Level (mmol/L) 150 WHY THIS SET OF“A complex MODEL collection of numbers that 100 PARAMETRES and NOT defineOTHERS? the mathematical relationship from specific input feature patterns 50 to specific output label values.” 0 0 20 40 60 Age (years) Now we have a linear regression model fitted to these 80 data points or training points/samples 34 Keep in mind, that from now on, everything that we are seeing is explaining how the learning algorithm works and in practice – you do not have to do any of this yourself! Already coded into the function we are using (for example lm() in R) WHY THIS SET OF PARAMETRES and NOT OTHERS? 200 ෣ = 120 +𝟎. 𝟓 ∗ 𝒂𝒈𝒆 𝒈𝒍𝒖𝒄𝒐𝒔𝒆 Glucose Level (mmol/L) 150 ෣ = 73.7 +𝟏. 𝟑𝟑 ∗ 𝒂𝒈𝒆 𝒈𝒍𝒖𝒄𝒐𝒔𝒆 ෣ = 80 +𝟎. 𝟗 ∗ 𝒂𝒈𝒆 𝒈𝒍𝒖𝒄𝒐𝒔𝒆 100 ෣ = 100 −𝟎. 𝟐 ∗ 𝒂𝒈𝒆 𝒈𝒍𝒖𝒄𝒐𝒔𝒆 50 0 0 20 40 60 Age (years) WHY THIS SET OF PARAMETRES and NOT OTHERS? 200 Glucose Level (mmol/L) 150 ෣ = 73.7 +𝟏. 𝟑𝟑 ∗ 𝒂𝒈𝒆 𝒈𝒍𝒖𝒄𝒐𝒔𝒆 100 This is the line of best fit for these set of data points, fitting the learning algorithm to a 50 dataset means choosing the right parameters B0 and B1 (if just 2 dimensions) 0 0 20 40 60 Age (years) What does line of BEST FIT mean? Two different ways of saying the same thing: The line that offers the best prediction of y: the line that passess as close as possible to most of the points We want the difference between real data point y and predicted 𝒚 ෝ to be as small as possible 38 What does line of BEST FIT mean? ෝ to be We want the difference between real data point y and predicted 𝒚 as small as possible 200 Everyone understand Glucose Level (mmol/L) ෝ 𝒚 ෝ𝒊 ? this is 𝒚 150 𝒊 100 𝒚𝒊 50 0 20 30 40 50 60 Age (years) 39 What does line of BEST FIT mean? ෝ to be We want the difference between real data point y and predicted 𝒚 as small as possible 200 Everyone understand Glucose Level (mmol/L) ෝ 𝒚 ෝ𝒊 ? this is 𝒚 150 𝒊 𝒆𝒊 = 𝒚𝒊 − 𝒚ෝ𝒊 100 𝒚𝒊 Residual 50 0 20 30 40 50 60 Age (years) 40 What does line of BEST FIT mean? ෝ to be We want the difference between real data point y and predicted 𝒚 as small as possible 200 Glucose Level (mmol/L) 150 𝒚𝒊 is fixed but the prediction depends on the 𝒆𝒊 = 𝒚𝒊 − 𝒚ෝ𝒊 𝒚𝒊 ), so we can make model (ෝ 100 Residual error better or worse with different models 50 0 0 20 40 60 Age (years) 41 How do we measure it? ෝ𝒊 to be We want the difference between real data point yi and predicted 𝒚 as small as possible 200 200 Glucose Level (mmol/L) Glucose Level (mmol/L) 150 150 100 100 𝒆𝒊 = 𝒚𝒊 − 𝒚ෝ𝒊 50 Residual 50 0 0 20 30 40 50 60 Age (years) 20 30 40 50 60 Age (years) 42 How do we measure it? ෝ to be We want the difference between real data point y and predicted 𝒚 as small as possible 200 +20 Glucose Level (mmol/L) 150 -30 BUT how do we measure it? 100 𝒆𝒊 = 𝒚𝒊 − 𝒚ෝ𝒊 ….. does the error cancel out? 50 Residual 0 20 30 40 50 60 Age (years) 43 How do we measure it? ෝ to be We want the difference between real data point y and predicted 𝒚 as small as possible 200 +20 Does the error cancel out? We Glucose Level (mmol/L) 150 care about distance not direction -30 so have to navigate signs by either 100 squaring or taking the absolute 𝒆𝒊 = 𝒚𝒊 − 𝒚ෝ𝒊 value 50 Residual -30 +20 = -10 0 −302 + 202 = 900 + 400 = 1300 20 30 40 50 60 |-30| + |+20| = 30 + 20 = 50 Age (years) 44 How do we measure it? ෝ to be We want the difference between real data point y and predicted 𝒚 as small as possible 200 +20 Does the error cancel out? We Glucose Level (mmol/L) 150 care about distance not -30 direction so have to navigate 100 signs by either squaring or 𝒆𝒊 = 𝒚𝒊 − 𝒚ෝ𝒊 taking the absolute value 50 Residual Different ways in which to add 0 up the error -> loss functions 20 30 40 50 60 Age (years) 45 How do we measure it? Loss or cost function is a numerical metric that describes how wrong a model’s predictions are by measuring the distance between model’s predictions and actual labels Squared |𝑒𝑖 | = |𝑦𝑖 − 𝑦ො𝑖 | Absolute values 𝑒𝑖2 = (𝑦𝑖 − 𝑦ො𝑖 )2 residuals Now we can add them up for each point Now we can add them up for each point 𝑖=1 𝑖=1 Sum of squared L1 loss ෍(𝑦𝑖 − 𝑦ො𝑖 )2 residuals ෍ |𝑦𝑖 − 𝑦ො𝑖 | 𝑚 L2 loss 𝑚 And get their average And get their average 𝑖=1 𝑖=1 1 MSE (mean 1 MAE (mean absolute ෍(𝑦𝑖 − 𝑦ො𝑖 )2 squared error) ෍ |𝑦𝑖 − 𝑦ො𝑖 | error) 𝑚 𝑚 𝑚 𝑚 46 How do we measure it? Loss or cost function is a numerical metric that describes how wrong a model’s predictions are by measuring the distance between model’s predictions and actual labels When the difference between the prediction and label is large, squaring makes the loss even larger. When the difference is small (less than 1), squaring makes the loss even smaller. Squared 𝑒𝑖2 = (𝑦𝑖 − 𝑦ො𝑖 )2 residuals |𝑒𝑖 | = |𝑦𝑖 − 𝑦ො𝑖 | Absolute values Now we can add them up for each point 𝑖=1 Sum of squared Now we can add them up for each point ෍(𝑦𝑖 − 𝑦ො𝑖 )2 residuals 𝑚 𝑖=1 L2 loss L1 loss And get their average ෍ |𝑦𝑖 − 𝑦ො𝑖 | 𝑚 𝑖=1 1 MSE (mean ෍(𝑦𝑖 − 𝑦ො𝑖 )2 squared error) And get their average 𝑚 𝑚 𝑖=1 1 MAE (mean absolute RMSE (root mean ෍ |𝑦𝑖 − 𝑦ො𝑖 | error) squared error) 𝑚 𝑚 47 PRACTICAL (2) Calculate different loss/cost functions 48 Which has the higher Mean Squared Error (MSE)? 49 Which has the higher Mean Squared Error (MSE)? MAE? Code these two datasets in R 50 Need more help understanding? https://mlu-explain.github.io/linear-regression/ 51 WHY THIS SET OF PARAMETRES and NOT OTHERS? 200 ෣ = 120 +𝟎. 𝟓 ∗ 𝒂𝒈𝒆 𝒈𝒍𝒖𝒄𝒐𝒔𝒆 Glucose Level (mmol/L) 150 ෣ = 73.7 +𝟏. 𝟑𝟑 ∗ 𝒂𝒈𝒆 𝒈𝒍𝒖𝒄𝒐𝒔𝒆 ෣ = 80 +𝟎. 𝟗 ∗ 𝒂𝒈𝒆 𝒈𝒍𝒖𝒄𝒐𝒔𝒆 100 ෣ = 100 −𝟎. 𝟐 ∗ 𝒂𝒈𝒆 𝒈𝒍𝒖𝒄𝒐𝒔𝒆 50 Okay, so the choice of linear regression model is the one that best fits this points 0 (i.e has the lowest loss function, in this 0 20 40 60 case MSE), but do we have to compute Age (years) this for all possible parameters? One by one? There has to be another way PRACTICAL (3) Line of best fit 53 Line of best fit is the one with the parameters that minimize the chosen loss function. But how do we find the minimum of the loss function? One big clue is that, as seen in the exercise, MSE is the default loss function because of one big advantage: “MSE provides a smooth, differentiable error surface, which allows for efficient optimization techniques.” 54 Line of best fit is the one with the parameters that minimize the chosen loss function. But how do we find the minimum of the loss function? One big clue is that, as seen in the exercise, MSE is the default loss function because of one big advantage: “MSE provides a smooth, differentiable error surface, which allows for efficient optimization techniques.” 55 OPTIMIZATION Maximum Minimum Saddle point/Inflection Point 4 4 4 In optimization we aim to f’(x) = 0 2 2 2 find the critical points of functions: f(x) f(x) f(x) 0 0 0 f’(x) = 0 −2 −2 −2 f’(x) = 0 −4 −4 −4 −2 0 2 −2 0 2 −2 0 2 x x x OPTIMIZATION And we do so, by finding the point where the slope = 0, or in other words, where the gradient = 0 Gradient = 0 57 Join the Vevox session Go to vevox.app Enter the session ID: 155-384-583 Or scan the QR code ##/## Join at: vevox.app ID: XXX-XXX-XXX Question slide How confortable are you with these terms? Understand what it means First time I am hearing it Calculus Derivative Partial derivative Gradient Ordinary Least Squares ##/## Join at: vevox.app ID: XXX-XXX-XXX Results slide How confortable are you with these terms? Understand what it means First time I am hearing it Calculus Derivative Partial derivative Gradient Ordinary Least Squares RESULTS SLIDE I am leaving some slides explaining the reasoning behind looking at the slope to find the minimum and more insight into derivatives, feel free to check them out and happy to go through them individually! 61 Lets go back to linear regression From what we have learnt, to find the best fit line (beta parameters), we need to optimize the loss function (see with which parameters the slope/derivative of the loss function = 0) BUT This can be done in 2 ways: 62 Lets go back to linear regression From what we have learnt, to find the best fit line (beta parameters), we need to optimize the loss function (see with which parameters the slope/derivative of the loss function = 0) BUT This can be done in 2 ways: 1) Analytical/closed form solution – Ordinary Least Squares (OLS) Where we explicitly calculate the loss’s function derivative and find at which parameters it equals 0. This approach is elegant but complex for real-world applications (e.g in high-dimensional models, like neural networks, analytical solutions are often impractical or impossible) 2) Numerical methods – Gradient descent 63 1. CLOSED FORM, ANALYTCIAL SOLUTION ORDINARY LEAST SQUARES 64 OLS: THIS IS NOT NEW Closed form 65 PRACTICAL TIME (4) Ordinary Least Squares derivation 66 Multiple dimensions 67 1. https://www.math.uwaterloo.ca/~hwolkowi/matr ixcookbook.pdf 2. https://statproofbook.github.io/P/sr-ola.html 3. https://introml.mit.edu/_static/fall22/LectureNot es/6_390_Lecture_notes_fall2022.pdf 4. https://dafriedman97.github.io/mlbook/content/ c1/s1/loss_minimization.html 5. https://setosa.io/ev/ordinary-least-squares- regression/ 68 2. GRADIENT DESCENT 69 Gradient descent Start at a random point , then use the function's derivative to determine the slope at that point, and move a little bit in the downwards direction (opposite to gradient, as gradient always point towards steepest increase), then repeat the process until convergence (you reach a local minimum) More precisely: 𝑥𝑡 = 𝑥𝑡−1 − 𝜶𝜵𝒇(𝑥𝑡−1 ) At each iteration, the step size is proportional to the slope, so the process naturally slows down as it approaches a local minimum. Each step is also proportional to the learning rate (𝜶): a parameter of the Gradient Descent algorithm itself (since it is not a parameter of the function we are optimizing, it is called a hyperparameter). Possible Stopping Criteria: iterate until ∇𝑓(𝑥𝑡 ) ≤ 𝜖 for some 𝜖 > 0 70 https://thenumb.at/Autodiff/ 𝑥𝑡 = 𝑥𝑡−1 − 𝜶𝜵𝒇(𝑥𝑡−1) = 𝑥1= −4 − 𝟎. 𝟖 ∗ 𝟐 −𝟒 f’(x) = 2x 𝑓 𝑥 = 𝑥2 Step size (𝛼0 ):.8 𝑥 (0) = −4 71 𝑥𝑡 = 𝑥𝑡−1 − 𝜶𝜵𝒇(𝑥𝑡−1) f’(x) = 2x 𝑥 (1) = −4 − 0.8 ⋅ 2 ⋅ (−4) 𝑓 𝑥 = 𝑥2 Step size:.8 𝑥 (0) = −4 𝑥 (1) = −4 −.8 ⋅ 2 ⋅ (−4) 72 𝑓 𝑥 = 𝑥2 Step size:.8 𝑥 (0) = −4 𝑥 (1) = 2.4 73 𝑓 𝑥 = 𝑥2 Step size:.8 𝑥 (0) = −4 𝑥 (1) = 2.4 𝑥 (2) = 2.4 −.8 ⋅ 2 ⋅ 2.4 𝑥 (1) = 0.4 74 𝑓 𝑥 = 𝑥2 Step size:.8 𝑥 (0) = −4 𝑥 (1) = 2.4 𝑥 (2) = −1.44 75 𝑓 𝑥 = 𝑥2 Step size:.8 𝑥 (0) = −4 𝑥 (1) = 2.4 𝑥 (2) = −1.44 𝑥 (3) =.864 𝑥 (4) = −0.5184 𝑥 (5) = 0.31104 𝑥 (30) = −8.84296𝑒 − 07 76 Step size:.2 77 Step size matters! 78 Step size matters! 79 Step size:.9 80 Many dimensions 81 PRACTICAL TIME (5) Gradient descent 82 Final Remarks Why are we learning this? 83 Final Remarks Why are we learning this? It is not always straightfroward to find a closed form solution. Minimum of convex functions can be found by opimizing using gradient descent. This can be applied to all functions of this kind, and that is why ML sometimes just resorts to identifying a loss function that is convex, so you can then find the minimum through optimization 84 Final Remarks 85 EXTRA 86 Now that we have learnt all of this, we can understand better why certain pre-processing techniques are needed: – Check for collinearity: - Scale variables so they have comparable magnitudes: - Outliers: in exercise! 87 SCALING: These lines are contour plots, fixed in z-axis, varying in x and y 88 89 https://towardsdatascience.com/gradient-descent-algorithm-and-its-variants-10f652806a3 COLLINEARITY: Now the contours run along a narrow valley; there is a broad range of values for the coeﬀicient estimates that result in equal values for RSS. Hence a small change in the data could cause the pair of coeﬀicient values that yield the smallest RSS—that is, the least squares estimates—to move anywhere along this valley. This results in a great deal of uncertainty in the coeﬀicient estimates. Since collinearity reduces the accuracy of the estimates of the regression coeﬀicients, it causes the standard error for βˆj to grow. 90 https://hastie.su.domains/ISLR2/ISLRv2_corrected_June_2023.pdf.download.html https://hastie.su.domains/ISLR2/ISLRv2_corrected_June_2023.pdf.download.html The first is a regression of balance on age and limit, and the second is a regression of balance on rating and limit. In the first regression, both age and limit are highly significant with very small p- values. In the second, the collinearity between limit and rating has caused the standard error for the limit coeﬀicient estimate to increase by a factor of 12 and the p-value to increase to 0.701. In other words, the importance of the limit variable has been masked due to the presence of collinearity. To avoid such a situation, it is desirable to identify and address potential collinearity problems while fitting the model. 91 From the normal equations (OLS) perspective: 92

Supervised Learning PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue