Machine Learning 1 Lecture 4 PDF

Machine Learning 1 Week 4 - Lecture Linear Models Linear Models Linearly Separable Data Source Abu-Mostafa, Y. S., Magdon-Ismail, M., & Lin, H. (2012). Learning from data: A short course. Linear Regression Linear Regression – Foundational Concepts t – Distribution The 𝒕 distribution (cont.) Always centered at zero, like the standard normal (𝓏) distribution Has a single parameter: degrees of freedom (df). ANOVA 𝑭 distribution – degrees of freedom 𝑣𝑎𝑟𝑖𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑔𝑟𝑜𝑢𝑝𝑠 𝐹= 𝑣𝑎𝑟𝑖𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑤𝑖𝑡ℎ𝑖𝑛 𝑔𝑟𝑜𝑢𝑝𝑠 F(b,w) b – degrees of freedom for variance between groups w – degrees of freedom for variance within groups 𝑭 distribution – degrees of freedom 𝑣𝑎𝑟𝑖𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑔𝑟𝑜𝑢𝑝𝑠 𝐹= 𝑣𝑎𝑟𝑖𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑤𝑖𝑡ℎ𝑖𝑛 𝑔𝑟𝑜𝑢𝑝𝑠 F(b,w) b – number of groups -1 w – total number of observations within groups – number of groups Population vs. Mean Variance When would we be interested in population mean differences? 𝜇 𝑥 = 𝜇𝑦 ? When would we be interested in population variances? 2 𝜎𝑥 = 2 𝜎𝑦 ? Population Variance Residuals Residuals are the leftovers from the model fit: Data = Fit + Residual Constant Variability - homoscedasticity Transformations Transformations are not only done to improve predictive accuracy; we sometimes need to do them to transform non- normally distributed data to normal or ”more normal” so that we can answer questions about our data using what we know about normal distribution. Linear Regression Linear Regression Source https://developers.google.com/machine-learning/crash-course/linear-regression Linear Regression Source https://developers.google.com/machine-learning/crash-course/linear-regression Simple Linear Regression Source https://developers.google.com/machine-learning/crash-course/linear-regression Linear Regression Source https://developers.google.com/machine-learning/crash-course/linear-regression Linear Regression - Loss Source https://developers.google.com/machine-learning/crash-course/linear-regression/loss Linear Regression – Loss Simulation https://developers.google.com/machine-learning/crash- course/linear-regression/parameters-exercise Advertising Example Advertising Example 1. Is there a relationship between advertising budget and sales? Association ≠ Causation Association might be predictive 2. How strong is the relationship between advertising and sales? 3. Which media are associated with sales? Individual contributions of TV, radio, and newspapers to sales Advertising Example 1. Is there a relationship between advertising budget and sales? Association ≠ Causation Association might be predictive 2. How strong is the relationship between advertising and sales? 3. Which media are associated with sales? Are they all effective? Advertising Example 4. How large is the association between each advertising medium and sales? For every dollar spent on advertising in each medium, by what amount will sales increase? 5. How accurately can we predict future sales? 6. Is the relationship linear? The law of diminishing returns Advertising Example 7. Is there synergy among the advertising media? Feature interaction Simple Linear Regression Coefficients β0 Intercept (bias) β1 slope (weight) Learned from data Simple Linear Regression Simple Linear Regression – Hypothesis Testing 1. Is there a relationship between advertising budget and sales? H0: There is no relationship between X and Y HA: There is some relationship between X and Y. Simple Linear Regression – Hypothesis Testing 1. Is there a relationship between advertising budget and sales? H0: β1 = 0 HA: β1 ≠ 0 Hypothesis testing (how far enough does β1 from zero so that we are confident that it is not zero) Simple Linear Regression – Hypothesis Testing 1. Is there a relationship between advertising budget and sales? What does this mean? Simple Linear Regression – Hypothesis Testing An increase in advertising spend of $1000 is associated with an increase in sales of about 50 units Simple Linear Regression – Hypothesis Testing Standard Error (SE) is the +/- value used to calculate the range of our confidence interval The 95% confidence interval for TV Advertising is [0.042, 0.053] Simple Linear Regression – Hypothesis Testing The 95% confidence interval Simple Linear Regression – Hypothesis Testing If we set the TV Advertising budget to 0, what is the expected range of sales at the 95% confidence interval? Simple Linear Regression – Hypothesis Testing If we set the TV Advertising budget to 0, what is the expected range of sales? The 95% confidence interval for sales with no TV Advertising is [6.12, 7.94] Simple Linear Regression – Hypothesis Testing We are 95% confident that β0- [0.042, 0.053] - for each $1000 of spend, we add between 42 and 53 sales β1 [6.12, 7.94] - with no spending, sales will be between 6120 and 7940 units Simple Linear Regression – Hypothesis Testing A small p-value indicates that it is unlikely to observe such a substantial association between the predictor and the response due to chance, in the absence of any real association between the predictor and the response Simple Linear Regression – Hypothesis Testing In this case, we see a small p-value, which means this association is unlikely to occur at random. Hence, we can reject the null hypothesis Linear Regression – Loss Linear Regression Loss function Gradient descent and optimal model parameters Hyperparameter tuning Linear Regression - Loss Source https://developers.google.com/machine-learning/crash-course/linear-regression/loss Linear Regression – Types of Loss Source https://developers.google.com/machine-learning/crash-course/linear-regression/loss Linear Regression – Types of Loss Discuss how outliers impact the different types of loss functions Linear Regression – Loss Function The loss function quantifies the difference between the actual output and the predicted output for a given input. To improve the model’s prediction, we aim to minimize the loss. Linear Regression – Loss Simulation https://developers.google.com/machine-learning/crash- course/linear-regression/parameters-exercise Linear Regression – Gradient Descent Linear Regression – Gradient Descent Model convergence and loss curves Source https://developers.google.com/machine-learning/crash-course/linear-regression/gradient-descent Minimizing the Loss Function Gradient descent is an optimization algorithm used to iterate through different combinations of weights to find the best combination of weights and bias that minimizes the loss (error). The goal is to reach the lowest (optimal) position (at the bottom of the curve) – local minimum. Local minimum Global vs. Local Minimum Source https://en.wikipedia.org/wiki/Maximum_and_minimum Gradient Descent Gradient descent may be used to search the weight space: The direction of change is determined by the gradient of the error function; magnitude of change is determined by the training rate. The goal is to minimize the error represented by the cost function of a model. It works by iteratively adjusting the model’s parameters to find the optimal value that will minimize the difference between predicted output and label Gradient Descent Gradient Descent – The Learning Rate The learning rate is a hyperparameter in gradient descent that controls the step size taken during each iteration of parameter updates. It determines how quickly or slowly the algorithm converges to the optimal solution. If the learning rate is too high, the algorithm may overshoot the minimum and fail to converge. This is known as overshooting. If the learning rate is too low, the algorithm may take a long time to converge or get stuck in a suboptimal solution. This is known as slow convergence or being trapped in local minima. Gradient Descent – Let’s visualize it Gradient Descent https://towardsdatascience.com/a-visual-explanation-of-gradient- descent-methods-momentum-adagrad-rmsprop-adam- f898b102325c Linear Regression – Hyperparameters NP-Hard problems Learning rate Batch size Stochastic gradient descent (SGD) Mini-batch stochastic gradient descent (mini-batch SGD) Epochs Read https://developers.google.com/machine-learning/crash-course/linear-regression/hyperparameters Linear Regression: Programming Excercise https://developers.google.com/machine-learning/crash- course/linear-regression/programming-exercise Logistic Regression Linear Models for Classifications ŷ = w * x + w * x +... + w[p] * x[p] + b > 0 ŷ - predicted value w – weight b – bias For linear models for classification, the decision boundary is a linear function of the input. In other words, a (binary) linear classifier is a classifier that separates two classes using a line, a plane, or a hyperplane. Logistic Regression Supervised training (learning) algorithm to create classification models In many cases, it is the first algorithm you try when working with a dataset. Logistic Regression Logistic regression is a probabilistic model. The probability of a binary event is always between 0 and 1 If the probability of the event exceeds a threshold (50%), we predict one class; otherwise, we predict the other class. Logistic Regression Regression model where the dependent (output) variable is categorical. If a binary variable 𝑌 is a function of a continuous input variable 𝑋, logistic regression may be used to estimate the conditional distribution Pr(𝑌|𝑋 = 𝑥) under certain conditions Sigmoid Function How would you describe this chart? Sigmoid Function The Y axis is between 0 and 1, representing the two classes of the binary classification problem. Irrespective of the input value (X), the output is always 0 or 1 Logistic Regression Assumptions Let 𝑝 be the probability that 𝑌 = 1 given 𝑋 = 𝑥 A model with 𝑘 input variables: 𝑋1 , 𝑋2 ,…, 𝑋𝑘 may be specified as: Pr 𝑌 = 1 𝑋1 = 𝑥1 , 𝑋2 = 𝑥2 ,…, 𝑋𝑘 = 𝑥𝑘 1 = 1 + 𝑒 − 𝛽0+𝛽1𝑥1+𝛽2𝑥2+⋯+𝛽𝑘𝑥𝑘 Sigmoid Function Source https://developers.google.com/machine-learning/crash-course/logistic-regression/sigmoid-function Sigmoid Function Source https://developers.google.com/machine-learning/crash-course/logistic-regression/sigmoid-function Sigmoid Function Linear vs. Logistic Regression Logistic Regression: Loss and Regularization Log Loss Regularization to prevent overfitting Read https://developers.google.com/machine-learning/crash-course/logistic-regression/loss-regularization Support Vector Machines (SVM) Support Vector Machines Are the datasets linearly separable? Which is the better decision boundary, L1 or L2? Why? Support Vector Machines Source Abu-Mostafa, Y. S., Magdon-Ismail, M., & Lin, H. (2012). Learning from data: A short course. Support Vector Machines The yellow region represents the margin of error (when you might start making errors) Source Abu-Mostafa, Y. S., Magdon-Ismail, M., & Lin, H. (2012). Learning from data: A short course. Support Vector Machines The goal is to classify points correctly and so through maximizing the margin of error. Source: https://en.wikipedia.org/wiki/Support-vector_machine What if the data is non-linearly separable? Non-Linearly Separable Source Abu-Mostafa, Y. S., Magdon-Ismail, M., & Lin, H. (2012). Learning from data: A short course. Non-Linearly Separable Approach 1 – Soft Margin (Tolerate Some Errors) Approach 2 – Kernel Transformation (also known as the Kernel Trick) Support Vector Machines – Soft Margin Source Abu-Mostafa, Y. S., Magdon-Ismail, M., & Lin, H. (2012). Learning from data: A short course. Support Vector Machines – Kernel Trick Source: https://www.hashpi.com/the-intuition-behind-kernel-methods Support Vector Machines – Kernel Trick Let’s visualize the Kernel Trick https://playground.tensorflow.org/ SVM - Maximum Margin Separator Constructs a maximum margin separator when the examples are linearly separable. The sum of the distances of the nearest positive example and the nearest negative example to the separator is maximized. Results in lower generalization error The maximum margin separator can be efficiently computed by solving a constrained optimization problem. SVM - Kernel Trick When examples are not linearly separable, a kernel function may be used to map the features on to a new feature space in which the examples are linearly separable. For example, the function 𝑓 𝑥1, 𝑥2 = 1 𝑖𝑓𝑓 𝑥12 + 𝑥22 < 𝑟 2 may be mapped to 2a new feature space 2 using 𝑧1 = 𝑥1 , and 𝑧2 = 𝑥2. The examples are now linearly separable in this feature space: 𝑔 𝑧1, 𝑧2 = 1 𝑖𝑓𝑓 𝑧1 + 𝑧2 < 𝑟

Machine Learning 1 Lecture 4 PDF

Document Details

Tags

Related

Summary

Full Transcript