STINTSY Notes Term 1, AY 2024-2025 PDF
Document Details
Uploaded by Deleted User
De La Salle University
Tags
Summary
These notes cover key concepts and formulas in machine learning, focusing on topics like K-Nearest Neighbors, Linear Regression, and Logistic Regression. The notes include formulas and examples.
Full Transcript
STINTSY Notes Term 1, AY 2024-2025 Formula List K-Nearest Neighbors 𝑑 (𝑖)...
STINTSY Notes Term 1, AY 2024-2025 Formula List K-Nearest Neighbors 𝑑 (𝑖) (𝑖) 2 Euclidean Distance (L-2 Distance): 𝑑𝑖𝑠𝑡(𝑧, 𝑋 ) = ∑ (𝑧𝑗 − 𝑋𝑗 ) s 𝑗=1 𝑑 (𝑖) (𝑖) Manhattan Distance (L-1 Distance): 𝑑𝑖𝑠𝑡(𝑧, 𝑋 ) = ∑ |𝑧𝑗 − 𝑋𝑗 | te 𝑗=1 Linear Regression no 𝑇 Model: ŷ = θ1𝑥1 + θ2𝑥2 +... + θ𝑑𝑥𝑑 + θ0𝑥0 = θ 𝑥 (x0 = 1) 1 2 Loss Function: 𝑙 (θ, 𝑋, 𝑦) = 2𝑛 Σ(ŷ − 𝑦) (2 is for convenience) ∂ 1 𝑇 Derivative of Loss: cs ∂θ 𝑙 (θ, 𝑋, 𝑦) = 𝑛 (𝑋θ − 𝑦) 𝑋 𝑇 −1 𝑇 Analytical Solution: θ = (𝑋 𝑋) 𝑋 𝑦 ∂ Gradient Descent: θ:= θ − α ∂θ 𝑙(θ) (α is hyperparameter) e/ Other Loss Functions 1 2 Mean Squared Error (MSE): Σ(ŷ − 𝑦) r.e 𝑛 1 Mean Absolute Error (MAE): 𝑛 Σ|ŷ − 𝑦| 1 2 Root Mean Squared Error (RMSE): 𝑛 Σ(ŷ − 𝑦) kt 2 Σ(ŷ−𝑦) Coefficient of Determination (R2): 2 (ȳ is mean of labels) Σ(𝑦−𝑦̄) lin Regularization 1 2 Regularized Loss Function: 𝑙 (θ, 𝑋, 𝑦) = 2𝑛 Σ(ŷ − 𝑦) + λ𝑅(θ) (constant λ) 𝑑 2 Ridge Regression (L-2): 𝑅(θ) = ∑ θ𝑗 𝑗=1 𝑑 Lasso Regression (L-1): 𝑅(θ) = ∑ |θ𝑗| 𝑗=1 1 𝑇 Gradient Descent: θ : = θ − α( 𝑛 (𝑋θ − 𝑦) 𝑋 + λθ) Logistic Regression 1 1 Model (Sigmoid Function): σ(𝑥) = 𝑧 = −𝑥 = −(θ1𝑥1+θ2𝑥2+... +θ𝑑𝑥𝑑 + θ0) 1+𝑒 1+𝑒 𝑛 (𝑖) (𝑖) (𝑖) (𝑖) Binary Cross-Entropy (Loss): − ∑ 𝑦 𝑙𝑛(𝑝 ) + (1 − 𝑦 )𝑙𝑛(1 − 𝑝 ) 𝑖=1 ∂ 𝑇 Derivative of Loss: ∂θ 𝑙 (θ, 𝑋, 𝑦) = (𝑋θ − 𝑦) 𝑋 𝑧𝑖 𝑒 Softmax Function: 𝑝𝑟𝑜𝑏𝑖 = 𝑘 𝑧𝑗 s Σ𝑗=1𝑒 𝑛 𝑇 (𝑖) (𝑖) Loss Function (Multiclass): 𝑙(θ, 𝑋, 𝑦) =− ∑ 𝑦 𝑙𝑛(𝑝 ) te 𝑖=1 Evaluation of Classification Models no 𝑇𝑃+𝑇𝑁 Accuracy: 𝑇𝑃+𝐹𝑃+𝑇𝑁+𝐹𝑁 𝑇𝑃 Precision: 𝑇𝑃+𝐹𝑃 𝑇𝑃 Recall: cs 𝑇𝑃+𝐹𝑁 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑟𝑒𝑐𝑎𝑙𝑙 F1-Score: 2 × 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙 Neural Networks (Backpropagation) e/ ∂ŷ [𝑛] = 1 ∂𝑎 [𝑖] r.e ∂𝑎 [𝑖] [𝑖] [𝑖] = σ(𝑧 )(1 − σ(𝑧 )) ∂𝑧 [𝑖] ∂𝑧 [𝑖−1] [𝑖] =𝑎 ∂𝑊 [𝑖] [𝑖] kt ∂𝑧 [𝑖−1] =𝑊 ∂𝑎 [𝑖] ∂𝑧 [𝑖] = 1 ∂𝑏 lin Activation Functions 1 Sigmoid Function: σ(𝑥) = −𝑥 Range: (0,1) 1+𝑒 −2𝑥 1−𝑒 Tanh Function: 𝑡𝑎𝑛ℎ(𝑥) = −2𝑥 Range: (-1,1) 1+𝑒 ReLU Function: 𝑅𝑒𝐿𝑈(𝑥) = 𝑚𝑎𝑥(0, 𝑥) Range: [0,∞) Leaky ReLU Function: 𝑅𝑒𝐿𝑈(𝑥) = 𝑚𝑎𝑥(0. 01𝑥, 𝑥) PreLU Function: 𝑃𝑟𝑒𝐿𝑈(𝑥) = 𝑚𝑎𝑥(𝑎𝑥, 𝑥) (α is hyperparameter) Naive Bayes 𝑃(𝑋1=◻ ∩ 𝑋2=◻ ∩ 𝑋𝑑=◻ | 𝑇)𝑃(𝑇) Classification: 𝑃(𝑇|𝐹) = 𝑃(𝐹) where T = target, F = features 𝑝𝑑𝑓(𝑋1=◻ ∩ 𝑋2=◻ ∩ 𝑋𝑑=◻ | 𝑇)𝑝𝑑𝑓(𝑇) Regression: 𝑃(𝑇|𝐹) = 𝑝𝑑𝑓(𝐹) where pdf = prob density fun. Decision Trees Measures of Impurity: 𝑦 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠 𝑖 s - Shannon’s Entropy: 𝐿(𝑆) =− |𝑆| ∑ 𝑝𝑖𝑙𝑜𝑔2(𝑝𝑖) where 𝑝𝑖 = |𝑆| 𝑖 te 𝑦 2 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠 𝑖 - Gini Index: 𝐿(𝑆) = |𝑆| × (1 − ∑ 𝑝𝑖 ) where 𝑝𝑖 = |𝑆| 𝑖 2 Σ(𝑥−µ) - Variance: 𝐿(𝑆) = |𝑆| × (for regression only) no 𝑛−1 Information Gain: 𝐼𝐺 = 𝐴 − (𝐵 + 𝐶) where A = impurityoriginal data B = impuritypartition 1 (yes) C = impuritypartition 2 (no) cs Ensemble Learning Bagging: ℎ(𝑥) = 𝑚𝑜𝑑𝑒 𝑜𝑟 𝑚𝑒𝑎𝑛 𝑜𝑓 ℎ(𝑥)𝑓𝑟𝑜𝑚 1 𝑡𝑜 𝑛 Boosting: ℎ(𝑥) = ℎ1(𝑥) + ℎ2(𝑥) +... + ℎ𝑛(𝑥) e/ Gradient Boosting: ℎ(𝑥) = 𝑎1ℎ1(𝑥) + 𝑎2ℎ2(𝑥) +... + 𝑎𝑛ℎ𝑛(𝑥) 1 1−𝑡𝑜𝑡𝑎𝑙 𝑒𝑟𝑟𝑜𝑟 r.e Amount of Say: 𝐴𝑜𝑆 = 2 𝑙𝑛 ( 𝑡𝑜𝑡𝑎𝑙 𝑒𝑟𝑟𝑜𝑟 ) (for Adaboost) kt lin Introduction to Machine Learning Artificial Intelligence (AI) field of research in computer science that focuses on making computers exhibit intelligent behavior (an illusion of intelligence) Machine Learning (ML) also called neural AI, sub-field of AI where rules are automatically inferred from data s model is the representation of the “patterns” that were learned in the data te Types of Artificial Intelligence Symbolic/Traditional AI no - data and rules → computer → output - predicts a correct output based on given data with rules - Examples: Know what makes an animal a quadruped, has computer use the rule cs Tic-Tac-Toe that searches the best move Neural AI/Machine Learning - data and output → computer → rules - outputs a set of rules based on given data with their correct output e/ - done to handle cases where rules cannot be expressed easily (no clear formula) data-dependent, making it a double-edged sword - Examples: r.e Determine what it is that makes an animal a quadruped Tic-Tac-Toe model that learns patterns of moves with the best outcome based on many games kt Types of Machine Learning Tasks Supervised Learning lin - model tries to predict a label/target/output - there is a “correct answer” to every example - can measure if the model is correct by comparing its predictions to the “correct” answer - Regression predicting numerical value Ex: age, temperature, stock forecast - Classification predicting categorical value can be binary or multiclass Ex: COVID or not, spam or not, type of flower, recognize handwritten digit Unsupervised Learning - there is no label, no “correct answer” for each example in the data - goal is simply find patterns from data - requires interpretation or validation of results Reinforced Learning - model learns by interacting with environment and receiving rewards/penalties for s its actions te Why Machine Learning? While symbolic AI is good at a lot of intelligent tasks it has a lot of limitations when it no comes to other tasks “Hard problems are easy and the easy problems are hard” (walking, etc.) Machine Learning Terms cs Dataset: collection of data instances where model will “learn” from, in table format Instance: a single object/row in the dataset Label: target variable being predicted Classes: list of possible values for the label (application to classification tasks) e/ Features: variables considered when “learning” the rules/making the prediction Machine Learning Task Formulation r.e Example 1: Make a system that can estimate the price of a house - In traditional AI: using size, location as a multiplier to predict price Limitations: accuracy of formula, dependence on assumptions kt - In machine learning: Supervised ML approach: collect data on existing houses - Instance: house lin - Label: price - Type: regression - Features: all except the id and price Note: feature selection matters, as models may struggle “learning” the patterns Example 2: Make a system that can auto-detect if animal img is mammal, reptile, or bird - In ML: Instance: image Label: mammal, bird, or reptile Features: colors of pixels - It would be nice to have info such as number of legs, presence of fur, and others, but it is impossible for the computer to extract just from the image Computer sees image as a 2D array of pixels, and each pixel is represented by 3 numbers (RGB), which can technically be made to a feature s Theoretically, there can be 30,000 features assuming each is 100x100 px te - Format: Given an animal image, predict if it is a mammal, reptile, or bird given the colors of each pixel in the image no Feature Vector think of it as a numerical representation of a single instance - for non-numerical values, usually assigned a numerical value Ex: [Ave. R value of all pixels, Ave. G value of all pixels, Ave. B value of all pixels] cs Notations D: dataset (subset of a population) X: feature matrix e/ y: labels Training a Supervised ML Model r.e kt lin model is also called hypothesis function model is not expected to be good immediately as it is being trained goal is to adjust the model to fit the training data, resulting in better predictions/answers Evaluating a Supervised ML Model s te no Basic Supervised ML Pipeline Collect data Preprocess data (exploratory data analysis, cleaning, etc.) Identify features and label cs Split data into training set and test set - model “learned” the patterns based on training data - if the same data was used to “test” the model, then the result will likely be good, but biased e/ - therefore, test must be done on data not yet seen before Build and fine-tune model from the training set Run the test set on the model to measure its performance r.e Iterate as needed Basic Unsupervised ML Pipeline kt Collect data Preprocess data (exploratory data analysis, cleaning, etc.) Identify features lin Build and fine-tune model form the dataset Perform expert interpretation and validation on results Iterate as needed K-Nearest Neighbors K-Nearest Neighbors (KNN) most “naive” kind of supervised machine learning model makes prediction based on similarity to its closest neighbors High-Level Idea: - No pattern-finding, just give training data and it is all saved in memory - Each point is an array of values describing something (like a person), and these would be considered by the model s - Example: Identifying if a person will be dated by a someone based on others te Each person/instance has features such as height, weight - some features may be ignored as a result (such as maybe being reminded of their ex, etc.) no Label is whether or not the person in the dataset has already been dated Similarity would be measured by how much of their features match Training: not much “training”, simply memorizes entire dataset and uses that as model Prediction: on unknown instance, find most similar object and copy its class label cs Measures of Similarity similarity of an instance z to the i-th training instance X(i) Euclidean Distance (L-2 Distance): e/ 𝑑 (𝑖) (𝑖) 2 𝑑𝑖𝑠𝑡(𝑧, 𝑋 ) = ∑ (𝑧𝑗 − 𝑋𝑗 ) r.e 𝑗=1 Manhattan Distance (L-1 Distance): 𝑑 (𝑖) (𝑖) 𝑑𝑖𝑠𝑡(𝑧, 𝑋 ) = ∑ |𝑧𝑗 − 𝑋𝑗 | 𝑗=1 kt If there are instances with the same distance but different class labels: - model treats it as unsure, so addressing it depends on implementation lin Hyperparameter k Hyperparameter: option manually decided on when training an ML model k is the number of greatest neighbors to consider before making a prediction - exists to safeguard against outliers by basing prediction on a possible trend - by default, predictions are based on the majority among k neighbors (mode) if k = n, prediction = most common class label in the dataset - if all k instances have different class labels, the model is unsure and pick any of the three (since no majority) Sample Data 1: Classification Predicting instance (21, 17)’s weather based on the given data (k = 1): s te no cs e/ - Using Euclidean distance, the model predicts that (21, 17) has rainy weather as the closest distance from the data point is (17, 14), which is rainy. Predicting (21, 17), (33, 13), and (7, 6)’s weather based on the given data (k = 3) r.e kt lin Sample Data 2: Regression unlike classification, predictions cannot be based on the majority since the labels are numerical values, requiring doing one of the following: - Bucketing: converting the values into categories (binning) - Summarizing values if it cannot be classified Prediction: if k > 1, get the average value of k nearest neighbors Predicting (21, 17), (33, 13), and (7, 6)’s weather based on the given data (k = 3) s te no cs e/ Assumptions in KNN r.e all dimensions correspond to points in a d-dimensional space ℝd - if 2D, can easily be visualized through plotting - Ex: temperature, humidity – ℝ2 Features may be discrete or continuous kt Labels can be continuous (regression) or categorical (classification) as well Other Distance Functions lin Minkowski Distance: generalization of Euclidean and Manhattan distance Cosine Distance: similarity between two vectors - looks at the direction of the vectors 𝐴·𝐵 - 𝑐𝑜𝑠(θ) = ||𝐴|| ||𝐵|| Hamming Distance: similarity between two strings - how many edits must be done on a string to make the two strings equal - Ex: bat and cab has a Hamming distance of 2 Hyperparameter Tuning s te val: validation set used to determine whether the k is the “best” most accurate k becomes the hyperparameter of the model no Hyperparameter Tuning with Cross-fold Validation cs e/ Sample 3: A dataset of age and monthly salary to predict whether someone is single or married r.e Potential Issue: bias towards the salary as it is not normalized - salary would be prioritized due to how it scales - Ex: Between (20, 20000), (21, 21000), and (29, 21000), the distance between (21, 21000) and (29, 21000) would be seen as more similar kt Potential Solutions: normalization - transforming a feature to the same scale as another - another possible solution is scaling lin KNN Advantages and Disadvantages Advantages - fast training time - straightforward and easy to implement Disadvantages - model is large - prediction is slow if dataset is large - considers all features equally, regardless of whether they are relevant or not Linear Regression Linear Regression supervised, regression learning algorithm unlike KNN, does not need to store the data of points; a line if with 1 feature - as a result, it is much smaller than KNN Model: ŷ = θ1 x1 + θ2 x2 + … + θd xd + θ0 - where x0 is always 1 (for convenience in vectorization) - based on equation of a line: ŷ = mx + b = θ1 x + θ0: s Parameters: te - θ1: slope of the line, determines the direction/rotation of the line - θ0: intercept of the line, serves as the offset - Goal: get a line that can fit as much of the data as possible no - can be extended to multiple features: cs e/ r.e with 2 features, line becomes a plane with over 2, becomes a hyperplane - generally expressed as vectorized matrices: ŷ = θ T x: kt lin can also be expressed as y = θ T x + ε - where ε ~ N (0, σ2) represents the measurement error or some other random noise Sample Data: Predict the price of the house given its lot area s x-axis is the feature, y-axis is the label if lotarea = 95, price = ? te - in KNN, it will just get the average of the k closest lotarea - in linear regression, it will draw a line (trend) that best represents the data: no cs one way to determine if a trend line is good is the distance of each point to the trend line - best line minimizes the distance of it between the points e/ Loss Function also known as objective function or cost function Input: parameters of a model + dataset r.e Returns: a numerical value representing how well the model fits the dataset Formula: l (θ, X, y) = loss (the lower value, the better) 1 2 Linear Regression Loss Function: 𝑙 (θ, 𝑋, 𝑦) = 2𝑛 Σ(ŷ − 𝑦) kt - n: number of instances - ŷ: prediction of model - y: correct answer lin 1 2 - can also be represented as 𝑙 (θ, 𝑋, 𝑦) = 𝑛 Σ(ŷ − 𝑦) ; the 2 is for convenience Loss Landscape is expected to look like a bowl: Optimization Problem 1 2 𝑎𝑟𝑔𝑚𝑖𝑛θ 𝑙 (θ, 𝑋, 𝑦) = 𝑎𝑟𝑔𝑚𝑖𝑛θ 2𝑛 Σ(ŷ − 𝑦) find the value of θ such that the loss function will return the smallest possible value - done by getting the derivative of the loss function: s te done for each θi, where θ0 is the intercept while the rest feature slopes no - in short, getting the derivative of the loss function: ∂ 1 2 Given ∂θ 2𝑛 Σ(ŷ − 𝑦) and θ = [θ0 θ1 θ2] 𝑇: ∂ 1 1 1 ∂θ 𝑙 (θ, 𝑋, 𝑦) = [ 𝑛 Σ(ŷ − 𝑦)(𝑥0), 𝑛 Σ(ŷ − 𝑦)(𝑥1),..., 𝑛 Σ(ŷ − 𝑦)(𝑥𝑑)] cs In vector form: ∂ 1 𝑇 ∂θ 𝑙 (θ, 𝑋, 𝑦) = 𝑛 (𝑋θ − 𝑦) 𝑋 e/ Two Solutions to Linear Regression Analytical Solution: set derivative to 0 and solve for θ: 1 𝑇 r.e 𝑛 (𝑋θ − 𝑦) 𝑋 = 0 𝑇 (𝑋θ − 𝑦) 𝑋 = 0 𝑇 𝑇 𝑋 𝑋θ − 𝑋 𝑦 = 0 kt 𝑇 𝑇 𝑋 𝑋θ = 𝑋 𝑦 𝑇 −1 𝑇 θ = (𝑋 𝑋) 𝑋 𝑦 Gradient Descent lin - try out different parameters until you reach the lowest point (always exists) - use the gradient (derivative evaluated at the current point) to guide the exploration - Diagram (graph of the loss function): - Pseudocode: procedure gradientdescent(θ): while not converged do: ∂ θ:= θ − α ∂θ 𝑙(θ) return θ Hyperparameter α - learning rate, determines how large the update will be - controls how “fast” the learning happens s - common values are 0.01, 0.001, 0.0001, and so on… - if α is too small: convergence may take too long te - if α is too large: algorithm my overshoot the minimum ∂ ∂θ 𝑙(θ): derivative/gradient of the loss no Stochastic, Mini-Batch Gradient Descent Gradient Descent Stochastic GD Mini-Batch GD cs Updates θ based on gradient Updates θ based on gradient Updates θ based on gradient of the whole dataset of one data instance of subset of whole dataset Runs slow Runs fast, but jittery Runs faster than GD and path is not as jittery as stochastic e/ 1 iter = N instances 1 iter = 1 instance 1 iter = M instances N = size of whole dataset M = size of batch r.e Standardization of Features 𝑥−µ 𝑥𝑛𝑜𝑟𝑚 = kt σ subtract each value by mean of feature, then divide by standard deviation of that feature sometimes necessary to prevent underflow or overflow lin Evaluating Regression Tasks (test set) 1 2 Mean of Squared Error (MSE): 𝑛 Σ(ŷ − 𝑦) 1 Mean Absolute Error (MAE): 2𝑛 Σ| ŷ − 𝑦| 1 2 Root Mean Squared Error (RMSE): 𝑛 Σ(ŷ − 𝑦) 2 Σ(ŷ−𝑦) Coefficient of Determination (R2): 2 where ȳ is mean of the labels Σ(𝑦−𝑦̄) Linear Regression Advantages and Disadvantages Advantages - relatively fast to train - relatively fast to test Disadvantages - can only be used for regression tasks - features may sometimes need to be standardized - features may sometimes need to be transformed s te no cs e/ r.e kt lin Bias-Variance Tradeoff and Regularization Sample Data s te Fitting a standard linear regression model: ŷ = w1 x + w0 Alternatively, changing the model to ŷ = w1 x2 + w0: no cs - can be interpreted as one of the following: change model by changing x to x2 e/ keeping the model but changing the features: - Ex: 1 feature Order = 1 r.e - Features: x1 - Model: ŷ = θ1 x1 + θ0 Order = 2 kt - Features: x1, x12 - Model: ŷ = θ1 x1 + θ1 x12 + θ0 and so on… lin - Ex: 2 features same idea as 1 feature but with more combinations since you will get every possible combination Issue: Curse of Dimensionality - as the dimensionality of the feature space increases, the number of configurations can grow exponentially, and thus the number of configurations covered by an observation decreases - Possible Solution: only have different exponents of each feature rather than also combining them as well (x1, x12, x2, x22, …) Issues of Higher Order Models: Bias-Variance High Bias s - model is too simple - no matter how much you try to fit, it won’t capture the patterns of the area te High Variance no cs - model is too complex - a large variety of models with the same complexity can fit the data just as nicely Underfitting e/ r.e kt - model did not fit the training data well Overfitting lin - model fits the training data well, but performs poorly on the testing data - Ex: did not generalize Summary of Bias-Variance s te no Underfit Model: high training error, high test error Overfit Model: low training error, high test error cs Bias-Variance Tradeoff e/ r.e kt generally, goal is to find a good balance between the bias and variance Effect of Data Quantity to Bias/Variance lin - more data means models are more likely to “listen” to the general trend Typical Overfitting Plot s te training error decreases as the degree of the polynomial increases (i.e. complexity of the hypothesis) the testing error, measured on independent data, decreases at first, then starts increasing no Regularization - methods to reduce overfitting in machine learning models, without having to collect more data or change the learning algorithm - Regularization Term cs Updated Linear Regression Loss Function: 1 2 𝑙 (θ, 𝑋, 𝑦) = 2𝑛 Σ(ŷ − 𝑦) + λ𝑅(θ) 1 2 e/ - Training error: 2𝑛 Σ(ŷ − 𝑦) θ farther from 0, the better - Regularization error: λ𝑅(θ) r.e θ nearer 0, the better λ: regularization constant Types: - Ridge Regression (L2 Regularization): kt 𝑑 2 𝑅(θ) = ∑ θ𝑗 𝑗=1 lin - Lasso Regression (L1 Regularization): 𝑑 𝑅(θ) = ∑ |θ𝑗| 𝑗=1 Effect of λ - if λ is large: training error has little impact on loss - if λ is 0: regularization has no effect - if λ is negative: higher the weights, the better - if λ is small: regularization is considered in the loss In Gradient Descent: 1 𝑇 θ : = θ − α( 𝑛 (𝑋θ − 𝑦) 𝑋 + λθ) - Ridge Regression (L2 Regularization): 𝑑 1 2 1 2 𝑙(θ, 𝑋, 𝑦) = 2𝑛 Σ(ŷ − 𝑦) + 2 λ ∑ θ𝑗 𝑗=1 - Lasso Regression (L1 Regularization): 𝑑 1 2 1 𝑙(θ, 𝑋, 𝑦) = Σ(ŷ − 𝑦) + λ ∑ |θ𝑗| s 2𝑛 2 𝑗=1 although absolute value is not differentiable, there are ways te to generalize gradient descent to non-differentiable functions - Example: no poly degree weight_1 weight_2 weight_3 weight_4 weight_5 1 -0.87 -0.17 3 -2.15 0.89 0.34 -0.50 5 13.40 -1.31 -17.28 1.36 3.79 cs 7 31.30 -5.53 -37.29 4.48 6.87 10 179.74 94.89 -378.87 -185.43 259.32 12 -540.45 -1015.67 1487.13 2623.60 -1490.51 30 -14156072.98 16329720.76 21300299.24 -14967132.99 13066492.56 e/ r.e kt wild swings are being caused by large magnitude weights lin Effect of Ridge Regression Effect of Lasso Regression s - more prone to have weights set to 0, which typically means that the feature should probably be removed as it does not make a te significant impact - Ridge vs. Lasso Regression (with 1 feature) no cs e/ r.e the non-green circle represents the loss landscape Ridge Regression: - green circle represents errors where λR(θ) = 1 (value can vary) kt - slope2 + y-int2 = 1 Lasso Regression - green diamond represents errors where λR(θ) = 1 lin - |slope| + |y-int| = 1 Goal: find a point in the green that is nearest to the minimum error - typically side for Ridge (in 3D), corner for Lasso (in 2D) - regularization error is decided by the model during training, and the goal is to have the green circle as close to the minimum as possible Logistic Regression Logistic Regression supervised, classification learning algorithm a generalized linear model (generalization of concepts and abilities of linear models) features can be discrete or continuous, or a mix of both Ex: predict whether a tumor is malignant or benign based on its size Why not just use linear regression? s te no predicted value can exceed the 0 – 1 range (does not make sense for binary classification) cs lacks flexibility to handle certain configurations Sigmoid Function e/ r.e kt lin used to “map” the output of a linear regression model to a 0 – 1 range S-shaped, and always returns a value from 0 to 1 (exclusive) 1 Formula: σ(𝑥) = −𝑥 1+𝑒 1 Logistic Regression Model: 𝑧 = −(θ1𝑥1+θ2𝑥2+... +θ𝑑𝑥𝑑 + θ0) (pre-sigmoid value) 1+𝑒 - it first solves the value from a linear regression model (with parameters/features) - it is then applied (as x) to the sigmoid function to be mapped (the prediction) Interpretation of Model Output probability that the input is classified as the positive class - Ex: if a point is close to 1 in sigmoid, high probability, and vice versa mapping to class: - set a threshold (i.e. 0.5) if output ≥ threshold, classify as “yes” if output < threshold, classify as “no” Logistic Regression Loss Function s 𝑛 𝑦 (𝑖) (𝑖) (𝑖) (𝑖) (1−𝑦 ) Formula: ∏ 𝑝 (1 − 𝑝 ) te 𝑖=1 (𝑖) 𝑦 (𝑖) - 𝑝 : used when label is positive (1) no (𝑖) (𝑖) (1−𝑦 ) - (1 − 𝑝 ) : used when label is negative (0) Problem: this value tend to be very small - Solution: use natural log probability instead of probability as it preserves order 𝑛 cs (𝑖) (𝑖) (𝑖) (𝑖) - Binary Cross Entropy: − ∑ 𝑦 𝑙𝑜𝑔(𝑝 ) + (1 − 𝑦 )𝑙𝑜𝑔(1 − 𝑝 ) 𝑖=1 in negative as loss functions typically mean lower scores = better (the trend of the formula was flipped without the sign) e/ log base is e - Optimization Problem: 𝑎𝑟𝑔𝑚𝑖𝑛θ 𝑙 (θ, 𝑋, 𝑦) r.e Derivative: 𝑛 ∂ (𝑖) (𝑖) (𝑖) (𝑖) 𝑇 ∂θ (− ∑ 𝑦 𝑙𝑜𝑔(𝑝 ) + (1 − 𝑦 )𝑙𝑜𝑔(1 − 𝑝 )) = (𝑋θ − 𝑦) 𝑋 𝑖=1 essentially same as derivative of mean squared error kt gradient descent works the same way as linear regression lin Decision Boundary shows the prediction results across the feature space determined by the value of feature xi where z = 0 (+z = “yes”, -z = “no”) - in a way, the feature line (linear regression line) “decides” the boundary Examples (1 Feature, 2 Features): s te no - Even though sigmoid is curvy, the decision boundaries are always straight lines (with respect to the feature space) Multinomial Logistic Regression cs e/ r.e for 2+ classes kt it’s like creating 3 “separate” classifiers, and choose the largest score (no more sigmoid) - regardless, they are trained together 𝑧𝑖 𝑒 Softmax Function: 𝑝𝑟𝑜𝑏𝑖 = lin 𝑘 𝑧𝑗 Σ𝑗=1𝑒 - Goal: Convert scores into a probabilistic representation, that totals to 1 - can handle negative values - although the score can already determine the class, it may still be needed for: interpret results as probability for the derivative, especially for neural networks “label” for a single instance is converted to a one-hot encoded vector 𝑛 𝑇 (𝑖) (𝑖) Loss Function: 𝑙(θ, 𝑋, 𝑦) =− ∑ 𝑦 𝑙𝑜𝑔(𝑝 ) 𝑖=1 - totals the log (base e) of probabilities and negates it - Example: s te Decision Boundary no cs Evaluation of Classification Models e/ Confusion Matrix r.e kt - shows statistics of the prediction (table can be flipped) - True Positive: predicted and classified as positive - True Negative: predicted and classified as negative lin - False Positive: predicted as positive but it is negative - False Negative: predicted as negative but it is positive Accuracy - number of correctly classified instances over all instances 𝑇𝑃+𝑇𝑁 - Formula: 𝑇𝑃+𝐹𝑃+𝑇𝑁+𝐹𝑁 Precision - out of all instances predicted as positive, how many are really positive? 𝑇𝑃 - Formula: 𝑇𝑃+𝐹𝑃 Recall - out of all positive instances, how many are predicted as positive 𝑇𝑃 - Formula: 𝑇𝑃+𝐹𝑁 F1-Score - harmonic mean of the precision and recall 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑟𝑒𝑐𝑎𝑙𝑙 - Formula: 2 × 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙 - better alternative to accuracy: imbalanced datasets amplify the issue where the model is stupid and s classifies everything as a certain feature (e.g. predict 0 all the time) te caused by test cases not being evenly distributed (such as all being 1 or 0) for Multiclass Classification: - needs separate precision/recall/F1-score for each class no - Example: cs e/ Training a Logistic Regression Model r.e kt lin Neural Networks Recap: Binary Logistic Regression s te no cs Recap: Multinomial Logistic Regression e/ r.e kt In neural networks, the initial predictions is treated as an intermediary to one final prediction instead of using the Softmax to find the “best” prediction: lin Neural Networks s intuitively, think of it as just a bunch of logistic regression units stacked together - from the raw features, intermediate features would be created first as a hidden te layer, which would be used to predict the results come in all shapes and sizes: - 2 layer network with 3 inputs, 3 neurons in the hidden layer, and 1 output no cs - 2 layer network with 4 inputs, 3 neurons in the hidden layer, and 1 output e/ r.e - 4 layer network with 2 inputs, 5 neurons in each hidden layer, and 3 outputs kt lin “Intermediate” Features automatically learned by the network not the same as features humans would create: - neural networks have no concept of “walkable”, “school quality”, or “family size” (the diagram above is just visualizing what they could be); instead, they will simply create whatever features that would transform the raw data into something more linearly separable consequently, trained model cannot be explained in a human-concept level Anatomy of a Neural Network Input Layer s - data X - has the shape n ✕ d te n = no. of instances d = no. of features Hidden Layer no - gives room for network to learn other “representations” of the data to lower loss - intermediate representation - pertaining to θ and ɑ in the diagram Output Layer cs - outputs the “answer” of task - may consist of one or more neurons - can be designed for regression or classification Weights e/ - same as weights in logistic/linear regression, except they have separate sets for each layer (represented by the arrows in the diagram) r.e - In the diagram, Layer 1 (θ) has 12 parameters; Layer 2 (θ) has 3 parameters Biases - provides the “intercept” for each layer; input to these lines is always 1 - represented as blue arrows in the diagram kt Activations - adds non-linearity to network - pertaining to the nodes under ɑ lin - Ex: sigmoid, tanh, ReLU Why is it called Neural Networks? it is structured like neurons in the brain; although neurons in the brain are more complex Training a Neural Network Pseudocode: while i < iterations: sample training data Forward Propagation - get sample data’s predictions and compute the loss by comparing predictions to labels (just like logistic regression) Backpropagation - compute the gradients; adjust each weight and bias based on the gradient s i++ te Simulation: no cs e/ 1. Forward Propagation feed the input data to the network and compute the predictions just let the numbers flow (can be vectorized) r.e Example: kt lin - Representations: θ are the weights of the features/inputs b represents weights of the bias, whose input is always 1 - θ is a 2D array containing θij[n], wherein i refers to the a neuron [n] while j represents the neuron in the previous layer - Each row/instance of features is fed to the network for prediction (Example given below for the first instance): s Matrix multiply the features of the instances by θT, then add te the bias term to get the Z[i] (score of nth neuron) - Z (the weight of the neurons of hidden layer) 1st Neuron: no - [0.1, 0.3, 1] [0.2, 0.5, -0.1]T = -0.23 - Add bias term (0.3), so Z1 = 0.07 2nd Neuron: - [0.1,0.3,1] [0.1,-0.2,0.3]T+0.6 = 0.97 cs After getting Z (represented as the left half of the neuron), solve the activation function to get the weights (right half) - In this example, the sigmoid function will serve as e/ the activation function to get a: 1 1st Neuron: −(0.07) = 0. 51749 1+𝑒 r.e 1 2nd Neuron: −(0.97) = 0. 72512 1+𝑒 Note: Sigmoid function is just an example of an activation function; it may differ kt Repeat the process until the output layer - In this case, Z = 0.76602, a = 0.68266 Therefore, predicted y = 1 (actual is 0) lin - From these values, loss function can be computed Note: The values are often vectorized to solve all together 2. Compute the loss, based on how good the predictions are 3. Backward Propagation Goal: just like in linear/logistic regression (reduce loss function) - How do we adjust every weight of the network to the direction that would lower the loss? Approach: use derivatives to compute effect of each parameter to loss - “How does the line affect the loss?” - Derivatives Recap: f(x, y, z) = (x + y) z s q = x+y ∂𝑞 ∂𝑞 - = 1, = 1 te ∂𝑥 ∂𝑦 f = qz ∂𝑓 ∂𝑓 - ∂𝑧 = 𝑞, ∂𝑞 = 𝑧 no ∂𝑓 ∂𝑓 - ∂𝑥 = 𝑧, ∂𝑦 = 𝑧 (use Chain Rule) derivative with respect to f Applying Derivatives in a Neural Network: cs - W represents the weights of the inputs - Note: W and θ mean the same thing e/ r.e - b represents the bias terms kt lin - z represents the score of the neuron before the activation function (in red/left) - a represents the score of the neuron after the activation function (in blue/right) - Note: The color here is not standard, just for visualization purposes - Goal: find how each weight affects loss (to know how to reduce it) - Therefore, compute the derivative of ŷ (prediction) with respect to every W and b (every line): In this case, with respect to W11, W12, b1, W11, W12, and so on until the first layer… Computing the Derivatives: - Strategy: Compute derivatives of the neighboring operations first, then use chain rule later - Example/Explanation: s te no cs ∂𝑎1 ∂ŷ How does a 1 affect ŷ : = = 1 ∂𝑎1 ∂𝑎1 - 1 because ŷ is based on the blue part (a1) ∂𝑎1 ∂σ(𝑧1 ) e/ How does z 1 affect a : 1 = ∂𝑧1 ∂𝑧1 - in short, how does the red part affect the blue part of r.e the neuron - in this example, the activation function is the sigmoid, so we get its derivative: ∂σ(𝑧1 ) = σ(𝑧1 )(1 − σ(𝑧1 )) kt ∂𝑧1 ∂𝑧1 ∂𝑊11 𝑎1 How does W11 affect z1: = = 𝑎1 ∂𝑊11 ∂𝑊11 lin ∂𝑧1 ∂𝑊11 𝑎1 +𝑊12 𝑎2 +𝑏1 - = , but the other ∂𝑊11 ∂𝑊11 terms were removed as they have nothing to do with W11 - follow this logic for the others (red part is affected by the blue part of the previous) ∂𝑧 - can be simplified/vectorized to =𝑎 ∂𝑊 ∂𝑧1 ∂𝑏1 How does b 1 affect z : 1 = = 1 ∂𝑏1 ∂𝑏1 - same idea as the one for the weight: ∂𝑧1 ∂𝑊11 𝑎1 +𝑊12 𝑎2 +𝑏1 = , except the only ∂𝑏1 ∂𝑏1 term being kept is b1 - can also be solved all at once ∂𝑧1 ∂𝑊11 𝑎1 s How does a3 affect z1: = = 𝑊11 ∂𝑎1 ∂𝑎1 te - how does the blue part affect the red part - same idea as the one with the weight ∂𝑧 - can be simplified/vectorized to =𝑊 no ∂𝑎 Repeat this for every layer - Note: All of them can be vectorized to solve each component of each layer at once - Summary of the Derivatives: cs ∂ŷ [𝑛] = 1 ∂𝑎 [𝑖] ∂𝑎 [𝑖] [𝑖] [𝑖] = σ(𝑧 )(1 − σ(𝑧 )) e/ ∂𝑧 [𝑖] ∂𝑧 [𝑖−1] [𝑖] =𝑎 ∂𝑊 r.e [𝑖] ∂𝑧 [𝑖] [𝑖−1] =𝑊 ∂𝑎 [𝑖] ∂𝑧 [𝑖] = 1 ∂𝑏 Completing the Goal: How do the weights and biases affect ŷ? kt - Strategy: Use the previously obtained derivatives above and put them in a chain rule - Examples: lin ∂ŷ ∂ŷ ∂𝑎1 ∂𝑧1 How does W 11 affect ŷ: = × × ∂𝑊11 ∂𝑎1 ∂𝑧1 ∂𝑊11 - using the given formulas, substitute to get the ∂ŷ answer, which is = σ(z1) (1 - σ(z1) a1 ∂𝑊11 How does a2 affect ŷ: ∂ŷ ∂ŷ ∂𝑎1 ∂𝑧1 ∂𝑎2 = × × × ∂𝑎2 ∂𝑎1 ∂𝑧1 ∂𝑎2 ∂𝑧2 How does a1 affect ŷ: ∂ŷ ∂ŷ ∂𝑧1 ∂ŷ ∂𝑧2 = ( × )+ ( × ) ∂𝑎1 ∂𝑧1 ∂𝑎1 ∂𝑧2 ∂𝑎1 - more complex as it has more paths - just add the derivatives of the two paths - Note: That answer can be expanded further an example of dynamic programming Note: This only computes how each weight affects the prediction, which is s just one step in knowing how the prediction affects the loss (which means, still need to compute derivative of the loss with respect to the prediction) te - in short, more chain rule 4. Repeat Step 1 again; feed input and get the predictions repeat until desired performance is achieved no Gradient Descent - generally pertains to Batch Gradient Descent for each iteration, entire dataset is used to compute the loss - compute loss → adjust weight → feed to network → repeat until minimized cs - Other Types: Stochastic Gradient Descent - for each iteration, the data is fed one instance at a time, and the loss is computed based on that e/ - Example: r.e kt adjustment occurs one at a one and adjusts based on the fed lin data only, not taking the previous into account - when instances are fed, that is 1 epoch: refers to the cycle of using everything in the training data per epoch can be 1 iteration, 8 iterations, etc. multiple must be ran in stochastic gradient descent - despite using different example per iteration to adjust, it can still approximate the minimum of loss quite well, but can be jittery: s a good alternative for large datasets (since batch is slower) te Mini-Batch Gradient Descent - for each iteration, the data is fed one batch/subset at a time, and the loss is computed based on that no - a middle ground between batch and stochastic gradient descent - contains a hyperparameter (self-defined parameter) – batch size refers to how many examples to feed at a given time - Example: cs e/ r.e in cases where there are remainders like this, it depends on the person on how they handle it (can be just take it as is, or place it to the previous iteration) - runs faster than gradient descent, and less jittery than stochastic kt Designing the Neural Network Architecture Number of Input Neurons lin - based on number of dimensions of the feature vector (number of features) - depends on formulation of task, but must be fixed - Examples: For images, maybe 100x100 where each pixel is represented in RGB, resulting in 30,000 input neurons To make a system that gets review text and classify it as either positive or negative, it could be one of the following approaches: - 2 features, with one for positive words and another for negative, which is determined by dictionary (but can instead be rule-based) - Bag of Words each feature is a vocabulary in a predefined dictionary, and the value represents how many times the word appears in the text n input neurons, where n is the number of words in the dictionary Note: state-of-the-art uses sentence embedding instead Number of Output Neurons and Activation Function of Output Layer - Regression s only 1 output neuron with no activation function (sometimes called te linear activation function) - neuron can output any real number value, which represents the prediction no - Binary Classification only 1 output neuron with sigmoid activation function - between 0 and 1, and can be interpreted as the predicted probability that the instance belongs to the positive class cs Ex: for the review, probability it is a positive or negative review - Multiclass Classification use k output neurons and apply a softmax layer after that (k is the number of classes) e/ - setup will output k values representing the predicted probability distribution across the k classes - softmax is there to normalize the values r.e Number of Hidden Layers and Number of Neurons in each Layer - no universal way to determine the best number of hidden layers and number of neurons in each layer kt - need to try out different combinations and see what works best (through hyperparameter tuning) - things to keep in mind: lin Start with something simple Look at related literature on the domain to get a good starting point Activation Function in each Hidden Layer - determines how the a is computed - examples include sigmoid function, tanh function, RelU function, etc. - Why is activation needed? Without it, network devolves into simple linear regression Activation functions provide the non-linearity - which allow for the hidden layers to in the first place - Example Functions: Sigmoid Function 1 - σ(𝑥) = −𝑥 1+𝑒 - Range: (0,1) - Problems: saturated near 0 or 1 - but saturation “kills” the neurons, often referred to as the Vanishing Gradients Problem s for neurons with outputs close to 0 or 1, the gradient is a small value, which is reduced te further when chain rule is applied, causing the network to have very minimal adjustments and “learning” no - the resulting derivative: cs may result in underflow (where value is e/ treated effectively as 0 since so small) -x e is expensive to compute r.e outputs are not zero-centered - it is generally centered at 0.5 - Zero-Centered Outputs if inputs to a neuron are all positive, the kt gradients of the weights will either be all positive or all negative therefore, two things can happen in gradient lin descent: - all weights will increase - all weights will decrease Issue: updates would be forced to zigzag, resulting in training taking even longer Tanh Function −2𝑥 1−𝑒 - 𝑡𝑎𝑛ℎ(𝑥) = −2𝑥 1+𝑒 - Range: (-1,1) - Graph: - Problems: saturated near 0 or 1 e-x is expensive to compute s - Advantages outputs are zero-centered te ReLU Function - ReLU(x) = max(0, x) if the score is positive, output it; but if the score is negative, no output 0 instead - Range: [0, ∞) - Graph: cs e/ -preferred activation function for deep learning -Problems: r.e saturated on one side outputs are not zero-centered - Advantages: saturated only on one side kt faster to compute Leaky ReLU Function - f (x) = max(0.01x, x) lin - Graph: - has a generalized version called Parametric Rectifier (PReLU): f (x) = max(ax, x) where a is a hyperparameter - Problems: saturated on one side outputs are not zero-centered, but a bit better - Advantages: saturated only on one side faster to compute converges faster - In practice: ReLU is the default activation function: s works well in most problems te be careful with learning rates sigmoid is almost never used in practice no Other Considerations in Setting Up Weights Initialization - if all weights are set to the same value (e.g. 0), derivative to the loss with respect to a line would just be same (adjustment would all just be same way) cs - weights are normally initialized with small random numbers normally through Gaussian with μ = 0 and σ = 0.01 okay for small networks but can lead to non-homogeneous activities for deeper networks e/ - Preferred Ways: Xavier Initialization - from a random distribution, divide it by the square root for the r.e number of inputs in the previous layer - used with tanh activation He Initialization kt - from a random distribution, divide it by the square root for the number of inputs in the previous layer divided by 2 - used with ReLU activation lin Regularization: methods to prevent a machine learning model from overfitting - Batch Normalization if features are on the same scale, it helps the network converge faster can reduce the zigzags in the loss landscape since hidden layers get their input from previous layer, it makes sense to normalize those as well For each batch: - Compute mean and variance and normalize the values: (𝑘) (𝑘) (𝑘) 𝑥 −𝐸[𝑥 ] 𝑥̂ = (𝑘) 𝑉𝑎𝑟[𝑥 ] - Allow the network to “squash” the range (i.e. change mean and variance) to lower the loss: (𝑘) (𝑘) (𝑘) (𝑘) 𝑦 = ŷ 𝑥̂ + β Note: y and β for each batch norm layer are also parameters of the network (they will also be adjusted s during gradient descent) - Dropout te no can’t rely on a single feature, so spread out the weights for each batch, randomly “kill” (set to 0) certain neurons in hidden layers (although not fully mathematically proven) cs Implementation: - Set a hyperparameter keep-prob, the number of nodes to keep for each layer e/ keep-prob = 1.0: no nodes will be dropped out keep-prob = 0.75: ¼ of the nodes will be dropped out - For each iteration, a random number of nodes will be dropped r.e (i.e. set to 0) More sophisticated optimization/learning algorithms: allow ML to minimize cost - Goals: Converge to an ideal minimum kt Converge as quickly a possible - Problems with Vanilla Gradient Descent Zigzagging on a certain parameter lin Local minima - In neural networks, descent is not expected to be always convex, resulting in an early convergence due to a zigzag Saddle points - Similar to local minima, wherein certain parts would be close to 0 - Gradient Descent with Momentum overcomes saddle points and local minima instead of the derivative, it uses the velocity V∂𝜃 = βV∂𝜃 + (1 - β) ∂𝜃, where 𝜃 = 𝜃 - αV∂𝜃 a correction is sometimes added to account for the fact that the moving average is not very accurate at the beginning iterations: β𝑉∂θ+(1−β)∂θ 𝑉∂θ = 𝑡 1−β - Gradient Descent with RMSProp put brakes when gradients are always large to avoid zigzagging ∂θ S∂𝜃 = βS∂𝜃 + (1 - β) ∂𝜃2, where θ = θ − α 𝑆∂θ s -if the squared gradients over the past iterations (denominator) gets too large, brake te - ADAM (Adaptive Moment Estimation) a mix of both momentum and rmsprop: no β1𝑉∂θ+(1−β1)∂θ - 1st order optimization (momentum): 𝑉∂θ = 𝑡 1−β1 2 β2𝑆∂θ+(1−β2)∂θ - 2nd order optimization (rmsprop): 𝑆∂θ = 𝑡 1−β2 cs 𝑉∂θ - where θ = θ − α 𝑆∂θ+ϵ - In practice ADAM is good default choice in many cases e/ - β1 = 0.9 (good default choice) - β2 = 0.999 (good default choice) r.e - a still has to be tuned -7 -8 - 𝜖 = 1 or 1 (does not really affect performance much) Slowly decay learning rate over time - usually, decay every epoch kt - some people manually delay it by observing the process lin Tips and Tricks in Training ML Models Preprocessing Things to consider before training the model: - Data cleaning Handling missing values: - Dropping: remove rows and/or columns with missing values - Imputation: use average value of most similar instance/s Data coming from multiple sources: s te no Different representations of text/numbers - Ex: Male/M/male can be simplified to just M Duplicate data cs - In some case, it makes sense for data to have duplicates - In other cases, caused by error in encoding - Exploratory data analysis Different information available in the data e/ Range of values of each variable Distributions of each variable r.e Handling outliers - Remove - Transform (log, square root, etc.) - Impute (replace) kt - Note: Option depends on context of data and problem Correlations within the different variables - Correlation matrix lin - Note: Some models don’t do well with highly correlated features - Scaling and normalization 𝑥−𝑚𝑖𝑛(𝑥) Normalization: 𝑥𝑛𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝑑 = 𝑚𝑎𝑥(𝑥)−𝑚𝑖𝑛(𝑥) 𝑥−𝑚𝑒𝑎𝑛(𝑥) Standardization: 𝑥𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑𝑖𝑧𝑒𝑑 = 𝑠𝑡𝑑𝑑𝑒𝑣(𝑥) Normalization vs Standardization: - Normalization s values fall between 0 and 1 more sensitive to outliers te useful when we don’t know distribution of underlying data - Standardization no values are not constrained to a range less sensitive to outliers useful when data follows a normal of Gaussian distribution Normalization/Standardization Pipeline cs - When normalizing, the min and max should only be based on the training data (not the test data) - When standardizing, the mean and stddev should only be based on the training data (not the test data) e/ - The test data should then be normalized/standardized according to the metrics of the training data to avoid data leakage r.e Feature Selection identifying the relevant features helpful for prediction Common ways to identify features: - Domain knowledge kt - Statistical methods Correlation Chi-square test lin ANOVA Feature and Label Encoding - some ML models require it in representing features/labels - sklearn has convenient functions that allow these encodings - Label Encoding: assign a unique integer for each class - One-Hot Encoding: makes a new feature for each class - Ordinal Encoding: similar to label encoding but with ordinal value Techniques for Improving Model Performance Data Augmentation - In most cases, ML models will improve performance with more data - Problem: more data is often expensive, and in some cases, impossible to obtain - Solution: generate artificial data from existing ones Examples (CV): - Augmenting the image with different transformations will help enlarge the dataset - CNN models are inherently spatially invariant, so data s augmentation can help solve this te - Techniques: Rotate: in varying angles Shift: in varying pixels (left and top) no Flip: to r, u to d Add noise: in varying degrees Blur: in varying sigma Others: add, multiply, invert, compress, cutout, sharpen, cs scale, edge, contrast, grayscale, temp, quantiz, hue+sat, blend, etc. - Adding weather (clouds, fog, snowflakes, rain) Examples (NLP): e/ - Task: Optical Character Recognition (OCR) from original data, replace characters with similar looking fonts r.e kt - Task: Text understanding tasks from original data, replace random characters with likely mistyped characters lin Example (Audio): - Add noise, speed up/down, and changing pitch Note: Plus, it also helps train the model to learn more situations that it should recognize -Reminders Never use augmented data on the validation/testing set Validation/testing sets are meant to represent data in the real-world, since we want to use it as a measure of how well our model performs in the wild; therefore, it should not contain any artificial data When using cross-validation, make sure data augmentation is only applied to the training set in every fold Diagnosing the ML Model s te no cs Error Analysis - Human Level Error Check what model is mis-classifying e/ “Can an expert human even classify it correctly? If not, we can’t expect that a computer can.” r.e may be caused by faulty generation of training data - Bias-Variance Analysis kt lin s te - Train-Val-Test Split Example: Climate data for different countries no - Idea 1: Splitting the data like this: cs Not a good way since model is likely trained and tested on different data distributions - Idea 2: Shuffle first before splitting: e/ Allow it to come from similar distributions r.e Example: Recognizing images of dogs uploaded by users in an app - Since there are not enough training data, augment it by searching for additional images of dogs online Ex: ~200,000 data from the web, ~10,000 from users/app kt - Idea 1: Shuffle all the data and split lin Not a good way since model will be evaluated mainly on the overwhelming web data - Idea 2: Web data should only be on the training set Better since it targets data the model will be encountering in the real world - Errors only happening in deployment Do a manual error analysis between collected data vs the data it encountered in the real-world during deployment to check discrepancies Ex (Audio): Match data to real world by synthesizing Ex (Animal Classifier): - Task: image classification - Performance: 20% error (validation set) - Cursory look at data shows some dogs look like other animals and have incorrect labels: Manually examine the mistakes Get ~100 mislabeled examples Count how many are incorrect because dogs look like other s animals te - If 5/100, ~1% error will be reduced - If 50/100, ~10% error will be reduced - Conduct error analysis: no cs Decide whether resolving these problems is needed for model purposes Hyperparameter Tuning e/ - Selecting the best set of hyperparameters for a given training procedure - General idea: Try out different combinations of hyperparameters and choose the best performing model on the validation set r.e - Hyperparameter Search: search from coarse (large ranges) to fine (narrow) Size of hidden layers: linear scale (ex. uniform random from 50-200) Number of layers: linear scale (ex. uniform random from 2-6) - can be exhaustive since only few possible values kt Learning rate: log scale (ex. r = uniform random from -1 to -8, learning rate a = 10 r ) Regularization: log scale (ex. r = uniform random from -1 to -6, lin regularization strength λ = 10 r ) General ML Pipeline Build a simple viable initial system ASAP - Lots of noise, little structure → shallow nn - Little noise, complex structure → deep nn - No structure → fully connected - Spatial structure → convolutional - Sequential structure → recurrent - Little data → Bayesian models or transfer learning - Simple structure: use what you know (linreg, logreg, decision trees, ada/xgboost, random levels, random forests, svm) Analyze errors and perform error analysis, decide what to prioritize next - Determine the bottlenecks in performance - Diagnose which components are performing worse than expected - Determine whether poor performance is due to overfitting, underfitting, or defect in the data or software - Repeat as needed s Ex: Suppose you are building a speech recognition system, there are lots of directions to te go into noise, accent, far/near from microphone, child’s speech, stutter no cs e/ r.e kt lin Naive Bayes Recall Probability: For instance, if there is a fair coin (50% head and 50% tails), and it is thrown 20 times, how many heads and how many tails do we expect to observe? - “if there is a fair coin (50% head and 50% tails)” is the random process that generated the observation - Answer: 10 heads and 10 tails (observation generated by the random process) Throwing 20 coins resulted in 4 heads and 16 tails. Is it a fair coin? If not, what is it? s - A coin with 20% head and 80% tails probabilities te Given the following data: no cs - What is the probability that it will rain on this planet? 3/7 = 0.4286 = 42.86% (an estimate of the probability) - What is the probability that the humidity is low? 2/7 = 0.2857 = 28.57% e/ 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑖𝑚𝑒𝑠 𝑒𝑣𝑒𝑛𝑡 𝐴 ℎ𝑎𝑝𝑝𝑒𝑛𝑒𝑑 As clearly seen, the probability is 𝑃(𝐴) = 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠 - Note: Probability will be more reliable with more samples r.e - What is the probability that the humidity is low given that it is not raining? 1/4 = 0.25 = 25% only considers fitting the first condition (not raining) as the total 𝑃(𝐴 ∩ 𝐵) - represented as 𝑃(𝐴|𝐵) = or “probability of A given B” kt 𝑃(𝐵) Sample Classification Task (Non-Naive Bayes) lin Task: predict whether a given student will pass ML class based on their math grade and number of hours studying per week What can we “learn” from the historical data? - Assume the dataset are observations generated by a random process - From a statistical point of view, we can infer: P(ML=passed | Math=bad ∩ Study>=4 hours) = 0.5 P(ML=passed | Math=bad ∩ Study=4 hours) = 1.0 P(ML=passed | Math=good ∩ Study=4 hours) = 0.5 P(ML=failed | Math=bad ∩ Study=4 hours) = 0.0 te P(ML=failed | Math=good ∩ Study= 4 hours 𝑃(𝐵𝑎𝑑|𝑃𝑎𝑠𝑠𝑒𝑑)𝑃(> 0 - if y = 0, we want hθ (x) ≈ 0, θ Tx > 0 r.e - if the correct answer is y = 0, we want θ Tx > 1 (not just 0) - if y = 0, we want θ Tx