Logistic Regression PDF
Document Details
Uploaded by WellRoundedTroll
Jasaswi Prasad Mohanty
Tags
Summary
These notes outline the concepts and principles of logistic regression, focusing on its application to classification problems. The document discusses various aspects, from the intuition behind the method to more technical details about the cost functions used within logistic regression. It also provides examples of classification problems that can be solved using logistic regression.
Full Transcript
MODULE II Logistic Regression Logistic Regression solves classification problem. Classification Problem Email: Spam / not Spam? Online Transactions: Fraudulent (Yes/No)? Tumour: Malignant / Benign? For a binary classification proble...
MODULE II Logistic Regression Logistic Regression solves classification problem. Classification Problem Email: Spam / not Spam? Online Transactions: Fraudulent (Yes/No)? Tumour: Malignant / Benign? For a binary classification problem 𝑦 ∈ {0,1}, 0: negative class (e.g. benign tumour), 1: positive class (e.g. malignant tumour). For a multiclass classification problem 𝑦 ∈ {0,1, 2, 3,... } ML / Module-II / Jasaswi Prasad Mohanty 3 Intuition Apply linear regression on the classification problem. 1 Threshold classifier output 𝜃 𝑥 , 𝑥 = , 𝑥 𝑥 ∈ 𝑅 at 0.5: If 𝜃 𝑥 ≥ 0.5, predict 𝑦 = 1. If 𝜃 𝑥 < 0.5, predict 𝑦 = 0. It seems Linear regression performing reasonably well. ML / Module-II / Jasaswi Prasad Mohanty 4 Intuition Let us add one more training example of malignant type as shown below. The hypothesis of Linear Regression will change. Now 2 out of 5 positive data points will be misclassified. Now it is clear that Linear Regression will not work for classification problem. Also the value of the hypothesis function 𝜃 𝑥 can be greater than 1 or less than 0, but in a classification problem 𝑦 = 1 or 𝑦 = 0. So we will use Logistic Regression for classification problem in which 0 ≤ 𝜃 𝑥 ≤ 1. ML / Module-II / Jasaswi Prasad Mohanty 5 Logistic Regression Model To fix the discussed problem let us use the following function known as logistic function or the sigmoid function: 1 𝑔 𝑧 = 1+𝑒 So now our hypothesis is 1 ℎ (𝑥) = 1+𝑒 Note that 𝑔 𝑧 tends towards 1 as z → ∞, and 𝑔 𝑧 tends towards 0 as z → −∞. Moreover, 𝑔 𝑧 , and hence also ℎ (𝑥), is always bounded between 0 and 1. ML / Module-II / Jasaswi Prasad Mohanty 6 Interpretation of Hypothesis Output ℎ (𝑥 ) = Estimated probability that 𝑦 = 1 on input 𝑥 Example: g(z) 1 1 If 𝑥 = = 𝑥 𝑡𝑢𝑚𝑜𝑟 𝑠𝑖𝑧𝑒 ℎ 𝑥 = 0.8 Tell patient that 80% chance of tumor being malignant. z ℎ 𝑥 = 𝑃 𝑌 = 1 𝑋 = 𝑥; 𝜃 : probability that 𝑌 = 1, given 𝑥, parameterized by 𝜃. We know 𝑔 𝑧 ≥ 0.5 when 𝑧 ≥ 0. Hence, ℎ (𝑥 ) = 𝑔 𝜃 𝑥 ≥ 0.5 whenever 𝜃 𝑥 ≥ 0 and ℎ 𝑥 = 𝑔 𝜃 𝑥 < 0.5 whenever 𝜃 𝑥 < 0. ML / Module-II / Jasaswi Prasad Mohanty 7 Logistic Regression Used to solve the classification problem for two class classification problem. Instead of directly predicting the class of a given pattern x, this method estimates the probability of the class to which x belongs. 1 Let 𝑃 𝑌 = 1 𝑋 = 𝑥; 𝜃 = 𝑝 𝑥; 𝜃 = ,𝑥= , 𝑥 ∈ 𝑅 ----------------------- (1) 𝑥 where 𝜃 = 𝜃 𝜃 … 𝜃 As it is a probability its value lie between 0 and 1 that is 0 ≤ 𝑝 𝑥; 𝜃 ≤ 1 The function used is known as sigmoid function or logistic function 1 𝑔 𝑧 = ,𝑧 ∈ 𝑅 1+𝑒 𝑃 𝑌 = 0 𝑋 = 𝑥; 𝜃 = 1 − 𝑝 𝑥; 𝜃 ----------------------- (2) These two class conditional probabilities can be written as a single expression: 𝑃 𝑌 = 𝑦 𝑋 = 𝑥; 𝜃 = 𝑝 𝑥; 𝜃 (1 − 𝑝 𝑥; 𝜃 ) , 𝑦 ∈ {0,1} ----------------------- (3) ML / Module-II / Jasaswi Prasad Mohanty 8 Logistic Regression Given the training data set 𝐷 = {(𝑥 ( ) , 𝑦 ( ) )} , 𝑥 ( ) ∈ 𝑅 and 𝑦 ( ) ∈ {0,1} which are independently and identically distributed ***, the likelihood function 𝐿 𝜃 of D is: () () () 𝐿 𝜃 =∏ 𝑝 𝑥 ;𝜃 (1 − 𝑝 𝑥 ( ) ; 𝜃 ) ----------------------- (4) By maximum likelihood principle, the best estimate of the parameters of 𝜃 are those that maximizes the likelihood function 𝐿 𝜃. This is equivalent to estimating 𝜃 that maximizes the log likelihood function 𝑙 𝜃 = log 𝐿 𝜃 = ∑ [𝑦 ( ) log 𝑝 𝑥 ( ) ; 𝜃 + (1 − 𝑦 ( ) ) log(1 − 𝑝 𝑥 ( ) ; 𝜃 )] ---------------------- (5) So 𝜃 ’s, 𝑗 = 0, 1, 2, … , 𝑝 are estimated by solving the problem: max 𝑙 𝜃 ----------------------- (6) 𝜃 ’s can be determined by setting = 0, 𝑗 = 0, 1, 2, … , 𝑝 ----------------------- (7) and solving (p+1) equations. *** Each data point comes from the same distribution. The occurrence of one data point does not affect the others. ML / Module-II / Jasaswi Prasad Mohanty 9 Logistic Regression We estimate the 𝜃 ’s, by solving the optimization problem (6) using gradient ascent method or Newton’s method. We compute the gradient of 𝑙 𝜃 as follows: = [∑ [𝑦 ( ) log 𝑝 +(1 − 𝑦 ( ) ) log(1 − 𝑝 )], where 𝑝 = 𝑝 𝑥 ( ) ; 𝜃 () () =∑ − = 𝑝 𝑥 ( ); 𝜃 = = ,𝑧 =𝜃 𝑥 = Let g z =. ( ) Hence, 𝑔 𝑧 = =− −1 𝑒 = = = 𝑔(𝑧)(1 − 𝑔 𝑧 ) ML / Module-II / Jasaswi Prasad Mohanty 10 Logistic Regression () () () () () = 𝜃 𝑥 = 𝜃 +𝜃 𝑥 + 𝜃 𝑥 +... +𝜃 𝑥 +... +𝜃 𝑥 =𝑥 () () = =𝑔 𝑧 1−𝑔 𝑧 𝑥 = 𝑝 (1 − 𝑝 )𝑥 () () () () () =∑ − =∑ − 𝑝 (1 − 𝑝 )𝑥 () () = 𝑦( ) 1 − 𝑝 − 1 − 𝑦( ) 𝑝 𝑥 = (𝑦 ( ) − 𝑝 ) 𝑥 ∑ (𝑦 ( ) − 𝑝 ) 1 () () ∑ (𝑦 ( ) − 𝑝 ) 𝑥 𝑥 𝛻 𝑙(𝜃) = = = =∑ (𝑦 ( ) − 𝑝 ) ⋮ = ∑ (𝑦 ( ) − 𝑝 )𝑥 , ⋮ ⋮ () () ∑ (𝑦 ( ) − 𝑝 ) 𝑥 𝑥 𝑖 = 1,2,... , 𝑚 ----------------------- (8) ML / Module-II / Jasaswi Prasad Mohanty 11 Logistic Regression ( ) ( ) ( ) 1 𝑥 𝑥 ⋯ 𝑥 𝑦 𝜃 𝑥( ) ( ) ( ) ( ) 𝑦 𝜃 Let 𝑋 = 1 𝑥 𝑥 ⋯ 𝑥 ( ) = 𝑥 , 𝑦= ⋮ ,𝜃= ⋮ --------- (9) ⋮ ⋮ ⋮ ⋱ ⋮ ⋮ ( ) ( ) ( ) 𝑦 𝜃 1 𝑥 𝑥 ⋯ 𝑥 𝑥( ) × ( )× ×( ) 𝑝 𝑝 𝑝= ⋮ , 𝑝 = 𝑝 𝑥 ( ); 𝜃 = () ----------------------- (10) 𝑝 × ----------------------- (11) We can write (8) as 𝛻 𝑙(𝜃) = 𝑋 (𝑦 − 𝑝) ML / Module-II / Jasaswi Prasad Mohanty 12 Algorithm for Computing 𝜽: Gradient Ascent Method Input: {(𝑥 , 𝑦 )} , where 𝑥 𝜖 𝑅 and 𝑦 𝜖 {0, 1} Output: 𝜃 1. Initialize 𝜃 , choose 𝜖 (tolerance or accuracy) and 𝛼 𝜖 (0,1) 2. Compute 𝑦, 𝑋 using equation (9) 3. for 𝑖 = 1, 2, 3,... , 𝑚 𝑝 = () using equation (1) and (10) 4. Compute 𝑝 using equation (11) 5. 𝜃 ⟵𝜃 + 𝛼𝑋 (𝑦 − 𝑝) 6. If ∆𝜃 < 𝜖 for 5 iterations in a sequence where ∆𝜃 = 𝜃 −𝜃 return 𝜃 7. Else 𝜃 ⟵𝜃 and go to step 3 ML / Module-II / Jasaswi Prasad Mohanty 13 Making Predictions After obtaining the estimate of 𝜃 the class probabilities for a new input 𝑥 is computed by 1 ----------------------- (12) 𝑃 𝑌=1𝑋=𝑥 = , where 𝑥 = 𝑥 𝑃 𝑌 = 0 𝑋 =𝑥 =1−𝑃 𝑌 = 1 𝑋 =𝑥 If 𝑃 𝑌 = 1 𝑋 = 𝑥 > 0.5 we conclude that 𝑥 is in class 1 else, it is in class 0. The threshold value 0.5 in equation (12) may be chosen in different from 0.5 depending on the problem. Since > , 𝛿 = −𝜃 𝑥 ⟺ 1 + 𝑒 < 2 ⟺ 𝑒 < 1 ⟺ ln 𝑒 < ln 1 ⟺ 𝛿 < 0 Equation (12) may be written as: If 𝛿(= −𝜃 𝑥 ) < 0 then assign x to class 1 else to class 0 ----------------------- (13) ML / Module-II / Jasaswi Prasad Mohanty 14 Logistic Regression Cost Function Cost Function 𝐽 𝜃 = ∑ 𝐶𝑜𝑠𝑡(ℎ 𝑥 ( ) , 𝑦 ( ) ) −𝒍𝒐𝒈(𝟏 − 𝒉 𝒙 ) 𝒊𝒇 𝒚 = 𝟎 −𝑙𝑜𝑔(ℎ 𝑥 ) 𝑖𝑓 𝑦 = 1 𝐶𝑜𝑠𝑡 ℎ 𝑥 , 𝑦 = −𝑙𝑜𝑔(1 − ℎ 𝑥 ) 𝑖𝑓 𝑦 = 0 For 𝒚 = 𝟏: −𝒍𝒐𝒈(𝒉 𝒙 ) 𝒊𝒇 𝒚 = 𝟏 If ℎ 𝑥 = 1, 𝐶𝑜𝑠𝑡 = 0 If ℎ 𝑥 → 0, 𝐶𝑜𝑠𝑡 → ∞ For 𝒚 = 𝟎: If ℎ 𝑥 = 0, 𝐶𝑜𝑠𝑡 = 0 If ℎ 𝑥 → 1, 𝐶𝑜𝑠𝑡 → ∞ Combining: ℎ 𝑥 𝐶𝑜𝑠𝑡 ℎ 𝑥 , 𝑦 = −𝑦 𝑙𝑜𝑔(ℎ 𝑥 ) − 1 − 𝑦 𝑙𝑜𝑔(1 − ℎ 𝑥 ) ℎ 𝑥 ML / Module-II / Jasaswi Prasad Mohanty 15 Logistic Regression: Gradient Descent Method Logistic Regression Cost Function 1 1 𝐽 𝜃 = 𝐶𝑜𝑠𝑡 ℎ 𝑥 , 𝑦 =− [𝑦 𝑙𝑜𝑔(ℎ 𝑥 )+ 1−𝑦 𝑙𝑜𝑔(1 − ℎ 𝑥 )] 𝑚 𝑚 In statistics, maximum likelihood estimation (MLE) is widely used to obtain the parameter for a distribution. In this paradigm, to maximize log likelihood is equal to minimize the above cost function J. It is a dual problem in Convex Optimization. We want 𝑀𝑖𝑛𝑖𝑚𝑖𝑧𝑒 𝐽(𝜃) Algorithm: repeat until convergence { ( ) ( ) () 𝜃 ≔𝜃 −𝛼 where = ∑ (ℎ 𝑥 − 𝑦 )𝑥 } () Here, 𝑥 is the value of the jth feature of the ith training example and ℎ 𝑥 = ML / Module-II / Jasaswi Prasad Mohanty 16 Problem ML / Module-II / Jasaswi Prasad Mohanty 17 Regularization in Logistic Regression Logistic Regression will overfit if we fit with high order polynomial features. 𝑧 =𝜃 +𝜃 𝑥 +𝜃 𝑥 +𝜃 𝑥 𝑥 +𝜃 𝑥 𝑥 +⋯ 𝑃 𝑌 = 1 𝑋 = 𝑥; 𝜃 = In general when we fit Logistic Regression with lot of features or a lot of polynomial features there could be a high risk of overfitting. Regularized Logistic Regression Cost Function 1 𝜆 𝐽 𝜃 =− [𝑦 𝑙𝑜𝑔(ℎ 𝑥 ) + 1 − 𝑦 𝑙𝑜𝑔(1 − ℎ 𝑥 )] + 𝜃 𝑚 2𝑚 ML / Module-II / Jasaswi Prasad Mohanty 18 Regularization in Logistic Regression Regularized Logistic Regression Cost Function 1 𝜆 𝐽 𝜃 =− [𝑦 𝑙𝑜𝑔(ℎ 𝑥 ) + 1 − 𝑦 𝑙𝑜𝑔(1 − ℎ 𝑥 )] + 𝜃 𝑚 2𝑚 When we minimize the cost function we will penalize the 𝜃 ’s and prevent them to be too large. If we do this we will get the decision boundary similar to shown in the rightmost figure. ML / Module-II / Jasaswi Prasad Mohanty 19 Gradient Descent for Regularized Logistic Regression Algorithm: repeat until convergence { ( ) ( ) () 𝜃 ≔𝜃 −𝛼 where = ∑ (ℎ 𝑥 − 𝑦 )𝑥 + 𝜃 } () Here, 𝑥 is the value of the jth feature of the ith training example and ℎ 𝑥 = Note that in the above algorithm ( ) ( ) = ∑ (ℎ 𝑥 − 𝑦 ) and = ∑ (ℎ 𝑥 −𝑦 ) + 𝜃 for j = 1, 2, … This looks same as regularized linear regression. The only difference is logistic function is applied to 𝜃 𝑥 here. ML / Module-II / Jasaswi Prasad Mohanty 20 Multiclass Classification Examples: Email Classification / Tagging: Work (y=1), Friends (y=2), Family (y=3), Hobby (y=4) Medical Diagrams: Not ill (y=1), Cold (y=2), Flu (y=3) Weather: Sunny (y=1), Cloudy (y=2), Rainy (y=3), Snow (y=4) Train a logistic regression classifier ℎ 𝑥 for each class i to predict the probability that 𝑦 = 𝑖. On a new input x to make a prediction, pick the class i that maximizes 𝑚𝑎𝑥 ℎ (𝑥). ML / Module-II / Jasaswi Prasad Mohanty 21 K - Nearest Neighbour (KNN) The k-nearest neighbors (KNN) algorithm is a simple, easy-to-implement supervised machine learning algorithm that can be used to solve both classification and regression problems. The KNN algorithm assumes that similar things exist in close proximity. In other words, similar things are near to each other. “Birds of a feather flock together.” The KNN algorithm hinges on this assumption (similar data points are close to each other) being true enough for the algorithm to be useful. KNN captures the idea of similarity by finding the Euclidean distance between the data points. This is the first algorithm of ML ML / Module-II / Jasaswi Prasad Mohanty 22 KNN Algorithm for Classification Problem Given : 𝐷 = {(𝑥 , 𝑦 )} , 𝑥 ∈ ℝ , 𝑦 ∈ 1, 2, 3,... , 𝑀 , 𝐾 and an 𝑥 ∈ ℝ , 𝑥 ≠ 𝑥 Goal : Determine the response 𝑦 ∈ 1, 2, 3,... , 𝑀 associated with 𝑥 ∈ ℝ Algorithm Input: Data Set 𝐷, Unknown data point 𝑥 ∈ ℝ , 𝐾. Output: Class label of 𝑥 1. for i = 1 to N do 2. Compute the distance of 𝑥 from 𝑥 : 𝑑 = ∑ (𝑥 − 𝑥 ) 3. Sort {𝑑 , 𝑑 ,... , 𝑑 } in non-decreasing order. 4. Choose the 𝐾-smallest distances in the sorted list and denote the set of points 𝑥 ’s which corresponds to 𝐾 smallest distances as 𝑁 i.e., 𝑁 = 𝐾. 5. for j = 1 to 𝑀 do 6. Count the number nj of points of class j in 𝑁 7. Compute the maximum of the nj’s. Let this maximum be m. Determine the class label 𝑙 of the class which has m points in Nx i.e., Compute 𝑙 = 𝑎𝑟𝑔 max 𝑛 8. Return 𝑙 ML / Module-II / Jasaswi Prasad Mohanty 23 KNN Algorithm for Regression Problem Given : 𝐷 = {(𝑥 , 𝑦 )} , 𝑥 ∈ ℝ , 𝑦 ∈ ℝ, 𝐾 and an 𝑥 ∈ ℝ , 𝑥 ≠ 𝑥 Goal : Determine the response 𝑦 ∈ ℝ associated with 𝑥 ∈ ℝ Algorithm: Input: Data Set 𝐷, Unknown data point 𝑥 ∈ ℝ , 𝐾. Output: Response of 𝑥 1. for i = 1 to N do 2. Compute the distance of 𝑥 from 𝑥 : 𝑑 = ∑ (𝑥 − 𝑥 ) 3. Sort {𝑑 , 𝑑 ,... , 𝑑 } in non-decreasing order. 4. Choose the 𝐾-smallest distances in the sorted list and denote the set of points 𝑥 ’s which corresponds to 𝐾 smallest distances as 𝑁 i.e., 𝑁 = 𝐾. 5. 𝑦= ∑ ∈ 𝑦 , where 𝑦 is the response of 𝑥 6. Return 𝑦 ML / Module-II / Jasaswi Prasad Mohanty 24 Example … … … ML / Module-II / Jasaswi Prasad Mohanty 25 Example K=3: Predicted Class – B K=7: Predicted Class – A ML / Module-II / Jasaswi Prasad Mohanty 26 KNN: Regression and Classification Combined Algorithm 1. Load the data 2. Initialize 𝐾 to your chosen number of neighbors 3. For each example in the data a) Calculate the distance between the query example and the current example from the data. b) Add the distance and the index of the example to an ordered collection 4. Sort the ordered collection of distances and indices from smallest to largest (in ascending order) by the distances 5. Pick the first 𝐾 entries from the sorted collection 6. Get the labels of the selected 𝐾 entries 7. If regression, return the mean of the 𝐾 labels 8. If classification, return the mode of the 𝐾 labels ML / Module-II / Jasaswi Prasad Mohanty 27 Advantages & Disadvantages The algorithm is simple and easy to implement. There’s no need to build a model, tune several parameters, or make additional assumptions. The algorithm is versatile. It can be used for classification, regression, and search. It does not make any assumption concerning the distribution of data. So the method is applicable to wide class of data. It provides limited insight into relationship between the predictors and the classes. Here one needs the entire dataset whenever one wants to classify a new point 𝑥 ∈ ℝ. So if 𝑁 or 𝑝 is large the computation is prohibitive. The prediction is slow, because it requires comparing distances to every point. The algorithm gets significantly slower as the number of examples and/or predictors/independent variables increase. ML / Module-II / Jasaswi Prasad Mohanty 28 KNN NOTE: The accuracy of the method depends upon the choice of 𝐾. So one has to tune the value of 𝐾 suitably. Theorem: The expected error of KNN rule converges to 1 + ⁄ times the error of the Baye’s classifier. When the number of features 𝑝 is large, there tends to be a deterioration in the performance of KNN. [ No.4 of Exercise 4.7, pp168-169, ISLR] ML / Module-II / Jasaswi Prasad Mohanty 29 Problem 7, Page 53, Book: ISLR (00) (30) (00) 2 2 2 d 1 3 (20) (00) (00) 2 2 2 d 2 2 (00) (10) (30) 2 2 2 d 3 3.162 (00) (10) (20) 2 2 2 d 4 2.236 (10) (00) (10) 2 2 2 d 5 1.414 (10) (10) (10) 2 2 2 d 6 1.732 K=1: Prediction – Green K=3: Prediction – Red ML / Module-II / Jasaswi Prasad Mohanty 30 Distance Metrics Distance metrics are used in supervised and unsupervised learning to calculate the similarity between data points. An effective distance metric improves the performance of our machine learning model. Types of Distance Metrics 1. Euclidean Distance (L2 Norm) 2. Manhattan Distance (L1 Norm) 3. Minkowski Distance 4. Hamming Distance 5. Cosine Similarity 6. Chebyshev Distance (L Norm) 7. Mahalanobis Distance 8. Jaccard Distance ML / Module-II / Jasaswi Prasad Mohanty 31 Euclidean Distance (L2 Norm) Measures the straight-line distance between two points in Euclidean space. 𝑑 𝑥, 𝑦 = ∑ (𝑥 − 𝑦 ) Commonly used when the features are continuous. Default choice for continuous, real-valued data. ML / Module-II / Jasaswi Prasad Mohanty 32 Manhattan Distance (L1 Norm) Also known as "Taxicab" or "City Block" distance. Suitable for grid-based problems and when the data has many outliers. 𝑑 𝑥, 𝑦 = ∑ 𝑥 −𝑦 It is ideal for situations where movement occurs along a grid or perpendicular axes. If the features in your dataset are not correlated and represent independent properties, Manhattan Distance is a better choice as it treats each dimension independently. ML / Module-II / Jasaswi Prasad Mohanty 33 Minkowski Distance Generalization of Euclidean (𝑘 = 2) and Manhattan (𝑘 = 1) distances. 𝑑 𝑥, 𝑦 = (∑ 𝑥 −𝑦 ) The parameter 𝑘 determines the nature of the metric. 𝒌=𝟐 𝒌=𝟏 ML / Module-II / Jasaswi Prasad Mohanty 34 Hamming Distance Counts the number of positions at which the corresponding elements are different. 𝑑 𝑥, 𝑦 = ∑ 1(𝑥 ≠ 𝑦 ) Used for categorical variables or binary data. Applications: Detecting typos or small differences between words. Error detection in digital communication, where Hamming Distance identifies the number of bit flips (errors) between a transmitted and received message. Comparing DNA sequences, where each base (A, T, G, C) is treated as a discrete category. ML / Module-II / Jasaswi Prasad Mohanty 35 Cosine Similarity Measures the cosine of the angle between two vectors.. ∑ ∗ Cosine Similarit𝑦 = = ∑ ∑ The range of cosine similarity is [−1,1]. 1: Indicates the vectors are in the exact same direction (maximum similarity). 0: Indicates the vectors are orthogonal (no similarity). -1: Indicates the vectors are in exactly opposite directions (maximum dissimilarity). Commonly used for text or sparse or high-dimensional data. ML / Module-II / Jasaswi Prasad Mohanty 36 Chebyshev Distance (𝑳∞ Norm) The Chebyshev Distance (also called 𝑳∞ Norm or Maximum Metric) measures the greatest absolute difference between corresponding components of two points. 𝑑 𝑥, 𝑦 = max 𝑥 − 𝑦 The Chebyshev distance focuses on the largest difference across all dimensions, making it useful when the dimension with the maximum deviation is the critical factor. Example: In a manufacturing process, if a product fails when a single attribute exceeds a certain threshold, Chebyshev distance helps identify such deviations. Suitable for chessboard-style problems. In chess, Chebyshev distance is used to calculate the minimum number of moves a king needs to travel between two squares, as the king can move in all 8 directions (horizontally, vertically, and diagonally). Example: The distance between (1,1) and (4,5) is 4 moves. ML / Module-II / Jasaswi Prasad Mohanty 37 Mahalanobis Distance Mahalanobis Distance is a powerful metric used in statistical and machine learning contexts where correlations between variables and the scale of the data are important. It measures the distance of a point from a distribution, taking into account the covariance of the data. Accounts for correlations between variables and scales the data accordingly. 𝑑 𝑥, 𝑦 = (𝑥 − 𝑦) 𝑆 (𝑥 − 𝑦) where 𝑆 is the covariance matrix of the data. 1 𝑆= (𝑋 − 𝜇) (𝑋 − 𝜇) 𝑛−1 Used when features are correlated. Use Mahalanobis distance when the data has correlated features, or when you need a metric that accounts for the variance and structure of the data. It's particularly useful for outlier detection, classification, and clustering in multivariate datasets. ML / Module-II / Jasaswi Prasad Mohanty 38 Mahalanobis Distance: Example 2 3 Covariance Matrix, 𝑆 = 𝑋−𝜇 𝑋−𝜇 Data, X = 3 5 −2 −3.5 5 8 1 −2 −1 1 2 −1 −1.5 6 10 = We want to calculate the Mahalanobis Distance 3 −3.5 −1.5 1.5 3.5 1 1.5 2 3.5 between 𝑥 = [2,3] and 𝑦 = 6,10 3.33 5.67 = 5.67 9.67 4 87 −51 Mean Vector 𝜇 = = 𝑆 = 6.5 −51 30 2−6 −4 𝑥−𝑦 = = 3 − 10 −7 2 − 4 3 − 6.5 −2 −3.5 X − μ = 3 − 4 5 − 6.5 = −1 −1.5 𝑑 𝑥, 𝑦 = (𝑥 − 𝑦) 𝑆 (𝑥 − 𝑦) = 5 − 4 8 − 6.5 1 1.5 87 −51 −4 6 − 4 10 − 6.5 2 3.5 −4 −7 = 2.45 −51 30 −7 −2 −1 1 2 (X − μ) = −3.5 −1.5 1.5 3.5 ML / Module-II / Jasaswi Prasad Mohanty 39 Jaccard Distance Measures the dissimilarity between two sets. ∩ 𝑑 𝑥, 𝑦 = 1 − ∪ Used for binary or categorical data. Example: 𝐴 = 1,2,3,4 𝐵 = {3,4,5,6} ∩ 𝐽𝑎𝑐𝑐𝑎𝑟𝑑 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 = 1 − = 1 − = 0.667 ∪ ML / Module-II / Jasaswi Prasad Mohanty 40