UNIT-3 Supervised Learning-1.pptx PDF

Noida Institute of Engineering and Technology, Greater Noida Artificial Intelligence & Machine Learning Unit: 3 Supervised Learning Dr. Raju Course Details...

Noida Institute of Engineering and Technology, Greater Noida Artificial Intelligence & Machine Learning Unit: 3 Supervised Learning Dr. Raju Course Details Assistant Professor & HoD B-Tech 3rd Sem. ONLINE & Offline (Sec A) Department of CSE(AIML) Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Faculty Introduction Name : Dr. Raju Qualification: Ph.D Experience: More than 9 years Subject Taught: Neural Network, DBMS, Object Oriented Programming, Computer Graphics, COA, Digital Image Processing, Dr. Raju, Assistant Prof. (CSE Computer Application (AIML)) UNIT 03 Course Outcomes (CO) Course Outcomes (CO) Bloom’s Knowledge Level (KL) Course outcome: After completion of this course students will be able to: CO 1 Choose and apply the most suitable search algorithm for a given problem to find the goal state. K3 CO2 Comprehend and apply feature engineering and data visualization concepts. K3 CO3 Critically analyze the strengths and weaknesses of various regression and classification algorithms. K5 CO4 Develop approaches that incorporates appropriate clustering algorithms to solve a specific data clustering problem. K3 CO5 Analyze the efficiency using the ensemble learning techniques, probabilistic learning and reinforcement learning K4 algorithms. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Syllabus Lecture Unit Module s Introduction to AI and problem-solving methods Introduction to AI and Intelligent agent, Different Approaches of AI, Problem Solving by searching Techniques: Uninformed search- BFS, DFS, Iterative deepening, Bi directional search, Unit-I Informed search- Iterative deepening, Bi directional search, Heuristic search, Greedy Best First Search, A * search, Local Search Algorithms- Hill Climbing and Simulated Annealing Adversial Search- Game Playing- minimax, alpha-beta pruning, constraint satisfaction problems Machine Learning & Feature Engineering Introduction to Machine Learning, Types of Machine Learning, Feature Engineering: Features and their types, handing missing data, Dealing with Unit-II categorical features, Working with features: Feature Scaling, Feature selection, Feature Extraction: Principal Component Analysis (PCA) algorithm Supervised Learning Regression & Classification: Types of regression (Univariate, Multivariate, Polynomial), Mean Square Error, R square error, Logistic Regression, Regularization: Bias and Variance, Overfitting and Underfitting, L1 and L2 Regularization, Regularized Linear Regression, Decision Trees (ID3, C4.5, Unit CART), Confusion matrix, k-folds cross-validation, K Nearest Neighbour, Support vector machine. III Unsupervised Machine Learning Unit Introduction to clustering, Types of clustering: K-means clustering, K-mode, K-medoid, hierarchical clustering, single-linkage, multiple linkage, AGNES IV and DIANA algorithms, Gaussian mixture models density based clustering, DBSCAN Ensemble & Reinforcement Learning Probabilistic learning: Bayesian Learning, Naive Bayes Classifier, Bayesian belief networks, Ensembles Learning: Random Forest, Gradient Boosting, Unit V XGBoost., Reinforcement Learning: Introduction to reinforcement learning, models of reinforcement learning: Markov decision process, Q-learning. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Course Contents / Syllabus UNIT-III Supervised Learning 8 Hours Regression & Classification: Types of regression (Univariate, Multivariate, Polynomial), Mean Square Error, R square error, Logistic Regression, Regularization: Bias and Variance, Overfitting and Underfitting, L1 and L2 Regularization, Regularized Linear Regression, Decision Trees (ID3, C4.5, CART), Confusion matrix, k-folds cross-validation, K Nearest Neighbour, Support vector machine. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Regression Regression is a supervised learning technique which helps in finding the correlation between variables and enables us to predict the continuous output variable based on the one or more predictor variables. Regression refers to a type of predictive modeling technique used to estimate the relationships among variables. It involves predicting a continuous outcome variable based on one or more predictor variables (features). It is a statistical method to model the relationship between a dependent (target) and independent (predictor) variables with one or more independent variables. It predicts continuous/real values such as temperature, age, salary, price, etc. It is mainly used for prediction, forecasting, time series Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Regression "Regression shows a line or curve that passes through all the datapoints on target-predictor graph in such a way that the vertical distance between the datapoints and the regression line is minimum." Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Terminologies Related to the Regression Analysis Dependent Variable: The main factor in Regression analysis which we want to predict or understand is called the dependent variable. It is also called target variable. Independent Variable: The factors which affect the dependent variables or which are used to predict the values of the dependent variables are called independent variable, also called as a predictor. Outliers: Outlier is an observation which contains either very low value or very high value in comparison to other observed values. An outlier may hamper the result, so it should be avoided. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Terminologies Related to the Regression Analysis Multicollinearity: If the independent variables are highly correlated with each other than other variables, then such condition is called Multicollinearity. It should not be present in the dataset, because it creates problem while ranking the most affecting variable. Underfitting and Overfitting: If our algorithm works well with the training dataset but not well with test dataset, then such problem is called Overfitting. And if our algorithm does not perform well even with training dataset, then such problem is called underfitting. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Types of Regression Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Type of Regression: Univariate Univariate regression refers to a statistical technique that analyzes the relationship between a single independent variable (predictor) and a single dependent variable (outcome). The goal is to model how changes in the independent variable affect the dependent variable. Simple Linear Regression Y=a+bX+ϵ Use Case: Predicting outcomes like sales based on advertising spend Polynomial Regression Y=a+b1 X+b2 X2+b3 X3+...+bn Xn +ϵ Use Case: Modeling relationships Dr. Raju, Assistant Prof. (CSE (AIML)) where the effect of the independent variable UNIT 03 changes at different levels, Type of Regression: Univariate Logarithmic Regression Y=a+blog(X)+ϵ Use Case: Analyzing phenomena like the relationship between income and consumption, where increases in income lead to smaller increases in consumption. Exponential Regression Y=a⋅e bX Use Case: Modeling population growth or radioactive decay. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Type of Regression:Multivariate regression Multivariate regression involves the analysis of multiple independent variables to predict a single dependent variable. Multiple Linear Regression Y=a+b1X1+b2X2+...+bnXn+ϵ Use Case: Predicting a person’s weight based on height, age, and exercise frequency. Ridge Regression where λ is the regularization parameter. Use Case: Useful when there are many predictors, and you want to reduce model complexity. Dr. Raju, Assistant UNIT 03 Prof. (CSE (AIML)) Type of Regression:Multivariate regression Lasso Regression Use Case: Effective for variable selection in models with a large number of predictors. Elastic Net Regression Use Case: Useful when there are many correlated variables and you want a more robust model. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Mean Square Error (MSE) Mean Squared Error (MSE) is a common metric used to evaluate the performance of regression models. It measures the average squared difference between the actual (observed) values and the predicted values generated by a model. The formula for MSE is: Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Mean Square Error (MSE) Actual Predict Observation y-Y (y-Y)^2 (y) (Y) 1 3 2.5 0.5 0.25 2 -0.5 0 -0.5 0.25 3 2 2 0 0 4 7 8.5 -1.5 2.25 MSE 0.6875 The Mean Squared Error (MSE) for this dataset is 0.6875. This value gives us an indication of how well the predicted values match the actual values, with a lower MSE representing better model performance. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Importance of MSE Performance Evaluation: MSE provides a quantitative measure of how well a regression model predicts outcomes. A lower MSE indicates a better fit to the data, meaning the model's predictions are closer to the actual values. Sensitivity to Outliers: Since MSE squares the errors, it gives greater weight to larger errors. This sensitivity makes it effective for detecting models that may not perform well on extreme values, although it can also make the metric overly influenced by outliers. Optimization Objective: Many machine learning algorithms, particularly those based on gradient descent (e.g., linear regression, neural networks), use MSE as the loss function to minimize during training. By minimizing MSE, models learn to make better predictions. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Importance of MSE Comparative Analysis: MSE allows for easy comparison between different models or algorithms. By evaluating multiple models using MSE, practitioners can select the one with the best performance based on this metric. Interpretable Metric: Although MSE itself is in squared units of the target variable, it is straightforward to interpret. When paired with the square root (resulting in Root Mean Squared Error, RMSE), it can be expressed in the same units as the target variable, enhancing interpretability. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 R-squared R-squared, also known as the coefficient of determination, is a statistical measure that indicates how well the independent variables in a regression model explain the variability of the dependent variable. It provides an indication of the goodness of fit of the model. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Interpretation of R-squared Range: R-squared values range from 0 to 1. 0: Indicates that the model does not explain any variability in the dependent variable (the mean of the dependent variable is the best predictor). 1: Indicates that the model explains all the variability in the dependent variable (perfect prediction). Value Interpretation: An R² value of 0.70, for example, suggests that 70% of the variability in the dependent variable can be explained by the independent variables in the model, while 30% is attributed to other factors or random noise.. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Linear Regression (Beyond The Syllabus) Linear regression is a statistical regression method which is used for predictive analysis. It is one of the very simple and easy algorithms which works on regression and shows the relationship between the continuous variables. It is used for solving the regression problem in machine learning. Linear regression shows the linear relationship between the independent variable (X-axis) and the dependent variable (Y-axis), hence called linear regression. If there is only one input variable (x), then such linear regression is called simple linear regression. And if there is more than one input variable, then such linear regression is called multiple linear regression. The relationship between variables in the linear regression model can be explained using the below image. Here we are predicting the salary Dr. Raju, Assistant Prof. (CSE (AIML)) of an employee on the basis UNIT 03of the year of experience. Linear Regression: Some popular applications of linear regression are: Analyzing trends and sales estimates Salary forecasting Real estate prediction Arriving at ETAs in traffic. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Colab Link for Linear Regression: https://colab.research.google.com/drive/1Ol5Rvfj- DDNR4vngO2EkZcfFaAM4jItw#scrollTo=0X7hGyLc11EZ Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Logistic Regression: Logistic regression is another supervised learning algorithm which is used to solve the classification problems. In classification problems, we have dependent variables in a binary or discrete format such as 0 or 1. Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes or No, True or False, Spam or not spam, etc. It is a predictive analysis algorithm which works on the concept of probability. Logistic regression is a type of regression, but it is different from the linear regression algorithm in the term how they are used. Logistic regression uses sigmoid function or logistic function which is a complex cost function. This sigmoid function is used to model the data in logistic regression. The function can be represented as: Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Logistic Regression: Logistic regression is another supervised learning algorithm which is used to solve the classification problems. In classification problems, we have dependent variables in a binary or discrete format such as 0 or 1. Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes or No, True or False, Spam or not spam, etc. It is a predictive analysis algorithm which works on the concept of probability. Logistic regression is a type of regression, but it is different from the linear regression algorithm in the term how they are used. Logistic regression uses sigmoid function or logistic function which is a complex cost function. This sigmoid function is used to model the data in logistic regression. The function can be represented as: Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Logistic Regression: Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Types of Logistic Regression: Binary(0/1, pass/fail) Multi(cats, dogs, lions) Ordinal(low, medium, high) Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Binary logistic regression Binary logistic regression predicts the relationship between the independent and binary dependent variables. Some examples of the output of this regression type may be, success/failure, 0/1, or true/false. Examples: Deciding on whether or not to offer a loan to a bank customer: Outcome = yes or no. Evaluating the risk of cancer: Outcome = high or low. Predicting a team’s win in a football match: Outcome = yes or no. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Multinomial logistic regression A categorical dependent variable has two or more discrete outcomes in a multinomial regression type. This implies that this regression type has more than two possible outcomes. Examples: Let’s say you want to predict the most popular transportation type for 2040. Here, transport type equates to the dependent variable, and the possible outcomes can be electric cars, electric trains, electric buses, and electric bikes. Predicting whether a student will join a college, vocational/trade school, or corporate industry. Estimating the type Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03of food consumed by pets, the Ordinal logistic regression Ordinal logistic regression applies when the dependent variable is in an ordered state (i.e., ordinal). The dependent variable (y) specifies an order with two or more categories or levels. Examples: Dependent variables represent, Formal shirt size: Outcomes = XS/S/M/L/XL Survey answers: Outcomes = Agree/Disagree/Unsure Scores on a math Dr. Raju, Assistant Prof. (CSE (AIML)) test: UNIT 03 Outcomes = Key advantages of logistic regression Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Colab Link for Linear Regression: Case Study: Predicting Diabetes Using Logistic Regression https://colab.research.google.com/drive/ 1QhRP1lhm05y6- I_T6uSz2CphiMcLKDYo#scrollTo=IHo9w80zCKoR Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Regularization Regularization is a technique used in machine learning and statistics to prevent overfitting, which occurs when a model learns the noise in the training data instead of the underlying patterns. Purpose: To improve model generalization and performance on unseen data by reducing overfitting. Common Types: L1 Regularization (Lasso): Adds a penalty equal to the absolute value of the coefficients. It can produce sparse models by driving some coefficients to zero, effectively selecting features. L2 Regularization (Ridge): Adds a penalty equal to the square of the coefficients. It tends to shrink coefficients evenly, preventing any one feature from having too much influence. Elastic Net: Combines both L1 and L2 penalties, allowing for feature selection and coefficient shrinkage. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Regularization Techniques: Dropout: Randomly sets a fraction of neurons to zero during training in neural networks, which helps prevent co-adaptation. Early Stopping: Involves monitoring the model's performance on a validation set and stopping training when performance begins to degrade. Benefits: Helps to avoid overfitting. Encourages simpler models, which are often more interpretable. Can improve prediction accuracy on new data. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Regularization: Bias and Variance An error is a measure of how accurately an algorithm can make predictions for the previously unknown dataset. Reducible errors: These errors can be reduced to improve the model accuracy. Such errors can further be classified into bias and Variance. Irreducible errors: These errors will always be present in the model Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Regularization: Bias and Variance Bias: a difference between prediction values made by the model and actual values/expected values is known as bias errors or Errors due to bias. Low Bias: A low bias model will make fewer assumptions about the form of the target function. High Bias: A model with a high bias makes more assumptions, and the model becomes unable to capture the important features of our dataset. A high bias model also cannot perform well on new data. Some examples of machine learning algorithms with low bias are Decision Trees, k-Nearest Neighbours and Support Vector Machines. At the same time, an algorithm with high bias is Linear Regression, Linear Discriminant Analysis and Dr. Raju, Assistant Prof. (CSE (AIML)) Logistic Regression.UNIT 03 Regularization: Bias and Variance Ways to reduce High Bias: Increase the input features as the model is underfitted. Decrease the regularization term. Use more complex models, such as including some polynomial features. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Regularization: Bias and Variance Variance The variance would specify the amount of variation in the prediction if the different training data was used. It tells that how much a random variable is different from its expected value. Ideally, a model should not vary too much from one training dataset to another, which means the algorithm should be good in understanding the hidden mapping between inputs and output variables. Variance errors are either of low variance or high variance. Low variance means there is a small variation in the prediction of the target function with changes in the training data set. At the same time, High variance shows a large variation in the prediction ofDr.the target function with changes in the Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Regularization: Bias and Variance A model with high variance has the below problems: A high variance model leads to overfitting. Increase model complexities. Ways to Reduce High Variance: Reduce the input features or number of parameters as a model is overfitted. Do not use a much complex model. Increase the training data. Increase the Regularization term. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Regularization: Bias and Variance Different Combinations of Bias-Variance Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Regularization: Bias and Variance Different Combinations of Bias-Variance Low-Bias, Low-Variance: The combination of low bias and low variance shows an ideal machine learning model. However, it is not possible practically. Low-Bias, High-Variance: With low bias and high variance, model predictions are inconsistent and accurate on average. This case occurs when the model learns with a large number of parameters and hence leads to an overfitting High-Bias, Low-Variance: With High bias and low variance, predictions are consistent but inaccurate on average. This case occurs when a model does not learn well with the training dataset or uses few numbers of the parameter. It leads to underfitting problems in the model. High-Bias, High-Variance: With high bias and high variance, predictions are inconsistent and also inaccurate on average. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Regularization Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Regularization: Overfitting Overfitting Overfitting is an undesirable machine learning behavior that occurs when the machine learning model gives accurate predictions for training data but not for new data. High variance and low bias. Reasons for Overfiting The training data size is too small and does not contain enough data samples to accurately represent all possible input data values. The training data contains large amounts of irrelevant information, called noisy data. The model trains for too long on a single sample set of data. The model complexity is high, Dr. Raju, Assistant Prof. so (CSEit learns the noise within the (AIML)) UNIT 03 Regularization: Overfitting Symptom of Overfitting: Low training error but high validation error. The model fits the training data very well but fails to generalize to new data. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Regularization: Underfitting Underfitting It occurs when a model is too simple to capture data complexities. It represents the inability of the model to learn the training data effectively result in poor performance both on the training and testing data. In simple terms, an underfit model’s are inaccurate, especially when applied to new, unseen examples. It mainly happens when we uses very simple model with overly simplified assumptions. To address underfitting problem of the model, we need to use more complex models, with enhanced feature representation, and less regularization. The underfitting model has High bias and low variance. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Regularization: Underfitting Reasons for Underfitting The model is too simple, So it may be not capable to represent the complexities in the data. The input features which is used to train the model is not the adequate representations of underlying factors influencing the target variable. The size of the training dataset used is not enough. Excessive regularization are used to prevent the overfitting, which constraint the model to capture the data well. Features are not scaled. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Regularization: Underfitting Techniques to Reduce Underfitting Increase model complexity. Increase the number of features, performing feature engineering. Remove noise from the data. Increase the number of epochs or increase the duration of training to get better results. Symptoms: High training error and high validation error. The model does not fit the training data well, leading to poor predictions even on known data. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Regularization: Underfitting vs. Overfitting Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Regularization: L1 (Lasso) LASSO (Least Absolute Shrinkage and Selection Operator) is a regression analysis method that incorporates L1 regularization. It's particularly useful for models that can benefit from feature selection and addressing multicollinearity. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Key Components of L1 (Lasso) L1 Regularization: LASSO adds a penalty equal to the absolute value of the coefficients multiplied by a tuning parameter (λ) to the loss function. The objective function becomes: Feature Selection: LASSO tends to shrink some coefficients to exactly zero when the tuning parameter λ is sufficiently large. This means it effectively selects a simpler model by excluding some features, which can improve interpretability. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Key Components of L1 (Lasso) Bias-Variance Tradeoff: By introducing the penalty, LASSO can help reduce overfitting, making the model more generalizable to unseen data. However, this can introduce some bias. Tuning Parameter (λ): The choice of λ is crucial. A smaller λ leads to a model closer to ordinary least squares (OLS), while a larger λ increases the amount of shrinkage and feature selection. Techniques like cross-validation are commonly used to find an optimal λ. Applications: LASSO is widely used in scenarios with many predictors, especially when some of them might be irrelevant or redundant. It's popular in fields like genetics, finance, and machine learning. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Example of L1 (Lasso) Example Scenario Imagine you have a dataset with information about houses, and you want to predict the price based on various features: Features Size (square feet) Number of bedrooms Age of the house Number of bathrooms Proximity to downtown Garage size Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Example of L1 (Lasso) Dataset https://colab.research.google.com/drive/ 1jeqwzMN5hpfasIFu6J9vub0BSBfonQEm#scrollTo=Dw5LvN SEy0nY Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Regularization: L1 (Lasso) https://colab.research.google.com/drive/ 1pD_uCPkmYla71GerEoN56izmX_ubaol0#scrollTo=3LMS -djesEHk Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Regularization: L2 (Ridge) It is a machine learning technique that avoids overfitting by introducing a penalty term into the model's loss function based on the squares of the model's parameters. The goal of L2 regularization is to keep the model's parameter sizes short and prevent oversizing. In order to achieve L2 regularization, a term that is proportionate to the squares of the model's parameters is added to the loss function. This word works as a limiter on the parameters' size, preventing them from growing out of control. A hyperparameter called lambda that controls the regularization's intensity also controls the size of the penalty term. Dr. Raju,UNIT Assistant Prof. (CSE (AIML)) 03 Regularization: L2 (Lasso) A regression model that uses the L2 regularization technique is called Ridge regression. Ridge regression adds the “squared magnitude” of the coefficient as a penalty term to the loss function(L). Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Regularization: L1(Lasso) and L2 (Ridge) Link for Code: https://colab.research.google.com/drive/ 1YkOjpnMJdYWvn0oqjJUCKebbqredDlFH#scrollTo=ityAL R23ZF-R Link for Dataset: https://www.kaggle.com/anthonypino/melbourne- housing-market Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Benefits of Regularization Reduces Overfitting: Regularization helps prevent models from learning noise and irrelevant details in the training data. Improves Generalization: By discouraging complex models, regularization ensures better performance on unseen data. Enhances Stability: Regularization stabilizes model training by penalizing large weights. Enables Feature Selection: L1 regularization can zero out some coefficients, effectively selecting more relevant features. Manages Multicollinearity: Reduces the problem of high correlationsDr.among features, Raju, Assistant particularly useful in Prof. (CSE (AIML)) UNIT 03 Benefits of Regularization Encourages Simplicity: Promotes simpler models that are easier to interpret and less likely to overfit. Controls Model Complexity: Provides a mechanism to balance the complexity of the model with its performance on the training and test data. Facilitates Robustness: Makes models less sensitive to individual peculiarities in the training set. Improves Convergence: Helps optimization algorithms converge more quickly and reliably by smoothing the error landscape. Adjustable Complexity: The strength of regularization can be tuned to fit the data's specific needs and desired model Dr. Raju, Assistant complexity. Prof. (CSE (AIML)) UNIT 03 Confusion Matrix A confusion matrix is a tool used in machine learning to evaluate the performance of a classification model. It provides a visual representation of the actual versus predicted classifications, helping to understand the types of errors made by the model. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Confusion Matrix True Positive (TP): The number of correct predictions that an instance is positive. True Negative (TN): The number of correct predictions that an instance is negative. False Positive (FP): The number of incorrect predictions where an instance is predicted as positive, but it is actually negative (Type I error). False Negative (FN): The number of incorrect predictions where an instance is predicted as negative, but it is actually positive (Type II error). Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Confusion Matrix Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Key Metrics Derived from the Confusion Matrix Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Key Metrics Derived from the Confusion Matrix Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Case Study: Medical Diagnosis for Diabetes Context: A healthcare provider develops a machine learning model to predict whether patients have diabetes based on various health metrics. After training and validating the model, they evaluate its performance using a confusion matrix. Data Overview The model was tested on a dataset of 1,000 patients, with the following results: Actual Diabetic Patients: 150 Actual Non-Diabetic Patients: 850 Confusion Matrix Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Case Study: Medical Diagnosis for Diabetes Interpretation of the Confusion Matrix True Positives (TP): 120 patients were correctly identified as diabetic. False Negatives (FN): 30 patients were diabetic but were incorrectly classified as non-diabetic. False Positives (FP): 50 patients were classified as diabetic, but they were actually non-diabetic. True Negatives (TN): 800 patients were correctly identified as non-diabetic Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Case Study: Medical Diagnosis for Diabetes Interpretation of the Confusion Matrix True Positives (TP): 120 patients were correctly identified as diabetic. False Negatives (FN): 30 patients were diabetic but were incorrectly classified as non-diabetic. False Positives (FP): 50 patients were classified as diabetic, but they were actually non-diabetic. True Negatives (TN): 800 patients were correctly identified as non-diabetic Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Case Study: Medical Diagnosis for Diabetes Key Metrics Accuracy = 92% Precision = 70.6% Recall = 80% F1-Score = 75% Insights and Analysis High Accuracy: The model has a high accuracy of 92%, which might seem promising, but accuracy alone does not give a complete picture, especially in medical diagnosis where the costs of false negatives can be significant. Precision vs. Recall: With a precision of 70.6%, the model does reasonably well when it predicts diabetes. However, a recall of 80% indicates that some diabetic patients are being missed, which can be critical in healthcare settings. Focus on Recall: Given that missing a diabetic patient can lead to serious health complications, healthcare providers might prioritize improving recall, even if it results in a lower precision (increased false positives). Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Decision Tree A Decision Tree is a machine learning model used for both classification and regression tasks. It represents decisions and their possible consequences in a tree-like structure, which makes it easy to visualize and interpret. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Decision Trees Definitions Root node: First node in the path from which all decisions initially started from. It has no parent node and 2 children nodes Decision nodes: Nodes that have 1 parent node and split into children nodes (decision or leaf nodes) Leaf nodes: Nodes that have 1 parent, but do not split further (also known as terminal nodes). They are the nodes that produce the prediction. Branches: A subsection of the entire tree (also known as a sub-tree) Parent / child nodes: A node that is divided in sub-nodes is called a parent node. The sub-nodes are the child nodes of the parent from which they divided. Maximum depth: maximum number of branches between the top and the lower end Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Types of Decision Tree Algorithms There are multiple decision tree algorithms: ID3 (Iterative Dichotomiser 3) C4.5 (extension of ID3) CART (Classification And Regression Tree) Chi-square (Chi-square automatic interaction detection) MARS (Multivariate Adaptive Regression Splines) There are 2 decision trees grouped under Classification and decision tree (CART). Classification decision tree (used for categorical data) Regression decision tree (used for continuous data) Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Decision Tree Metrics:Information Gain Information Gain (IG) To produce the “best” result, decision trees aim at maximizing the Information Gain (IG) after each split. Information Gain of a single node is calculated by subtracting the entropy to 1. The information gain helps define if the split contains more pure nodes compared to the parent node. To measure the information gain of a parent against its children, we must subtract the weighted entropy of the children from the entropy of the parent. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Decision Tree Metrics:Information Gain Entropy Entropy is used to measure the quality of a split for categorical targets. The formula of entropy in decision trees is: Where pi represents the percentage of the class in the node. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Decision Tree Metrics:Information Gain Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Decision Tree Metrics:Information Gain To calculate the entropy of a split, we: Calculate the entropy of the parent node Calculate the entropy of the children nodes Calcute the weighted average entropy of the split If the weighted entropy is smaller than the entropy of the parent node, then the information gain is greater. In our case, the entropy of the parent equals 1 and the weighted entropy equals 1.09. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Decision Tree Metrics:Gini Impurity Gini Impurity Gini impurity is used as an alternative to information gain (IG) to compute the homogeneity of a leaf in a less computationally intensive way. The purer, or homogenous, a node is, the smaller the Gini impurity is. Gini Index is a metric to measure how often a randomly chosen element would be incorrectly identified. It means an attribute with a lower Gini index should be preferred. Sklearn supports “Gini” criteria for Gini Index and by default, it takes “gini” value. The Formula for the calculation of the Gini Index is given below. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Decision Tree Metrics Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Features and characteristics of the Gini Index It is calculated by summing the squared probabilities of each outcome in a distribution and subtracting the result from 1. A lower Gini Index indicates a more homogeneous or pure distribution, while a higher Gini Index indicates a more heterogeneous or impure distribution. In decision trees, the Gini Index is used to evaluate the quality of a split by measuring the difference between the impurity of the parent node and the weighted impurity of the child nodes. Compared to other impurity measures like entropy, the Gini Index is faster to compute and more sensitive to changes in class probabilities. One disadvantage of the Gini Index is that it tends to favour splits that create equally sized child nodes, even if they are not optimal for classification accuracy. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Advantages of Decision Tree Easy to understand and interpret, making them accessible to non-experts. Handle both numerical and categorical data without requiring extensive preprocessing. Provides insights into feature importance for decision-making. Handle missing values and outliers without significant impact. Applicable to both classification and regression tasks. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Disadvantages of Decision Tree Disadvantages include the potential for overfitting Sensitivity to small changes in data, limited generalization if training data is not representative Potential bias in the presence of imbalanced data. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Colab implementation of Decision Tree https://colab.research.google.com/drive/ 1PvJ7IfGna4AJm9pOG_wguT_BJjP7koTJ#scrollTo=YJGJ6KUvQxIg Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Classification using the ID3 algorithm Consider whether a dataset based on which we will determine whether to play football or not. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Classification using the ID3 algorithm The independent variables are Outlook, Temperature, Humidity, and Wind. The dependent variable is whether to play football or not. Step 1: Find the parent node for decision tree. Find the entropy of the class variable. From the above data for outlook we can arrive at the following table easily Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Classification using the ID3 algorithm Now we have to calculate average weighted entropy. The next step is to find the information gain. It is the difference between parent entropy and average weighted entropy we found above. Similarly find Information gain for Temperature, Humidity, and Windy. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Classification using the ID3 algorithm Now select the feature having the largest entropy gain. Here it is Outlook. So it forms the first node(root node) of our decision tree. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Classification using the ID3 algorithm To Continue: https://medium.datadriveninvestor.com/decision-tree-algorithm- with-hands-on-example-e6c2afb40d38 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 K-fold cross-validation Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 K-fold cross-validation Data Partitioning: K-fold cross-validation divides the dataset into k equal-sized subsets (or "folds"), allowing the model to be trained and validated multiple times. Each fold serves as the validation set once while the remaining folds are used for training. Performance Estimation: The model is trained and evaluated k times, and the performance metrics (like accuracy, precision, recall, etc.) are averaged across all folds. This provides a more reliable estimate of the model's performance on unseen data. Mitigation of Overfitting: By using multiple train-test splits, k- fold cross-validation helps reduce overfitting, ensuring that the model generalizes well to new data rather than just memorizing the training set. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 K-fold cross-validation Flexibility in Choice of k: The value of k can be adjusted based on the size of the dataset and the computational resources available. Common choices are k = 5 or k = 10, but smaller or larger values can be used depending on specific needs. Stratification Option: In cases of imbalanced datasets, stratified k-fold cross- validation can be employed to ensure that each fold has a similar distribution of classes, which helps maintain the representativeness of each fold. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 How K-Fold Cross-Validation Works Data Splitting: The entire dataset is randomly divided into k subsets or "folds" of approximately equal size. Common values for k are 5 or 10, but it can be adjusted based on the size of the dataset. Model Training and Validation: The model is trained and validated k times. In each iteration: One fold is used as the validation set, while the remaining k-1 folds are used for training. This process ensures that every instance in the dataset is used for both training and validation at least once. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 How K-Fold Cross-Validation Works Performance Measurement: After completing all k iterations, the performance metrics (such as accuracy, precision, recall, F1 score, etc.) are averaged across all iterations. This average provides a more comprehensive view of the model's performance. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Advantages of K-Fold Cross-Validation Better Generalization: By using multiple train-test splits, k-fold cross-validation reduces the risk of overfitting and provides a better estimate of how the model will perform on unseen data. Efficient Use of Data: Since each instance in the dataset is used for both training and validation, it maximizes the utilization of available data, which is particularly important for smaller datasets. Variance Reduction: Averaging the results over multiple folds helps to smooth out the variability that can occur with a single train-test split. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Disadvantages of K-Fold Cross-Validation Computationally Intensive: K-fold cross-validation can be computationally expensive, especially for large datasets or complex models, as the model must be trained k times. Choice of K: Selecting an appropriate value for k is crucial. A very small k may lead to high variance in the performance estimate, while a very large k can be computationally expensive. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 K-Nearest Neighbour Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 K-Nearest Neighbour Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 K-Nearest Neighbour K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised Learning technique. K-NN algorithm assumes the similarity between the new case/data and available cases and put the new case into the category that is most similar to the available categories. K-NN algorithm stores all the available data and classifies a new data point based on the similarity. This means when new data appears then it can be easily classified into a well suite category by using K- NN algorithm. K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for the Classification problems. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 K-Nearest Neighbour K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying data. It is also called a lazy learner algorithm because it does not learn from the training set immediately instead it stores the dataset and at the time of classification, it performs an action on the dataset. KNN algorithm at the training phase just stores the dataset and when it gets new data, then it classifies that data into a category that is much similar to the new data. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 How does K-NN work? Step-1: Select the number K of the neighbors Step-2: Calculate the Euclidean distance of K number of neighbors Step-3: Take the K nearest neighbors as per the calculated Euclidean distance. Step-4: Among these k neighbors, count the number of the data points in each category. Step-5: Assign the new data points to that category for which the number of the neighbor is maximum. Step-6: Our model is ready. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 How to select the value of K in the K-NN Algorithm? There is no particular way to determine the best value for "K", so we need to try some values to find the best out of them. The most preferred value for K is 5. A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers in the model. Large values for K are good, but it may find some difficulties. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Distance Metrics Used in KNN Algorithm From the Minkowski, when p = 2 then it is the same as the formula for the Euclidean distance and when p = 1 then we obtain the formula for the Manhattan distance. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Advantages of the KNN Algorithm Dataset: Let's consider a simple dataset with two features (height and weight) and a binary target variable indicating whether a person is "Athlete" (1) or "Non-Athlete" (0). Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Steps of KNN process Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Steps of KNN process Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Steps of KNN process Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Steps of KNN process Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Advantages of the KNN Algorithm Easy to implement as the complexity of the algorithm is not that high. Adapts Easily – As per the working of the KNN algorithm it stores all the data in memory storage and hence whenever a new example or data point is added then the algorithm adjusts itself as per that new example and has its contribution to the future predictions as well. Few Hyperparameters – The only parameters which are required in the training of a KNN algorithm are the value of k and the choice of the distance metric which we would like to choose from our evaluation metric. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Disadvantages of the KNN Algorithm Does not scale – As we have heard about this that the KNN algorithm is also considered a Lazy Algorithm. The main significance of this term is that this takes lots of computing power as well as data storage. This makes this algorithm both time-consuming and resource exhausting. Curse of Dimensionality – There is a term known as the peaking phenomenon according to this the KNN algorithm is affected by the curse of dimensionality which implies the algorithm faces a hard time classifying the data points properly when the dimensionality is too high. Prone to Overfitting – As the algorithm is affected due to the curse of dimensionality it is prone to the problem of overfitting as well. Hence generally feature selection as well as dimensionality reduction techniques are applied to deal with this problem. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Colab Link :KNN Algorithm https://colab.research.google.com/drive/ 1IgRx863lKCxLXnINK5wiNjBJoGiWBCO-#scrollTo=jGLVgSF8rlx1 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Support Vector Machine Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Support Vector Machine Support Vector Machine (SVM), also known as Support Vector Classification. It is a supervised and linear Machine Learning technique typically used to solve classification problems. SVR stands for Support Vector Regression and is a subset of SVM that uses the same ideas to tackle regression problems. SVM also supports the kernel method called the kernel SVM, which allows us to tackle non- linearity. The primary use case for SVM is classification, but it can solve classification and regression problems. SVM constructs a hyperplane (see the picture below) in multidimensional space to separate different classes. It iteratively generates the best hyperplane to minimize classification error. The goal of SVM is to find a maximum marginal hyperplane (MMH) that splits a dataset into classes as evenly as possible. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Support Vector Machine Support Vectors are data points closest to the hyperplane called support vectors. These points will define the separating line better by calculating margins and are more relevant to the construction of the classifier. A hyperplane is a decision plane that separates objects with different class memberships. Margin is the distance between the two lines on the class points closest to each other. It is calculated as the perpendicular distance from the line to support vectors or nearest points. The bold margin between the classes is good, whereas a thin margin is not good. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Types of Support Vector Machine Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Types of Support Vector Machine Linear SVM or Simple SVM It is used for data that is linearly separable. A dataset is termed linearly separable data if it can be classified into two classes using a single straight line, and the classifier is known as the linear SVM classifier. It’s most commonly used for tasks involving linear regression and classification. Nonlinear SVM or Kernel SVM It is also known as Kernel SVM, is a type of SVM that is used to classify nonlinearly separated data, or data that cannot be classified using a straight line. It has more flexibility for nonlinear data because more features can be added to fit a hyperplane instead of a two-dimensional space. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 SVM Kernels Some problems can’t be solved using a linear hyperplane because they are non- linearly separable. In such a situation, SVM uses a kernel trick to transform the input space into a higher-dimensional space. There are different types of SVM kernels depending on the kind of problem. Linear Kernel is a regular dot product for two observations. The sum of the multiplication of each pair of input values is the product of two vectors. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 SVM Kernels Polynomial Kernel is a more generalized form of Linear Kernel. The polynomial Kernel can tell if the input space is curved or nonlinear. The d is the degree of the polynomial. If the d = 1, then it is similar to the linear transformation. The degree needs to be manually specified in the learning algorithm. Radial Basis Function Kernel can map an input space into an infinite-dimensional space. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 SVM Kernels Radial Basis Function Kernel can map an input space into an infinite-dimensional space. Here gamma is a parameter, which ranges from 0 to 1. A higher gamma value will perfectly fit the training dataset, which causes over- fitting. The gamma = 0.1 is considered to be a good default value. The gamma value again needs to be manually specified in the learning algorithm. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 Regularization Regularization: Bias and Variance, Overfitting and Underfitting, L1 and L2 Regularization, Regularized Linear Regression, Decision Trees (ID3, C4.5, CART), Confusion matrix, k-folds cross-validation, K Nearest Neighbour, Support vector machine. Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03

UNIT-3 Supervised Learning-1.pptx PDF

Document Details

Tags

Related

Summary

Full Transcript