Unit-3 Supervised Learning-1 PDF

Noida Institute of Engineering and Technology, Greater Noida Artificial Intelligence & Machine Learning Unit: 3 Supervised Learning Dr. Raju Course Details...

Noida Institute of Engineering and Technology, Greater Noida Artificial Intelligence & Machine Learning Unit: 3 Supervised Learning Dr. Raju Course Details Assistant Professor & HoD B-Tech 3rd Sem. ONLINE & Offline (Sec A) Department of CSE(AIML) Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 11/3/2024 2 Faculty Introduction Name : Dr. Raju Qualification: Ph.D Experience: More than 9 years Subject Taught: Neural Network, DBMS, Object Oriented Programming, Computer Graphics, COA, Digital Image Processing, Computer Application Dr. Raju, Assistant Prof. (CSE (AIML)) 11/3/2024 3 UNIT 03 Course Outcomes (CO) Course Outcomes (CO) Bloom’s Knowledge Level (KL) Course outcome: After completion of this course students will be able to: CO 1 Choose and apply the most suitable search algorithm for a given problem to find the goal state. K3 CO2 Comprehend and apply feature engineering and data visualization concepts. K3 CO3 Critically analyze the strengths and weaknesses of various regression and classification algorithms. K5 CO4 Develop approaches that incorporates appropriate clustering algorithms to solve a specific data clustering K3 problem. CO5 Analyze the efficiency using the ensemble learning techniques, probabilistic learning and reinforcement learning K4 algorithms. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 4 Syllabus Lecture Unit Module s Introduction to AI and problem-solving methods Introduction to AI and Intelligent agent, Different Approaches of AI, Problem Solving by searching Techniques: Uninformed search- BFS, DFS, Iterative deepening, Bi directional search, Unit-I Informed search- Iterative deepening, Bi directional search, Heuristic search, Greedy Best First Search, A* search, Local Search Algorithms- Hill Climbing and Simulated Annealing Adversial Search- Game Playing- minimax, alpha-beta pruning, constraint satisfaction problems Machine Learning & Feature Engineering Introduction to Machine Learning, Types of Machine Learning, Feature Engineering: Features and their types, handing missing data, Dealing with Unit-II categorical features, Working with features: Feature Scaling, Feature selection, Feature Extraction: Principal Component Analysis (PCA) algorithm Supervised Learning Regression & Classification: Types of regression (Univariate, Multivariate, Polynomial), Mean Square Error, R square error, Logistic Regression, Unit Regularization: Bias and Variance, Overfitting and Underfitting, L1 and L2 Regularization, Regularized Linear Regression, Decision Trees (ID3, C4.5, III CART), Confusion matrix, k-folds cross-validation, K Nearest Neighbour, Support vector machine. Unsupervised Machine Learning Unit Introduction to clustering, Types of clustering: K-means clustering, K-mode, K-medoid, hierarchical clustering, single-linkage, multiple linkage, AGNES IV and DIANA algorithms, Gaussian mixture models density based clustering, DBSCAN Ensemble & Reinforcement Learning Probabilistic learning: Bayesian Learning, Naive Bayes Classifier, Bayesian belief networks, Ensembles Learning: Random Forest, Gradient Boosting, Unit V XGBoost., Reinforcement Learning: Introduction to reinforcement learning, models of reinforcement learning: Markov decision process, Q-learning. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 5 Course Contents / Syllabus UNIT-III Supervised Learning 8 Hours Regression & Classification: Types of regression (Univariate, Multivariate, Polynomial), Mean Square Error, R square error, Logistic Regression, Regularization: Bias and Variance, Overfitting and Underfitting, L1 and L2 Regularization, Regularized Linear Regression, Decision Trees (ID3, C4.5, CART), Confusion matrix, k-folds cross-validation, K Nearest Neighbour, Support vector machine. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 6 Regression Regression is a supervised learning technique which helps in finding the correlation between variables and enables us to predict the continuous output variable based on the one or more predictor variables. Regression refers to a type of predictive modeling technique used to estimate the relationships among variables. It involves predicting a continuous outcome variable based on one or more predictor variables (features). It is a statistical method to model the relationship between a dependent (target) and independent (predictor) variables with one or more independent variables. It predicts continuous/real values such as temperature, age, salary, price, etc. It is mainly used for prediction, forecasting, time series modeling, and determining the causal-effect relationship between variables. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 7 Regression "Regression shows a line or curve that passes through all the datapoints on target-predictor graph in such a way that the vertical distance between the datapoints and the regression line is minimum." 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 8 Terminologies Related to the Regression Analysis Dependent Variable: The main factor in Regression analysis which we want to predict or understand is called the dependent variable. It is also called target variable. Independent Variable: The factors which affect the dependent variables or which are used to predict the values of the dependent variables are called independent variable, also called as a predictor. Outliers: Outlier is an observation which contains either very low value or very high value in comparison to other observed values. An outlier may hamper the result, so it should be avoided. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 9 Terminologies Related to the Regression Analysis Multicollinearity: If the independent variables are highly correlated with each other than other variables, then such condition is called Multicollinearity. It should not be present in the dataset, because it creates problem while ranking the most affecting variable. Underfitting and Overfitting: If our algorithm works well with the training dataset but not well with test dataset, then such problem is called Overfitting. And if our algorithm does not perform well even with training dataset, then such problem is called underfitting. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 10 Types of Regression 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 11 Type of Regression: Univariate Univariate regression refers to a statistical technique that analyzes the relationship between a single independent variable (predictor) and a single dependent variable (outcome). The goal is to model how changes in the independent variable affect the dependent variable. Simple Linear Regression Y=a+bX+ϵ Use Case: Predicting outcomes like sales based on advertising spend Polynomial Regression Y=a+b1 X+b2 X2+b3 X3+...+bn Xn +ϵ Use Case: Modeling relationships where the effect of the independent variable changes at different levels, such as growth patterns that are quadratic. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 12 Type of Regression: Univariate Logarithmic Regression Y=a+blog(X)+ϵ Use Case: Analyzing phenomena like the relationship between income and consumption, where increases in income lead to smaller increases in consumption. Exponential Regression Y=a⋅e bX Use Case: Modeling population growth or radioactive decay. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 13 Type of Regression:Multivariate regression Multivariate regression involves the analysis of multiple independent variables to predict a single dependent variable. Multiple Linear Regression Y=a+b1X1+b2X2+...+bnXn+ϵ Use Case: Predicting a person’s weight based on height, age, and exercise frequency. Ridge Regression where λ is the regularization parameter. Use Case: Useful when there are many predictors, and you want to reduce model complexity. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 14 Type of Regression:Multivariate regression Lasso Regression Use Case: Effective for variable selection in models with a large number of predictors. Elastic Net Regression Use Case: Useful when there are many correlated variables and you want a more robust model. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 15 Mean Square Error (MSE) Mean Squared Error (MSE) is a common metric used to evaluate the performance of regression models. It measures the average squared difference between the actual (observed) values and the predicted values generated by a model. The formula for MSE is: 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 16 Mean Square Error (MSE) Observation Actual (y) Predict (Y) y-Y (y-Y)^2 1 3 2.5 0.5 0.25 2 -0.5 0 -0.5 0.25 3 2 2 0 0 4 7 8.5 -1.5 2.25 MSE 0.6875 The Mean Squared Error (MSE) for this dataset is 0.6875. This value gives us an indication of how well the predicted values match the actual values, with a lower MSE representing better model performance. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 17 Importance of MSE Performance Evaluation: MSE provides a quantitative measure of how well a regression model predicts outcomes. A lower MSE indicates a better fit to the data, meaning the model's predictions are closer to the actual values. Sensitivity to Outliers: Since MSE squares the errors, it gives greater weight to larger errors. This sensitivity makes it effective for detecting models that may not perform well on extreme values, although it can also make the metric overly influenced by outliers. Optimization Objective: Many machine learning algorithms, particularly those based on gradient descent (e.g., linear regression, neural networks), use MSE as the loss function to minimize during training. By minimizing MSE, models learn to make better predictions. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 18 Importance of MSE Comparative Analysis: MSE allows for easy comparison between different models or algorithms. By evaluating multiple models using MSE, practitioners can select the one with the best performance based on this metric. Interpretable Metric: Although MSE itself is in squared units of the target variable, it is straightforward to interpret. When paired with the square root (resulting in Root Mean Squared Error, RMSE), it can be expressed in the same units as the target variable, enhancing interpretability. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 19 R-squared R-squared, also known as the coefficient of determination, is a statistical measure that indicates how well the independent variables in a regression model explain the variability of the dependent variable. It provides an indication of the goodness of fit of the model. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 20 Interpretation of R-squared Range: R-squared values range from 0 to 1. 0: Indicates that the model does not explain any variability in the dependent variable (the mean of the dependent variable is the best predictor). 1: Indicates that the model explains all the variability in the dependent variable (perfect prediction). Value Interpretation: An R² value of 0.70, for example, suggests that 70% of the variability in the dependent variable can be explained by the independent variables in the model, while 30% is attributed to other factors or random noise.. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 21 Linear Regression (Beyond The Syllabus) Linear regression is a statistical regression method which is used for predictive analysis. It is one of the very simple and easy algorithms which works on regression and shows the relationship between the continuous variables. It is used for solving the regression problem in machine learning. Linear regression shows the linear relationship between the independent variable (X- axis) and the dependent variable (Y-axis), hence called linear regression. If there is only one input variable (x), then such linear regression is called simple linear regression. And if there is more than one input variable, then such linear regression is called multiple linear regression. The relationship between variables in the linear regression model can be explained using the below image. Here we are predicting the salary of an employee on the basis of the year of experience. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 22 Linear Regression: Some popular applications of linear regression are: Analyzing trends and sales estimates Salary forecasting Real estate prediction Arriving at ETAs in traffic. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 23 Colab Link for Linear Regression: https://colab.research.google.com/drive/1Ol5Rvfj- DDNR4vngO2EkZcfFaAM4jItw#scrollTo=0X7hGyLc11EZ 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 24 Logistic Regression: Logistic regression is another supervised learning algorithm which is used to solve the classification problems. In classification problems, we have dependent variables in a binary or discrete format such as 0 or 1. Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes or No, True or False, Spam or not spam, etc. It is a predictive analysis algorithm which works on the concept of probability. Logistic regression is a type of regression, but it is different from the linear regression algorithm in the term how they are used. Logistic regression uses sigmoid function or logistic function which is a complex cost function. This sigmoid function is used to model the data in logistic regression. The function can be represented as: 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 25 Logistic Regression: Logistic regression is another supervised learning algorithm which is used to solve the classification problems. In classification problems, we have dependent variables in a binary or discrete format such as 0 or 1. Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes or No, True or False, Spam or not spam, etc. It is a predictive analysis algorithm which works on the concept of probability. Logistic regression is a type of regression, but it is different from the linear regression algorithm in the term how they are used. Logistic regression uses sigmoid function or logistic function which is a complex cost function. This sigmoid function is used to model the data in logistic regression. The function can be represented as: 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 26 Logistic Regression: 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 27 Types of Logistic Regression: Binary(0/1, pass/fail) Multi(cats, dogs, lions) Ordinal(low, medium, high) 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 28 Binary logistic regression Binary logistic regression predicts the relationship between the independent and binary dependent variables. Some examples of the output of this regression type may be, success/failure, 0/1, or true/false. Examples: Deciding on whether or not to offer a loan to a bank customer: Outcome = yes or no. Evaluating the risk of cancer: Outcome = high or low. Predicting a team’s win in a football match: Outcome = yes or no. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 29 Multinomial logistic regression A categorical dependent variable has two or more discrete outcomes in a multinomial regression type. This implies that this regression type has more than two possible outcomes. Examples: Let’s say you want to predict the most popular transportation type for 2040. Here, transport type equates to the dependent variable, and the possible outcomes can be electric cars, electric trains, electric buses, and electric bikes. Predicting whether a student will join a college, vocational/trade school, or corporate industry. Estimating the type of food consumed by pets, the outcome may be wet food, dry food, or junk food. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 30 Ordinal logistic regression Ordinal logistic regression applies when the dependent variable is in an ordered state (i.e., ordinal). The dependent variable (y) specifies an order with two or more categories or levels. Examples: Dependent variables represent, Formal shirt size: Outcomes = XS/S/M/L/XL Survey answers: Outcomes = Agree/Disagree/Unsure Scores on a math test: Outcomes = Poor/Average/Good 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 31 Key advantages of logistic regression 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 32 Colab Link for Linear Regression: Case Study: Predicting Diabetes Using Logistic Regression https://colab.research.google.com/drive/1QhRP1lhm05y6- I_T6uSz2CphiMcLKDYo#scrollTo=IHo9w80zCKoR 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 33 Regularization Regularization is a technique used in machine learning and statistics to prevent overfitting, which occurs when a model learns the noise in the training data instead of the underlying patterns. Purpose: To improve model generalization and performance on unseen data by reducing overfitting. Common Types: L1 Regularization (Lasso): Adds a penalty equal to the absolute value of the coefficients. It can produce sparse models by driving some coefficients to zero, effectively selecting features. L2 Regularization (Ridge): Adds a penalty equal to the square of the coefficients. It tends to shrink coefficients evenly, preventing any one feature from having too much influence. Elastic Net: Combines both L1 and L2 penalties, allowing for feature selection and coefficient shrinkage. Dr. Raju, Assistant Prof. (CSE (AIML)) 11/3/2024 UNIT 03 34 Regularization Techniques: Dropout: Randomly sets a fraction of neurons to zero during training in neural networks, which helps prevent co-adaptation. Early Stopping: Involves monitoring the model's performance on a validation set and stopping training when performance begins to degrade. Benefits: Helps to avoid overfitting. Encourages simpler models, which are often more interpretable. Can improve prediction accuracy on new data. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 35 Regularization: Bias and Variance An error is a measure of how accurately an algorithm can make predictions for the previously unknown dataset. Reducible errors: These errors can be reduced to improve the model accuracy. Such errors can further be classified into bias and Variance. Irreducible errors: These errors will always be present in the model 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 36 Regularization: Bias and Variance Bias: a difference between prediction values made by the model and actual values/expected values is known as bias errors or Errors due to bias. Low Bias: A low bias model will make fewer assumptions about the form of the target function. High Bias: A model with a high bias makes more assumptions, and the model becomes unable to capture the important features of our dataset. A high bias model also cannot perform well on new data. Some examples of machine learning algorithms with low bias are Decision Trees, k-Nearest Neighbours and Support Vector Machines. At the same time, an algorithm with high bias is Linear Regression, Linear Discriminant Analysis and Logistic Regression. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 37 Regularization: Bias and Variance Ways to reduce High Bias: Increase the input features as the model is underfitted. Decrease the regularization term. Use more complex models, such as including some polynomial features. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 38 Regularization: Bias and Variance Variance The variance would specify the amount of variation in the prediction if the different training data was used. It tells that how much a random variable is different from its expected value. Ideally, a model should not vary too much from one training dataset to another, which means the algorithm should be good in understanding the hidden mapping between inputs and output variables. Variance errors are either of low variance or high variance. Low variance means there is a small variation in the prediction of the target function with changes in the training data set. At the same time, High variance shows a large variation in the prediction of the target function with changes in the training dataset. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 39 Regularization: Bias and Variance A model with high variance has the below problems: A high variance model leads to overfitting. Increase model complexities. Ways to Reduce High Variance: Reduce the input features or number of parameters as a model is overfitted. Do not use a much complex model. Increase the training data. Increase the Regularization term. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 40 Regularization: Bias and Variance Different Combinations of Bias-Variance 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 41 Regularization: Bias and Variance Different Combinations of Bias-Variance Low-Bias, Low-Variance: The combination of low bias and low variance shows an ideal machine learning model. However, it is not possible practically. Low-Bias, High-Variance: With low bias and high variance, model predictions are inconsistent and accurate on average. This case occurs when the model learns with a large number of parameters and hence leads to an overfitting High-Bias, Low-Variance: With High bias and low variance, predictions are consistent but inaccurate on average. This case occurs when a model does not learn well with the training dataset or uses few numbers of the parameter. It leads to underfitting problems in the model. High-Bias, High-Variance: With high bias and high variance, predictions are inconsistent and also inaccurate on average. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 42 Regularization 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 43 Regularization: Overfitting Overfitting Overfitting is an undesirable machine learning behavior that occurs when the machine learning model gives accurate predictions for training data but not for new data. High variance and low bias. Reasons for Overfiting The training data size is too small and does not contain enough data samples to accurately represent all possible input data values. The training data contains large amounts of irrelevant information, called noisy data. The model trains for too long on a single sample set of data. The model complexity is high, so it learns the noise within the training data. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 44 Regularization: Overfitting Symptom of Overfitting: Low training error but high validation error. The model fits the training data very well but fails to generalize to new data. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 45 Regularization: Underfitting Underfitting It occurs when a model is too simple to capture data complexities. It represents the inability of the model to learn the training data effectively result in poor performance both on the training and testing data. In simple terms, an underfit model’s are inaccurate, especially when applied to new, unseen examples. It mainly happens when we uses very simple model with overly simplified assumptions. To address underfitting problem of the model, we need to use more complex models, with enhanced feature representation, and less regularization. The underfitting model has High bias and low variance. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 46 Regularization: Underfitting Reasons for Underfitting The model is too simple, So it may be not capable to represent the complexities in the data. The input features which is used to train the model is not the adequate representations of underlying factors influencing the target variable. The size of the training dataset used is not enough. Excessive regularization are used to prevent the overfitting, which constraint the model to capture the data well. Features are not scaled. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 47 Regularization: Underfitting Techniques to Reduce Underfitting Increase model complexity. Increase the number of features, performing feature engineering. Remove noise from the data. Increase the number of epochs or increase the duration of training to get better results. Symptoms: High training error and high validation error. The model does not fit the training data well, leading to poor predictions even on known data. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 48 Regularization: Underfitting vs. Overfitting 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 49 Regularization: L1 (Lasso) LASSO (Least Absolute Shrinkage and Selection Operator) is a regression analysis method that incorporates L1 regularization. It's particularly useful for models that can benefit from feature selection and addressing multicollinearity. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 50 Key Components of L1 (Lasso) L1 Regularization: LASSO adds a penalty equal to the absolute value of the coefficients multiplied by a tuning parameter (λ) to the loss function. The objective function becomes: Feature Selection: LASSO tends to shrink some coefficients to exactly zero when the tuning parameter λ is sufficiently large. This means it effectively selects a simpler model by excluding some features, which can improve interpretability. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 51 Key Components of L1 (Lasso) Bias-Variance Tradeoff: By introducing the penalty, LASSO can help reduce overfitting, making the model more generalizable to unseen data. However, this can introduce some bias. Tuning Parameter (λ): The choice of λ is crucial. A smaller λ leads to a model closer to ordinary least squares (OLS), while a larger λ increases the amount of shrinkage and feature selection. Techniques like cross-validation are commonly used to find an optimal λ. Applications: LASSO is widely used in scenarios with many predictors, especially when some of them might be irrelevant or redundant. It's popular in fields like genetics, finance, and machine learning. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 52 Example of L1 (Lasso) Example Scenario Imagine you have a dataset with information about houses, and you want to predict the price based on various features: Features Size (square feet) Number of bedrooms Age of the house Number of bathrooms Proximity to downtown Garage size 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 53 Example of L1 (Lasso) Dataset https://colab.research.google.com/drive/1jeqwzMN5hpfasIFu6J9vu b0BSBfonQEm#scrollTo=Dw5LvNSEy0nY 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 54 Regularization: L1 (Lasso) https://colab.research.google.com/drive/1pD_uCPkmYla71GerEoN56i zmX_ubaol0#scrollTo=3LMS-djesEHk 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 55 Regularization: L2 (Ridge) It is a machine learning technique that avoids overfitting by introducing a penalty term into the model's loss function based on the squares of the model's parameters. The goal of L2 regularization is to keep the model's parameter sizes short and prevent oversizing. In order to achieve L2 regularization, a term that is proportionate to the squares of the model's parameters is added to the loss function. This word works as a limiter on the parameters' size, preventing them from growing out of control. A hyperparameter called lambda that controls the regularization's intensity also controls the size of the penalty term. The parameters will be smaller and the regularization will be stronger the greater the lambda. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 56 Regularization: L2 (Lasso) A regression model that uses the L2 regularization technique is called Ridge regression. Ridge regression adds the “squared magnitude” of the coefficient as a penalty term to the loss function(L). 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 57 Regularization: L1(Lasso) and L2 (Ridge) Link for Code: https://colab.research.google.com/drive/1YkOjpnMJdYWvn0oqjJUCK ebbqredDlFH#scrollTo=ityALR23ZF-R Link for Dataset: https://www.kaggle.com/anthonypino/melbourne-housing-market 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 58 Benefits of Regularization Reduces Overfitting: Regularization helps prevent models from learning noise and irrelevant details in the training data. Improves Generalization: By discouraging complex models, regularization ensures better performance on unseen data. Enhances Stability: Regularization stabilizes model training by penalizing large weights. Enables Feature Selection: L1 regularization can zero out some coefficients, effectively selecting more relevant features. Manages Multicollinearity: Reduces the problem of high correlations among features, particularly useful in linear models. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 59 Benefits of Regularization Encourages Simplicity: Promotes simpler models that are easier to interpret and less likely to overfit. Controls Model Complexity: Provides a mechanism to balance the complexity of the model with its performance on the training and test data. Facilitates Robustness: Makes models less sensitive to individual peculiarities in the training set. Improves Convergence: Helps optimization algorithms converge more quickly and reliably by smoothing the error landscape. Adjustable Complexity: The strength of regularization can be tuned to fit the data's specific needs and desired model complexity. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 60 Confusion Matrix A confusion matrix is a tool used in machine learning to evaluate the performance of a classification model. It provides a visual representation of the actual versus predicted classifications, helping to understand the types of errors made by the model. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 61 Confusion Matrix True Positive (TP): The number of correct predictions that an instance is positive. True Negative (TN): The number of correct predictions that an instance is negative. False Positive (FP): The number of incorrect predictions where an instance is predicted as positive, but it is actually negative (Type I error). False Negative (FN): The number of incorrect predictions where an instance is predicted as negative, but it is actually positive (Type II error). 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 62 Confusion Matrix 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 63 Key Metrics Derived from the Confusion Matrix 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 64 Key Metrics Derived from the Confusion Matrix 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 65 Case Study: Medical Diagnosis for Diabetes Context: A healthcare provider develops a machine learning model to predict whether patients have diabetes based on various health metrics. After training and validating the model, they evaluate its performance using a confusion matrix. Data Overview The model was tested on a dataset of 1,000 patients, with the following results: Actual Diabetic Patients: 150 Actual Non-Diabetic Patients: 850 Confusion Matrix 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 66 Case Study: Medical Diagnosis for Diabetes Interpretation of the Confusion Matrix True Positives (TP): 120 patients were correctly identified as diabetic. False Negatives (FN): 30 patients were diabetic but were incorrectly classified as non-diabetic. False Positives (FP): 50 patients were classified as diabetic, but they were actually non-diabetic. True Negatives (TN): 800 patients were correctly identified as non-diabetic 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 67 Case Study: Medical Diagnosis for Diabetes Interpretation of the Confusion Matrix True Positives (TP): 120 patients were correctly identified as diabetic. False Negatives (FN): 30 patients were diabetic but were incorrectly classified as non-diabetic. False Positives (FP): 50 patients were classified as diabetic, but they were actually non-diabetic. True Negatives (TN): 800 patients were correctly identified as non-diabetic 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 68 Case Study: Medical Diagnosis for Diabetes Key Metrics Accuracy = 92% Precision = 70.6% Recall = 80% F1-Score = 75% Insights and Analysis High Accuracy: The model has a high accuracy of 92%, which might seem promising, but accuracy alone does not give a complete picture, especially in medical diagnosis where the costs of false negatives can be significant. Precision vs. Recall: With a precision of 70.6%, the model does reasonably well when it predicts diabetes. However, a recall of 80% indicates that some diabetic patients are being missed, which can be critical in healthcare settings. Focus on Recall: Given that missing a diabetic patient can lead to serious health complications, healthcare providers might prioritize improving recall, even if it results in a lower precision (increased false positives). 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 69 Decision Tree A Decision Tree is a machine learning model used for both classification and regression tasks. It represents decisions and their possible consequences in a tree-like structure, which makes it easy to visualize and interpret. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 70 Decision Trees Definitions Root node: First node in the path from which all decisions initially started from. It has no parent node and 2 children nodes Decision nodes: Nodes that have 1 parent node and split into children nodes (decision or leaf nodes) Leaf nodes: Nodes that have 1 parent, but do not split further (also known as terminal nodes). They are the nodes that produce the prediction. Branches: A subsection of the entire tree (also known as a sub-tree) Parent / child nodes: A node that is divided in sub-nodes is called a parent node. The sub-nodes are the child nodes of the parent from which they divided. Maximum depth: maximum number of branches between the top and the lower end 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 71 Types of Decision Tree Algorithms There are multiple decision tree algorithms: ID3 (Iterative Dichotomiser 3) C4.5 (extension of ID3) CART (Classification And Regression Tree) Chi-square (Chi-square automatic interaction detection) MARS (Multivariate Adaptive Regression Splines) There are 2 decision trees grouped under Classification and decision tree (CART). Classification decision tree (used for categorical data) Regression decision tree (used for continuous data) 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 72 Decision Tree Metrics:Information Gain Information Gain (IG) To produce the “best” result, decision trees aim at maximizing the Information Gain (IG) after each split. Information Gain of a single node is calculated by subtracting the entropy to 1. The information gain helps define if the split contains more pure nodes compared to the parent node. To measure the information gain of a parent against its children, we must subtract the weighted entropy of the children from the entropy of the parent. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 73 Decision Tree Metrics:Information Gain Entropy Entropy is used to measure the quality of a split for categorical targets. The formula of entropy in decision trees is: Where pi represents the percentage of the class in the node. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 74 Decision Tree Metrics:Information Gain 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 75 Decision Tree Metrics:Information Gain To calculate the entropy of a split, we: Calculate the entropy of the parent node Calculate the entropy of the children nodes Calcute the weighted average entropy of the split If the weighted entropy is smaller than the entropy of the parent node, then the information gain is greater. In our case, the entropy of the parent equals 1 and the weighted entropy equals 1.09. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 76 Decision Tree Metrics:Gini Impurity Gini Impurity Gini impurity is used as an alternative to information gain (IG) to compute the homogeneity of a leaf in a less computationally intensive way. The purer, or homogenous, a node is, the smaller the Gini impurity is. Gini Index is a metric to measure how often a randomly chosen element would be incorrectly identified. It means an attribute with a lower Gini index should be preferred. Sklearn supports “Gini” criteria for Gini Index and by default, it takes “gini” value. The Formula for the calculation of the Gini Index is given below. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 77 Decision Tree Metrics 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 78 Features and characteristics of the Gini Index It is calculated by summing the squared probabilities of each outcome in a distribution and subtracting the result from 1. A lower Gini Index indicates a more homogeneous or pure distribution, while a higher Gini Index indicates a more heterogeneous or impure distribution. In decision trees, the Gini Index is used to evaluate the quality of a split by measuring the difference between the impurity of the parent node and the weighted impurity of the child nodes. Compared to other impurity measures like entropy, the Gini Index is faster to compute and more sensitive to changes in class probabilities. One disadvantage of the Gini Index is that it tends to favour splits that create equally sized child nodes, even if they are not optimal for classification accuracy. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 79 Advantages of Decision Tree Easy to understand and interpret, making them accessible to non-experts. Handle both numerical and categorical data without requiring extensive preprocessing. Provides insights into feature importance for decision-making. Handle missing values and outliers without significant impact. Applicable to both classification and regression tasks. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 80 Disadvantages of Decision Tree Disadvantages include the potential for overfitting Sensitivity to small changes in data, limited generalization if training data is not representative Potential bias in the presence of imbalanced data. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 81 Colab implementation of Decision Tree https://colab.research.google.com/drive/1PvJ7IfGna4AJm9pOG_wguT_BJjP7koTJ #scrollTo=YJGJ6KUvQxIg 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 82 Classification using the ID3 algorithm Consider whether a dataset based on which we will determine whether to play football or not. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 83 Classification using the ID3 algorithm The independent variables are Outlook, Temperature, Humidity, and Wind. The dependent variable is whether to play football or not. Step 1: Find the parent node for decision tree. Find the entropy of the class variable. From the above data for outlook we can arrive at the following table easily 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 84 Classification using the ID3 algorithm Now we have to calculate average weighted entropy. The next step is to find the information gain. It is the difference between parent entropy and average weighted entropy we found above. Similarly find Information gain for Temperature, Humidity, and Windy. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 85 Classification using the ID3 algorithm Now select the feature having the largest entropy gain. Here it is Outlook. So it forms the first node(root node) of our decision tree. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 86 Classification using the ID3 algorithm To Continue: https://medium.datadriveninvestor.com/decision-tree-algorithm- with-hands-on-example-e6c2afb40d38 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 87 K-fold cross-validation 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 88 K-fold cross-validation Data Partitioning: K-fold cross-validation divides the dataset into k equal-sized subsets (or "folds"), allowing the model to be trained and validated multiple times. Each fold serves as the validation set once while the remaining folds are used for training. Performance Estimation: The model is trained and evaluated k times, and the performance metrics (like accuracy, precision, recall, etc.) are averaged across all folds. This provides a more reliable estimate of the model's performance on unseen data. Mitigation of Overfitting: By using multiple train-test splits, k- fold cross-validation helps reduce overfitting, ensuring that the model generalizes well to new data rather than just memorizing the training set. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 89 K-fold cross-validation Flexibility in Choice of k: The value of k can be adjusted based on the size of the dataset and the computational resources available. Common choices are k = 5 or k = 10, but smaller or larger values can be used depending on specific needs. Stratification Option: In cases of imbalanced datasets, stratified k-fold cross- validation can be employed to ensure that each fold has a similar distribution of classes, which helps maintain the representativeness of each fold. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 90 How K-Fold Cross-Validation Works Data Splitting: The entire dataset is randomly divided into k subsets or "folds" of approximately equal size. Common values for k are 5 or 10, but it can be adjusted based on the size of the dataset. Model Training and Validation: The model is trained and validated k times. In each iteration: One fold is used as the validation set, while the remaining k-1 folds are used for training. This process ensures that every instance in the dataset is used for both training and validation at least once. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 91 How K-Fold Cross-Validation Works Performance Measurement: After completing all k iterations, the performance metrics (such as accuracy, precision, recall, F1 score, etc.) are averaged across all iterations. This average provides a more comprehensive view of the model's performance. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 92 Advantages of K-Fold Cross-Validation Better Generalization: By using multiple train-test splits, k-fold cross-validation reduces the risk of overfitting and provides a better estimate of how the model will perform on unseen data. Efficient Use of Data: Since each instance in the dataset is used for both training and validation, it maximizes the utilization of available data, which is particularly important for smaller datasets. Variance Reduction: Averaging the results over multiple folds helps to smooth out the variability that can occur with a single train-test split. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 93 Disadvantages of K-Fold Cross-Validation Computationally Intensive: K-fold cross-validation can be computationally expensive, especially for large datasets or complex models, as the model must be trained k times. Choice of K: Selecting an appropriate value for k is crucial. A very small k may lead to high variance in the performance estimate, while a very large k can be computationally expensive. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 94 K-Nearest Neighbour 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 95 K-Nearest Neighbour 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 96 K-Nearest Neighbour K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised Learning technique. K-NN algorithm assumes the similarity between the new case/data and available cases and put the new case into the category that is most similar to the available categories. K-NN algorithm stores all the available data and classifies a new data point based on the similarity. This means when new data appears then it can be easily classified into a well suite category by using K- NN algorithm. K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for the Classification problems. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 97 K-Nearest Neighbour K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying data. It is also called a lazy learner algorithm because it does not learn from the training set immediately instead it stores the dataset and at the time of classification, it performs an action on the dataset. KNN algorithm at the training phase just stores the dataset and when it gets new data, then it classifies that data into a category that is much similar to the new data. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 98 How does K-NN work? Step-1: Select the number K of the neighbors Step-2: Calculate the Euclidean distance of K number of neighbors Step-3: Take the K nearest neighbors as per the calculated Euclidean distance. Step-4: Among these k neighbors, count the number of the data points in each category. Step-5: Assign the new data points to that category for which the number of the neighbor is maximum. Step-6: Our model is ready. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 99 How to select the value of K in the K-NN Algorithm? There is no particular way to determine the best value for "K", so we need to try some values to find the best out of them. The most preferred value for K is 5. A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers in the model. Large values for K are good, but it may find some difficulties. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 100 Distance Metrics Used in KNN Algorithm From the Minkowski, when p = 2 then it is the same as the formula for the Euclidean distance and when p = 1 then we obtain the formula for the Manhattan distance. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 101 Advantages of the KNN Algorithm Dataset: Let's consider a simple dataset with two features (height and weight) and a binary target variable indicating whether a person is "Athlete" (1) or "Non-Athlete" (0). 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 102 Steps of KNN process 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 103 Steps of KNN process 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 104 Steps of KNN process 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 105 Steps of KNN process 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 106 Advantages of the KNN Algorithm Easy to implement as the complexity of the algorithm is not that high. Adapts Easily – As per the working of the KNN algorithm it stores all the data in memory storage and hence whenever a new example or data point is added then the algorithm adjusts itself as per that new example and has its contribution to the future predictions as well. Few Hyperparameters – The only parameters which are required in the training of a KNN algorithm are the value of k and the choice of the distance metric which we would like to choose from our evaluation metric. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 107 Disadvantages of the KNN Algorithm Does not scale – As we have heard about this that the KNN algorithm is also considered a Lazy Algorithm. The main significance of this term is that this takes lots of computing power as well as data storage. This makes this algorithm both time-consuming and resource exhausting. Curse of Dimensionality – There is a term known as the peaking phenomenon according to this the KNN algorithm is affected by the curse of dimensionality which implies the algorithm faces a hard time classifying the data points properly when the dimensionality is too high. Prone to Overfitting – As the algorithm is affected due to the curse of dimensionality it is prone to the problem of overfitting as well. Hence generally feature selection as well as dimensionality reduction techniques are applied to deal with this problem. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 108 Colab Link :KNN Algorithm https://colab.research.google.com/drive/1IgRx863lKCxLXnINK5wiNjBJoGi WBCO-#scrollTo=jGLVgSF8rlx1 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 109 Support Vector Machine 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 110 Support Vector Machine Support Vector Machine (SVM), also known as Support Vector Classification. It is a supervised and linear Machine Learning technique typically used to solve classification problems. SVR stands for Support Vector Regression and is a subset of SVM that uses the same ideas to tackle regression problems. SVM also supports the kernel method called the kernel SVM, which allows us to tackle non- linearity. The primary use case for SVM is classification, but it can solve classification and regression problems. SVM constructs a hyperplane (see the picture below) in multidimensional space to separate different classes. It iteratively generates the best hyperplane to minimize classification error. The goal of SVM is to find a maximum marginal hyperplane (MMH) that splits a dataset into classes as evenly as possible. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 111 Support Vector Machine Support Vectors are data points closest to the hyperplane called support vectors. These points will define the separating line better by calculating margins and are more relevant to the construction of the classifier. A hyperplane is a decision plane that separates objects with different class memberships. Margin is the distance between the two lines on the class points closest to each other. It is calculated as the perpendicular distance from the line to support vectors or nearest points. The bold margin between the classes is good, whereas a thin margin is not good. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 112 Types of Support Vector Machine 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 113 Types of Support Vector Machine Linear SVM or Simple SVM It is used for data that is linearly separable. A dataset is termed linearly separable data if it can be classified into two classes using a single straight line, and the classifier is known as the linear SVM classifier. It’s most commonly used for tasks involving linear regression and classification. Nonlinear SVM or Kernel SVM It is also known as Kernel SVM, is a type of SVM that is used to classify nonlinearly separated data, or data that cannot be classified using a straight line. It has more flexibility for nonlinear data because more features can be added to fit a hyperplane instead of a two-dimensional space. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 114 SVM Kernels Some problems can’t be solved using a linear hyperplane because they are non- linearly separable. In such a situation, SVM uses a kernel trick to transform the input space into a higher-dimensional space. There are different types of SVM kernels depending on the kind of problem. Linear Kernel is a regular dot product for two observations. The sum of the multiplication of each pair of input values is the product of two vectors. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 115 SVM Kernels Polynomial Kernel is a more generalized form of Linear Kernel. The polynomial Kernel can tell if the input space is curved or nonlinear. The d is the degree of the polynomial. If the d = 1, then it is similar to the linear transformation. The degree needs to be manually specified in the learning algorithm. Radial Basis Function Kernel can map an input space into an infinite-dimensional space. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 116 SVM Kernels Radial Basis Function Kernel can map an input space into an infinite-dimensional space. Here gamma is a parameter, which ranges from 0 to 1. A higher gamma value will perfectly fit the training dataset, which causes over- fitting. The gamma = 0.1 is considered to be a good default value. The gamma value again needs to be manually specified in the learning algorithm. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 117 Regularization Regularization: Bias and Variance, Overfitting and Underfitting, L1 and L2 Regularization, Regularized Linear Regression, Decision Trees (ID3, C4.5, CART), Confusion matrix, k-folds cross-validation, K Nearest Neighbour, Support vector machine. 11/3/2024 Dr. Raju, Assistant Prof. (CSE (AIML)) UNIT 03 118

Unit-3 Supervised Learning-1 PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue