Machine Learning Lecture Notes (PDF)

Lecture 12 Email: [email protected] Phone no: 9400543249 An Introduction to the World of Machine Learning Instructor: Keerthana Vinod Kumar PMRF Schol...

Lecture 12 Email: [email protected] Phone no: 9400543249 An Introduction to the World of Machine Learning Instructor: Keerthana Vinod Kumar PMRF Scholar, Koita Centre for Digital health Indian Insitute of Technology Bombay Gamma hyperparameter Another impact of gamma, is that the higher its value, the more the scope of the decision boundary gets closer to the points around it, making it more jagged and prone to overfit - and the lowest its value, the smoother and regular the decision boundary surface gets, also, less prone to overfit. This is true for any hyperplane, but can be easier observed when separating data in higher dimensions. In some documentations, gamma can also be referred to as sigma. colab1 colab 2 Assignment 4 Use the diabetes dataset from the onedrive link provided to you and do the following steps: 1. Read the file and perform EDA (eg: getting overall info, description of dataset, checking for null values, data visualization, data scaling) 2. Separate the features and the target variable 3. Do train test split (80:20) 4. Perform classification using SVM, RF, KNN, DT (train the classifiers on the training data and predict on the test data) 5. Calculate the evaluation metrics (confusion matrix, Accuracy, Precision, F1-score, Recall, TPR,FPR, ROC-AUC. 6. Store the classification results in a dataframe and evaluate which algorithm works best for the given data and state the reason. Upload the assignment in this link Lecture 11 Email: [email protected] Phone no: 9400543249 An Introduction to the World of Machine Learning Instructor: Keerthana Vinod Kumar PMRF Scholar, Koita Centre for Digital Health Indian Insitute of Technology Bombay Confusion Matrix A confusion matrix is a matrix that summarizes the performance of a machine learning model on a set of testdata. It is a means of displaying the number of accurate and inaccurate instances based on the model’s predictions. It is often used to measure the performance of classificationmodels, which aim to predict a categorical label for each input instance. The matrix displays the number of instances produced by the model on the test data. Confusion Matrix True positives (TP): occur when the model accurately predicts a positive data point. For example, identifying a spam email as spam. True negatives (TN): occur when the model accurately predicts a negative data point. For example, identifying a regular email as not spam. False positives (FP): occur when the model predicts a positive data point incorrectly. For example, identifying a regular email as spam. False negatives (FN): occur when the model mispredicts a negative data point. For example, identifying a spam email as a regular email. Confusion Matrix True Positive (TP): It is the total counts having both predicted and actual values are Dog. True Negative (TN): It is the total counts having both predicted and actual values are Not Dog. False Positive (FP): It is the total counts having prediction is Dog while actually Not Dog. False Negative (FN): It is the total counts having prediction is Not Dog while actually, it is Dog. Confusion Matrix Amongst the 200 emails, 80 emails are actually spam in which the model correctly identifies 60 of them as spam. (TP) Amongst the 200 emails, the model misses 20 spam emails and identifies them as non-spam. (FN) Amongst the 200 emails, the model incorrectly identifies 20 non-spam emails as spam. (FP) Amongst the 200 emails, 120 emails are not spam in which the model correctly identifies 100 of them as not spam. (TN) Confusion Matrix terminology 1. Accuracy : how often the model is right? Accuracy measures the proportion of correct predictions among the total predictions made. High accuracy indicates a high proportion of correct predictions,but it may not be the best metric for imbalanced datasets. 2. Precision : how often the positivepredictions are correct? Precision measures the proportion of true positivepredictions among all positive predictions made by the model. High precision indicates a low false positiverate, making it suitable for scenarios where false positives are costly. Confusion Matrix terminology 3. Recall or Sensitivity: can an ML model find all instances of the positive class? Recall measures the proportion of true positive predictions among all actual positiveinstances in the dataset. High recall indicates a low false negative rate, making it suitable for scenarios where false negatives are costly. 4. Specificity The ability of the model to correctly identify negative cases. 5. F-Score F-Score (F1 Score) is the harmonic mean of precision and recall, providing a single metric to evaluate the balance between precision and recall. F1 Score balances precision and recall, providing a comprehensive evaluation metric. Confusion Matrix terminology 6. AUC-ROC ROC (Receiver Operating Characteristic curve) represents a graph to show the performance of a classification model at different threshold levels. The curve is plotted between two parameters, which are:  True Positive Rate  False Positive Rate TPR or true Positive rate is a synonym for Recall, hence can be calculated as: FPR or False Positive Rate can be calculated as: Confusion Matrix terminology AUC measures how well a model is able to distinguish between classes. Higher the AUC, the better the model is at predicting 0 classes as 0 and 1 classes as 1. AUC is 0.7, it means there is a 70% chance that the model will be able to distinguish between positive class and negative class. Larger the area better the classifier Perfect Model: A model that perfectly classifies positives and negatives will have a curve that goes straight up to the top left corner (where TPR = 1 and FPR = 0). Lecture-11a Lecture-11b Grid search :Hyperparameter Tuning GridSearchCV is a technique used for hyperparameter tuning, which is the process of finding the best combination of hyperparameters for a machine learning model. Hyperparameters are the parameters that are set prior to the training process and are not learned from the data. Examples of hyperparameters include the the number of trees in a random forest, or the kernel type and regularization parameter in a support vector machine. kernel function defines how the algorithm computes the similarity between pairs of data points in the input space. This similarity measure is then used to separate different classes of data points in a higher- dimensional space, enabling the model to classify them. Kernel hyperparameter 1. Linear Kernel: What it does: Think of this as drawing a straight line to separate two groups of data points. When to use: This works well when the data can be separated by a simple straight line, meaning the data is linearly separable. 2. Polynomial Kernel: What it does: Instead of drawing a straight line, this kernel draws curves or more complex shapes to separate data points. How it works: It takes the dot product (a way to measure similarity) between two points and raises it to a power, like squaring or cubing the dot product. This allows the model to create more flexible decision boundaries. When to use: If the relationship between the data points is non-linear but not too complicated, a polynomial kernel works well. 3. Radial Basis Function (RBF) Kernel (also called Gaussian kernel): What it does: This kernel looks at how far apart two data points are and transforms the distance into a similarity score. How it works: It measures the squared distance between two points and then applies a mathematical function (exponential) to this distance. This creates very flexible decision boundaries When to use: If your data isn't separated by a straight line or simple curve, and the relationship between points is quite complex, the RBF kernel works best. It's commonly used in many applications. 4. Sigmoid Kernel: What it does: This kernel is based on the tanh (hyperbolic tangent) function, which is used a lot in neural networks When to use: It's not as popular as the other kernels, but it's sometimes used when you want to mimic how a neural network processes data. Linear Kernel: Straight line decision boundary for simple data. Polynomial Kernel: Curved or more complex shapes for moderate complexity. RBF Kernel: Highly flexible boundary for very complex, non-linear data. Sigmoid Kernel: Works like a neural network, not commonly used. Kernel hyperparameters The C Hyperparameter The C parameter is inversely proportional to the margin size, this means that the larger the value of C, the smaller the margin, and, conversely, the smaller the value of C, the larger the margin. The C parameter can be used along with any kernel, it tells the algorithm how much to avoid misclassifying each training sample, due to that, it is also known as regularization. Gamma hyperparameter Like C, gamma is somewhat inversely proportional to its distance. The higher its value, the closest are the points that are considered for the decision boundary, and the lowest the gamma, the farther points are also considered for choosing the decision boundary Gamma hyperparameter Another impact of gamma, is that the higher its value, the more the scope of the decision boundary gets closer to the points around it, making it more jagged and prone to overfit - and the lowest its value, the smoother and regular the decision boundary surface gets,also, less prone to overfit. This is true for any hyperplane, but can be easier observed when separating data in higher dimensions. In some documentations, gamma can also be referred to as sigma. Colab file colab file 2 Lecture 10 Email: [email protected] Phone no: 9400543249 An Introduction to the World of Machine Learning Instructor: Keerthana Vinod Kumar PMRF Scholar, Koita Centre for Digital health Indian Insitute of Technology Bombay Identify whether it’s a classification problem or a regression problem Predict whether a patient has "diabetes," "hypertension," or "no disease Analyze customer reviews and classify them as "positive," "negative," or "neutral. Predict the temperature for the next day Predict whether a tissue sample is "benign" or "malignant" based on genetic markers or imaging data Predict whether a cell culture treated with a drug will result in "viable" or "non-viable" cells Predict the level of gene expression (e.g., RNA abundance) in different conditions or tissue samples Predict the IC50 (half maximal inhibitory concentration) value for a drug based on molecular structure and biological assays Predict whether a protein belongs to one of several functional classes based on its amino acid sequence or structure Evaluation matrices in Machine learning: Accuracy score: Mean Square Error(MSE): Classification Algorithms in Machine Learning Supervised or Unsupervised ? Output variable of Classification is a category, not a value The algorithm which implements the classification on a dataset is known as a classifier. Types of ML Classification Algorithms: Classification Algorithms can be further divided into the Mainly two category Logistic Regression Supervised Learning technique Predicting the categorical dependent variable using a given set of independent variables. Logistic regression predicts the output of a categorical dependent variable. Instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie between 0 and 1. Linear Regression is used for solving Regression problems, whereas Logistic regression is used for solving the classification problems. In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function, which predicts two maximum values (0 or 1). Logistic Regression The S-form curve is called the Sigmoid function or the logistic function. In logistic regression, we use the concept of the threshold value, which defines the probability of either 0 or 1. Such as values above the threshold value tends to 1, and a value below the threshold values tends to 0. Support Vector Machine (SVM) Primarily used for classificationproblem The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-dimensional space into classes so that we can easily put the new data point in the correct categoryin the future. This best decision boundary is called a hyperplane. Linear SVM multiple lines that can separate these classes Support Vector Machine (SVM) Non-linear SVM K-Nearest Neighbor(KNN) Another simplest Machine Learning algorithms based on Supervised Learning technique K-NN algorithm assumes the similarity between the new case/data and available cases and put the new case into the categorythat is most similar to the available categories. K-NN is less sensitive to outliers compared to other algorithms. K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying data. K-Nearest Neighbor(KNN) K represents the number of nearest neighbors that Step-1: Select the number K of the neighbors needs to be considered while making prediction. Step-2: Calculate the Euclidean distance of K number of neighbors Step-3: Take the K nearest neighbors as per the calculated Euclidean distance. Step-4: Among these k neighbors, count the number of the data points in each category. Step-5: Assign the new data points to that category for which the number of the neighbor is maximum. Step-6: Our model is ready. Manhattan Distance? Minkowski Distance? K-Nearest Neighbor(KNN) There is no particular way to determine the best value for "K". The most preferred value for K is 5. If the input data has more outliers or noise, a higher value of k would be better. Decision Tree It can be used for both classificationand Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-structuredclassifier, where internal nodes represent the featuresof a dataset, branches representthe decision rules and each leaf node represents the outcome. Decision used to make any decision and have multiple branches Node Leaf Node are the output of those decisions and do not contain any further branches A decision tree simply asks a question, and based on the answer (Yes/No), it further split the treeinto subtrees. Decision Tree Decision Trees usually mimic human thinking ability while making a decision, so it is easy to understand. The logic behind the decision tree can be easily understood because it shows a tree-likestructure. Working Random Forest It can be used for both Classification and Regression problems in ML. It works by creating a number of Decision Trees during the training phase. Each tree is constructed using a random subset of the data set. This randomness introduces variability among individual trees, reducing the risk of overfittingand improving overall prediction performance. Regression algorithm: Linear regression Linear regression is based on supervised learning It is a statistical method that is used for predictive analysis. It makes predictions for continuous/real or numeric variables such as sales, salary, age, product price, etc. Linear regression algorithm shows a linear relationship between a dependent (y) and one or more independent (y) variables, hence called as linear regression. Positive Linear Relationship: NegativeLinear Relationship: Linear regression ~650000 Linear regression What if there are multiple features? Multiple linear regression Clustering in Machine Learning A way of grouping the data points into different clusters, consisting of similar data points. The objects with the possible similarities remain in a group that has less or no similarities with another group. It is an unsupervised learning method, hence no supervision is provided to the algorithm, and it deals with the unlabeled dataset. Clustering in Machine Learning K means clustering K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into different clusters. K defines the number of pre-defined clusters that need to be createdin the process, as if K=2, there will be two clusters,and for K=3, there will be three clusters,and so on. Exercises What is the primary goal of classification in machine learning? A) Predicting continuous values B) Dividing data into distinct categories C) Estimating missing values D) Maximizing the number of features in a model Which machine learning model is commonly used for regression tasks? A) Decision Trees B) Linear Regression C) Logistic Regression D) K-Means Clustering In which of the following scenarios would you use a regression model? A) Predicting whether a fruit is an apple or an orange B) Estimating the time it takes to complete a marathon C) Classifying emails as spam or not spam D) Grouping similar customers together Exercises What type of target variable is predicted by classification algorithms? A) Continuous values B) Categorical labels C) Probability distributions D) Time-series data Which of the following is an example of a regression problem? A) Predicting the likelihood of rain tomorrow (Yes/No) B) Predicting the salary of an employee based on their experience C) Identifying the species of a flower based on its features D) Predicting whether a customer will purchase a product Logistic Regression is typically used for which type of machine learning task? A) Regression B) Classification C) Clustering D) Dimensionality Reduction Exercises Which of the following is an example of a binary classification problem? A) Predicting the number of customers visiting a store B) Predicting whether a student passes or fails an exam C) Grouping customers into different segments based on spending D) Estimating the fuel efficiency of a car Which metric is commonly used to evaluate regression models? A) Accuracy B) Mean Squared Error (MSE) C) Precision D) Confusion Matrix If the target variable is a continuous number, what type of problem is this? A) Classification B) Clustering C) Regression D) Dimensionality Reduction Naïve Bayes Classifier Algorithm Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the probability of a hypothesis with prior knowledge. It depends on the conditional probability. Where, P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B. P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a hypothesis is true. P(A) is Prior Probability: Probability of hypothesis before observing the evidence. P(B) is Marginal Probability: Probability of Evidence. Naïve Bayes Classifier Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is independent of the occurrence of other features. Each feature individually contribute to the identification Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem. Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and used for solving classification problems. It is mainly used in textclassificationthat includes a high-dimensional training dataset. It is a probabilistic classifier, which means it predicts on the basis of the probability of an object. Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental analysis, and classifying articles. Lecture 9 Email: [email protected] Phone no: 9400543249 An Introduction to the World of Machine Learning Instructor: Keerthana Vinod Kumar PMRF Scholar, Koita Centre for Digital health Indian Insitute of Technology Bombay Machine learning uses algorithms and data to allow computers to learn from data and make predictions Example: Google Assistant Trained with a lot of speech data When you speak it understands it Performs certain task Machine learning model Machine learning model Slope: What change in value of x tends to change in value of y Y = 2X + 3 Machine learning model Cannot have linear relationship between variables all the time. Machine learning model Machine learning model: Supervised Machine learning model: Supervised Machine learning model: Supervised Machine learning model: Unsupervised Machine learning model: Unsupervised Clustering is an unsupervised task which involves grouping the similar data points Machine learning model: Unsupervised Association is an unsupervised task that is used to find important relationship between data points Machine learning model: Unsupervised Important concepts in Machine learning MODEL SELECTION CROSS FOLD VALIDATION OVERFITTING UNDERFITTING …. …. Model Selection: Model Selection: a) Based on type of Data available 1. Images and Videos – CNN 2. Text data or Speech data – RNN 3. Numerical data –SVM ,LR, DT etc b) Based on the task that needs to be carried out 1. Classification tasks – SVM, LR, DT 2. Regression tasks – Linear Regression, Random Forest, Polynomial regression 3. Clustering tasks – K-Means Clustering, Hierarchical Clustering Cross fold validation Accuracy 85% 90% 85% 80% 85% Average accuracy = 85% Overfitting Overfitting is a common challenge in machine learning where a model learns to fit the training data too closely, capturing noise or random fluctuations rather than the underlying pattern of the data. It occurs when a model becomes overly complex, having too many parameters relative to the amount of training data available. Model has very high train accuracy but low test accuracy Causes of Overfitting Insufficient training data: When the amount of training data is limited, the model may learn to memorize the data rather than generalize from it. Model complexity: Models with too many parameters or features can capture noise in the data instead of the underlying pattern. Incorrect model assumptions: If the model assumptions do not match the underlying data distribution, it may lead to overfitting. Detection and Prevention Cross-validation: Use techniques like k-fold cross-validation to estimate the model's performance on unseen data and detect overfitting. Regularization: Techniques like L1 or L2 regularization penalize large coefficients, helping to prevent overfitting by reducing model complexity. Feature selection: Selecting relevant features and removing irrelevant ones can help reduce the model's complexity and prevent overfitting. Underfitting Underfitting occurs when a model is too simple, which can be a result of a model needing more training time, more input features, or less regularization. Model does not learn enough. It often happens when the model has high bias, meaning it is too simplistic to capture the complexity of the data. Model has very low test accuracy Causes of Underfitting Model complexity: Models that are too simplistic or have too few parameters may not have enough capacity to capture the underlying patterns in the data. Insufficient training: When the model is not trained for long enough or on enough data, it may fail to learn the underlying structure effectively. Incorrect features: If the features used to train the model do not adequately represent the underlying patterns in the data, it can lead to underfitting. Detection and Prevention Model complexity: Increasing the complexity of the model by adding more parameters or layers can help mitigate underfitting. Choose correct model: Selecting correct model that can help in improving model performance. Hyperparameter tuning: Adjusting hyperparameters such as learning rate, regularization strength, or model architecture can help strike a balance between bias and variance, reducing underfitting. Bias-Variance tradeoff Bias: Bias Bias-Variance tradeoff Variance: Bias-Variance tradeoff Underfitting and Overfitting Plot on Train data Underfitting and Overfitting Test on different data Model complexity: Techniques to have better bias-variance tradeoff Loss functions in Machine learning: Loss functions provide a measurable representation of how well a model's predictions align with actual outcomes. It measures how well (or bad) our model is doing. If the errors are high, the loss will be high, which means that the model does not do a good job. Otherwise, the lower it is, the better our model works. Lecture 8 Email: [email protected] Phone no: 9400543249 An Introduction to the World of Machine Learning Instructor: Keerthana Vinod Kumar PMRF Scholar, Koita Centre for Digital health Indian Insitute of Technology Bombay Recap 5 8 10 5 4 -8 3 9 Find location of 3? 9 4 6 8 scaler or vector or matrix ??? 5 10 [3 –4 8] 4 -8 Statistics What is Statistics? Statistics is the study of the collection, analysis, interpretation, presentation, and organization of data. Statistics helps to understand data in a better way Positive correlation? Negative correlation? The more it rains, the less you can water the garden. Statistics Role of ML in Statistics? Train ML model with lots of data --> Model will learn patterns/relations in data --> Predict according to it Data is integral part of machine learning Example: Train model with diabetes and w/o diabetes ---> model will find relation between the factors and make future prediction and make better analysis Statistics Types of Data Categorical Numerical Nominal Ordinal Discrete Continuous Nominal data do not Ordinal data values follow Discrete data take only Ordinal data often arises provide any quantitative a natural order. certain values from measuring rather value. Example: educational Example: number of than counting and can Example: gender (male, levels (elementary, high students in a class, the include decimals or female), types of fruit school, college, graduate number of cars in a parking fractions. (apple, banana, orange) school) lot Example: height, weight, temperature, or time measurements. Statistics Types Statistics Descriptive Inferential Descriptive statistics are tabular, graphical, and numerical summaries of data Inferential statistics involves making inferences or predictions about a The purpose of descriptive statistics is to facilitate the presentation and population based on a sample of data taken from that population. interpretation of data. Inferential statistics allows researchers to draw conclusions and make Describe the data generalizations Analyse the data It includes techniques such as hypothesis testing, confidence intervals, It includes measures such as mean, median, mode, range, variance, standard and regression analysis. deviation, and percentiles. Statistics Statistics Correlation of generated AKI data Statistics Types of Statistical study designs Sample Observational Experimental Subset of individuals/objects from population Observing and gathering data Manipulating variables Data gathered from sample No manipulation of variables Control over other variables Inferences drawn about entire population Prospective or retrospective Random assignment to groups Sample ideally representative Identifies associations, not causation Establishes cause-and-effect Statistics When population is too huge, we can use sampling techniques to sample data 1. Sample Study All elements should be included from population Calculate parameters Statistics 2. Observational Study Outliers ?? Statistics 3. Experimental Study Statistics Central tendency Central tendency refers to the idea of finding a single representative value for a dataset It helps summarize large amounts of data into a single value Common measures of central tendency include the mean, median, and mode Measures of central tendency Mean Median Mode Statistics : Central tendency Mean The mean is also known as the average. It is calculated by adding up all the values in a dataset and dividing by the total number of values. The mean can be affected by outliers or extreme values in the dataset. Weight 60+72+65+68+74 60 Mean= 67.8 72 5 65 68 74 Statistics : Central tendency Median The median is the middle value in a dataset when the values are arranged in ascending order. It is less sensitive to outliers compared to the mean. If there is an even number of values, the median is the average of the two middle values. Weight 60 72 65 68 74 60 65 68 72 74 60 65 68 72 74 76 68 + 72 Median = 70 2 Statistics : Central tendency Mode The mode is the value that appears most frequently in a dataset. Unlike the mean and median, the mode can be used with categorical data. If no number in list is repeated, then there is no mode for the list. Weight 60 72 Mode = 60 65 60 74 Statistics : Central tendency Mean: Missing value in dataset can be replaced with mean if data is symmetrical Median: Missing value in dataset can be replaced with median if data is skewed Mode: Missing value in dataset can be replaced with mode if data is skewed, missing values of categorical data can also be replaced by mode Statistics Measures of variability Range Variability Standard deviation The range is a measure of dispersion Variance is a measure of dispersion Standard deviation is a measure of that represents the difference that quantifies the average squared dispersion that represents the average between the highest and lowest deviation of each data point from deviation of each data point from the values in a dataset. the mean. mean. It provides a simple way to It provides a measure of how much It provides a measure of the spread of understand the spread of data. the data values deviate from the data around the mean. mean. Statistics Statistics Correlation: Correlation is a statistical measure that describes the degree to which two variables change together. It indicates the strength and direction of the relationship between variables. It does not mean only one event is cause of another event. Statistics Causation: Causation refers to the relationship between cause and effect, where changes in one variable directly lead to changes in another variable. If one event occur other will also occur Probability What is Probability ? Probability is a measure of the likelihood that an event will occur. It quantifies uncertainty. Key points Probability ranges from 0 to 1, where 0 means the event will not occur and 1 means the event will definitely occur. Events with higher probabilities are more likely to happen. Probabilities can be expressed as fractions, decimals, or percentages. Examples Coin Toss Roll a Dice Probability What is Probability ? Try to find pattern in data which will Try to find if event will occur or not help in predicting future values Probability Basics of Probability ? Probability Random variables It represents a numerical outcome of a random phenomenon, with each outcome having a certain probability of occurring.

Machine Learning Lecture Notes (PDF)

Document Details

Tags

Related

Summary

Full Transcript