DOC-20240805-WA0003..pdf
Document Details
Uploaded by CushyXylophone
SRM Institute of Science and Technology
Tags
Full Transcript
21CSC307P Machine Learning for Data Analytics UNIT I Course Objectives ◦ Understanding Human Learning aspects ◦ Acquaintance with primitives in the learning process by computer ◦ Develop the linear learning models and classification in Machine Learning ◦ Implement the clustering techniques...
21CSC307P Machine Learning for Data Analytics UNIT I Course Objectives ◦ Understanding Human Learning aspects ◦ Acquaintance with primitives in the learning process by computer ◦ Develop the linear learning models and classification in Machine Learning ◦ Implement the clustering techniques and their utilization in Machine Learning ◦ Implement the tree-based machine learning techniques and to appreciate their capability Course Outcomes ◦ Demonstrate knowledge of learning algorithms and concept learning through implementation for sustainable solutions of applications ◦ Evaluation of different algorithms on well formulated problems along with stating Valid conclusions that the evaluation supports ◦ Formulate a given problem within the Bayesian learning framework with focus on Building lifelong learning ability ◦ Analyze research-based problems using Machine learning techniques and apply different clustering algorithms used in machine learning to generic datasets and Specific multidisciplinary domains ◦ Evaluate decision tree learning algorithms Reference Books ◦ Ethem Alpaydin, “Introduction to Machine Learning”, MIT Press, Fourth Edition, 2020. 2. ◦ Stephen Marsland, “Machine Learning: An Algorithmic Perspective, “Second Edition”, CRC Press, 2014. ◦ Kevin P. Murphy, ―Machine learning: A Probabilistic Perspectiveǁ, MIT Press, 2012. ◦ Tom Mitchell, "Machine Learning", McGraw-Hill, 1997. Sebastian Raschka, Vahid Mirjilili,Python Machine Learning and deep learning, 2nd edition, kindle book, 2018 ◦ Carol Quadros, Machine Learning with python, scikit-learn and Tensorflow, Packet Publishing, 2018. ◦ Gavin Hackeling, Machine Learning with scikit-learn, Packet publishing, O‘Reily, 2018. Career Opportunities Unit I ◦ Introduction ◦ The Curse of dimensionality ◦ Machine Learning: What & Why? ◦ Over fitting and under fitting ◦ Examples of Machine Learning ◦ Linear regression applications ◦ Bias and Variance Tradeoff ◦ Training versus Testing ◦ Regularization-Learning Curve ◦ Positive and Negative Class ◦ Classification ◦ Cross-validation ◦ Error and noise ◦ Types of Learning: Supervised, ◦ Parametric vs. non-parametric models Unsupervised and Semi-Supervised ◦ Linear Algebra for machine learning Learning Unit I ◦ T1: Building programs to work with the data pre-processing in python ◦ T2: Building programs to work with linear regression in python ◦ T3: Building programs to work with cross validation in Python Introduction What is Learning? ◦ A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. Introduction The Task, T ◦ If we want a robot to be able to walk, then walking is the task. ◦ “Learning is our means of attaining the ability to perform the task” ◦ We could program the robot to learn to walk, or we could directly write a program that specifies how to walk manually. ◦ Some of the most common machine learning tasks include the following: ◦ Classification ◦ Regression ◦ Machine translation ◦ Transcription Introduction The Task, T ◦ In order to evaluate a machine learning algorithm, we must measure its performance. ◦ For tasks such as classification, we often measure the accuracy of the model. ◦ Accuracy is just the proportion of examples for which the model produces the correct output. Introduction The Experience, E Machine learning algorithms can be broadly categorized as unsupervised or supervised by what kind of experience they are allowed to have during the learning process. Unsupervised learning algorithms experience a dataset containing many features, then learn useful properties of the structure of this dataset. Supervised learning algorithms experience a dataset containing features, but each example is also associated with a label or target. Introduction Introduction Machine learning is a branch of artificial intelligence that enables algorithms to uncover hidden patterns within datasets, allowing them to make predictions on new, similar data without explicit programming for each task. Traditional machine learning combines data with statistical tools to predict outputs, yielding actionable insights. This technology finds applications in diverse fields such as image and speech recognition, natural language processing, recommendation systems, fraud detection, portfolio optimization, and automating tasks. Training vs Testing Class Testing Data You will need unknown information to test your machine learning model after it was created (using your training data). This data is known as testing data, and it may be used to assess the progress and efficiency of your algorithms’ training as well as to modify or optimize them for better results. ❖ Showing the original set of data. ❖ Be large enough to produce reliable projections This dataset needs to be “unseen” and recent. This is because the training data was already “learned” by your model. You can decide if it is operating successfully or when it need more training data to fulfill your standards by observing how it performs on fresh test data. Test data provides as a last, real check if an unknown dataset was correctly trained by the machine learning algorithm. Training Data Testing data is used to determine the performance of the trained model, whereas training data is used to train the machine learning model. Training data is the power that supplies the model in machine learning, it is larger than testing data. Because more data helps to more effective predictive models. When a machine learning algorithm receives data from our records, it recognizes patterns and creates a decision-making model. Algorithms allow a company’s past experience to be used to make decisions. It analyzes all previous cases and their results and, using this data creates models to score and predict the outcome of current cases. The more data ML models have access to, the more reliable their predictions get over time. Training vs Testing Positive and Negative Class True Positive (TP): is the result that we get if we correctly predict the positive class False Positive (FP): is the outcome that we get if we predict a negative class as a positive class True Negative (TN): is the result that we get if we correctly predict the negative class False Negative (FN): is the outcome that we get if we predict a positive class as a negative class Positive and Negative Class Accuracy is the metric of prediction that our model got for predicting the right results. Accuracy can be formulated as follows: That means, Positive and Negative Class Let’s supposed we have a tumor system model that predicted 100 data with following result: TP: 3(Malignant) TN: 88(Benign) FP: 1(Falsely predict Benign as Malignant) FN: 8(Falsely predict Malignant as Benign) That means, Accuracy = (3+88)/(3+88+1+8) = 91/100 = 0.91 Our model got 91% accuracy, it’s a pretty decent accuracy. But, we got a huge problem here. The model’s final result are 88/89 benign tumors (88TN-1FP) and 3/11 malignant tumors (3TP-8FN). The model can only predict 3 out of 11 correctly for malignant tumors. So, there is ± 8/11 or 72% error rate. This can be happened because of class imbalance datasets. Cross Validation Cross-validation is a technique for validating the model efficiency by training it on the subset of input data and testing on previously unseen subset of the input data. We can also say that it is a technique to check how a statistical model generalizes to an independent dataset. Cross Validation In machine learning, there is always the need to test the stability of the model. It means based only on the training dataset; we can't fit our model on the training dataset. For this purpose, we reserve a particular sample of the dataset, which was not part of the training dataset. After that, we test our model on that sample before deployment, and this complete process comes under cross-validation. This is something different from the general train-test split. Cross Validation The three steps involved in cross-validation are as follows : ◦ Reserve some portion of sample data-set. ◦ Using the rest data-set train the model. ◦ Test the model using the reserve portion of the data-set. Methods of Cross Validation The three steps involved in cross-validation are as follows : ◦ Reserve some portion of sample data-set. ◦ Using the rest data-set train the model. ◦ Test the model using the reserve portion of the data-set. Types of Cross Validation LOOCV Validation set K-fold Validation set approach We divide our input dataset into a training set and test or validation set in the validation set approach. Both the subsets are given 50% of the dataset. But it has one of the big disadvantages that we are just using a 50% dataset to train our model, so the model may miss out to capture important information of the dataset. It also tends to give the underfitted model Validation set approach A random splitting of the dataset into a certain ratio(generally 70-30 or 80-20 ratio is preferred) Training of the model on the training data set The resultant model is applied to the validation set Model’s accuracy is calculated through prediction error by using model performance metrics LOOCV It means, in this approach, for each learning set, only one datapoint is reserved, and the remaining dataset is used to train the model. This process repeats for each datapoint. Hence for n samples, we get n different training set and n test set. It has the following features: In this approach, the bias is minimum as all the data points are used. The process is executed for n times; hence execution time is high. This approach leads to high variation in testing the effectiveness of the model as we iteratively check against one data point. LOOCV K-fold cross-validation K-fold cross-validation approach divides the input dataset into K groups of samples of equal sizes. These samples are called folds. For each learning set, the prediction function uses k-1 folds, and the rest of the folds are used for the test set. This approach is a very popular CV approach because it is easy to understand, and the output is less biased than other methods. K-fold cross-validation Steps for k-fold cross-validation are: ◦ Split the input dataset into K groups ◦ For each group: ◦ Take one group as the reserve or test data set. ◦ Use remaining groups as the training dataset ◦ Fit the model on the training set and evaluate the performance of the model using the test set. K-fold K-fold Let's take an example of 5-folds cross-validation. So, the dataset is grouped into 5 folds. On 1st iteration, the first fold is reserved for test the model, and rest are used to train the model. On 2nd iteration, the second fold is used to test the model, and rest are used to train the model. This process will continue until each fold is not used for the test fold. K-fold In each iteration, we will get an accuracy score and have to sum them and find the mean. Machine Learning Supervised Learning Supervised Learning is a category of machine learning algorithms that are based upon the labelled data set. Predictive analytics is achieved for this category of algorithms where the outcome of the algorithm that is known as the dependent variable depends upon the value of independent data variables. It is based upon the training dataset, and it improves through iterations. Supervised Learning Supervised Learning Supervised Learning Supervised Learning Consider yourself as a student sitting in a classroom wherein your teacher is supervising you, “how you can solve the problem” or “whether you are doing correctly or not”. Likewise, in Supervised Learning input is provided as a labelled dataset, a model can learn from it to provide the result of the problem easily. Supervised Learning Regression vs. Classification Classification This algorithm helps to predict a discrete value. It can be thought, the input data as a member of a particular class or group. For instance, taking up the photos of the fruit dataset, each photo has been labelled as a mango, an apple, etc. Here, the algorithm has to classify the new images into any of these categories. Examples: ◦ Naive Bayes Classifier ◦ Support Vector Machines ◦ Logistic Regression Classification ❖ A person’s gender (male or female), ❖ Brand of product purchased (brand A, B, or C), ❖ Whether a person defaults on a debt(yes or no), ❖ Cancer diagnosis (Acute Myelogenous Leukemia, Acute Lymphoblastic Leukemia, or No Leukemia). Regression These problems are used for continuous data. For example, predicting the price of a piece of land in a city, given the area, location, number of rooms, etc. And then the input is sent to the machine for calculating the price of the land according to previous examples. ◦ Linear Regression ◦ Nonlinear Regression ◦ Bayesian Linear Regression Unsupervised Learning Unsupervised learning is a type of machine learning in which models are trained using unlabeled dataset and are allowed to act on that data without any supervision. Unsupervised learning is a machine learning technique in which models are not supervised using training dataset. Instead, models itself find the hidden patterns and insights from the given data. It can be compared to learning which takes place in the human brain while learning new things. Unsupervised learning cannot be directly applied to a regression or classification problem because unlike supervised learning, we have the input data but no corresponding output data. The goal of unsupervised learning is to find the underlying structure of dataset, group that data according to similarities, and represent that dataset in a compressed format. Unsupervised Learning Unsupervised Learning Supervised vs Unsupervised Learning Semi-Supervised Learning Semi-supervised learning is a type of machine learning that falls in between supervised and unsupervised learning. It is a method that uses a small amount of labeled data and a large amount of unlabeled data to train a model. The goal of semi-supervised learning is to learn a function that can accurately predict the output variable based on the input variables, similar to supervised learning. However, unlike supervised learning, the algorithm is trained on a dataset that contains both labeled and unlabeled data. Semi-supervised learning is particularly useful when there is a large amount of unlabeled data available, but it’s too expensive or difficult to label all of it. Semi-Supervised Learning The Curse of Dimensionality The Curse of Dimensionality in Machine Learning arises when working with high-dimensional data, leading to increased computational complexity, overfitting, and spurious correlations. Techniques like dimensionality reduction, feature selection, and careful model design are essential for mitigating its effects and improving algorithm performance. Navigating this challenge is crucial for unlocking the potential of high-dimensional datasets and ensuring robust machine-learning solutions. The Curse of Dimensionality The Curse of Dimensionality refers to the phenomenon where the efficiency and effectiveness of algorithms deteriorate as the dimensionality of the data increases exponentially. In high-dimensional spaces, data points become sparse, making it challenging to discern meaningful patterns or relationships due to the vast amount of data required to adequately sample the space. The Curse of Dimensionality significantly impacts machine learning algorithms in various ways. It leads to increased computational complexity, longer training times, and higher resource requirements. Moreover, it escalates the risk of overfitting and spurious correlations, hindering the algorithms’ ability to generalize well to unseen data. The Curse of Dimensionality To overcome the curse of dimensionality, you can consider the following strategies: Dimensionality Reduction Techniques: ◦ Feature Selection: Identify and select the most relevant features from the original dataset while discarding irrelevant or redundant ones. This reduces the dimensionality of the data, simplifying the model and improving its efficiency. ◦ Feature Extraction: Transform the original high-dimensional data into a lower-dimensional space by creating new features that capture the essential information. Techniques such as Principal Component Analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE) are commonly used for feature extraction. Data Preprocessing: ◦ Normalization: Scale the features to a similar range to prevent certain features from dominating others, especially in distance-based algorithms. ◦ Handling Missing Values: Address missing data appropriately through imputation or deletion to ensure robustness in the model training process. Overfitting A statistical model is said to be overfitted when the model does not make accurate predictions on testing data. When a model gets trained with so much data, it starts learning from the noise and inaccurate data entries in our data set. And when testing with test data results in High variance. Then the model does not categorize the data correctly, because of too many details and noise. The causes of overfitting are the non-parametric and non-linear methods because these types of machine learning algorithms have more freedom in building the model based on the dataset and therefore they can really build unrealistic models. A solution to avoid overfitting is using a linear algorithm if we have linear data or using the parameters like the maximal depth if we are using decision trees. In a nutshell, Overfitting is a problem where the evaluation of machine learning algorithms on training data is different from unseen data. Overfitting Reasons for Overfitting: ◦ High variance and low bias. ◦ The model is too complex. ◦ The size of the training data. Techniques to Reduce Overfitting ◦ Improving the quality of training data reduces overfitting by focusing on meaningful patterns, mitigate the risk of fitting the noise or irrelevant features. ◦ Increase the training data can improve the model’s ability to generalize to unseen data and reduce the likelihood of overfitting. ◦ Reduce model complexity. ◦ Early stopping during the training phase (have an eye over the loss over the training period as soon as loss begins to increase stop training). ◦ Ridge Regularization and Lasso Regularization. ◦ Use dropout for neural networks to tackle overfitting. Underfitting A statistical model or a machine learning algorithm is said to have underfitting when a model is too simple to capture data complexities. It represents the inability of the model to learn the training data effectively result in poor performance both on the training and testing data. In simple terms, an underfit model’s are inaccurate, especially when applied to new, unseen examples. It mainly happens when we uses very simple model with overly simplified assumptions. To address underfitting problem of the model, we need to use more complex models, with enhanced feature representation, and less regularization. Underfitting Reasons for Underfitting ◦ The model is too simple, So it may be not capable to represent the complexities in the data. ◦ The input features which is used to train the model is not the adequate representations of underlying factors influencing the target variable. ◦ The size of the training dataset used is not enough. ◦ Excessive regularization are used to prevent the overfitting, which constraint the model to capture the data well. ◦ Features are not scaled. Techniques to Reduce Underfitting ◦ Increase model complexity. ◦ Increase the number of features, performing feature engineering. ◦ Remove noise from the data. ◦ Increase the number of epochs or increase the duration of training to get better results.. Overfitting and Underfitting Overfitting and Underfitting Bias and Variance Bias: Bias refers to the error due to overly simplistic assumptions in the learning algorithm. These assumptions make the model easier to comprehend and learn but might not capture the underlying complexities of the data. It is the error due to the model’s inability to represent the true relationship between input and output accurately. When a model has poor performance both on the training and testing data means high bias because of the simple model, indicating underfitting. Variance: Variance, on the other hand, is the error due to the model’s sensitivity to fluctuations in the training data. It’s the variability of the model’s predictions for different instances of training data. High variance occurs when a model learns the training data’s noise and random fluctuations rather than the underlying pattern. As a result, the model performs well on the training data but poorly on the testing data, indicating overfitting. Bias and Variance Bias and Variance Tradeoff Linear Regression Linear regression is one of the ways to perform predictive analysis. It is used to examine regression estimates. ◦ To predict the outcome from the set of predictor variables ◦ Which predictor variables have maximum influence on the outcome variable? The regression estimates explain the relationship between one dependent variable and one or more independent variables. Linear Regression Linear regression is a statistical regression method which is used for predictive analysis. It is one of the very simple and easy algorithms which works on regression and shows the relationship between the continuous variables. It is used for solving the regression problem in machine learning. Linear regression shows the linear relationship between the independent variable (X-axis) and the dependent variable (Y-axis), hence called linear regression. Linear Regression Simple Linear Regression: ◦ If there is only one input variable (x) Multiple linear regression ◦ If there is more than one input variable i.e., relationship between one continuous dependent variable and two or more independent variables. Linear Regression Linear Regression Predicting the salary of an employee on the basis of the year of experience. Linear Regression The mathematical equation for Linear regression: Y= aX+b Y = dependent variables (target variables) X= Independent variables (predictor variables) a and b are the linear coefficients The regression dependent variable can be called as outcome variable or criterion variable or an endogenous variable. The independent variable can also be called an exogenous variable. Linear Regression Linear Regression Properties of the Regression line 1. The line minimizes the sum of squared difference between the observed values(actual y-value) and the predicted value(ŷ value) 2. The line passes through the mean of independent and dependent features. Properties of the Regression line Properties of the Regression line Positive Linear Regression Negative Linear Regression Linear Regression Linear Regression Linear Regression Mathematical Implementation Linear Regression Linear Regression Linear Regression Linear Regression Linear Regression Linear Regression Linear Regression Linear Regression Linear Regression Linear Regression Linear Regression Linear Regression Linear Regression Linear Regression Linear Regression Regularization Regularization is a technique used to prevent overfitting by adding a penalty term to the loss function, discouraging the model from assigning too much importance to individual features or coefficients. Let’s explore some more detailed explanations about the role of Regularization in Python: ◦ Complexity Control: Regularization helps control model complexity by preventing overfitting to training data, resulting in better generalization to new data. ◦ Preventing Overfitting: One way to prevent overfitting is to use regularization, which penalizes large coefficients and constrains their magnitudes, thereby preventing a model from becoming overly complex and memorizing the training data instead of learning its underlying patterns. ◦ Balancing Bias and Variance: Regularization can help balance the trade-off between model bias (underfitting) and model variance (overfitting) in machine learning, which leads to improved performance. ◦ Feature Selection: Some regularization methods, such as L1 regularization (Lasso), promote sparse solutions that drive some feature coefficients to zero. This automatically selects important features while excluding less important ones. ◦ Handling Multicollinearity: When features are highly correlated (multicollinearity), regularization can stabilize the model by reducing coefficient sensitivity to small data changes. ◦ Generalization: Regularized models learn underlying patterns of data for better generalization to new data, instead of memorizing specific examples. Regularization-Learning Curve Classification The Classification algorithm is a Supervised Learning technique that is used to identify the category of new observations on the basis of training data. In Classification, a program learns from the given dataset or observations and then classifies new observation into a number of classes or groups. Such as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc. Classes can be called as targets/labels or categories. Unlike regression, the output variable of Classification is a category, not a value, such as "Green or Blue", "fruit or animal", etc. Since the Classification algorithm is a Supervised learning technique, hence it takes labeled input data, which means it contains input with the corresponding output. In classification algorithm, a discrete output function(y) is mapped to input variable(x). Classification Error and Noise Noisy data are data with a large amount of additional meaningless information called noise. This includes data corruption, and the term is often used as a synonym for corrupt data. It also includes any data that a user system cannot understand and interpret correctly. Many systems, for example, cannot use unstructured text. Noisy data can adversely affect the results of any data analysis and skew conclusions if not handled properly. Statistical analysis is sometimes used to weed the noise out of noisy data. Noisy data are data that is corrupted, distorted, or has a low Signal-to-Noise Ratio. Improper procedures (or improperly-documented procedures) to subtract out the noise in data can lead to a false sense of accuracy or false conclusions. Data = true signal + noise Noisy data unnecessarily increases the amount of storage space required and can adversely affect any data mining analysis results. Statistical analysis can use information from historical data to weed out noisy data and facilitate data mining. Parametric vs Non-parametric models Parametric methods are statistical techniques that rely on specific assumptions about the underlying distribution of the population being studied. These methods typically assume that the data follows a known Probability distribution, such as the normal distribution, and estimate the parameters of this distribution using the available data. The basic idea behind the Parametric method is that there is a set of fixed parameters that are used to determine a probability model that is used in Machine Learning as well. Parametric methods are those methods for which we priory know that the population is normal, or if not then we can easily approximate it using a Normal Distribution which is possible by invoking the Central Limit Theorem. Parameters for using the normal distribution are as follows: ◦ Mean ◦ Standard Deviation Eventually, the classification of a method to be parametric completely depends on the presumptions that are made about a population. Parametric vs Non-parametric models Statistical Tests: ◦ t-test: Tests for the difference between the means of two independent groups. ◦ ANOVA: Tests for the difference between the means of three or more groups. ◦ F-test: Compares the variances of two groups. ◦ Chi-square test: Tests for relationships between categorical variables. ◦ Correlation analysis: Measures the strength and direction of the linear relationship between two continuous variables. Machine Learning Models: ◦ Linear regression: Predicts a continuous outcome based on a linear relationship with one or more independent variables. ◦ Logistic regression: Predicts a binary outcome (e.g., yes/no) based on a set of independent variables. ◦ Naive Bayes: Classifies data points based on Bayes’ theorem and assuming independence between features. ◦ Hidden Markov Models: Models sequential data with hidden states and observable outputs. Parametric vs Non-parametric models Non-parametric methods are statistical techniques that do not rely on specific assumptions about the underlying distribution of the population being studied. These methods are often referred to as “distribution-free” methods because they make no assumptions about the shape of the distribution. The basic idea behind the parametric method is no need to make any assumption of parameters for the given population or the population we are studying. In fact, the methods don’t depend on the population. Here there is no fixed set of parameters are available, and also there is no distribution (normal distribution, etc.) of any kind is available for use. This is also the reason that nonparametric methods are also referred to as distribution-free methods. Nowadays Non-parametric methods are gaining popularity and an impact of influence some reasons behind this fame is: ◦ The main reason is that there is no need to be mannered while using parametric methods. ◦ The second important reason is that we do not need to make more and more assumptions about the population given (or taken) on which we are working on. ◦ Most of the nonparametric methods available are very easy to apply and to understand also i.e. the complexity is very low. Parametric vs Non-parametric models Statistical Tests: ◦ Mann-Whitney U test: Tests for the difference between the medians of two independent groups. ◦ Kruskal-Wallis test: Tests for the difference between the medians of three or more groups. ◦ Spearman’s rank correlation: Measures the strength and direction of the monotonic relationship between two variables. ◦ Wilcoxon signed-rank test: Tests for the difference between the medians of two paired samples. Machine Learning Models: ◦ K-Nearest Neighbors (KNN): Classifies data points based on the k nearest neighbors. ◦ Decision Trees: Makes classifications based on a series of yes/no questions about the features. ◦ Support Vector Machines (SVM): Creates a decision boundary that maximizes the margin between different classes. ◦ Neural networks: Can be designed with specific architectures to handle non-parametric data, such as convolutional neural networks for image data and recurrent neural networks for sequential data.