week05-introMachineLearning.pdf
Document Details
Uploaded by DexterousFern6890
NUS
Related
- PCSII Depression/Anxiety/Strong Emotions 2024 Document
- A Concise History of the World: A New World of Connections (1500-1800)
- Human Bio Test PDF
- Vertebrate Pest Management PDF
- Lg 5 International Environmental Laws, Treaties, Protocols, and Conventions
- Fármacos Anticoncepcionais e Disfunção Erétil PDF
Full Transcript
Machine Learning An introduction Dr Yeo Wee Kiang Unauthorized distribution or sharing of these materials is strictly prohibited. 1 ...
Machine Learning An introduction Dr Yeo Wee Kiang Unauthorized distribution or sharing of these materials is strictly prohibited. 1 IS5126 Hands-on with Applied Analytics Overview What is machine learning? Supervised Learning workflow Applications of ML Steps involved Model Validation or Evaluation The ML Process Train, validation, test split The different stages Train, test split Types of ML k-fold Cross Validation Supervised Confusion Matrix Unsupervised Semi-supervised Types of errors Supervised Learning Metrics for evaluating classification models Classification Common issues in ML Regression Imbalanced datasets Unsupervised Learning ‘Dirty’ data Applications Irrelevant data Clustering techniques Lack of relevant data Unauthorized distribution or sharing of these materials is strictly prohibited. 2 Machine Learning What is it? Machine Learning is the science of getting computers to learn patterns and trends and improve their learning over time in an autonomous and iterative fashion, by providing them with data and information in the form of observations and real-world interactions. It is a method of analysing data that automates model building. It allows computers to find hidden insights without being explicitly programmed step-by-step. Unauthorized distribution or sharing of these materials is strictly prohibited. 3 Applications of Machine Learning Fraud detection Product recommendations Natural Language Processing Predicting customer or employee churn Customer segmentation Image recognition and object detection New Pricing Models Financial Modelling Unauthorized distribution or sharing of these materials is strictly prohibited. 4 Machine Learning Process The different stages Unauthorized distribution or sharing of these materials is strictly prohibited. 5 Types of Machine Learning Supervised Learning Supervised Supervised learning is the machine learning Learning All data are labelled Model task of learning a function that maps an input to an output based on examples of input- output pairs. Small portion of data Semi-supervised Learning are labelled Semi-supervised learning is a type of machine Semi-supervised Model learning that uses both labeled and unlabeled Learning data to train a model. It can be used to Large portion of data improve the performance of a machine learning model when there is limited labeled are unlabelled data available. Unsupervised Learning Unsupervised All data are Unsupervised learning is a type of machine Model learning algorithm used to draw inferences Learning unlabelled from datasets consisting of input data without labelled responses. Unauthorized distribution or sharing of these materials is strictly prohibited. 6 Supervised Learning Two types Supervised Learning Classification Regression Unauthorized distribution or sharing of these materials is strictly prohibited. 7 Supervised Learning What is the supervisor? Datasets suitable for supervised learning contain known class labels. Datasets without known class labels can only be used for unsupervised learning. Unauthorized distribution or sharing of these materials is strictly prohibited. 8 Classification versus Regression Supervised Learning Classification is used when the prediction outcome is categorical. e.g. cats or dogs. Regression is used when the prediction outcome is numerical. e.g. stock prices, temperature. Unauthorized distribution or sharing of these materials is strictly prohibited. 9 Unsupervised Learning For data without class labels Unsupervised learning does Supervised Learning Unsupervised Learning not require labelled data. Use of Relies on labelled data Lacks predefined It discovers patterns and Predefined with predefined target target labels; operates structures in data Target outcomes. solely with the data. Labels independently. Algorithms Algorithms are guided Key differences from Learning by labeled examples autonomously identify supervised learning: Guidance during training. patterns and structures in data. No predefined target labels. Primarily focuses on Learns data patterns without Focus and specific prediction More exploratory and explicit supervision. open-ended, revealing Objective tasks with known insights within data. labels. Unauthorized distribution or sharing of these materials is strictly prohibited. 10 Unsupervised Learning For data without class labels Applications of unsupervised learning: Clustering Customer Segmentation: Grouping customers based on their purchasing behaviour for targeted marketing. Anomaly Detection in Network Security: Identifying unusual patterns in network traffic to detect cyber threats. Healthcare Data Analysis: Identifying patient groups with similar medical histories for personalized treatment plans. Unauthorized distribution or sharing of these materials is strictly prohibited. 11 Unsupervised Learning For data without class labels Clustering is a fundamental unsupervised learning technique. Clustering groups similar data points together. Unauthorized distribution or sharing of these materials is strictly prohibited. 12 Unsupervised Learning For data without class labels Clustering techniques: K-Means Clustering Partitions data into K clusters based on similarity. Minimizes intra-cluster distance and maximizes inter- cluster distance. Good for initial clustering exploration. Hierarchical Clustering Builds a dendrogram to represent the hierarchy of clusters. Starts with individual data points and merges them. Reveals relationships between clusters at different levels. https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html https://scikit-learn.org/stable/auto_examples/cluster/plot_agglomerative_dendrogram.html Unauthorized distribution or sharing of these materials is strictly prohibited. 13 Supervised Learning Workflow The different stages Unauthorized distribution or sharing of these materials is strictly prohibited. 14 Model Validation or Evaluation Supervised Learning Model Validation Train, Validation, k-fold cross Train, Test split Test split validation Model validation or evaluation is the process of assessing the performance of a trained machine learning model on a held-out dataset. This is important because it helps us to understand how well the model will generalize to new data. Unauthorized distribution or sharing of these materials is strictly prohibited. 15 Model Validation or Evaluation Supervised Learning Train, Validation, Test Split Train Validation Test One common way to validate a model is to split the dataset into three parts: train, validation, and test. The train set is used to train the model, the validation set is used to tune the model hyperparameters, and the test set is used to evaluate the final model. We need to randomly split the rows in the original dataset into subsets. Typically, the train set comprises about approximately 70% of the original number of rows. The remainder rows are split randomly between the validation set and test set. Unauthorized distribution or sharing of these materials is strictly prohibited. 16 Model Validation or Evaluation Supervised Learning Train, Validation, Test Split Unauthorized distribution or sharing of these materials is strictly prohibited. 17 Model Validation or Evaluation Supervised Learning Train Test Train, Test Split Another common way to validate a model is to split the dataset into two parts: train and test. The train set is used to train the model, and the test set is used to evaluate the final model. This approach is simpler than the train, validation, test split, but it can be less reliable, especially for small datasets. We randomly split the rows in the original dataset into train set and test set. Typically, the training set comprises approximately 70% of the original number of rows. The remainder rows will become the test set. Unauthorized distribution or sharing of these materials is strictly prohibited. 18 k-fold Cross Validation Supervised Learning K-fold cross-validation is a more robust method for model validation. k=5 In k-fold cross-validation, the dataset is split into k folds. For each fold, the model is trained on the k-1 remaining folds and evaluated on the held- out fold. This process is repeated for all k folds. The average performance of the model across all folds is used as the final evaluation metric. https://scikit-learn.org/stable/modules/cross_validation.html Unauthorized distribution or sharing of these materials is strictly prohibited. 19 Confusion Matrix Model Validation or Evaluation for Supervised Learning Prediction: Class 1 Prediction: Class 2 True Positive (TP) False Negative (FN) Actual: Class 1 👍👍 Type 2 Error Actual: Class 2 False Positive (FP) True Negative (TN) Type 1 Error 👍👍 A confusion matrix is a table that summarizes the performance of a classification model on a test set. Each column of the confusion matrix represents a predicted class, and each row represents a true class. The diagonal elements of the confusion matrix represent the number of correctly classified samples, The off-diagonal elements represent the number of incorrectly classified samples. Unauthorized distribution or sharing of these materials is strictly prohibited. 20 Confusion Matrix Model Validation or Evaluation for Supervised Learning Prediction: Class 1 Prediction: Class 2 True Positive (TP) False Negative (FN) Actual: Class 1 👍👍 Type 2 Error Actual: Class 2 False Positive (FP) True Negative (TN) Type 1 Error 👍👍 There are two main types of error in classification models: False positives (Type 1 error) are samples that are predicted to belong to a class but do not actually belong to that class. False negatives (Type 2 error) are samples that are not predicted to belong to a class but actually do belong to that class. Unauthorized distribution or sharing of these materials is strictly prohibited. 21 Types of Errors Model Validation or Evaluation for Supervised Learning https://www.statisticssolutions.com/wp-content/uploads/2017/12/rachnovblog.jpg Unauthorized distribution or sharing of these materials is strictly prohibited. 22 Metrics Model Validation or Evaluation for Supervised Learning Metrics for evaluating Metrics for evaluating Regression Classification models models Accuracy Mean squared error (MSE) Precision Root mean squared error (RMSE) Recall Mean absolute error (MAE) F1 score R-squared Unauthorized distribution or sharing of these materials is strictly prohibited. 23 Metrics Model Validation or Evaluation for Supervised Learning There are many different metrics that can be used to evaluate classification models. Some common metrics include: Accuracy is the percentage of samples that are correctly classified. Precision is the percentage of predicted positive samples that are actually positive. Precision is a good metric for evaluating models when the cost of false positives is high. Recall is the percentage of actual positive samples that are predicted to be positive. Recall is a good metric for evaluating models when the cost of false negatives is high. The F1 score is a harmonic mean of precision and recall. It is a good metric for evaluating models when both precision and recall are important. Unauthorized distribution or sharing of these materials is strictly prohibited. 24 Metrics Model Validation or Evaluation for Supervised Learning More about precision and recall: Precision is the fraction of predicted positives that are actually positive. High precision means that the model is very accurate at predicting positive samples. However, it may not be very complete at identifying all positive samples. For example, a spam filter with high precision will correctly identify most spam emails, but it may also miss some spam emails. Recall is the fraction of actual positives that are predicted positive. High recall means that the model is very complete at identifying all positive samples. However, it may not be very accurate at predicting positive samples. For example, a spam filter with high recall will identify all spam emails, but it may also flag some legitimate emails as spam. Unauthorized distribution or sharing of these materials is strictly prohibited. 25 Metrics Model Validation or Evaluation for Supervised Learning Accuracy is the most intuitive performance measure, and it is simply a ratio of correctly predicted observation to the total observations. One may think that, if we have high accuracy then our model is best. For example, if we got 0.932, it means that our model is approximately 93% accurate. Yes, accuracy is a great measure but only when you have balanced datasets where you have an even class distribution. Most of the time, you must look at other metrics to evaluate the performance of your model since class distribution is usually uneven or imbalanced. Accuracy works best if false positives and false negatives have similar cost. If the cost of false positives and false negatives are very different, it’s better to look at both Precision and Recall. Unauthorized distribution or sharing of these materials is strictly prohibited. 26 Metrics Model Validation or Evaluation for Supervised Learning F1 score: 2 * Precision * Recall/(Precision + Recall) F1 Score is the harmonic mean of the precision and recall of a model. Therefore, this score takes both false positives and false negatives into account. Intuitively it is not as easy to understand as accuracy, but F1 is usually more useful than accuracy, especially if you have an uneven class distribution. A high F1 score indicates that the model is performing well on both precision and recall. A low F1 score indicates that the model is not performing well on either precision or recall, or that it is not performing well on both precision and recall. https://developers.google.com/machine-learning/crash-course/classification/precision-and-recall Unauthorized distribution or sharing of these materials is strictly prohibited. 27 Metrics Model Validation or Evaluation for Supervised Learning Prediction: Class 1 Prediction: Class 2 Actual: Class 1 True Positive (TP) False Negative (FN) Actual: Class 2 False Positive (FP) True Negative (TN) We have the following formulae: Accuracy = (TP + TN)/(TP+TN+FP+FN) Precision = TP/(TP+FP) Recall or Sensitivity = TP/(TP+FN) F1 score = 2 * Precision * Recall/(Precision + Recall) Unauthorized distribution or sharing of these materials is strictly prohibited. 28 Confusion Matrix Example: 2-class classification Prediction: Class 1 Prediction: Class 2 Actual: Class 1 320 43 Actual: Class 2 20 538 Accuracy = (TP + TN)/(TP+TN+FP+FN) = (320+538)/(320+538+20+43) = 0.932 Precision = TP/(TP+FP) = 320/(320+20) = 0.941 Recall or Sensitivity = TP/(TP+FN) = 320/(320+43) = 0.882 F1 score = 2 * Precision * Recall/(Precision + Recall) = 2*0.941*0.882/(0.941+0.882) = 0.911 Unauthorized distribution or sharing of these materials is strictly prohibited. 29 Common issues in Machine Learning Imbalanced datasets This happens when the number of data points for a particular class label is very much larger than the number of data points for other class labels. Class distribution is uneven. ‘Dirty’ data Any dataset probably needs some level of cleaning and pre-processing. Using raw data without any prior cleaning is not advisable. Irrelevant data or lack of relevant data The data points (rows) and features (columns) in your dataset do not support the problem statement that you are trying to answer or solve. Unauthorized distribution or sharing of these materials is strictly prohibited. 30