Lec2 Introduction 2024.pdf
Document Details
Uploaded by Deleted User
Full Transcript
LARGE SCALE MACHINE LEARNING PROF DR. BAN N. DHANNOON 1. Stanford University Machine Learning CS229 Lecture notes by Andrew Ng. 2. Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, 2023. 3. Machine Learning Bookcamp, 2021 2...
LARGE SCALE MACHINE LEARNING PROF DR. BAN N. DHANNOON 1. Stanford University Machine Learning CS229 Lecture notes by Andrew Ng. 2. Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, 2023. 3. Machine Learning Bookcamp, 2021 2 4. Machine Learning for Business Analytics, 2023 Types of Machine Learning Techniques 1. Supervised Learning 2. Unsupervised Learning 3. Reinforcement Learning 4. Semi-Supervised Learning 5. Self-Supervised Learning https://www.simplilearn.com/tutorials/machine-learning-tutorial/types-of-machine-learning 1 1. Supervised learning (task-driven) It is an ML method in which a model learns from a labeled dataset containing input-output pairs. It analyzes the training data and produces a function, which can be used for mapping new examples Example o Classify Iris flower o Predicting house prices Tasks/types: - Classification (binary/ multi) Make discrete predictions. - Prediction/Regression Make continuous predictions. 2 Advantages of Supervised Learning 1. Effectiveness: can predict outcomes based on past data. 2. Simplicity: easy to understand and implement. 3. Performance Evaluation: It is easy to measure the performance since the ground truth (labels) is known. 4. Applications: Can be used in various fields like finance, healthcare, marketing, etc. 5. Feature Importance: It allows an understanding of which features are most important in making predictions. Disadvantages of Supervised Learning 1. Dependency on Labeled Data: It requires a large amount of labeled data, which can be expensive and time-consuming. 2. Overfitting: Models can become too complex and fit the noise in the training data. 3. Generalization: Sometimes, these models need to generalize better to unseen data. 3 2. Unsupervised learning (data-driven) An ML algorithm draws inferences from datasets consisting of input data without labeled responses. It tries to learn patterns and relationships directly from the input data (without a teacher) Categories of Unsupervised Learning Clustering. Dimensionality Reduction. Anomaly Detection Association Rule Learning: Discovering rules that capture interesting relationships between variables in large databases (e.g., market basket analysis). 4 Clustering Grouping similar instances into clusters. Ex: A magazine that covers a wide range of topics such as technology, health, finance, and travel. The magazine has an archive of thousands of articles. - Organize these articles into distinct groups of similar articles. This would help with better content management and enhance the reader’s experience by making it easy to find related articles. 5 How to measure the distance of a point to another point? There are several ways to measure distance 1. Euclidean Distance: represents the shortest distance between two points. 2. Manhattan Distance : is the sum of absolute differences between points across all the dimensions. 3. Minkowski Distance: is the generalized form of Euclidean and Manhattan Distance. where q=1 -> Manhattan Distance q=2 -> Euclidean Distance 4. Hamming Distance: measures the similarity between two strings of the same length. Is the number of positions at which the corresponding characters are different 6 Dimensionality Reduction A method for representing a given dataset using a lower number of features (i.e., dimensions) while still capturing the original data's meaningful properties Simplify the data without losing too much information. Reducing the number of variables under consideration can be divided into feature Selection and feature extraction. When dealing with high-dimensional data (i.e., data with many features or variables), it can be challenging to analyze and visualize the data, and it may also lead to issues like overfitting in ML models. Principle Component Analysis (PCA) There are two main types of dimensionality reduction: - Feature Selection: Selecting a subset of the most important features (variables) from the original dataset. - Feature extraction: Transforming the original data into a new set of features. The new feature set should capture most of the important information in a smaller number of features. Advantage: - It will run much faster, - The data will take up less disk and memory space - It may also perform better 7 Anomaly detection Also known as outlier detection, it is a process in ML and statistics used to identify unusual patterns or observations in data that do not conform to a well-defined notion of normal behavior (such as a problem with the system, fraud, or errors). The importance of anomaly detection varies across different domains: Finance: Identifying fraudulent transactions. Cybersecurity: Detecting intrusions and security breaches. Healthcare: Monitoring patient vitals and identifying unusual readings that could indicate a medical issue. Industrial: Monitoring machinery and detecting faults or failures. Internet of Things (IoT): Detecting abnormal behavior in connected devices. 8 Advantages of Unsupervised Learning 1. Discovering Hidden Patterns: It can identify patterns and relationships in data that are not initially evident. 2. No Need for Labelled Data: Works with unlabeled data, making it useful where obtaining labels is expensive or impractical 3. Reduction of Complexity in Data: Helps reduce the dimensionality of data, making complex data more comprehensible. 4. Feature Discovery: This can be used to find useful features that can improve the performance of supervised learning algorithms. 5. Flexibility: Can handle changes in input data or the environment since it doesn’t rely on predefined labels Disadvantages of Unsupervised Learning 1. Interpretation of Results: The results can be ambiguous and harder to interpret 2. Dependency on Input Data: The output quality heavily depends on the quality of the input data. 3. Lack of Precise Objectives: With specific tasks like prediction or classification, the direction of learning is more focused, leading to more actionable insight. 9 3. Reinforcement learning (Learn from mistakes) The learning system (called an agent), can observe the environment, select and perform actions, and receive rewards in return (or penalties in the form of negative rewards). It must then learn by itself what the best strategy, called a policy, is to get the most reward over time. We can say that the reinforcement learning process has three steps: Interaction Learning Decision-making For example, consider teaching a dog a new trick: we cannot tell it what to do, but we can reward/punish it if it does the right/wrong thing. It has to find out what it did that made it get the reward/punishment. There are two types of reinforcement learning: model-based and model-free 10 4. Semi-Supervised Learning Since labeling data is usually time-consuming and costly, you will often have plenty of unlabeled and few labeled instances. Semi-supervised learning is an ML approach that trains models using a combination of a small amount of labeled data and a large amount of unlabeled data. The main goal of semi-supervised learning is to leverage the large pool of unlabeled data to better understand the underlying structure of the data and improve learning accuracy with the limited labeled data. Example of Semi-Supervised Learning classifying web pages, It can use the labeled pages to learn about features indicative of each category and apply this knowledge to categorize the unlabeled pages. 11 5. Self-Supervised Learning (SSL) if you have a large dataset of unlabeled images, you can randomly mask a small part of each image and then train a model to recover the original image. During training, the masked images are used as inputs for the model, and the original images are used as labels. i.e., to predict one part of an image given another part, or for text, to predict the next word in a sentence. 12 Testing and Validating The only way to know how well a model will generalize to new cases is to try it out on new cases A better option is to split your data into two sets: the training set and the test set. The error rate on new cases is called the generalization error (or out-of-sample error). It is common to use 80% of the data for training and hold out 20% for testing. Types of cross-validation 1. K-fold cross-validation 2. Hold-out cross-validation 3. Stratified k-fold cross-validation 4. Leave-p-out cross-validation 5. Leave-one-out cross-validation 6. Monte Carlo (shuffle-split) 7. Time series (rolling cross-validation) 13 Holdout Cross-validation 1. Hold out part of the training set. 2. The new held-out set is called the validation set (or the development set, or dev set). 3. Train multiple models with various hyperparameters on the reduced training set (i.e., the full training set minus the validation set), and you select the model that performs best on the validation set. 4. Train the best model on the full training set (including the validation set) to obtain the final model. 5. Lastly, this final model will be evaluated on the test set to get an estimate of the generalization error. 14 K-fold cross-validation The whole dataset is partitioned into k parts of equal size, and each partition is called a fold. It’s known as a k-fold since there are k parts where k can be any integer— 3,4,5, etc. This validation technique is not considered suitable for imbalanced datasets as the model will not get trained properly owing to the proper ratio of each class's data. 15 Holdout vs. cross-validation Holdout Method: Cross-Validation (typically k-fold cross-validation): Advantages: Advantages: Simple and fast to implement. It is more robust since all data points are used for Provides an independent test set to evaluate the training and testing, reducing the variability that model's performance. can arise from a single random train-test split. Disadvantages: Provides a better estimate of model performance, The performance can vary depending on how the especially with small datasets. data is split, as the model might perform better or Disadvantages: worse depending on the random partitioning. More computationally expensive compared to the May not utilize the full dataset efficiently since a holdout method since the model is trained multiple portion is held out only for testing. times. Conclusion: Use Holdout: When you have a large dataset and want a quick estimate of the model's performance. Use Cross-Validation: When the dataset is small, or you need a more reliable estimate of model performance, ensure that the result is less dependent on the specific train-test split. 16 Stratified k-fold cross-validation k-fold validation can’t be used for imbalanced datasets because data is split into k- folds with a uniform probability distribution. Not so with stratified k-fold, which is an enhanced version of the k-fold cross-validation technique. Although it, too, splits the dataset into k equal folds, each fold has the same ratio of instances of target variables as the complete dataset. This enables it to work perfectly for imbalanced datasets but not for time-series data. 17 18 THANKYOU