Machine Learning Recap PDF
Document Details
Uploaded by FavorablePersonification
University of El Oued
2024
Chourouk Guettas
Tags
Summary
This document provides a recap of machine learning fundamentals, covering topics such as supervised learning, unsupervised learning, and reinforcement learning. It also outlines evaluation metrics, the machine learning pipeline, and ethical considerations.
Full Transcript
Recap - Fundamentals of Machine Learning Chourouk Guettas 01/09/2024 Table of contents Objectives 3 I - Overview of Machine Learning Concepts...
Recap - Fundamentals of Machine Learning Chourouk Guettas 01/09/2024 Table of contents Objectives 3 I - Overview of Machine Learning Concepts 4 1. Overview of Machine Learning..................................................................................4 2. Core Concepts of Machine Learning: Data...............................................................5 3. Core Concepts of Machine Learning: Dataset..........................................................6 4. Core Concepts of Machine Learning: Learning........................................................6 5. Core Concepts of Machine Learning: Generalization...............................................7 6. The Role of Data in Machine Learning......................................................................8 II - Types of Machine Learning Models 9 1. Supervised Learning..................................................................................................9 2. Unsupervised Learning............................................................................................10 3. Reinforcement Learning..........................................................................................10 III - Evaluation Metrics and Model Performance 12 1. Regression Metrics...................................................................................................12 2. Classification Metrics:..............................................................................................12 3. Clustering Metrics....................................................................................................13 IV - The Machine Learning Pipeline 14 1. Data Collection and Preprocessing.........................................................................14 2. Feature Selection and Engineering.........................................................................15 3. Model Selection and Training.................................................................................15 4. Model Evaluation and Tuning.................................................................................16 5. Deployment and Monitoring...................................................................................16 6. Ethical Considerations and Privacy........................................................................17 V - Challenges in Machine Learning 18 1. Bias and Variance Trade-off....................................................................................18 2. Handling Imbalanced Datasets...............................................................................18 3. Dealing with Missing Data.......................................................................................19 4. Feature Selection and Dimensionality Reduction.................................................19 5. Deployment Challenges in IoT Environments........................................................20 2 Objectives Second Chapter: Recap - Fundamentals of Machine Learning Basic Concepts of Machine Learning Supervised learning, unsupervised learning, reinforcement learning. Types of Machine Learning Models Regression, classification, clustering. Evaluation of Machine Learning Models Performance metrics, bias-variance tradeoff, overfitting, and underfitting 3 Overview of Machine Learning Concepts I 1. Overview of Machine Learning 1. What is Machine Learning? Definition: Machine Learning (ML) is a subset of artificial intelligence (AI) that enables systems to learn from data, identify patterns, and make decisions with minimal human intervention. Purpose: The goal is to develop algorithms that can receive input data and use statistical analysis to predict an output while updating outputs as new data becomes available. 2. Historical Context and Evolution of ML Brief timeline of ML development: 1950s: Early AI research and the Turing Test 1960s: Pattern recognition and the "nearest neighbor" algorithm 1980s: Machine learning becomes a separate field from AI 1990s: Rise of data mining and adaptive algorithms 2000s: Support Vector Machines and boosting methods gain popularity 2010s onwards: Deep learning revolution, big data, and increased computational power Key milestones: 1957: Frank Rosenblatt designs the Perceptron, the first artificial neural network 1967: The nearest neighbor algorithm is introduced 1986: Backpropagation is applied to neural networks 1997: IBM's Deep Blue defeats world chess champion Garry Kasparov 2011: IBM Watson wins Jeopardy! 2016: Google DeepMind's AlphaGo defeats world Go champion Lee Sedol 3. Importance of ML in Modern Technology Personalized recommendations (e.g., Netflix, Amazon) Voice assistants (e.g., Siri, Alexa) Image and facial recognition Fraud detection in financial services Autonomous vehicles Medical diagnosis and drug discovery Transformative impact of ML across industries: Healthcare: predictive diagnostics, personalized treatment plans 4 Overview of Machine Learning Concepts Finance: algorithmic trading, credit scoring Manufacturing: predictive maintenance, quality control Retail: inventory management, customer behavior analysis Transportation: route optimization, traffic prediction 4. The Future of ML and IoT (Emerging trends and potential future developments) Edge computing: bringing ML capabilities directly to IoT devices Federated learning: training ML models while keeping data on local devices Explainable AI: making ML decisions more transparent and interpretable Integration with 5G (6G) networks for faster, more reliable IoT communications Quantum machine learning: leveraging quantum computing for ML tasks 2. Core Concepts of Machine Learning: Data Features (input variables): Definition: Features are the individual measurable properties or characteristics of the phenomenon being observed. They are the inputs to your machine learning model. In an IoT context: Features could be various sensor readings such as temperature, humidity, pressure, vibration, sound levels, etc. Importance: The quality and relevance of features significantly impact model performance. Types of features: Numerical: Continuous (e.g., temperature) or discrete (e.g., count of events) Categorical: Nominal (e.g., color) or ordinal (e.g., low/medium/high) Time-series: Data points indexed in time order Feature engineering: The process of using domain knowledge to create new features that make machine learning algorithms work better. Labels (output variables): Definition: Labels are the target variables that the model is trying to predict or classify. Examples: In a predictive maintenance scenario, the label could be "failure" or "no failure". Types of labels: For regression problems: Continuous values (e.g., predicting temperature) For classification problems: Discrete categories (e.g., classifying device status) Labeled vs. Unlabeled data: Labeled data: Used in supervised learning, where both features and labels are provided Unlabeled data: Used in unsupervised learning, where only features are available 3. 1. 5 Overview of Machine Learning Concepts 3. Core Concepts of Machine Learning: Dataset Training Set Purpose: Used to train the model, i.e., learn the relationships between features and labels. Size: Typically the largest portion of the data, usually 60-80% of the entire dataset. Usage: The model iteratively learns patterns from this data by adjusting its parameters. Validation Set Purpose: Used to tune model hyperparameters and prevent overfitting. Size: Usually smaller than the training set, typically 10-20% of the dataset. Usage: Helps in model selection and optimization Used to evaluate model performance during training Assists in early stopping to prevent overfitting Test Set Purpose: Used to evaluate the final model performance. Size: Typically the smallest portion, around 10-20% of the dataset. Usage: Simulates how the model will perform on unseen data Provides an unbiased evaluation of the final model Should only be used once the model is fully trained Importance of Proper Data Splitting Ensures that the model's performance assessment is reliable and generalizable. Helps detect and prevent overfitting. Cross-validation: A technique that involves rotating the training and validation sets to get a more robust estimate of model performance. 4. Core Concepts of Machine Learning: Learning Steps in the Learning Process 1. Model Initialization: Start with random or predetermined parameters. 2. Forward Pass: Use the current model to make predictions on the training data. 3. Loss Calculation: Compare predictions with actual values using a loss function. 4. Backward Pass: Compute gradients of the loss with respect to model parameters. 5. Parameter Update: Adjust model parameters to minimize loss using an optimization algorithm. 6. Iteration: Repeat steps 2-5 until convergence or a stopping criterion is met Key Concepts in the Learning Process Loss Functions: Purpose: Measure how well the model is performing Examples: Mean Squared Error (for regression), Cross-Entropy (for classification) 6 Overview of Machine Learning Concepts Optimization Algorithms: Purpose: Methods to adjust model parameters to minimize loss Examples: Gradient Descent, Stochastic Gradient Descent, Adam Learning Rate: Definition: Controls how much the model changes in response to the estimated error each time the model weights are updated Importance: Too high can cause unstable training, too low can result in slow convergence 5. Core Concepts of Machine Learning: Generalization Generalization Definition: The model's ability to perform well on unseen data. Goal: Create models that generalize well to new, unseen examples. Indicators of good generalization: Similar performance on training and test sets. Overfitting Definition: When a model learns the training data too well, including noise and outliers. Signs: High accuracy on training data, poor performance on test data. Causes: Too complex model for the amount of training data Insufficient data Training for too long Prevention techniques: Regularization (e.g., L1, L2 regularization) Early stopping Data augmentation Ensemble methods Underfitting Definition: When a model is too simple to capture the underlying patterns in the data. Signs: Poor performance on both training and test data. Causes: Too simple model Insufficient feature engineering Not training for long enough The Bias-Variance Tradeoff Bias: Error due to overly simplistic assumptions in the learning algorithm. Variance: Error due to too much complexity in the learning algorithm. Tradeoff: Decreasing bias will often increase variance and vice versa. Goal: Find the sweet spot that minimizes both bias and variance, leading to a model that generalizes well. Impact of Model Complexity 7 Overview of Machine Learning Concepts Low complexity models: Prone to underfitting (high bias, low variance) High complexity models: Prone to overfitting (low bias, high variance) Optimal complexity: Balances bias and variance to achieve the best generalization 6. The Role of Data in Machine Learning Data Quality and Quantity Quality: Clean, relevant, and representative data is crucial for model performance. Quantity: More data generally leads to better model performance, but with diminishing returns. Challenges with Real-World Data Noise: Random variation or error in the data Missing values: Incomplete data points Outliers: Data points that significantly differ from other observations Data Preprocessing Techniques Cleaning: Handling missing values, removing duplicates Transformation: Normalization, standardization Feature selection: Choosing the most relevant features Feature engineering: Creating new features from existing ones 8 Types of Machine Learning Models II 1. Supervised Learning Definition: Supervised learning is a type of machine learning where the model is trained on a labeled dataset. The algorithm learns to map input data to known output labels. Key characteristics: Requires labeled data for training Goal is to learn a function that maps inputs to outputs Used for prediction and classification tasks Examples of supervised learning Models: Regression Purpose: Predict continuous numerical values Examples: Linear Regression Simplest form of regression Models linear relationship between input features and output Equation: y = mx + b (for simple linear regression) Polynomial Regression Extension of linear regression for non-linear relationships Can model curved relationships in data Support Vector Regression (SVR) Uses support vector machines for regression tasks Effective in high-dimensional spaces Classification Purpose: Categorize input data into predefined classes Examples: Logistic Regression Despite its name, used for binary classification Outputs probability of an instance belonging to a particular class Decision Trees Tree-like model of decisions Can handle both numerical and categorical data Easy to interpret 9 Types of Machine Learning Models Support Vector Machines (SVM) Finds the hyperplane that best separates classes Effective in high-dimensional spaces Can use different kernel functions for non-linear classification k-Nearest Neighbors (k-NN) Classification based on the majority class of k nearest neighbors Simple but can be computationally expensive for large datasets 2. Unsupervised Learning Definition: Unsupervised learning is a type of machine learning where the model is given unlabeled data and must find patterns or structure within it. Key characteristics: Works with unlabeled data Goal is to discover hidden patterns or groupings in data Used for clustering, dimensionality reduction, and anomaly detection Unsupervised Learning Models: Clustering Purpose: Group similar data points together Examples: K-means Clustering Partitions data into K clusters Each data point belongs to the cluster with the nearest mean Hierarchical Clustering Creates a tree of clusters (dendrogram) Can be agglomerative (bottom-up) or divisive (top-down) DBSCAN (Density-Based Spatial Clustering of Applications with Noise) Clusters areas of high density separated by areas of low density Can find clusters of arbitrary shape 3. Reinforcement Learning Definition: Reinforcement learning is a type of machine learning where an agent learns to make decisions by interacting with an environment. Key characteristics: Involves an agent, environment, states, actions, and rewards Goal is to learn a policy that maximizes cumulative reward Used for sequential decision-making problems Key components: Agent: The learner or decision-maker Environment: The world that the agent interacts with 10 Types of Machine Learning Models State: The current situation of the agent Action: A move the agent can make Reward: Feedback from the environment Examples of reinforcement learning algorithms: Q-Learning: Learns the value of an action in a particular state Policy Gradient Methods: Directly learn the optimal policy Deep Q-Network (DQN): Combines Q-learning with deep neural networks 11 Evaluation Metrics and Model Performance III 1. Regression Metrics Regression Metrics: Mean Absolute Error (MAE): The average of the absolute differences between predicted and actual values. Formula: n 1 MAE = ∑|yi − y ^i | n i=1 Mean Squared Error (MSE): The average of the squares of the differences between predicted and actual values. Formula: n 1 2 MSE = ∑ (yi − y ^i ) n i=1 R-squared (R²): Indicates the proportion of the variance in the dependent variable that is predictable from the independent variables. Formula: n 2 ∑i=1 (yi − y ^i ) 2 R = 1 − n 2 ∑ (yi − ȳ ) i=1 2. Classification Metrics: Accuracy: The proportion of correctly classified instances among the total instances. Formula: True Positives + True Negatives Accuracy = Total Instances Precision: The proportion of true positive predictions among all positive predictions. Formula: True Positives Precision = True Positives + False Positives 12 Evaluation Metrics and Model Performance Recall (Sensitivity): The proportion of true positive predictions among all actual positive instances. Formula: True Positives Recall = True Positives + False Negatives F1 Score: The harmonic mean of precision and recall, providing a single metric to evaluate the model. Formula: Precision × Recall F1 Score = 2 × Precision + Recall 3. Clustering Metrics Silhouette Score: Measures how similar a data point is to its own cluster compared to other clusters. Formula: b − a Silhouette = « max(a, b) Where: a is the average distance to the other points in the same cluster, and b is the average distance to the points in the nearest cluster. » Davies-Bouldin Index: The average similarity ratio of each cluster with its most similar cluster. Lower values indicate better clustering. Formula: K 1 Si + Sj » DBI = ∑ max( ) K j≠i Mij « i=1 where s is the cluster diameter, and d is the distance between cluster centroids. 13 The Machine Learning Pipeline IV Introduction The machine learning pipeline is a series of steps that encompass the entire lifecycle of a machine learning project, from data collection to model deployment and monitoring. Understanding this pipeline is crucial for effectively applying machine learning to IoT applications. 1. Data Collection and Preprocessing a) Data Collection Sources of data in IoT: Sensors (e.g., temperature, humidity, pressure) Devices (e.g., smartphones, wearables) Logs and system outputs Considerations: Data quality and reliability Data volume and velocity Privacy and security concerns b) Data Preprocessing Data Cleaning: Handling missing values Deletion: Remove rows or columns with missing data Imputation: Fill missing values (mean, median, mode, or predicted values) Removing duplicates Correcting inconsistencies Data Transformation: Normalization: Scaling features to a fixed range (usually 0-1) Formula: X_norm = (X - X_min) / (X_max - X_min) Standardization: Scaling features to have zero mean and unit variance Formula: X_stand = (X - μ) / σ Encoding categorical variables: One-hot encoding Label encoding 14 The Machine Learning Pipeline Data Integration: Combining data from multiple sources Ensuring consistency across different data streams 2. Feature Selection and Engineering a) Feature Selection Purpose: Choose the most relevant features for the model Techniques: Filter methods: Select features based on statistical measures Wrapper methods: Use a model to evaluate feature subsets Embedded methods: Perform feature selection as part of the model training process Importance in IoT: Reduces computational complexity and improves model efficiency b) Feature Engineering Purpose: Create new features from existing ones to improve model performance Techniques: Polynomial features Domain-specific transformations Time-based features for time series data Importance in IoT: Can capture complex patterns and relationships in sensor data 3. Model Selection and Training a) Model Selection Considerations: Nature of the problem (classification, regression, clustering) Size and quality of the dataset Computational resources available (especially important for IoT devices) Interpretability requirements Techniques: Cross-validation Grid search for hyperparameter tuning b) Model Training Process: Split data into training and validation sets Initialize model parameters Feed training data into the model Optimize model parameters using an appropriate algorithm Validate model performance on validation set 15 The Machine Learning Pipeline Challenges in IoT: Limited computational resources on edge devices Handling streaming data Adapting to concept drift (changes in data distribution over time) 4. Model Evaluation and Tuning a) Evaluation Metrics For Regression: Mean Squared Error (MSE) Root Mean Squared Error (RMSE) Mean Absolute Error (MAE) R-squared (R²) For Classification: Accuracy Precision and Recall F1 Score Area Under the ROC Curve (AUC-ROC) For Clustering: Silhouette Score Calinski-Harabasz Index b) Model Tuning Hyperparameter optimization: Grid Search Random Search Bayesian Optimization Techniques to improve model performance: Ensemble methods (e.g., Random Forests, Gradient Boosting) Regularization to prevent overfitting Learning rate scheduling 5. Deployment and Monitoring a) Model Deployment Deployment options in IoT: Cloud deployment: Model runs on cloud servers Edge deployment: Model runs on edge devices Hybrid deployment: Combination of cloud and edge 16 The Machine Learning Pipeline Considerations: Model size and computational requirements Latency requirements Power consumption (especially for battery-powered devices) b) Model Monitoring and Maintenance Monitoring model performance: Tracking prediction accuracy over time Detecting concept drift Model updates: Retraining on new data Fine-tuning existing models Challenges in IoT: Limited connectivity for remote updates Ensuring consistency across distributed devices 6. Ethical Considerations and Privacy Data privacy: Ensuring personal data is protected Bias and fairness: Preventing and mitigating biases in models Transparency: Providing explanations for model decisions when necessary Security: Protecting models and data from malicious attacks 17 Challenges in Machine Learning V 1. Bias and Variance Trade-off a) Understanding Bias and Variance Bias: The error due to overly simplistic assumptions in the learning algorithm Variance: The error due to too much complexity in the learning algorithm b) The Trade-off High bias (underfitting): Model is too simple, misses relevant relations between features and target outputs High variance (overfitting): Model is too complex, captures noise in the training data Goal: Find the right balance to create a model that generalizes well c) Addressing the Trade-off Feature selection and engineering Regularization techniques (L1, L2 regularization) Ensemble methods (e.g., Random Forests, Gradient Boosting) Cross-validation for model selection 2. Handling Imbalanced Datasets a) The Problem One class significantly outnumbers the other(s) in classification tasks Common in many IoT applications (e.g., anomaly detection, predictive maintenance) b) Challenges Standard algorithms tend to favor the majority class Traditional evaluation metrics (e.g., accuracy) can be misleading c) Solutions Resampling techniques: Oversampling the minority class (e.g., SMOTE - Synthetic Minority Over-sampling Technique) Undersampling the majority class Algorithmic approaches: Cost-sensitive learning Ensemble methods (e.g., BalancedRandomForestClassifier) 18 Challenges in Machine Learning Appropriate evaluation metrics: Precision, Recall, F1-score ROC AUC 3. Dealing with Missing Data a) Types of Missing Data Missing Completely at Random (MCAR) Missing at Random (MAR) Missing Not at Random (MNAR) b) Challenges Biased results if not handled properly Reduced statistical power Computational issues in model training c) Techniques for Handling Missing Data Deletion methods: Listwise deletion Pairwise deletion Imputation methods: Mean/median/mode imputation Regression imputation Multiple imputation Machine learning approaches: Using algorithms that can handle missing values (e.g., decision trees) Treating missingness as a feature 4. Feature Selection and Dimensionality Reduction a) The Curse of Dimensionality As the number of features increases, the amount of data needed to generalize accurately grows exponentially b) Challenges Increased computational complexity Risk of overfitting Difficulty in data visualization and interpretation c) Techniques Feature Selection: Filter methods (e.g., correlation-based selection) Wrapper methods (e.g., recursive feature elimination) Embedded methods (e.g., LASSO regularization) 19 Challenges in Machine Learning Dimensionality Reduction: Principal Component Analysis (PCA) t-SNE (t-Distributed Stochastic Neighbor Embedding) Autoencoders 5. Deployment Challenges in IoT Environments (This will be covered in more detail in the upcoming lectures.) a) Resource Constraints b) Connectivity Issues c) Model Updates and Maintenance d) Security and Privacy Concerns 20