Chapter 3 Supervised Machine Learning Algorithms PDF
Document Details
Uploaded by Deleted User
Mostafa Z. Ali
Tags
Summary
This document details supervised machine learning algorithms, including classification and regression techniques, focusing on concepts like the Iris dataset and its application. It explains the process, data structure, potential problems like overfitting, and data variety.
Full Transcript
Chapter 2 End-to-End Machine Learning Project Mostafa Z. Ali Chapter 3 Supervised Machine Learning Algorithms Chapter 2 Mostafa Z. Ali A second application: The Iris Species Problem Setup - Identifying Iris Species A researcher want...
Chapter 2 End-to-End Machine Learning Project Mostafa Z. Ali Chapter 3 Supervised Machine Learning Algorithms Chapter 2 Mostafa Z. Ali A second application: The Iris Species Problem Setup - Identifying Iris Species A researcher wants to identify the species of iris flowers based on their physical characteristics Measurements available: Petal length and width Sepal length and width Objective: Build a model that predicts the species of iris using these measurements. Parts of the iris flower Chapter Chapter 3 2 A second application: The Iris Species Understanding the Data The researcher has measurements for irises previously identified by an expert. Known Species: Setosa Versicolor Virginica Only these three species are considered in this study. Chapter Chapter 3 2 A second application: The Iris Species Defining the Goal ➔Train a machine learning model using the known measurements and species. Goal: Predict the species of a new iris based on its petal and sepal measurements. Chapter Chapter 3 2 A second application: The Iris Species Problem Type - Supervised Learning Since we have measurements and known species labels ➔ this is a supervised learning problem. Supervised Learning: We train the model on labeled data (data with known outputs). Chapter Chapter 3 2 A second application: The Iris Species A Classification Problem This is a classification problem because we are predicting one of several categories (species). Classes: Setosa, Versicolor, Virginica A three-Class Classification Problem: The model must choose between three possible classes. This "classification" problem is different from regression, where the output is continuous rather than categorical Chapter Chapter 3 2 A second application: The Iris Species Key Terminology Classes: Possible output categories (Setosa, Versicolor, Virginica). Label: The true species of each iris in the dataset. Data Point: A single iris flower with measurements and a label (one record/row in the dataset) Chapter Chapter 3 2 A second application: The Iris Species Introduction to the Iris Dataset The Iris dataset is a classical dataset in machine learning and statistics. It includes 150 samples of iris flowers with measurements for sepal length, sepal width, petal length, and petal width. Chapter Chapter 3 2 A second application: The Iris Species Understanding the Data Structure The iris_dataset object returned by load_iris() is a Bunch object, similar to a dictionary. It contains several keys with useful information about the dataset Container object exposing keys as attributes. Bunch objects are sometimes used as an output for functions and methods. They extend dictionaries by enabling values to be accessed by key, bunch["value_key"], or by an attribute, bunch.value_key. Chapter Chapter 3 2 A second application: The Iris Species Dataset Description The DESCR key contains a description of the dataset Chapter Chapter 3 2 A second application: The Iris Species Target Names and Feature Names target_names: Array containing the species we want to predict ('setosa', 'versicolor', 'virginica’). feature_names: Description of each measurement feature Chapter Chapter 3 2 A second application: The Iris Species Data and Target Fields data contains the numeric measurements in a NumPy array. target contains the species classification (0 = setosa, 1 = versicolor, 2 = virginica). Chapter Chapter 3 2 A second application: The Iris Species Exploring Sample Data Rows in data correspond to samples (flowers), and columns represent measurements (features). Shape of (150, 4) means 150 samples and 4 features. First five samples: Chapter Chapter 3 2 A second application: The Iris Species Target Data Structure target is a 1D array containing the species for each flower. Shape of (150,) indicates 150 entries (one per flower) Chapter Chapter 3 2 A second application: The Iris Species Understanding the Target Array The target array encodes species as integers:0 = setosa, 1 = versicolor, 2 = virginica Chapter Chapter 3 2 A second application: The Iris Species Importance of Model Evaluation Objective: To predict the species of iris based on a new set of measurements. Challenge: How can we trust the model’s predictions on new data? Solution: Evaluate the model on data it hasn’t seen before to test its generalization ability. Chapter Chapter 3 2 A second application: The Iris Species Why Not Use Training Data for Evaluation? The model could simply memorize the training data, achieving perfect accuracy on it. Generalization: A good model should perform well on unseen data, not just the training set Chapter Chapter 3 2 A second application: The Iris Species Training and Testing Data Split Split labeled data into two parts: Training Set: Used to train the model. Test Set: Used to evaluate the model’s performance on unseen data Chapter Chapter 3 2 A second application: The Iris Species Train-Test Split with scikit-learn train_test_split function shuffles and splits the dataset. A typical rule of thumb: 75% for training, 25% for testing Chapter Chapter 3 2 A second application: The Iris Species Understanding Data and Labels Data is denoted as X (capitalized for 2D arrays). Labels are denoted as y (lowercase for 1D vectors). This notation follows the mathematical convention 𝑓(𝑋)=𝑦. Chapter Chapter 3 2 A second application: The Iris Species Shuffling the Data for Fair Representation Before splitting, train_test_split shuffles the data to ensure fair class representation in both sets. Example: If the last 25% of data were selected without shuffling, all samples might belong to a single class. Chapter Chapter 3 2 A second application: The Iris Species Using Random State for Reproducibility Setting random_state=0 ensures that the output is consistent each time the code runs. Fixed seed makes the outcome deterministic, which is helpful for reproducible results. Chapter Chapter 3 2 A second application: The Iris Species Shape of the Training and Test Sets X_train and y_train: 75% of the dataset used for training. X_test and y_test: Remaining 25% used for testing Chapter Chapter 3 2 Introduction to Supervised Learning Tasks Classification and Regression in Supervised Learning Two Main Types: Classification: Predicts a class label from a predefined set. Regression: Predicts a continuous value Chapter Chapter 3 2 Introduction to Supervised Learning Tasks What is Classification? Goal: To predict a class label from a set of possible classes. Example: Classifying irises into one of three species. Types of Classification: Binary Classification: Two classes (e.g., spam vs. not spam). Multiclass Classification: More than two classes (e.g., species of iris). Source: link Chapter Chapter 3 2 Introduction to Supervised Learning Tasks Examples of Classification Tasks Binary Classification: Predicting if an email is spam or not. The question being asked: “Is this email spam?” Multiclass Classification: Predicting the language of a website (e.g., English, French, Spanish). There’s no “between” class in languages. Chapter Chapter 3 2 Introduction to Supervised Learning Tasks What is Regression? Goal: Predict a continuous, numeric value. Example: Predicting a person’s annual income based on education, age, and location. Continuous Nature: The output is a range of values (e.g., income levels). Source: link Chapter Chapter 3 2 Introduction to Supervised Learning Tasks Examples of Regression Tasks Predicting Annual Income: Input: Education level, age, location. Output: Any numeric income value (e.g., $40,000 or $40,001). Predicting Crop Yield: Input: Attributes like weather, past yields, employee count. Output: Yield amount (can be any number). In binary classification we often speak of one class being the positive class and the other class being the negative class. Here, positive doesn’t represent having benefit or value, but rather what the object of the study is. So, when looking for spam, “positive” could mean the spam class. Which of the two classes is called positive is often a subjective matter, and specific to the domain Chapter Chapter 3 2 Introduction to Supervised Learning Tasks Distinguishing Classification from Regression Question: Is there continuity in the output? Classification: Discrete, non-continuous labels (e.g., language detection). Regression: Continuous values (e.g., income prediction). Example: For income prediction, small variations (like $40,000 vs. $40,001) don’t impact results significantly. For language prediction, each language is distinct, with no overlap Chapter Chapter 3 2 Introduction to Supervised Learning Tasks Goal of Supervised Learning In supervised learning, we aim to train a model on known data so it can accurately predict outcomes on new, unseen data. Key Concept: If a model makes accurate predictions on unseen data, it is said to generalize well. Chapter Chapter 3 2 Introduction to Supervised Learning Tasks Generalization in Practice A well-generalized model performs accurately on both the training set and test set. Problem: Sometimes, a model performs well on training data but poorly on test data. Chapter Chapter 3 2 Introduction to Supervised Learning Tasks Complexity and Accuracy on Training Data Allowing very complex models can result in perfect accuracy on the training set. Risk: High accuracy on training data does not guarantee good performance on test data. Trade-off of model complexity against training and test accuracy Chapter Chapter 3 2 Introduction to Supervised Learning Tasks Example: Predicting Boat Purchases Scenario: A novice data scientist wants to predict if a customer will buy a boat. Goal: Send promotional emails to likely customers while avoiding uninterested ones. Example data about customers Chapter Chapter 3 2 Introduction to Supervised Learning Tasks Example Rule Creation The data scientist proposes the rule: "If the customer is older than 45 and has fewer than 3 children or is not divorced, they want to buy a boat.“ This rule achieves 100% accuracy on the current dataset. Chapter Chapter 3 2 Introduction to Supervised Learning Tasks Potential Pitfall: Overfitting Overfitting occurs when a model is too complex and fits the training data very closely. This makes the model unreliable for new data. Example: Creating many complex rules specific to the training set Chapter Chapter 3 2 Introduction to Supervised Learning Tasks Characteristics of Overfitting Overfitting focuses on the unique details of the training data, such as individual data points. Example: Using details like age, children, and marital status can lead to an overly complex model that won't generalize well. Chapter Chapter 3 2 Introduction to Supervised Learning Tasks Simplifying for Better Generalization Simpler models often generalize better. For example, “People over 50 want to buy a boat” is a simpler rule that might generalize better than complex rules. Chapter Chapter 3 2 Introduction to Supervised Learning Tasks Underfitting and Its Risks Underfitting occurs when a model is too simple and fails to capture data patterns. Example of underfitting: “Everybody who owns a house buys a boat”—too broad and fails to capture nuances in the data Chapter Chapter 3 2 Introduction to Supervised Learning Tasks The Overfitting-Underfitting Trade-off Model Complexity: As model complexity increases, accuracy on training data improves. Sweet Spot: We aim to find the optimal level of complexity where the model generalizes best. Trade-off of model complexity against training and test accuracy Chapter Chapter 3 2 Introduction to Supervised Learning Tasks Model Complexity and Dataset Size The complexity of a model should align with the variety of data in the dataset. Key Insight: The more varied the dataset, the more complex a model can be without overfitting. Overfitting Risk: Using a complex model with low-variety data can lead to overfitting. Chapter Chapter 3 2 Introduction to Supervised Learning Tasks Model Complexity and Dataset Size – Data Variety Data Variety refers to the diversity of features, patterns, or structures present in a dataset. It can include: Different types of features (e.g., numerical, categorical, or textual). Variations in the relationships between input features and the target variable. Diverse subgroups or clusters within the data. Chapter Chapter 3 2 Introduction to Supervised Learning Tasks Model Complexity and Dataset Size – Data Variety High Data Variety Can Demand Higher Model Complexity: Models need to capture diverse patterns in the data. If the dataset contains complex relationships or multiple data distributions (e.g., distinct clusters), simpler models may struggle to generalize effectively. For example, a linear model might fail to capture non-linear relationships in data, whereas a more complex model like a neural network or ensemble method can adapt to such variety. Overfitting Risk: High complexity in the model (e.g., too many parameters or layers) can lead to overfitting if the data variety isn't adequately represented or if the dataset is too small. Proper data preprocessing, regularization, and cross-validation are necessary to balance model complexity with the variety in the data. Need for Feature Engineering and Preprocessing: Diverse data may require preprocessing steps like scaling, one-hot encoding, or feature extraction. Complex models may implicitly handle these (e.g., neural networks) but simpler models often rely on explicit preprocessing. Chapter Chapter 3 2 Introduction to Supervised Learning Tasks Model Complexity and Dataset Size – Data Variety (example) Dataset: Titanic Dataset (available in Kaggle or scikit-learn) Variety in Features: Numerical: Age, Fare. Categorical: Gender, Passenger Class (Pclass), Embarked. Textual: Name, Cabin (if used for feature extraction). Mixed Data Distributions: Passenger classes (1st, 2nd, 3rd) represent different socio- economic groups. Model Complexity Implications: A simple logistic regression model might capture relationships between features like Pclass and Survived, but it may struggle with interactions like Fare varying by Embarked port. More complex models (e.g., Random Forests or Gradient Boosting) can capture such interactions and nonlinearities better. If textual features like Name are incorporated, a natural language processing (NLP) component may increase model complexity further. Chapter Chapter 3 2 Introduction to Supervised Learning Tasks Impact of Dataset Variety on Model Complexity Larger Datasets: More variety usually means less chance of overfitting, as the model can generalize better. Similar Data Points: Simply duplicating or collecting very similar data points doesn’t increase variety and doesn’t benefit model training. Chapter Chapter 3 2 Introduction to Supervised Learning Tasks Example - Boat Selling Model Scenario: In a boat-selling example, the rule "If the customer is older than 45, has less than 3 children, or is not divorced, then they want to buy a boat" seems valid. Data Expansion: If we gather 10,000 more rows following this rule, we gain more confidence in its validity. Conclusion: The larger dataset supports a more complex model that generalizes well. Chapter Chapter 3 2 Introduction to Supervised Learning Tasks Benefits of Larger Datasets in Supervised Learning In supervised learning, adding more data often has a greater impact than model tweaking. More Data = Better Generalization: A larger dataset can better capture different scenarios and reduce overfitting Chapter Chapter 3 2 Introduction to Supervised Learning Tasks Practical Note for Real-World Data Collection In Practice: Data collection size is often within your control, and collecting varied data can benefit model performance significantly. Never Underestimate the Power of More Data: Before fine-tuning the model, consider if increasing the dataset size could yield better results. Chapter Chapter 3 2 Introduction to Supervised Learning Tasks Introduction to Supervised Machine Learning Algorithms Supervised learning involves algorithms that learn from labeled data to make predictions. Key focus areas: How models learn from data Model complexity Strengths and weaknesses of each algorithm Common parameters and options Chapter Chapter 3 2 Introduction to Supervised Learning Tasks Understanding Model Complexity Definition: Model complexity refers to how flexible or sophisticated a model is in fitting data. Complex vs. Simple Models: Complex models can capture intricate patterns but risk overfitting; simpler models may generalize better but miss subtle patterns. Source: link Chapter Chapter 3 2 Introduction to Supervised Learning Tasks Types of Supervised Learning Algorithms Most algorithms have classification and regression variants. Classification: Used to predict discrete labels (e.g., spam or not spam). Regression: Used to predict continuous values (e.g., predicting housing prices). Chapter Chapter 3 2 Introduction to Supervised Learning Tasks Key Supervised Learning Algorithms Overview We’ll explore several widely-used algorithms, each with distinct methods for learning and predicting. Algorithms covered include: k-Nearest Neighbors (k-NN) Linear Regression and Logistic Regression Decision Trees and Random Forests Support Vector Machines (SVM) Neural Network Chapter Chapter 3 2 Introduction to Supervised Learning Tasks Introduction to Sample Datasets In this section, we'll explore different datasets for testing and illustrating machine learning algorithms. We’ll use both synthetic datasets (small, made-up) and real-world datasets. Chapter Chapter 3 2 Introduction to Supervised Learning Tasks Synthetic Dataset – Forge Dataset The Forge dataset is a simple, synthetic dataset used for two-class classification. Contains 2 features to help visualize the dataset’s structure. Scatter plot of the forge dataset Chapter Chapter 3 2 Introduction to Supervised Learning Tasks Synthetic Dataset – Wave Dataset The Wave dataset is a synthetic dataset used for regression. Consists of 1 input feature and a continuous target variable. Plot of the wave dataset, with the x-axis showing the feature and the y-axis showing the regression target Chapter Chapter 3 2 Introduction to Supervised Learning Tasks Importance of Low-Dimensional Datasets Low-dimensional datasets (few features) help visualize and understand algorithm behavior. Caution: Results from low-dimensional datasets may not always generalize to high-dimensional data. Chapter Chapter 3 2 Introduction to Supervised Learning Tasks Real-World Dataset – Wisconsin Breast Cancer Dataset The Breast Cancer dataset records clinical measurements of breast cancer tumors. Labels: Benign (harmless) or Malignant (cancerous). Datasets that are included in scikit-learn are usually stored as Bunch objects, which contain some information about the dataset as well as the actual data. All you need to know about Bunch objects is that they behave like dictionaries, with the added benefit that you can access values using a dot (as in bunch.key instead of bunch['key']). Chapter Chapter 3 2 Introduction to Supervised Learning Tasks Breast Cancer Dataset Details 569 data points with 30 features each. Class distribution: 212 malignant and 357 benign tumors. Chapter Chapter 3 2 Introduction to Supervised Learning Tasks Breast Cancer Feature Names Feature names in the Breast Cancer dataset provide context for each measurement Chapter Chapter 3 2 Introduction to Supervised Learning Tasks Real-World Dataset – Boston Housing Dataset The Boston Housing dataset contains 506 data points with 13 features. Task: Predict the median value of homes in Boston neighborhoods. Chapter Chapter 3 2 Introduction to Supervised Learning Tasks Feature Engineering with Boston Housing Dataset Feature Engineering: Generating new features by combining existing ones (more on that in chapter 1) In this example, we consider all pairwise combinations of the 13 features. Chapter Chapter 3 2