Gradient Boosting PDF
Document Details
Uploaded by Deleted User
Tags
Summary
This document provides a presentation on gradient boosting, a powerful machine learning ensemble technique.
Full Transcript
12/5/2024 Gradient Boosting 132 What is Gradient Boosting in General? Boosting is a powerful ensemble technique in machine learning. Unlike traditional models that learn from the data independently, boosting combines the predictions o...
12/5/2024 Gradient Boosting 132 What is Gradient Boosting in General? Boosting is a powerful ensemble technique in machine learning. Unlike traditional models that learn from the data independently, boosting combines the predictions of multiple weak learners to create a single, more accurate, strong learner. 133 1 12/5/2024 Weak Learner A weak learner is a machine learning model that is better than a random guessing model. For instance, in our mushroom classification scenario, if a random guessing model is 40% accurate, a weak learner would be just above that: 50-60%. The most popular weak learner is a decision tree, chosen for their ability to work with almost any dataset. If you are not familiar with decision trees 134 Real-World Applications of Gradient Boosting Gradient boosting has become such a dominant force in machine learning that its applications now span various industries, from predicting customer churn to detecting asteroids. Here’s a glimpse into its success stories in Kaggle and real-world use cases: Dominating Kaggle competitions: Netflix Movie Recommendation Challenge: Gradient boosting was crucial in building recommendation systems for multi-billion companies like Netflix. 135 2 12/5/2024 Transforming business and industry: Retail and e-commerce: personalized recommendations, inventory management, fraud detection Finance and insurance: credit risk assessment, churn prediction, algorithmic trading Healthcare and medicine: disease diagnosis, drug discovery, personalized medicine Search and online advertising: search ranking, ad targeting, click-through rate prediction 136 The Gradient Boosting Algorithm: A Step-by-Step Input The gradient boosting algorithm works for tabular data with features (X) and a target (y). Like other machine learning algorithms, the aim is to learn enough from the training data to generalize well to unseen data points. Use a simple sales dataset with four rows to understand the underlying process of gradient boosting. Using three features—customer age, purchase category, and purchase weight—we want to predict the purchase amount 137 3 12/5/2024 138 The loss function in gradient boosting In machine learning, a loss function is a critical component that quantifies the difference between a model’s predictions and the actual values. it measures how a model is performing. Here is a breakdown of its role: Calculates the error: Takes the model's predicted output and compares it to the ground truth (actual observed values). Evaluation metric: By comparing the loss on training, validation, and test datasets, you can assess your model’s generalization ability and avoid overfitting. 139 4 12/5/2024 The loss function in gradient boosting The two most common loss functions are: Mean Squared Error (MSE): This popular loss function for regression measures the sum of the squared differences between predicted and actual values. Gradient boosting often uses this variation of it: 140 The loss function in gradient boosting Cross-entropy: This function measures the difference between two probability distributions. It is commonly used for classification tasks where the targets have discrete categories. 141 5 12/5/2024 Step 1: Make an initial prediction Gradient boosting is an algorithm that gradually increases its accuracy. To start the process, we need an initial guess or prediction. The initial guess is always the average of the target. (123.45+56.78+345.67+98.01)/4=156 The first round, our model predicts that all purchases were the same — 156 dollars 142 143 6 12/5/2024 Step 2: Calculate the pseudo-residuals The next step is to find the differences between each observed value and our initial prediction: 156 - Observed. 144 Step 3: Build a weak learner Build a decision tree (weak learner) that predicts the residuals using our three features (age, category, purchase weight). For this problem, we will limit the decision tree to just four leaves (terminal nodes), but in practice, people usually choose leaves between 8 and 32. 145 7 12/5/2024 146 147 8 12/5/2024 148 149 9 12/5/2024 150 151 10 12/5/2024 Step 4: Iterate In the next steps, we iterate on step 3, build more weak learners. 152 Hyperparameter tuning This parameter sets the direction and the loss function of the algorithm. If the objective is regression, MSE is chosen as a loss function, whereas for classification, Cross-Entropy is the one to go. 153 11 12/5/2024 Learning rate The most important hyperparameter of gradient boosting is perhaps the learning rate. It controls the contribution of each weak learner by adjusting the shrinkage factor. Smaller values (towards 0) decrease how much say each weak learner has in the ensemble. This requires building more trees and, thus, more time to finish training. But, the final strong learner will be strong and impervious to overfitting. 154 Number of trees This parameter, called the number of boosting, controls the number of trees to build. The more trees you build, the stronger and more performant the ensemble becomes. It also becomes more complex as more trees allow the model to capture more patterns in the data. However, more trees significantly improve the chances of overfitting. To mitigate this, employ a combination of early stopping and a low learning rate. 155 12 12/5/2024 Max depth This parameter controls the number of levels in each weak learner (decision tree). A max depth of 3 means the tree has three levels, counting the leaf level. The deeper the tree, the more complex and computationally expensive the model becomes. Choose a value close to 3 to prevent overfitting. Your maximum should be a depth of 10. 156 Minimum number of samples per leaf This parameter controls how branches split in decision trees. Setting a low value for the number of samples in termination nodes (leaves) makes the algorithm noise-sensitive. A larger minimum number of samples helps to prevent overfitting by making it more difficult for the trees to create splits based on too few data points. 157 13 12/5/2024 Subsampling rate This parameter controls the proportion of the data used to train each tree. In the examples above, we used 100% of the rows as our dataset had only four rows. However, real-world datasets often have much more and require sampling. So, if you set the subsampling rate below 1, such as 0.7, each weak learner trains on the randomly sampled 70% of the rows. A smaller subsample rate can lead to faster training but can also lead to overfitting. 158 Feature sampling rate This parameter is exactly like subsampling, but it samples rows. For datasets with hundreds of features, it is recommended to choose a feature sampling rate between 0.5 and 1 to lower the chance of overfitting. 159 14 12/5/2024 Introduction to Cluster Analysis 160 Why Segmentation? For example, you cluster the customers on demographic variables to create segments. 161 15 12/5/2024 Unsupervised Classification Unsupervised classification: grouping of cases based on similarities in input values inputs grouping cluster 1 cluster 2 cluster 3 cluster 1 cluster 2 162 What is Clustering? 163 16 12/5/2024 164 165 17 12/5/2024 166 167 18 12/5/2024 168 169 19 12/5/2024 170 171 20