EEC3501 Machine Learning Lecture 8 (Part 2) PDF

Summary

This lecture covers ensemble methods in machine learning, including bagging, boosting, and random forests. It explains how these techniques combine multiple models to improve performance and robustness. The lecture also delves into AdaBoost, a specific boosting algorithm.

Full Transcript

# EEC3501 Machine Learning ## Ensemble methods: combine different estimators by voting - **General Idea:** combine the predictions of several different estimators in order to improve performance/robustness over a single estimator. - **Voting/Averaging** of predictions of multiple trained models....

# EEC3501 Machine Learning ## Ensemble methods: combine different estimators by voting - **General Idea:** combine the predictions of several different estimators in order to improve performance/robustness over a single estimator. - **Voting/Averaging** of predictions of multiple trained models. ![](https://i.imgur.com/X7B4FzU.png) ## Ensembles: train same model multiple times on different datasets ### Bagging & Boosting: - **From one single dataset<em>D</em>**: - Create multiple datasets, $D_1, D_2, ..., D_n$. - **Train the same model** *h* (e.g. DT) on the different datasets. - Get *n* differently trained models, $h_1, h_2, ..., h_n$ (i.e. *n* different models.) - **Output prediction** $y(x)$ is the weighted combination of the outputs of the *n* models. ![](https://i.imgur.com/v7c3o9Z.png) ## Bagging vs. Boosting **Differences:** - How to generate the *n* datasets - How to choose the initial model - Objectives ![](https://i.imgur.com/s0W2b8r.png) ## Bagging and Boosting: two ways of combining models by voting ### Bagging: - Train the same model *h* on *n* independent training sets. - *n* independently trained models, $h_1, h_2, ..., h_n$. ![](https://i.imgur.com/p7n333n.png) ### Boosting: - Train the same model *h* over a sequence of *n* training datasets, each generated **conditionally** to the previous model/dataset. - *n* trained models, $h_1, h_2, ..., h_n$, each doing well in a different portion of the feature space. ![](https://i.imgur.com/D3t3h0R.png) ## Bagging: Bootstrap Aggregation - Take original dataset **D**, with *m* examples. - Create *n* **bootstrap copies** of size $m' \leq m$ by re-sampling from *D* with **replacement** (sample one element from set *D* and bring it back to *D* for the next resampling). - Train the *n* models independently. - **Average** the answers from all *n* trained models. ![](https://i.imgur.com/15n3nPl.png) ## Bagging: Prediction (example classification with DT) - **Test data**: (1,1,1) - Predict *c1* with probability of 2/3. ![](https://i.imgur.com/C0nX4j9.png) ## Random Forest: Decision Tree ensemble + bagging on features - **Test data**: (1,1,1) - **Test Sample Input** - **Predict c1** with probability of 2/3. - The idea can be generalized to random forests of other types of estimators ![](https://i.imgur.com/bH7R76N.png) ## Boosting: turning a weak algorithm into an awesome one! ### Sequential, Meta-learning algorithm: 1. Take a weak learning algorithm (e.g. decision stump): - One requirement: Should be at least slightly better than random. 2. Use the algorithm to train a weak model $h_i$, *i* = 1 on some weighted training data. 3. Store the model $h_i$. 4. Compute the error of the model on each training example. 5. Give higher importance to examples on which the model made mistakes. 6. *i* += 1, (Re)train a new model $h_i$ using the importance weighted training examples. 7. Got back to step 2, or stop based on some criterion (e.g. #iterations). 8. To make a prediction on a new input, combine all the trained models $h_1$, *i* = 1, ..., M making each model voting on / weighting its output using an estimated parameter $a_i$. ![](https://i.imgur.com/dSbH7uQ.png) ## AdaBoost (Adaptive Boosting) [Freund & Schapire '95] (Godel Prize '03) - **Given**: a dateset ${(x_i, y_i)1}^m$, where $x_i \in X$, $y_i \in Y = {-1, +1}$, a weak learner. - **Initialize**: $D_1 (i) = \frac{1}{m}$ - Initially equal weights - **For** *t* = 1, ..., *T*: - Train weak learner using distribution $D_t$ - Get weak classifier $h_t$: *X* $\rightarrow$ *R* - Choose $a_t \in R$ - How? - **Update**: $D_{t+1}(i) = \frac{D_t (i) exp{(-a_t y_i h_t (x_i))}}{Z_t}$ - Increase weight if wrong on example *i*: $y_i h_t (x_i) = -1 < 0$ - Decrease weight if correct on example *i*: $y_i h_t (x_i) = +1 > 0$ - **Final classifier**: - (binary case) - $H(x) = sign(\sum_{t=1}^T a_t h_t(x))$ ![](https://i.imgur.com/Q8u9k7S.png) ## How to choose $a_t$? - **Weight update rule**: $D_{t+1}(i) = \frac{D_t(i) exp{(-a_t y_i h_t (x_i))}}{Z_t}$ - **Voting weight**: $a_t = \frac{1}{2} log(\frac{1- \epsilon_t}{\epsilon_t})$ [Freund & Schapire '95] - **$\epsilon_t$: Probability of error in the weighted training set t → Weighted training error** - $\epsilon_t = P_{i \sim Dt(t)} [h_t(x_i) \neq y_i] = \sum_{i=1}^m D_t(i) d(h_t (x_i) \neq y_i)$ - Does $h_t$ get *i*-th point wrong? (1/0) - **$\epsilon_t = 0$ if $h_t$ perfectly classifies all weighted data points → $a_t = \infty$** - **$\epsilon_t = 1$ if $h_t$ perfectly wrong → $-h_t$ perfectly right → $a_t = -\infty$** - **$\epsilon_t = 0.5$ → $a_t = 0$** ![](https://i.imgur.com/VjH35z5.png) ## Thanks! Do you have any questions? ![](https://i.imgur.com/D2sH22B.png)

Use Quizgecko on...
Browser
Browser