EEC3501 Machine Learning Lecture 8 (Part 2) PDF

# EEC3501 Machine Learning ## Ensemble methods: combine different estimators by voting - **General Idea:** combine the predictions of several different estimators in order to improve performance/robustness over a single estimator. - **Voting/Averaging** of predictions of multiple trained models. ![](https://i.imgur.com/X7B4FzU.png) ## Ensembles: train same model multiple times on different datasets ### Bagging & Boosting: - **From one single dataset<em>D</em>**: - Create multiple datasets, $D_1, D_2, ..., D_n$. - **Train the same model** *h* (e.g. DT) on the different datasets. - Get *n* differently trained models, $h_1, h_2, ..., h_n$ (i.e. *n* different models.) - **Output prediction** $y(x)$ is the weighted combination of the outputs of the *n* models. ![](https://i.imgur.com/v7c3o9Z.png) ## Bagging vs. Boosting **Differences:** - How to generate the *n* datasets - How to choose the initial model - Objectives ![](https://i.imgur.com/s0W2b8r.png) ## Bagging and Boosting: two ways of combining models by voting ### Bagging: - Train the same model *h* on *n* independent training sets. - *n* independently trained models, $h_1, h_2, ..., h_n$. ![](https://i.imgur.com/p7n333n.png) ### Boosting: - Train the same model *h* over a sequence of *n* training datasets, each generated **conditionally** to the previous model/dataset. - *n* trained models, $h_1, h_2, ..., h_n$, each doing well in a different portion of the feature space. ![](https://i.imgur.com/D3t3h0R.png) ## Bagging: Bootstrap Aggregation - Take original dataset **D**, with *m* examples. - Create *n* **bootstrap copies** of size $m' \leq m$ by re-sampling from *D* with **replacement** (sample one element from set *D* and bring it back to *D* for the next resampling). - Train the *n* models independently. - **Average** the answers from all *n* trained models. ![](https://i.imgur.com/15n3nPl.png) ## Bagging: Prediction (example classification with DT) - **Test data**: (1,1,1) - Predict *c1* with probability of 2/3. ![](https://i.imgur.com/C0nX4j9.png) ## Random Forest: Decision Tree ensemble + bagging on features - **Test data**: (1,1,1) - **Test Sample Input** - **Predict c1** with probability of 2/3. - The idea can be generalized to random forests of other types of estimators ![](https://i.imgur.com/bH7R76N.png) ## Boosting: turning a weak algorithm into an awesome one! ### Sequential, Meta-learning algorithm: 1. Take a weak learning algorithm (e.g. decision stump): - One requirement: Should be at least slightly better than random. 2. Use the algorithm to train a weak model $h_i$, *i* = 1 on some weighted training data. 3. Store the model $h_i$. 4. Compute the error of the model on each training example. 5. Give higher importance to examples on which the model made mistakes. 6. *i* += 1, (Re)train a new model $h_i$ using the importance weighted training examples. 7. Got back to step 2, or stop based on some criterion (e.g. #iterations). 8. To make a prediction on a new input, combine all the trained models $h_1$, *i* = 1, ..., M making each model voting on / weighting its output using an estimated parameter $a_i$. ![](https://i.imgur.com/dSbH7uQ.png) ## AdaBoost (Adaptive Boosting) [Freund & Schapire '95] (Godel Prize '03) - **Given**: a dateset ${(x_i, y_i)1}^m$, where $x_i \in X$, $y_i \in Y = {-1, +1}$, a weak learner. - **Initialize**: $D_1 (i) = \frac{1}{m}$ - Initially equal weights - **For** *t* = 1, ..., *T*: - Train weak learner using distribution $D_t$ - Get weak classifier $h_t$: *X* $\rightarrow$ *R* - Choose $a_t \in R$ - How? - **Update**: $D_{t+1}(i) = \frac{D_t (i) exp{(-a_t y_i h_t (x_i))}}{Z_t}$ - Increase weight if wrong on example *i*: $y_i h_t (x_i) = -1 < 0$ - Decrease weight if correct on example *i*: $y_i h_t (x_i) = +1 > 0$ - **Final classifier**: - (binary case) - $H(x) = sign(\sum_{t=1}^T a_t h_t(x))$ ![](https://i.imgur.com/Q8u9k7S.png) ## How to choose $a_t$? - **Weight update rule**: $D_{t+1}(i) = \frac{D_t(i) exp{(-a_t y_i h_t (x_i))}}{Z_t}$ - **Voting weight**: $a_t = \frac{1}{2} log(\frac{1- \epsilon_t}{\epsilon_t})$ [Freund & Schapire '95] - **$\epsilon_t$: Probability of error in the weighted training set t → Weighted training error** - $\epsilon_t = P_{i \sim Dt(t)} [h_t(x_i) \neq y_i] = \sum_{i=1}^m D_t(i) d(h_t (x_i) \neq y_i)$ - Does $h_t$ get *i*-th point wrong? (1/0) - **$\epsilon_t = 0$ if $h_t$ perfectly classifies all weighted data points → $a_t = \infty$** - **$\epsilon_t = 1$ if $h_t$ perfectly wrong → $-h_t$ perfectly right → $a_t = -\infty$** - **$\epsilon_t = 0.5$ → $a_t = 0$** ![](https://i.imgur.com/VjH35z5.png) ## Thanks! Do you have any questions? ![](https://i.imgur.com/D2sH22B.png)

EEC3501 Machine Learning Lecture 8 (Part 2) PDF

Document Details

Tags

Related

Summary

Full Transcript