Week 6 COE305 Machine Learning PDF

WEEK 6 COE305 MACHINE LEARNING BY FEMILDA JOSEPHIN AGENDA RANDOM FOREST CROSS-VALIDATION TECHNIQUES QUIZ Problems faced with decision trees Decision trees are prone to overfitting, especially when they become deep and complex, capturing noise in the training data. Decision trees can be sensitive to variations in the training data, leading to high variance and instability in the model. Decision trees may become biased, especially if the training data is imbalanced or contains outliers. Decision trees may create overly specific models that do not generalize well to new, unseen data. Small changes in the training data can lead to significant changes in the structure of a decision tree. Decision trees can be sensitive to outliers, which may disproportionately influence the structure of the tree. ENSEMBLE LEARNING Ensemble methods - create multiple models and then combine them to produce improved results. Produces more accurate solutions than a single model would. Types of Ensemble Methods Bagging - creates a different training subset from sample training data with replacement & the final output is based on majority voting. Eg: Random Forest. Boosting– It combines weak learners into strong learners by creating sequential models such that the final model has the highest accuracy. Eg: ADA BOOST, XG BOOST BAGGING Also known as Bootstrap Aggregation Bagging chooses a random sample from the data set. Each model is generated from the samples provided by the original data with replacement known as row sampling. This step of row sampling with replacement is called bootstrap. Each model is trained independently which generates results. The final output is based on majority voting after combining the results of all models. Combining all the results and generating output based on majority voting is known as aggregation. EXAMPLE RANDOM FORESTS Supervised learning algorithm Used for both classification and regression. Ensemble Learning method. Combines the output of multiple decision trees to reach a single result. Most popular algorithm – given excellent performance for many applications. It builds decision trees on different samples and takes their majority vote for classification and average in case of regression. Bagging method is used. RANDOM FOREST RANDOM FOREST IMPORTANT PROPERTIES OF RANDOM FOREST Diversity- Not all attributes/variables/features are considered while making an individual tree, each tree is different. Immune to the curse of dimensionality- Since each tree does not consider all the features, the feature space is reduced. Parallelization-Each tree is created independently out of different data and attributes. This means that we can make full use of the CPU to build random forests. Stability- Stability arises because the result is based on majority voting/ averaging. It solves the problem of overfitting as output is based on majority voting or averaging ADVANTAGES 1. High Accuracy: Ensemble of trees provides accurate predictions. 2. Robust to Overfitting: Reduces overfitting through aggregation. 3. Handles Missing Values: Manages missing data effectively. 4. Variable Importance: Identifies key contributing features. 5. Works with Data Types: Handles categorical and continuous data. 6. Reduces Variance: Averaging improves model generalization. 7. Efficient on Large Datasets: Scales well for big datasets. 8. Implicit Feature Selection: Highlights important features. DISADVANTAGES Lack of Interpretability: Complex ensemble is challenging to interpret. Computational Complexity: Can be resource-intensive. Memory Usage: Larger model sizes may require substantial memory. Not Suitable for Small Datasets: Requires a diverse set of trees. Black Box Model: Limited understanding of internal workings. Sensitivity to Noisy Data: Vulnerable to noisy datasets. May Overfit Noise: Can capture irrelevant patterns in data. CROSS-VALIDATION CROSS-VALIDATION Cross-validation is a technique for validating the model efficiency by training it on the subset of input data and testing on previously unseen subset of the input data. It helps us better use our data, and it gives much more information about our algorithm performance. WHY CROSS-VALIDATION? Performance Estimation: Cross-validation provides a more reliable estimate of a model's performance than a single train- test split. Variability in Data: Datasets can vary in terms of the distribution of classes or patterns. Model Robustness Assessment: Cross-validation helps assess the robustness of a model by testing its performance across different subsets of the data. Maximizing Data Utilization: In situations where the dataset is limited, cross-validation allows for a more efficient use of available data. METHODS USED FOR CROSS-VALIDATION Hold Out method Leave one out cross-validation K-fold cross-validation Stratified k-fold cross-validation Time Series Cross-Validation HOLD-OUT METHOD This is the simplest evaluation method and is widely used in Machine Learning projects. Here the entire dataset(population) is divided into 2 sets – train set and test set. The data can be divided into 70-30 or 60-40, 75-25 or 80-20, or even 50-50 depending on the use case. As a rule, the proportion of training data has to be larger than the test data. The data split happens randomly. Generally a random_state is specified. This can lead to extremely high variance and every time, the split changes, the accuracy will also change. DRAWBACKS In the Hold out method, the test error rates are highly variable (high variance) and it totally depends on which observations end up in the training set and test set. (Overfitting) Only a part of the data is used to train the model (high bias) which is not a very good idea when data is not huge and this will lead to overestimation of test error. (Underfitting) LEAVE ONE OUT CROSS-VALIDATION In this method, we divide the data into train and test sets – but with a twist. Instead of dividing the data into 2 subsets, we select a single observation as test data, and everything else is labeled as training data and the model is trained. Now the 2nd observation is selected as test data and the model is trained on the remaining data. This process continues ‘n’ times and the average of all these iterations is calculated and estimated as the test set error. DRAWBACKS It can also be time-consuming if a model is particularly complex and takes a long time to fit to a dataset. It is computationally expensive as the model is run ‘n’ times to test every observation in the data. 1)K-fold cross-validation approach divides the input dataset into K K-FOLD groups of samples of equal sizes.These samples are called folds. CROSS- 2)For each learning set, the prediction function uses k-l folds, and the rest of the folds are used for the test set. VALIDATION 3)This approach is a very popular CV approach because it is easy to understand, and the output is less biased than other methods. The steps for k-fold cross-validation are: 1) Split the input dataset into K groups 2) For each group: Take one group as the reserve or test data set. Use remaining groups as the training dataset Fit the model on the training set and evaluate the performance of the model using the test set. EXAMPLE 5-FOLDS CROSS- VALIDATION EXAMPLE K-FOLDS CROSS- VALIDATION. ADVANTAGES AND DISADVANTAGES Cons : Higher computational costs; the Pros : Prone to less model needs to be trained variance because it uses K times at the validation the entire training set. step (plus one more at the test step). STRATIFIED K-FOLD CROSS VALIDATION Stratified k-Fold is a variation of the standard k- Fold CV technique which is designed to be effective in such cases of target imbalance. if you have a data set of 100 RECORDS where 1/3 of the data is in one class and 2/3 of the data is in another class then stratied technique can be helpful. It ensures that the distribution of target classes is balanced across different folds, addressing potential issues related to class imbalance. Advantages of Stratified Cross- Validation DISADVANTAGES OF STRATIFIED CROSS-VALIDATION Computational Cost: Can be more computationally Not Always Necessary: May offer limited benefits in expensive than regular cross-validation due to the well-balanced datasets with sufficient samples for need to ensure balanced class distribution. each class. TIME SERIES CROSS-VALIDATION Time series cross-validation is a technique used to evaluate the performance of time series forecasting models. Unlike standard cross-validation, time series cross-validation respects the temporal order of the data, which is crucial for assessing a model's ability to make accurate predictions on unseen future data points. Eg: Stock Market Prediction EXAMPLE QUIZ 1) How does Random Forest introduce randomness in the construction of individual decision trees? A. By always using the same set of features for all trees. B. By using the same training data for each tree. C. By considering a random subset of features for each split in each tree. D. By using a fixed set of hyperparameters for all trees. Answer: C. By considering a random subset of features for each split in each tree. QUIZ Quiz 3) In Random Forest, how are predictions made for a new data point? A. By averaging the predictions of all trees in the ensemble. B. By considering only the prediction of the first tree in the ensemble. C. By selecting the tree with the highest accuracy. D. By summing the predictions of all trees in the ensemble. Answer: A. By averaging the predictions of all trees in the ensemble. Quiz 4) What is the concept of "feature importance" in Random Forest? A. It represents the order in which features are added to the model during training. B. It indicates the significance of each feature in contributing to the model's predictions. C. It refers to the number of features randomly selected for each split in a tree. D. It measures the size of the dataset used for training each decision tree. Answer: B. It indicates the significance of each feature in contributing to the model's predictions. Quiz 5. What is the role of the term "forest" in the name Random Forest? A. It emphasizes the complexity of each individual decision tree. B. It signifies that the model is only suitable for forestry-related datasets. C. It highlights the combination of multiple decision trees in an ensemble. D. It refers to the need for a dense and diverse set of features. Answer: C. It highlights the combination of multiple decision trees in an ensemble. Quiz 6. In k-fold cross-validation, how is the dataset divided? A. Only into training and test sets. B. Into k subsets, and the model is trained and tested k times, using a different subset as the test set in each iteration. C. Into two subsets, and the model is trained on one and tested on the other. D. Randomly into training and test sets for each iteration. Answer: B. Into k subsets, and the model is trained and tested k times, using a different subset as the test set in each iteration. Quiz 7) What is the advantage of using stratified k-fold cross-validation over regular k-fold cross- validation? A. It is computationally faster. B. It ensures that each fold has a representative distribution of the target variable. C. It guarantees a larger training set for each fold. D. It simplifies the implementation of cross-validation. Answer: B. It ensures that each fold has a representative distribution of the target variable. Quiz 8. Which statement is true regarding the bias-variance tradeoff in the context of cross- validation? A. Cross-validation has no impact on the bias-variance tradeoff. B. Cross-validation helps in reducing both bias and variance. C. Cross-validation increases bias but reduces variance. D. Cross-validation increases variance but reduces bias. Answer: B. Cross-validation helps in reducing both bias and variance. Quiz Each fold contains one data point used as the test set, and the remaining n-1 observations are used for training in each fold. 9) In leave-one-out cross-validation (LOOCV), how many folds are used? A. Equal to the number of observations in the dataset. B. Twice the number of observations in the dataset. C. One less than the number of observations in the dataset. D. A fixed number, typically 10. Answer: A. Equal to the number of observations in the dataset. Quiz 10) What is the main advantage of using cross-validation over a single train-test split? A. Cross-validation provides a more optimistic estimate of model performance. B. Cross-validation ensures that the model is trained on the entire dataset. C. Cross-validation gives a more robust estimate of the model's generalization performance. D. Cross-validation is faster and requires less computational resources. Answer: C. Cross-validation gives a more robust estimate of the model's generalization performance.

Week 6 COE305 Machine Learning PDF

Document Details

Tags

Related

Summary

Full Transcript