116.pdf

SVM - formal deﬁnition of maximal margin (“wides street”) approach [source: MIT 6.034 Artiﬁcial Intelligence Lecture, Jeremy Jordan blog] 96 SVM - decision margins deﬁnition (constraint deﬁnition) [source: MIT 6.034 Artiﬁcial Intelligence Lecture, Jeremy Jordan blog] 97 SVM - expression of distance between margins (objective function deﬁnition) shortest distance between margins given by x+ and x- points (same trick with dot product) [source: MIT 6.034 Artiﬁcial Intelligence Lecture, Jeremy Jordan blog] 98 SVM - constrained optimization using Lagrangian In practice we will take ﬁnal formula of L and we will maximize it (quadratic programming problem), subject to α (as a result, we obtain a corresponding α value for each observation from training set). [source: MIT 6.034 Artiﬁcial Intelligence Lecture, Jeremy Jordan blog] 99 SVM - Kernel trick for the model Unfortunately, kernels should be treated as hyperparameter and they themselves also have hyperparameters that need tweaking (e.g. gamma which is kernel coeﬃcient for Radial and Polynomial and deﬁnes how much inﬂuence a single training example has, thus the larger gamma is, the closer other examples must be to be aﬀected) [source: MIT 6.034 Artiﬁcial Intelligence Lecture, Jeremy Jordan blog, Intel Course: Introduction to Machine Learning] 100 SVM - regularization of the model Our previous solution is called “hard margin”. to obtain S is a slack variable an d express misclassiﬁc ation of the model A low C makes the decision surface smooth (wide margins), while a high C aims at classifying all training examples correctly (large values of C will give you the Hard Margin - regularization is canceled). C is crucial SVM hyperparameter. [source: MIT 6.034 Artiﬁcial Intelligence Lecture, Jeremy Jordan blog, Baeldung] 101 SVR - Support Vector Regression model tweaks The larger the ϵ is, the larger errors we admit in our model. ϵ (in addition to C and kernels) is a key hyperparameter for Support Vector Regression model. [source: Saed Sayad, MathWorks] 102 SVM - the landscape of support vector algorithms Attention! In practice SVM/SVR requires some standardization of features - just like with KNN (on training and testing)! We often use Z-score. One vs All [k models] or One vs One [k choose 2 = k(k-2)/2 models] hinge loss function [source: Aman.AI] 103 SVM - pros and cons PROS ● CONS Eﬀective in high dimensional spaces (especially when ● groups are nearly fully separable) ● ● ● large number of examples Still eﬀective in cases where number of dimensions is ● SVM has a relatively low explainability greater than the number of samples ● When class separability is low (regardless of the kernel) it Uses a subset of training points in the decision function (called support vectors), so it is also memory eﬃcient ● Algorithm (in the basic implementation) is super slow for is not eﬀective ● If the number of features is much greater than the Versatile: diﬀerent Kernel functions can be speciﬁed for number of samples, avoid over-ﬁtting in choosing Kernel the decision function functions and regularization term is crucial. Algorithm is partially immune to outliers (when we use regularization term) ● Low number of hyperparameters to tune ● There is possibility to apply some class weighting and thus ● SVMs do not directly provide probability estimates, these are calculated using an expensive cross-validation deal with imbalanced dataset “directly” [source: Scikit-learn] 104 Comparison of ML algorithms (classiﬁers) In this semester we will not learn more machine learning algorithms (of course we managed to cover only the most basic - so called shallow algorithms, in the next course ML 2 we will focus among others on the boosting mechanism and neural networks, etc.). At the end, it is worth to illustrate and analyse the nature of decision boundaries of diﬀerent classiﬁers (which we already know!). The plots show training points in solid colors and testing points semi-transparent. The lower right shows the classiﬁcation accuracy on the test set. As you can see, depending on the dataset and the problem we undertake, diﬀerent models show better or worse predictive properties. Sometimes it is worth relying on your intuition about the set, but it's usually better to test each model for a given problem and compare the results. [source: Scikit-learn] 105 Labs no. 3 - machine learning modeling with KNN and SVM models Link do the materials: 106 Chapter 5 Crucial machine learning techniques (part 2) 107 [Recap] P-norm deﬁnition: Regularization We came across the concept of regularization when we discussed the bias-variance trade-oﬀ and the concepts of underﬁtting and overﬁtting. We mentioned that regularization is a tool to deal with the problem of overﬁtting, i.e. too much model variance (it allows for better generalization of our model). Being more precise: regularization (regularization term ) is a modiﬁcation of a loss function incorporates the information about the value of all parameters. Regularized cost function looks as follows ( , that is a parameter which controls the importance of the regularization term): There are three main types of regularization: ● L1 (Lasso) - thanks to its properties it is a good feature selection method ● L2 (Ridge) - thanks to its properties it is a good model when multicollinearity of you your deal with variables (sometimes you have to introduce couple highly correlated variables due to business requirements) ● combination of L1 and L2 (Elastic net) - in most cases it is the most reasonable approach Regularization is the most popular technique for linear models, however, more and more machine learning models are introducing this technique. [source: Stanford CS 229, Wikipedia] 108 Hyperparameters tuning approaches We know so far what a hyperparameter is. Let’s recall this deﬁnition: a hyperparameter is such a parameter that are not directly learnt within estimators. In addition, we know that the cross-validation procedure is helpful (and obligatory) in selecting the best hyperparameters (an appropriate cross-validation scheme should be selected for speciﬁc problem). In addition to cross-validation in the hyperparameter tuning process, we need components such as: hyperparameter space (from which space we want to search parameter values); a method for searching or sampling candidates (how we want to search the hyperparameter space); a score function (by which evaluation function we want to evaluate the models). By combining all the elements together, the process of tuning hyperparameters can be reduced to the following generalized procedure: 1. from the deﬁned space of hyperparameters, select a set (one value per each hyperparameter) of hyperparameters (according to the selected search method); 2. for the selected set of hyperparameters apply cross-validation (perform evaluation on selected score functions); 3. go back to point 1 and keep doing 1-2 until the stop criterion is reached (e.g. number of iterations); 4. based on all iterations, choose the best set of hyperparameters. In practice some algorithms are creating n-sets of hyperparameters at the beginning of the procedure and then they test them in cross-validation with parallelization (to speed up computations). We already know how to properly choose cross-validation and score function for various problems (in the problem of tuning hyperparameters, it is recommended to evaluate the results of the procedure using several metrics). It remains for us to deﬁne the hyperparameter space and how to search it. The space is deﬁned speciﬁcally to the requirements and properties of a given hyperparameter for a given algorithm (we will search a diﬀerent space for k in KNN, and another for C in SVM). In practice, we deﬁne it either by specifying the set (e.g. for k in KNN {1,5,10}) or by specifying the distribution from which the parameters will be sampled (e.g. for k in KNN: uniform [1,15]). 109 Hyperparameters tuning approaches - cont’d Now we need to discuss the most popular approaches to searching for hyperparameter space: 1. Grid search - an approach that examines all combinations (exhaustive approach) of user-deﬁned hyperparameters in the hyperparameter space. 2. Random search - an approach where we randomly select sets of hyperparameters (from a set or distribution) n times and therefore only try out some possible combinations 3. Bayesian search - an approach where we built a probabilistic model of the function mapping from hyperparameter values to the objective evaluated on a validation set. By iteratively evaluating a promising hyperparameter conﬁguration based on the current model, and then updating it, Bayesian optimization aims to gather observations revealing as much information as possible about this function and, in particular, the location of the optimum. It tries to balance exploration (hyperparameters for which the outcome is most uncertain) and exploitation (hyperparameters expected close to the optimum). There are many discussions about which approach is best. However, as it usually happens, it all depends on the speciﬁcs of the problem being solved. The superiority of random search over grid search is very often indicated (primarily for high dimensional data). I personally recommend using Bayesian methods (while they are much more complicated than the ﬁrst two solutions). [source: Random Search for Hyper-Parameter Optimization, Wikipedia] 110 Feature selection After a well-executed feature engineering process, we should have a large number of variables in our dataset. We already know that not all models have the ability to select variables (e.g. OLS, KNN, SVM), so now we need to choose which of the variables will be important from the point of view of our models. I personally recommend a two-step approach: 1) Univariate/bivariate feature selection: a) Variance Threshold - it removes all features whose variance doesn’t meet some threshold b) Mutual Information (for classiﬁcation and regression) - it measures the mutual dependence between the two variables (close relationship with entropy). More speciﬁcally, it quantiﬁes the "amount of information" obtained about one random variable by observing the other random variable. 2) c) F-test / Chi2 test - statistical test for dependence between two variables d) Correlation (Pearson/Spearman/Kendall, in most cases we will prefer Spearman for instance it is applicable to categorical variables!) Multivariate feature selection a) b) Embedded feature selection: i) Lasso / Elastic Net (we discussed it) ii) Boruta - powerful ML approach (with statistically elegant algorithm) based on feature importance from Random Forest (Polish author!) iii) Shap values feature importance - Explainable ML approach utilized in feature selection problem Wrapper feature selection (we often use here cross-validation): i) Backward selection - we start with a full model comprising all available features. In subsequent iterations, we remove one feature at a time, always the one that yields the largest gain in a model performance metric, until we reach the desired number of features. ii) Forward selection - this is the opposite of backward selection (we start with a model without variables and then we add more variables). iii) Recursive Feature Elimination - it is very similar to backward selection. It starts with a full model and iteratively eliminates the features one by one. RFE makes its decision based on feature importance extracted from the model. This could be feature weights in linear models, impurity decrease in tree-based models, or permutation importance (which is applicable to any model type). Most modern algorithms, e.g. Random Forest, XGBoost, CatBoost, LightGBM, have a feature importance mechanism. In practice, we will use it strongly! [source: Neptune.AI] 111 Classes rebalancing In our course, we discussed the problem of imbalanced classes for a classiﬁcation task many times. This problem manifests itself primarily: 1) when we train our machine learning model (imbalance has negative impact on cost function - the algorithm during learning will focus on the majority class then on the minority class); 2) when we analyse evaluation metrics, e.g. Accuracy, ROC AUC, etc., which will mislead us through their optimism. Of course, we can deal with this problem by rebalancing the classes in the dataset. We can do this with the following techniques: 1) modiﬁcation of the cost function (we have covered it already along with algorithms) 2) undersampling 3) oversampling 4) combination of undersampling and oversampling. Let’s discuss these techniques. Undersampling - in general, it is about reducing the number of observations from the dominant (major) class. Two categories of algorithms can be distinguished here: 1) Prototype generation - is a technique which will reduce the number of samples in the targeted classes but the remaining samples are generated — and not selected — from the original set. Examples of algorithms: a) Cluster Centroids - makes use of K-means to reduce the number of samples. Therefore, each class will be synthesized with the centroids of the K-means method instead of the original samples. 2) Prototype selection - is a technique which will select samples from the original set. Examples of algorithms: a) Random undersampler - is a fast and easy way to balance the data by randomly selecting a subset of data (but we can select for instance desired ratio of the number of samples in the minority class over the number of samples in the majority class) for the targeted classes (it allows to bootstrap the data) b) Tomek’s links - it removes unwanted overlap between classes where majority class links (Tomek’s link exist if the two samples are the nearest neighbors of each other) are removed until all minimally distanced nearest neighbor pairs are of the same class c) Edited Nearest Neighbours - it removes samples of the majority class for which their class diﬀer from the one of their nearest-neighbors. This sieve can be repeated which is the principle of the Repeated Edited Nearest Neighbours. d) And many many more: reference [source: Imbalance learn] 112 Classes rebalancing - cont’d Oversampling - in general, it is about increasing the number of observations from the dominated (minor) class. We can distinguish following algorithms: 1) Random oversampler - it can be used to repeat some samples and balance the number of samples between the dataset (in the simplest case we will duplicate observations from minor class). By default, random over-sampling generates a bootstrap. 2) SMOTE (Synthetic Minority Over-sampling Technique) - it creates new synthetic observations of minority class as convex (〜linear) combinations of existing points and their nearest neighbors of same class. SMOTE proposes several variants by identifying speciﬁc samples to consider during the resampling. The borderline version (Borderline SMOTE) will detect which point to select which are in the border between two classes. The SVM version (SVM SMOTE) will use the support vectors found using an SVM algorithm to create new sample while the KMeans version (KMeans SMOTE) will make a clustering before to generate samples in each cluster independently depending each cluster density. 3) ADASYN (Adaptive Synthetic) - this method is similar to SMOTE but it generates diﬀerent number of samples depending on an estimate of the local distribution of the class to be oversampled. ADASYN uses a weighted distribution for diﬀerent minority class examples according to their level of diﬃculty in learning, where more synthetic data is generated for minority class examples that are harder to learn. Combination of undersampling and oversampling - SMOTE can generate noisy samples by interpolating new points between marginal outliers and inliers. This issue can be solved by cleaning the space resulting from oversampling. In this regard, Tomek’s link and edited nearest-neighbours are the two cleaning methods that have been added to the pipeline after applying SMOTE oversampling to obtain a cleaner space. Thanks to it we can we can distinguish following algorithms: SMOTETomek SMOTEENN. In business practice, the simplest approach is often used random undersampler. However, it is worth checking how diﬀerent approaches will work in our business problem (check it with cross-validation). Attention: if you use these techniques, be very careful about leakage data - if you are doing cross-validation, you must rebalance each fold of the algorithm separately! Additionally, both SMOTE and Tomek links only work on low-dimensional data! [source: Imbalance learn] 113 Ensemble methods Let’s recall deﬁnition of ensembling: ensemble methods are techniques that aim to improve the performance of our model by combining multiple weak models instead of using a single model. The combined models should increase the accuracy of the results signiﬁcantly. We have three standard ensemble strategies: 1. Bagging (we discussed it alongside Random Forest which is the most popular bagging algorithm); 2. Stacking - is a procedure where learner (so called meta-learner) is trained to combine the individual learners (in other words, we create multiple models whose results are features for the meta-learning model that produces the ﬁnal result; note: any model can be a meta-learner); 3. Boosting - is a idea to combine several weak learners in interactive way to form a stronger one (we will discuss these techniques and models like AdaBoost, XGBoost, LightGBM, CatBoost in detail on the Machine Learning in Finance II course); 4. Voting - idea to combine conceptually diﬀerent machine learning models and return the average predicted values or majority of votes. Adaptive boosting - High weights are put on errors to improve at the next boosting step [source: Machine Learning Mastery] 114 Probability calibration The probability calibration procedure allows to: 1) better calibrate the probabilities of a given classiﬁcation model; 2) to add support for probability prediction (some models do not return probability by default e.g. SVM). We are interested in this procedure, in particular, when we really care about correctly estimated probabilities, e.g. a popular technique in credit risk. Well calibrated classiﬁers are probabilistic classiﬁers for which the output (probability) of the model can be directly interpreted as a conﬁdence level (e.g. let’s set our cut-oﬀ point to 90%, a good classiﬁer is one where about 90% of the classiﬁcations indicated by this cut-oﬀ point will belong to correct class). To check if our classiﬁer has these properties, we can use calibration curves, which presents true frequency of the positive label against its predicted probability. Logistic regression is well calibrated. Naive Bayes is not well calibrated and it tends to push probabilities to 0 or 1. Random Forest is not well calibrated and it shows peaks at approximately 0.2 and 0.9 probability, while probabilities close to 0 or 1 are very rare. SVM is also not well calibrated and it shows “strong” sigmoid curve. [source: Scikit-learn] 115 Probability calibration - cont’d Calibrating a classiﬁer consists of ﬁtting a regressor (called a calibrator) that maps the output of the classiﬁer can deﬁne it as: end of the 5th lecture to a calibrated probability in [0, 1]. We .. In practice, this procedure is performed in the process of cross-validation, due to the high bias risk (the same training set should not be used to create the initial classiﬁer and the calibrated classiﬁer). We can use the Brier score metric to assess the quality of the calibration, but it is not free from defects and caution is recommended. There are two popular calibration regressors (applicable directly only for binary classiﬁcation, but extensible for the problem of classifying multiple classes using the One vs Rest approach : 1) sigmoid regressor: 2) isotonic regressor: [source: Scikit-learn] 116

Document Details

Tags

Related

Full Transcript