133.pdf

Machine learning project workﬂow - general idea Most applied machine learning projects are characterized by a speciﬁc structure and hierarchy of tasks. This allows for the eﬃcient conduct of the project (of a research and programming nature). The choice of this methodology depends among others on: the type of business problem, the organization and the ML/DS team. We will discuss the most popular workﬂows here. CRISP DM: Cross-industry standard process for data mining Model Development Process by DrWhy.AI [source: DrWhy] 117 Machine learning project workﬂow - general idea (cont'd) Data Science cycle [source: Contino] 118 Machine learning project workﬂow - MLOps MLOps is a set of practices that aims to deploy and maintain machine learning models in production reliably and eﬃciently. The word is a compound of "machine learning" and the continuous development practice of DevOps in the software ﬁeld. Machine learning models are tested and developed in isolated experimental systems. When an algorithm is ready to be launched, MLOps is practiced between Data Scientists, DevOps, and Machine Learning engineers to transition the algorithm to production systems. Similar to DevOps or DataOps approaches, MLOps seeks to increase automation and improve the quality of production models, while also focusing on business and regulatory requirements. [source: Wikipedia, Medium, Arrikt] 119 Labs no. 4 - Case study #1 Link do the materials: 120 Labs no. 5 - Case study #2 Link do the materials: https://github.com/michaelwozniak/ML-in-Finance-I-case-study-forecasting-tax-avoidance-rates 121 Chapter 6 Extra materials 122 Decision Trees - general information Decision Trees are the basic non-parametric supervised learning model used for both classiﬁcation and regression problems. Their main idea of Decision Tree is to learn and inference simple decision rules (conditional statements) from the training set. You can equate these decision rules with the logic of human thinking when we are making predictions e.g. if the swallows ﬂy low it will rain (if-else rules). You can see tree as a piecewise constant approximation. It is worth knowing that decision trees are a collective name for many tree algorithms: ID3, C4.5, C5.0 and CART (in chronological order of origin). As a rule, the general idea of the functioning of these algorithms is constant, while during the evolution these algorithms gained more and more features, e.g. C4.5 allows the use of continuous variables as explanatory variables (as opposed to the basic ID3), CART (Classiﬁcation and Regression Trees) allows you to forecast both continuous and categorical variables, etc. We will discuss the idea of tree models using the simplest ID3 (Iterative Dichotomiser 3) models (Amazon materials) and derive the mathematical formula of the most recent CART models (Scikit-learn materials). In practice, we rarely use single decision trees. However, they are an important starting point for more complex and eﬃcient algorithms (bagging and boosting based models), so you should know them very well! 123 Decision Trees - external materials We use a Machine Learning University (MLU)-Explain course created by Amazon to present the concept of decision trees. The course is made available under the Attribution-ShareAlike 4.0 International licence (CC BY-SA 4.0). Thanks to numerous visualisations, the course allows many theoretical concepts to be discussed very quickly. Decision Trees by MLU-EXPLAIN 124 Decision Trees - additional materials (by Scikit-learn) CART (Classiﬁcation and Regression Trees) - mathematical formulation of training [source: Scikit-learn] 125 Decision Trees - additional materials (by Scikit-learn) CART - classiﬁcation criteria [source: Scikit-learn] 126 Decision Trees - additional materials (by Scikit-learn) CART - regression criteria Classiﬁcation and regression trees - building process Additionally, we highly recommend two YouTube tutorials that illustrate well the tree-building methodology: Regression Trees, Clearly Explained!!! and Decision and Classiﬁcation Trees, Clearly Explained!!!. They are perfect for building your intuition if Amazon and Scikit-learn materials are not enough for you. [source: Scikit-learn] 127 Decision Trees - key hyperparameters The ﬁve key hyperparameters for the Decision Tree model are: ● maximum depth of the tree - determines how deep the tree can be (the most important hyperparameter that can easily lead to overﬁtting), if it is not set then nodes are expanded until all leaves are pure (it is recommended to start with max depth equal to 3 and check further performance of the tree with additional levels) ● minimum number of samples required to split an internal node - prevents subsequent splits if there are not enough observations to perform splitting on current tree level (small number will lead to the overﬁtting, whereas a large number will prevent the tree from learning the data - high bias) ● minimum number of samples required to be at a leaf node - prevents subsequent splits if there will be not enough observations on the last level of the tree (for classiﬁcation with few classes this hyperparameter equal to 1 is often the best choice) ● splitting criterion - function to measure the quality of a split (for classiﬁcation e.g.: Entropy and for regression e.g.: Mean Squared Error) ● number of features to consider when looking for the best split - the lower the greater the reduction of variance, but also the greater the increase in bias Of course, we should look for these and other hyperparameters in the cross-validation procedure. [source: Scikit-learn] 128 end of the 4th lecture Decision Trees - pros and cons PROS ● CONS Trees are very intuitive in construction and therefore in ● interpretation (we can easily visualize them) - very often, decision trees make inferences similar to humans (they are ● ● minimas (tree creates "sharp" divisions) ● Small perturbation of training dataset causes a large humanly more logical) change in the structure of the decision tree, therefore this Decision trees can handle both continuous and categorical method is insatiable variables without the need to create dummy variables (requires ● Decision trees like to overﬁt - they tend to get stuck in local ● Tree does not work well in case of imbalanced dataset little processing: often there is no need to normalize, scale, (however, some implementations of the algorithm provide input missing etc.) for appropriate weightings to avoid this problem) Trees are non-parametric, there is no assumption regarding ● Decision trees can be aﬀected by outliers distributions ● Tree complexity might be very high in case of regression Feature selection for decision trees happens automatically problem (unimportant features are not inﬂuencing the model result) and we can calculate importance of the features ● They are relatively easy (quick) to train ● They have a pruning mechanism that reduces their complexity [source: Scikit-learn] 129 Random Forest - general information start of the 5th lecture First, let's deﬁne the concept of ensembling. Ensemble methods are techniques that aim to improve the performance of our model by combining multiple weak models instead of using a single model. The combined models should increase the accuracy of the results signiﬁcantly. There are many ensembling techniques, now let's focus just on bagging. The Bagging (Bootstrap Aggregating) ensemble method build several instances of estimator on random subsets of the original training set with possible replacement (bootstrapping) and then aggregate their individual predictions to form a ﬁnal prediction. It allows to reduce variance of the whole model. Random Forest is a non-parametric supervised machine learning model based on the idea of bagging. Its author is Leo Breiman. As the name suggests, this algorithm combines many individual trees into one forest. The randomness of the forest is determined by two techniques: bagging (training a collection of decision trees on random subsets of the training data) and feature bagging (one the level of single tree on each sampling from the population, we also sample a subset of features from the overall feature space). Both techniques reduce the variance of the ﬁnal model (by adding randomness into weak models) and improve its performance (feature bagging allows for a signiﬁcant reduction in correlation between trees). 130 Random Forest (bagging idea) - external materials We use a Machine Learning University (MLU)-Explain course created by Amazon to present the concept of Random Forest. The course is made available under the Attribution-ShareAlike 4.0 International licence (CC BY-SA 4.0). Thanks to numerous visualisations, the course allows many theoretical concepts to be discussed very quickly. Random Forest by MLU-EXPLAIN 131 Random Forest - extra materials Out-of-bag error/score Out-of-bag (OOB) error, also called out-of-bag estimate, is a method of measuring the prediction error of Random Forests. Bagging uses subsampling with replacement to create training samples for the model to learn from. OOB error is the mean prediction error on each training sample x_i, using only the trees that did not have x_i in their bootstrap sample. OOB can be used to evaluate the RF model instead of cross validation (rather rare, but worth knowing). Extremely Randomized Trees Extremely Randomized Trees is a model that adds extra layers of randomness to Random Forest. As in RF, a random subset of candidate features is used, but instead of looking for the most discriminative thresholds, thresholds are drawn at random for each candidate feature and the best of these randomly-generated thresholds is picked as the splitting rule (there is still optimization!). This usually allows to reduce the variance of the model a bit more, at the expense of a slightly greater increase in bias. Of course in terms of computational cost, and therefore execution time, the Extra Trees algorithm is faster. [source: Wikipedia, Scikit-learn] 132 Random Forest - key hyperparameters Random Forest has many important hyperparameters. We've already covered some of them (see the key hyperparameters for Decision Trees). These are additional key hyperparameters: ● number of trees in the forest - generally the larger the better (but computation cost will also increase). Please note that results will stop getting signiﬁcantly better beyond a critical number of trees (the additional trees do not introduce any value into the model, they only increase the computation time)! ● number of features to consider when looking for the best split - the lower the greater the reduction of variance, but also the greater the increase in bias (rule of thumb: 100% of features is a good starting approach for regression problem and sqrt(number of features) is a nice starting point for classiﬁcation) ● whether bootstrap samples are used when building trees - if not the whole dataset is used to build each tree (we should use nearly always bootstrap approach in Random Forest) ● whether parallelization is used - parallelization of calculations in the case of Random Forest estimation is a trivial task (if we have many processor cores available, we can signiﬁcantly speed up the learning process) Of course, we should look for these and other hyperparameters in the cross-validation procedure. [source: Scikit-learn] 133

Document Details

Tags

Related

Full Transcript