Machine Learning 1 Classification Methods Lectures PDF

Document Details

LucidFourier6586

Uploaded by LucidFourier6586

University of Warsaw

2024

Szymon Lis, Michał Woźniak

Tags

Machine Learning classification methods supervised learning machine learning algorithms

Summary

This document is an introduction to Machine Learning 1, focusing on classification methods. It covers supervised and unsupervised learning techniques, theoretical foundations, and practical programming skills. The course is aimed at students with prior knowledge of linear algebra, calculus, statistics, and probability theory, and basic Python programming.

Full Transcript

Machine Learning 1: classification methods [2400-DS1ML1] Spring 2024 Chapter 0 Introduction to the course 2 Meet your lecturers Szymon Lis Michał Woźniak E-mail: [email protected] E-mail: mj.wozniak9@u...

Machine Learning 1: classification methods [2400-DS1ML1] Spring 2024 Chapter 0 Introduction to the course 2 Meet your lecturers Szymon Lis Michał Woźniak E-mail: [email protected] E-mail: [email protected] LinkedIn: www.linkedin.com/in/mjwozniak Scholar profiles: https://linktr.ee/michalwozniak 3 High-level course goals Students have reliable and structured knowledge on a wide range of supervised machine learning algorithms for regression and classification problems ○ theoretical foundations of machine learning algorithms, ○ practical programming skills to apply machine learning algorithms. Students are able select predictive modeling algorithms that are best suited to the specific research problem and perform an independent research project using the methods learned. 4 Course prerequisites We require you to know: 1. well linear algebra, calculus, statistics, and probability theory (recommended to read and understand Deisenroth P., Faisal A., Ong S. (2020). Mathematics for machine learning. Cambridge University Press); 2. at least basic Python programming skills (recommended to: 2.1. read and understand Matthes, E. (2019). Python crash course: A hands-on, project-based introduction to programming. no starch press or 2.2. do Programiz Python Programming Course: https://www.programiz.com/python-programming). 5 Course bibliography James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning. Springer, New York, NY (offical online access: https://www.statlearning.com/) Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning. Springer-Verlag. Harrington, P. (2012). Machine learning in action (Vol. 5). Greenwich, CT: Manning. Intel (2018). Introduction to Machine Learning. Retrieved from https://www.intel.com/content/www/us/en/developer/learn/course-machine-learning. html VanderPlas, J. (2016). Python data science handbook: Essential tools for working with data. O'Reilly Media, Inc. 6 Course agenda Lectures: 1. Introduction to Machine Learning 2. Crucial Machine Learning techniques (part 1) 3. Assessing model accuracy, machine learning diagnostics 4. Basic Supervised Learning models 5. Crucial Machine Learning techniques (part 2) Labs: 1. Introduction to exploratory data analysis, data wrangling, data engineering and modeling using econometric models 2. Machine learning diagnostics with different evaluation metrics and dataset splits 3. Machine learning modeling with KNN and SVM models 4. Case study 1 5. Case study 2 7 Course credits regulations There are two elements that the final grade consists of. The first one is the theoretical part exam. The second is to prepare two machine learning project (in pairs of two) and create a presentation about them. The following weights are used to determine the final grade (max 100 pts): - 40 pts - mid-term theoretical exam - 30 pts for each of 2 projects, including: - 10 pts for in class presentation - 10 pts for presentation contents - 10 pts for models performance in competition (out of sample test) In case of projects we will provide detailed information in one month. 8 What we expect / What to expect This is a challenging course (a lot of knowledge in a short time) The course requires at least several hours of study per week Systematic studying is required to learn the material Active participation in classes is recommended The most important things is to have the willingness to learn (fast) 9 Lecture materials http://tinyurl.com/ml2024spring 10 start of the 1st lecture Chapter 1 Introduction to Machine Learning 11 What is Machine Learning? Machine learning (ML) is the process of using mathematical models of data to help a computer learn without direct instruction. It’s considered a subset of artificial intelligence (AI). Machine learning uses algorithms to identify patterns within data, and those patterns are then used to create a data model that can make predictions. With increased data and experience, the results of machine learning are more accurate—much like how humans improve with more practice. The adaptability of machine learning makes it a great choice in scenarios where the data is always changing, the nature of the request or task are always shifting, or coding a solution would be effectively impossible. [source: Microsoft Azure] 12 Types of Machine Learning Supervised learning Unsupervised learning Semi-supervised learning Reinforcement learning [source: IBM Developer] 13 Supervised learning Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value. Worth noting, econometrics is a subset of supervised learning family. [source: Wikipedia] In other words: given a set of data points {x1, …, xi} associated to a set of outcomes {y1, …, yi}, we build a model that learns to predict y from x. Types of supervised learning: Regression - outcome is continuous Classification - outcome is category [source: Intel] 14 Unsupervised learning Unsupervised learning is a type of algorithm that learns patterns from untagged data. Since the examples given to the learner are unlabeled, there is no evaluation of the accuracy of the structure that is output by the relevant algorithm - which is one way of distinguishing unsupervised learning from supervised learning and reinforcement learning. [source: Wikipedia] In other words: given a set of data points {x1, …, xi}, we look for hidden patterns in the data. Types of unsupervised learning: Clustering - grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other clusters. Dimension reduction - transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data. Association rules - identification of strong rules discovered in databases using some measures of interestingness. 15 Semi-supervised learning Semi-supervised learning is an approach to machine learning that combines a small amount of labeled data with a large amount of unlabeled data during training. In such approach algorithms make use of this additional unlabeled data to better capture the shape of the underlying data distribution and generalize better to new samples. Semi-supervised learning falls between unsupervised learning (with no labeled training data) and supervised learning (with only labeled training data). It is a special instance of weak supervision. [source: Wikipedia & scikit-learn] Reinforcement learning Reinforcement learning is the training of machine learning models to make a sequence of decisions. The agent learns to achieve a goal in an uncertain, potentially complex environment. In reinforcement learning, an artificial intelligence faces a game-like situation. The computer employs trial and error to come up with a solution to the problem. To get the machine to do what the programmer wants, the artificial intelligence gets either rewards or penalties for the actions it performs. Its goal is to maximize the total reward. [source: deepsense.ai] 16 Da ta &A Iw orl dm ap 17 Machine Learning glossary y, target, dependent variable, endogenous variable, output variable, regressand, response variable - variable predicted by the algorithm x, feature, independent variable, exogenous variable, explanatory variable, predictor, regressor - variable to predict target variable example, entity, row - a single data point within the data (one row in the dataset) label - the target value for a single data point target column client_id age education income default 1 25 bachelor 25k USD 0 example feature 2 45 doctorate 40k USD 0 columns 3 52 master 70k USD 1 label 18 Machine Learning as a function The fundamental assumption of machine learning is as follows: there is a function that represents a causal relationship between features X and target Y, and can be written in very general form as: Y = f(X) + ϵ, where f is some fixed but unknown function of X, and ϵ is a random error term, which is independent of X and has mean zero. The goal of Machine Learning is to “find” a function f, where by “find” is meant the set of methods that estimate this function (we need to approximate this function, as its actual form is unobservable). 19 Machine learning estimation idea In general, we can define estimation process as: Ŷ = f̂ (X), where Ŷ is the prediction of our estimator for target variable and f̂ (X) is our estimate of the function f(X). The estimator approximates reality imperfectly, so it generates prediction error, which is equal to Y - Ŷ. The size of this error reflects the quality of the model (in general, the smaller the error the better). Importantly, for a number of reasons, part of the error is reducible (bias part) and part is irreducible (e.g. due to omitted variables). = Irreducible + Reducible 20 Machine learning estimation approaches A great many approaches can be used to estimate our function f(X). Thus, the primary division of supervised machine learning methods is as follows: parametric algorithms (for instance econometrics) ○ known functional form ○ known distribution of random variables ○ finite number of parameters nonparametric algorithms ○ unknown functional form (lack of a priori assumptions) ○ infinite number of parameters semi-parametric algorithms ○ theoretically infinite number of parameters, but in practice we estimate part of them. Both parametric and non-parametric methods have their advantages and disadvantages (trade-offs for parametric approaches: simplicity vs constrain, speed vs limited complexity, less data requireds vs poor fit). 21 Training Machine learning model - error minimization Regardless of the estimation approach chosen for , we are always keen for the forecast error to be as small as possible with the currently estimated parameters. Therefore, it is necessary to define and then optimise a function that expresses how “wrong” the model is. First of all we define loss function (L) - usually a function which measures the error between a single prediction and the corresponding actual value. For instance: Based on that we can define more general object, which is cost function (J) - usually a function which measures the error between predictions and their actual values across the whole dataset. It might be a sum of loss functions over your training set plus some model complexity penalty (regularization). For instance: Model training is about minimizing the cost function! 22 Training Machine learning model - cost function properties Cost function directly influences our estimator f(X). Thus, when we choose this function, we should (possibly) ensure that our estimator is unbiased: and efficient: estimator with the smallest variance. In the best situation, we obtain a minimum-variance unbiased estimator (MVUE). In addition, due to optimisation algorithms (based on differentiation), the cost function should be convex and it is good if it is smooth (continuous and differentiable). Last but not least, it is always important to consider whether our cost function reflects the real cost of prediction errors in the context of the research/modelling objective. It is worth considering whether it is more costly to overestimate or underestimate our problem (asymmetry) (e.g. whether it is better to employ more people in the shop for the Christmas peak, or whether it is better not to overestimate this number). 23 Training Machine learning model - idea of gradient descent Once we have defined the cost function, we can generally take its derivative with respect to the parameters (weights), set it to zero, and solve for the parameters to get the perfect global solution (FOC) - for instance it works for OLS. However, for most functions this is impossible! Therefore, we have to use an alternative (local) optimisation method which is the gradient descent algorithm. General idea of gradient descent is as follows: we define surface created by the objective function (we don't know what it looks like in general); we follow the direction of the slope of this function downhill until we reach a valley. [source: PaperspaceBlog] 24 Training Machine learning model - gradient descent formally First of all, let's recall the simplified definition of a gradient. The gradient is the a vector whose coordinates consist of the partial derivatives of the parameters:. The gradient vector can be interpreted as the "direction and rate of fastest increase". Now we define gradient descent optimization algorithm. Gradient descent is a way to minimize an objective function parameterized by a model's parameters by updating the parameters in the opposite direction of the gradient of the objective function w.r.t. to the parameters. Additionally, we have to define learning rate which determines the size of the steps we take to reach a (local) minimum. Vanilla gradient descent algorithm : 1. Start with initial random guess of 2. Generate new guess by moving in the negative gradient direction (gradient computed on entire training dataset): 3. Repeat point 2. to successively refine the guess and stop when convergence criteria is reached [sources: Sebastian Ruder blog, Stanford CS 229] 25 Training Machine learning model - gradient descent versions There are three variants of gradient descent, which differ in how much data we use to compute the gradient of the objective function. Depending on the amount of data, we make a trade-off between the accuracy of the parameter update and the time it takes to perform an update. [source: Sebastian Ruder blog] 26 General purpose of the estimation When we do a research we have to answer the question: 1) are we interested in the best possible prediction, 2) the best possible understanding of the relationship between our features and target (inference), or 3) are we interested in both issues at the same time? Depending on the answer (business environment, research problem, etc.), we will decide on the choice of estimator, e.g. parametric or non-parametric, fully explanable model or black-box model (a system which can be viewed in terms of its inputs and outputs without any knowledge of its internal workings) etc. Note that a more complex model will not always be better than a simple model (e.g. some problems are purely linear and non-parametric methods may search for complexly artificial and spurious relationships). Before starting experiments, it is important to have a good understanding of the problem being undertaken! 27 Types of variables There are different types of variables in statistics and machine learning. The most important ones are highlighted in the illustration below. [source: K2 Analytics] 28 Linear regression - general information Linear regression is a basic supervised learning algorithm for predicting continuous variables from a set of independent variables. From an econometric point of view, linear regression is primarily used for inference (much less frequently for prediction). In this course we look at linear regression from the machine learning perspective i.e. we are mostly interested in prediction. To get a good understanding of linear regression in economic applications, a separate course is generally devoted to it. We don't have time for that, so we will discuss its key elements from an ML perspective. At the same time, we recommend a very good course teaching the principles of linear regression (chapter 3 and 4). Importantly, linear regression can be estimated in a number of ways: ordinary least squares (OLS), weighted least squares (WLS), generalised least squares (GLS). We will focus on the most popular of these - OLS. 29 Linear regression - external materials We use a Machine Learning University (MLU)-Explain course created by Amazon to present the concepts, assumptions, mathematical foundations, and interpretation of linear regression. The course is made available under the Attribution-ShareAlike 4.0 International licence (CC BY-SA 4.0). Thanks to numerous visualisations, the course allows many theoretical concepts to be discussed very quickly. Linear regression by MLU-EXPLAIN 30 Linear regression - additional materials Matrix notation of linear regression equation [source: Practical Econometrics and Data Science] Adjusted R squared Adjusted R2 is a corrected goodness-of-fit (model accuracy) measure for linear models. It identifies the percentage of variance in the target field that is explained by the inputs. R2 tends to optimistically estimate the fit of the linear regression. It always increases as the number of effects are included in the model. Adjusted R2 attempts to correct for this overestimation. Adjusted R2 might decrease if a specific effect does not improve the model. Adjusted R2 is always less than or equal to R2. A value of 1 indicates a model that perfectly predicts values in the target field. A value that is less than or equal to 0 indicates a model that has no predictive value. If we assume that p is the total number of explanatory variables in the model, and n is the sample size, then R2 is equal to: [source: IBM] 31 Linear regression - additional materials OLS - Closed-Form Solution extension [source: Practical Econometrics and Data Science] OLS - regression output analysis R^2 and Adjusted R^2 P-value of F-statistic (interpretation: value below significance level e.g. 5% means that our model is well specified - it is better than the model without features) Values of model parameters, thus regression is equal to: y = 5.2 + 0.47*x1 + 0.48*x2 - 0.02*x3 P-value of t-statistic (interpretation: value below significance level e.g. 5% means that our variables is significant in the model) Some model specification tests [source: Statsmodels] 32 end of the 1st lecture Linear regression - additional materials Key assumptions in OLS and BLUE concept [source: Practical Econometrics and Data Science] 33 start of the 2nd lecture Logistic regression - general information Logistic regression is a basic supervised learning algorithm for predicting nominal binary variables (dichotomous variables) from a set of independent variables. As with linear regression, from an econometric point of view, logistic regression is primarily used for inference. However, the interpretation of logistic regression results is much more difficult (we can’t interpret logistic regression results directly, thus we use marginal effects and odds). During this course we look at logistic regression from the machine learning perspective i.e. we are mostly interested in prediction. At the same time, we recommend a very good course teaching the principles of logistic regression from econometric perspective (chapter 5.2). A natural generalisation of logistic regression to the ability to classify more than two classes is multinomial logistic regression. It is worth knowing that logistic regression is just one selected model representing the entire class of Generalized Linear Models (GLMs). Here you can find more details about GLM and its families. 34 Logistic regression - additional materials Linear regression for binary classification problem [source: Intel Course: Introduction to Machine Learning] 35 Logistic regression - additional materials Sigmoid function A sigmoid function is a mathematical function having a characteristic "S"-shaped curve or sigmoid curve. Some sigmoid functions compared: A common example of a sigmoid function is the logistic function: Logistic function has many useful properties: 1. it maps solution space to probability functions - output range is from 0 to 1 2. it is differentiable - important from the perspective of optimization problem 3. it uses exponential - most outputs are “attached” to 0 or 1 (not in the mid ambiguous zone) [source: Wikipedia] 36 Logistic regression - additional materials Logistic regression for binary classification problem [source: Intel Course: Introduction to Machine Learning] 37 Logistic regression - additional materials The relationship between logistic and linear regression Logistic function Logistic function Odds ratio Log odds (or logit function) [source: Intel Course: Introduction to Machine Learning] 38 Logistic regression - additional materials Logistic regression - decision boundary Logistic regression - cost function We utilize cross-entropy (log-loss) as the cost function for logistic regression: [source: Intel Course: Introduction to Machine Learning] 39 Logistic regression - additional materials Multinomial logistic regression - one vs all approach [source: Intel Course: Introduction to Machine Learning] 40 Logistic regression - additional materials Multinomial logistic regression & softmax function Let’s assume that we have k classes. We can define multinomial logistic regression using following formula: , where is linear predictor function (linear regression) to predict that given observation has outcome i. The cost function for multinomial logistic regression is generalization of log-loss to cross entropy for k>2. We calculate a separate loss for each class label per observation and sum the result: , where y is binary indicator (0 or 1) if class label j is the correct classification for observation o and p is predicted probability observation o is of class j. 41 Logistic regression - external materials We use a Machine Learning University (MLU)-Explain course created by Amazon to present the concepts, mathematical foundations, and interpretation of logistic regression. The course is made available under the Attribution-ShareAlike 4.0 International licence (CC BY-SA 4.0). Thanks to numerous visualisations, the course allows many theoretical concepts to be discussed very quickly. Logistic regression by MLU-EXPLAIN 42 Logistic regression - additional materials Generalized Linear Models GLM is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value. [source: Stanford CS 229, Wikipedia] 43 Logistic regression - additional materials Generalized Linear Models - examples (here we utilize other notation - be careful!) [source: Time series reasoning] 44 Chapter 2 Crucial machine learning techniques (part 1) 45 Problem definition - problem statement worksheet approach At the beginning of any ML research/consulting project, it is good practice to formulate a problem statement worksheet - a document that formalizes at a basic level the definition of the business task we will be tackling (what, why, how). This document is an excellent initiator (constitutes a document binding between the parties) and allows you to plan a complete project using nearly any project management technique (for instance scrum etc.). The masters in preparing such worksheets are consulting firms (e.g., top-3: McKinsey, BCG, Bain). Let's analyze the worksheet template used by McKinsey: [source: Betty Wu Talk] 46 Dataset preparation steps After collecting the business requirements and designing the project/experiment, the next step is to prepare the data for the project implementation. We usually distinguish the following elements of such a data preparation process: 1. Defining / selecting the necessary data and their source 2. Data ingestion - data extraction from source systems 3. Transforming data to a convenient analytical form (preferably homogeneous for all sets) 4. Initial data exploration (e.g. with visualizations) and validation 5. Combining data sets into one (if effective at this stage - depending on the project) 6. Eventual division of the set into {train, validation, test} or {train+validation and test}. Note: All transformations performed on the training set should be applicable later on the test set, e.g. parameters learned on the training set for normalisation should also be used for the same transformation on the test set. 7. Data cleaning like missing imputation 8. Feature engineering - generation of new variables 9. Extensive exploratory data analysis (preferably using visualizations and statistics) 10. Initial feature selection process 11. Eventual balancing target variable classes 12. Efficient data saving to a universal format (preferred by models) 47 Exploratory data analysis The exploratory data analysis (EDA) stage is designed to help us build up both a general and detailed picture of the data we have (EDA is used to summarize their main characteristics). In the first instance, we can perform a relatively simple visual analysis. In the case of tabular data, this involves analysing sets of data stored in a data frame (e.g. in the case of a project using images, we will look at images from the set). Furthermore, we should use visualisation techniques (both univariate - single variable, bivariate - two variables, and multivariate - several variables) to better analyse the data. The image on the right shows a 'guide' to visualisations by application. [source: TapClicks] 48 Exploratory data analysis with statistics In addition to visual analysis, it is also useful to use statistical tools to explore properties and relationships between data. In the first instance, the use of univariate analysis is recommended. We want to examine what properties single variables have, e.g. using 1) descriptive statistics (frequency table, mean, standard deviation, skewness, kurtosis, quantiles) 2) one-sample tests, 3) tests for autocorrelation, white noise (for time series), etc. Next, we want to check the multivariate statistics. The table on the right shows the most common measures and tests of association between variables of different types (source: Statistics & Explanatory Data Analysis course, Dr Marcin Chlebus, Dr Ewa Cukrowska - Torzewska). In addition, we can use various Unsupervised Learning techniques here, e.g. dimension reduction (principal component analysis), etc. 49 Data imputation When working with real data you will always encounter the problem of missing values. There can be many reasons for this, but first and foremost the cause of the missing values can be attributed to: the data does not exist, the data was not captured by a hardware, software or human error, the data was deleted, etc. The figure on It is always worth checking what the lack of data results from, because perhaps it the right shows the most general classification of reasons for can be supplemented through an additional data extraction process (without the missing data. need to use data mining techniques). We distinguish the following techniques for dealing with the problem of missing values: do nothing (some machine learning algorithms can deal with this problem automatically) remove missing variables/columns (consider it when the variable has more than ~10% missings and is not crucial for the analysis) or examples/raws (avoid if possible, especially if your dataset is small and missings are not random) fill in (imputation) the missing values using: ○ univariate techniques (use only one feature): for continuous variables: use constant (like 0); use statistics like: mean/median/mode (globally or in the subgroup); use random value from assumed distribution; for categorical variables: encode missing as additional “missing” category; replace with mode (globally or in the subgroup); replace randomly from non-missings; for time series variables: use last or next observed value; use linear/polynomial/spline interpolation ○ multivariate techniques (use multiple features): use KNN or other supervised ML algorithm; use Multivariate Imputation by Chained Equation [source: Kaggle] 50 Feature engineering Feature engineering, i.e. generation of new variables, is a key stage of modeling (the correctness of this process will determine the quality of the model). It is performed in several moments, while the two most popular are: during the ETL process and after the ETL process: 1) During the ETL process, we focus on the so-called analytical engineering (domain knowledge is key here), i.e. we try to transform our sets into a form consumable by our models - most often we use various types of aggregations here using descriptive statistics, e.g. mean/median/quantiles etc. (e.g. being a bank, we want to estimate the customer's credit risk - so we will aggregate his/her history for the selected period into one observation). 2) After the ETL process, the matter is more complicated because we focus on creating additional variables or processing existing ones in order to improve the predictive power of our algorithm (this requires particular creativity - it’s kind of art) - for instance some algorithms are not able to capture non linear relationships (OLS/SVM/KNN), so we have to feed them with the variables that make it possible. Let's discuss the most popular feature engineering techniques. (Attention! Each of the techniques requires training on the train set, and then inference should be used to apply to the test!) Numeric variables transformations (only the most important): scaling to a range (min-max scaler): z = (x - min(x))/(max(x)- min(x)) (recommendation: when the feature is more-or-less uniformly distributed across a fixed range) clipping (winsorization): if x > max, then z = max. if x < min, then z = min (recommendation: when the feature contains some extreme outliers) log scaling: z = log(x) (recommendation: when the feature conforms to the power law) z-score (standard scaler): z = (x - u) / s (recommendation: when the feature distribution does not contain extreme outliers) quantile transformer: map the data to a uniform distribution with values between 0 and 1 power transformer (Yeo-Johnson transform and the Box-Cox transform): map data from any distribution to as close to a Gaussian distribution as possible in order to stabilize variance and minimize skewness. bucketing (discretization of continuous variables) with: 1) equally spaced boundaries, 2) quantile boundaries, 3) bivariate decision trees boundaries 4) expertly given boundaries - all in all it can help with both overfitting and nonlinear modelling polynomial transformer, spline transformer, rounding, replacing with PCA and any other arithmetic operation [source: Google] 51 end of the 2nd lecture Feature engineering - cont’d Categorical variables transformations (only the most important): one hot encoding - it transforms each categorical feature with n possible values (n categories or n levels) into n binary features, with one of them 1, and all others 0. Sometimes we have to drop/remove one category (one binary feature) to avoid perfect collinearity in the input matrix in some estimators, in most cases we will drop the most frequent category (for instance OLS without this dropping will be impossible to compute, however OLS with regularization like L2 or Lasso works work well with such collinearity, and we should not remove any level). This approach supports aggregating infrequent categories into a single output for each feature. ordinal econder - it transforms each categorical feature to one new feature of integers. Be careful with this encoding, because passing such a variable directly to the model will impose an order to the model. there is a lot of other super powerful encoders like: BaseN, CatBoost Encoder, Count Encoder, Hashing, Helmert Coding, James-Stein Encoder, Leave One Out, Polynomial Coding, Quantile Encoder, Sum Coding, Summary Encoder, Target Encoder, Weight of Evidence etc. (check out source for more information) (I personally use most often one hot when doing econometrics, but when doing ML I love CatBoost Encoder, credit risk bankers love WoE, etc.) Interactions between variables - of course we can look for some interactions between our variables (numeric & numeric, categorical & categorical or numeric & categorical). We can try multiplication, division, subtraction and basically anything math and our imagination allow us to do. Keep in mind that the possibilities for feature engineering are endless. The advantage of machine learning over classical econometrics is that in most cases we are interested in the result itself, not how we arrived at it, so the variables we produce may have low interpretability. In addition, some algorithms (especially those based on decision trees) are able to select the relevant variables themselves, and marginalize the non-relevant ones. However, my years of experience show that we should not overdo our creativity - in financial problems, usually the best variables created in feature engineering are those that have a strong business/economic/theoretical basis. However, it is always good to look for happiness in numbers :) [source: Category Encoders] 52 start of the 3rd lecture Chapter 3 Assessing model accuracy, machine learning diagnostics 53 Evaluation metrics - concept At this point, we have a broad understanding of the cost function and its crucial role in machine learning. We know that cost function should meet certain properties (e.g. differentiability with respect to parameters). In practice, this means that we can use a limited number of functions to train and monitor the quality of our model. However, for the extensive process of evaluation the performance of a model (during training and testing), evaluation metrics have been developed that do not have to comply with restrictive mathematical properties. Evaluation metrics are calculated after the estimator is already created with use of different cost function, thus evaluation metric does not affect estimator per se. We distinguish evaluation metrics for following problems: regression classification probabilities 54 Evaluation metrics - regression In case of regression we deal with continuous target. Intuitively we are looking for metrics which describe distance between prediction and actual (it’s straightforward). The most popular regression metrics are: Mean Square Error (MSE): Root Mean Square Error (RMSE): Mean Absolute Error (MAE): Mean Absolute Percentage Error (MAPE): , where epsilon is small strictly positive number Mean Squared Logarithmic Error (MSLE): R² score: Median Absolute Error (MedAE): Mean Absolute Scaled Error, Mean Directional Accuracy and many many more… To visualize errors distribution we can use histogram/KDE model and we are able to get a complete picture of the performance of regression estimator. [source: Scikit-learn] 55 Evaluation metrics - regression (extra materials) Forecasting Time Series - Evaluation Metrics. (2017). AutoGluon. Access: https://auto.gluon.ai/stable/tutorials/timeseries/forecasting-metrics.html#point-forecast-metrics Hyndman, R. J. (2006). Another look at forecast-accuracy metrics for intermittent demand. Foresight: The International Journal of Applied Forecasting, 4(4), 43-46. Access: https://robjhyndman.com/papers/foresight.pdf 56 Evaluation metrics - regression When choosing an evaluation metric, be very careful and deeply understand the business outcome of your decisions: MAPE = symmetric MAPE (sMAPE) = [source: Towards Data Science] 57 Evaluation metrics - classification In the case of a classification problem, it is much more difficult to make a correct assessment of the model. It requires a bit more knowledge and abstract thinking. First of all let’s introduce confusion matrix: Example: [source: Wikipedia] 58 * not applicable for imbalanced problem Evaluation metrics - classification Based on confusion matrix we can derive following classification metrics: Accuracy* (how many observations, both positive and negative, were correctly classified): True Positive Rate or Recall or Sensitivity (how many observations out of all positive observations are classified as positive): True Negative Rate or Specificity (how many observations out of all negative are classified as negative): Positive Predictive Value or Precision (how many observations predicted as positive are in fact positive): Negative Predictive Value (how many predictions out of all negative predictions were correct): False Positive Rate or Type I error: False Negative Rate or Type II error: F beta score (combination of precision and recall in one metric, the more you care about recall over precision the higher beta you should choose; well suited to the problem of imbalanced dataset): [source: Neptune.ai blog] 59 Evaluation metrics - classification Based on confusion matrix we can derive following classification metrics: Matthews Correlation Coefficient (correlation between predicted classes and ground truth; well suited to the problem of imbalanced dataset): and many many more … In case of binary classification metrics we strongly recommend following Neptune.ai blogpost: link. They accurately define each evaluation metric with an intuitive interpretation (super useful in regular business environment). Additionally they provide very pertinent advice on when to apply a given metric. Of course we can generalize binary classification metrics to the multivariate classification metrics. First of all we can plot confusion matrix which is sell explanatory. Additionally for each class (one vs all approach) we can calculate separately: precision, recall, f-beta score etc. and finally average each of them by some aggregation rule (micro, macro and weighted aggregation approach) (here you can find more details link). [source: Neptune.ai blog] 60 Accuracy, precision, recall, F1-score - external materials We use a Machine Learning University (MLU)-Explain course created by Amazon to present the concept of accuracy, precision, recall, F1-score metrics. The course is made available under the Attribution-ShareAlike 4.0 International licence (CC BY-SA 4.0). Thanks to numerous visualisations, the course allows many theoretical concepts to be discussed very quickly. Precision & Recall by MLU-EXPLAIN 61 Evaluation metrics - probabilities (for classification task) When we use classification algorithms we nearly always want to deal with probabilities. In most cases we can setup up models to return probabilities (not predicted class)! We need to decide where to place the probability cut-off point (after which we assign observation to specific class). It’s not easy task. In most cases we start with 0.5 (50%) cut-off point, but for most cases it might be wrong value! Thanks to evaluation metrics and plots dedicated to probabilities (for classification) we can make above decision in responsible and aware way. We can distinct here following metrics: Receiver Operating Characteristic Curve (ROC) Precision/recall curve Lift curve Gini curve Area Under the Curve ROC (AUC ROC) Area Under the Curve Precision/recall (AUC PR) Log-loss or Cross entropy or Entropy 62 Evaluation metrics - probabilities Receiver Operating Characteristic Curve (ROC) ROC allows to address the tradeoff between true positive rate (TPR) and false positive rate (FPR). For every probability cut-off point, we calculate TPR and FPR and plot it on one chart. At the beginning when the cutoff point is 1 we classify every observation as "0". Obviously in this situation FPR is equal to 0. With the decrease of the cut-off point we increase the number of "1" - the TPR starts to increase. However our estimator will probably not be perfect so some of predicted "1" are incorrect, therefore the increase of FPR (and decrease of TNR). Generally, the higher TPR and the lower FPR is for each threshold the better and so classifiers that have curves that are more top-left side are better. As you may notice ROC curve is not well suited for imbalanced classification tasks (for more details please read this article). Area Under the Curve ROC (AUC ROC) Additionally we can calculate AUC ROC, which will be one ultimate metric to assess quality of the model. It takes values from 0.5 to 1. We should not use it with imbalanced dataset. It is recommended if you care about true negatives as much as true positives, and you care about ranking predictions. Additionally, this metric can be interpreted as: the probability that a uniformly drawn random positive has a higher score than a uniformly drawn random negative. Notice: AUC metric treats all classification errors equally, ignoring the potential consequences of ignoring one type over another. For example, in cancer detection, we’ll probably want to minimize false negatives. [source: Wikipedia] 63 ROC and ROC AUC - external materials We use a Machine Learning University (MLU)-Explain course created by Amazon to present the concept of ROC and ROC AUC. The course is made available under the Attribution-ShareAlike 4.0 International licence (CC BY-SA 4.0). Thanks to numerous visualisations, the course allows many theoretical concepts to be discussed very quickly. ROC & AUC by MLU-EXPLAIN 64 Evaluation metrics - probabilities Precision recall curve (PR curve) PR curve combines precision and recall in a single visualization. For every probability cut-off, we calculate PPV and TPR and we plot it into the graph. The higher results the better model is. However we deal here with classic precision/recall dilemma (the higher the precision the lower the recall). Area Under the Curve Precision recall (AUC PR) Similarly to ROC AUC score we can calculate the Area Under the Precision-Recall Curve to get one representative number for the whole model. We can treat PR AUC as the average of precision calculated for each recall threshold from 0.0 to 1.0. AUC AUC PR is recommended to highly imbalance problems and when we communicate precision/recall decision to our stakeholders (and we additionally suggest where is the best possible cut-off point). [source: Stackoverflow] 65 Evaluation metrics - probabilities (for regression task) As in the case of classification in regression, we can also use the notion of probability, but in a slightly different sense: we can build regression models that, in addition to the expected value, estimate the confidence intervals of the forecast. We will not discuss this issue in our classes due to its high level of advancement, but it is worth knowing about the existence of the metric: The Continuous Ranked Probability Score (CRPS) which generalizes the MAE to the case of probabilistic forecasts. Link for more details: https://www.lokad.com/continuous-ranked-probability-score. 66 Bias/variance trade-off - concept The bias of a model is the difference between the expected (mean) prediction and the correct model that we try to predict for given data points. The variance of a model is the variability of the model prediction for given data points. MSE decomposition: Error = Bias^2 + Variance + Noise Bias/variance tradeoff - The simpler the model, the higher the bias, and the more complex the model, the higher the variance. For MSE decomposition check page 19. [source: Stanford CS 229, Scott Fortmann-Roe Essay] 67 Bias/variance trade-off - overfitting and underfitting [source: Stanford CS 229] 68 Bias/variance trade-off - overfitting and underfitting (cont’d) reduce complexity of the model + use bagging + use boosting These illustrations present learning curves. A learning curve is a plot of model learning performance over experience or time. We distinguish following learning curves: Train Learning Curve: Learning curve calculated from the training dataset that gives an idea of how well the model is learning. Validation Learning Curve: Learning curve calculated from a hold-out validation dataset that gives an idea of how well the model is generalizing. Optimization Learning Curves: Learning curves calculated on the metric by which the parameters of the model are being optimized, e.g. log-loss. Performance Learning Curves: Learning curves calculated on the metric by which the model will be evaluated and selected, e.g. AUC ROC. [source: Stanford CS 229, Machine Learning mastery] 69 Training, validation and testing sets - concept Generally, learning the parameters of a prediction function and testing it on the same data is a methodological mistake (we can easily overfit our model). In statistics and machine learning, there is a good practice of dividing a data set into three parts that have a dedicated purpose. (for more details: see slide 61) In some classification problems we can encounter imbalanced datasets (e.g. few “1” and lot “0”). It’s super important to use stratify approach, which allows you to ensure that relative class frequencies is approximately preserved in train-validation pair. Sometimes a strategy of creating several models independently on the train and validation data is used, and then one best model is selected on the testing sample. [source: Stanford CS 229] 70 Bias/variance trade-off - external materials We use a Machine Learning University (MLU)-Explain course created by Amazon to present the concept of bias/variance trade-off. The course is made available under the Attribution-ShareAlike 4.0 International licence (CC BY-SA 4.0). Thanks to numerous visualisations, the course allows many theoretical concepts to be discussed very quickly. Bias/variance trade-off by MLU-EXPLAIN 71 Training, validation and testing sets - external materials We use a Machine Learning University (MLU)-Explain course created by Amazon to present the concept of train-test-validation dataset split. The course is made available under the Attribution-ShareAlike 4.0 International licence (CC BY-SA 4.0). Thanks to numerous visualisations, the course allows many theoretical concepts to be discussed very quickly. Train-test-validation by MLU-EXPLAIN 72 Cross-validation - concept We can generalize idea of training, validation and testing sets split to much more complex and powerful solution which is Cross-validation (CV). CV is a technique for evaluating a machine learning model and testing its performance. More precisely, CV is a resampling method that uses different portions of the data to validate and train a model on different iterations. This approach is much more robust that single train-validation split, because we shouldn't treat the value in a single validation as ideal approximation of ground truth. There are two crucial reasons for validation/cross validation usage: assessment of the quality of our model in a quasi-objective way (less probability of overfitting) “safe” (again less probability of overfitting) execution of the hyperparameter tuning procedure (hyperparameter - is a parameter whose value is used to control the learning process, thus it is not estimable and researcher has to specify it by hand based on intuition or via hyperparameter searching procedure, where CV is crucial) [source: Wikipedia, Scikit-learn] 73 Cross-validation - different types We can distinguish dozens types of cross-validations, for instance (we will discuss some of them): Hold-out K-folds Leave-one-out (LOO) Leave-p-out Stratified K-folds Repeated K-folds Nested K-folds Time series CV The needs for having multiple types are many: the specifics of the data (e.g. cross-sectional vs. time series data), the specifics of the business problem, the size of the dataset, the imbalance of the dataset, computing resources, probability of data leakage etc. In such a view, it is impossible to say which approach is best and should only be followed. However in everyday use k-fold seems most popular (for cross-sectional problems). 74 Cross-validation - different types [source: Neptune.ai blog] 75 Cross-validation - different types [source: Neptune.ai blog] 76 Cross-validation - different types [source: Neptune.ai blog] 77 Cross-validation - different types [source: Neptune.ai blog] 78 Cross-validation - different types (target imbalance problem) [source: Neptune.ai blog] 79 Cross-validation - different types [source: Neptune.ai blog] 80 Cross-validation - different types (time series problems) [source: Neptune.ai blog] 81 Cross-validation - different types (time series problems) [source: Neptune.ai blog] 82 Cross-validation - different types (Nested CV) Nested Cross-validation is an extension of the above CVs, but it fixes one of the problems that we have with normal cross-validation. In normal cross-validation you only have a training and testing set, which you find the best hyperparameters for. This may cause information leakage and significant bias. You would not want to estimate the error of your model, on the same set of training and testing data, that you found the best hyperparameters for. As the image below suggests, we have two loops. The inner loop is basically normal cross-validation with a search function, e.g. random search or grid search. Though the outer loop only supplies the inner loop with the training dataset, and the test dataset in the outer loop is held back. [source: ML from scratch] 83 end of the 3rd lecture Cross-validation - external materials We use a Machine Learning University (MLU)-Explain course created by Amazon to present the concept of cross-validation. The course is made available under the Attribution-ShareAlike 4.0 International licence (CC BY-SA 4.0). Thanks to numerous visualisations, the course allows many theoretical concepts to be discussed very quickly. Cross Validation by MLU-EXPLAIN 84 Labs no. 1 - introduction to exploratory data analysis, data wrangling, data engineering and modeling using econometric models The data we will be working with can be accessed through the following link: https://www.kaggle.com/competitions/home-credit-default-risk/data. Link do the materials: Link 85 Labs no. 2 - machine learning diagnostics with different evaluation metrics and dataset splits Link do the materials: https://colab.research.google.com/drive/195_9tF4bbkyBqnix4-UqRXMff00tq3ZJ?usp=sharing Attendence list: https://forms.gle/A98RmxiSsDc5QJbK8 86 start of the 4th lecture Chapter 4 Basic Supervised Learning models 87 K-nearest neighbours - general information The K-nearest neighbours (KNN) algorithm is a basic and probably the simplest supervised machine learning algorithm for both classification and regression problems. Behind this algorithm is the following idea of locality: the best prediction for a certain observation is the known target value (label) for the observation from the training set that is most similar to the observation for which we are predicting. The KNN algorithm belongs to the following group of methods, it is: non-parametric (it does not require the assumption of a sample distribution) and instance-based (it does not carry out the learning process directly - it remembers the training set and creates predictions on the basis of it on an ongoing basis). The model does not generate computational costs at the time of learning, while the entire computational cost lies on the side of making the prediction (lazy learning). The regression version differs little from the classification approach. In the classification approach we use an algorithm to vote for the most popular class of neighbours, while in the regression problem we use a technique to average the values of the target variable across neighbours. 88 K-nearest neighbours - general idea and formal algorithm (classification case) KNN classification algorithm: [source: Intel Course: Introduction to Machine Learning, 89 Application of K-Nearest Neighbor (KNN) Approach for Predicting Economic Events: Theoretical Background] K-nearest neighbours - key hyperparameters The three key hyperparameters for the KNN model are: distance metric number of k neighbours weights of the individual neighbours. Distance metrics allow us to formally define a measure of similarity between observations. Thanks to them we can determine whether two points lying in a multidimensional space are close to each other. In general, there are many ways to measure the distance between two points (X and Y) in space. The most popular of these are: minkowski p distance: euclidean distance: minkowski distance with p = 2 manhattan distance: minkowski distance with p = 1 chebyshev distance: minkowski distance with p reaching infinity: [source: Wikipedia, Lyfat blog] 90 K-nearest neighbours - key hyperparameters Additionally we have to determine how many of the k nearest observations we would like to take into account in our computations (this will also significantly affect our decision boundary). There is a rule thumb than square root of number of samples in our training set might be good choice for k. However, in practice we should look for values smaller than the square root of n and we use cross-validation for this task. Generally, the higher the parameter k, the higher the bias, and the lower the parameter k, the higher the variance (for instance for k = 1, our algorithm will be characterised by overfitting and a large variance). We can observe that easily via Bias/variance trade-off by MLU-EXPLAIN course page (paragraph dedicated to: K-Nearest Neighbors). KNN allows for weighing neighbours during the final stage of the prediction execution (voting - classification, averaging - regression). In the default algorithm, all points in the neighbourhood are weighted the same. However, weighting by distance can also be introduced. There are many methods for this approach, e.g. weighting by inverse distance - this means that observations that are closer will have a higher impact on the fitted value. [source: Intel Course: Introduction to Machine Learning] 91 K-nearest neighbours - features scaling (lack of homogeneity of features) Distance metrics, in addition to their many advantages, also introduce a number of problems into KNN. First and foremost, they are absolute in nature, which can very strongly affect the correctness of the KNN. A very common situation is that one or more explanatory variables (features) in our dataset are set on a large domain (significantly larger than the rest of the variables) and these features have low predictive power. This variable(s) will strongly influence the distances and dominate the other variables in KNN. By the fact that variables are weak predictors, they will make our model very ineffective. In order to get rid of the above problem, it is necessary to use the technique of feature scaling (normalization, standardization, etc.) This is a necessary step for the KNN algorithm (it is worth to try numerous techniques on the same variable, to check which one is the best)! The most popular scaling approaches for continuous variables are: standardization (z-score normalization): rescaling (min-max normalization): quantile normalization The most popular standardization approaches for nominal variables are: one hot encoder (with potential rescaling of 0-1 to other range for instance: 0-2, 0-0.5 etc.) ordinal encoder with further rescaling to 0-1 or other convenient range [source: Wikipedia] 92 K-nearest neighbours - other important informations KNN requires choosing a method to search our stored data for k-nearest neighbors. Brute force searching, which is simply calculating the distance of our query from each point in our dataset, will work fairly well with small datasets, but becomes undesirably slow at larger scales. Tree-based approaches can make the search process more efficient by inferring distances. The two most popular algorithms are: K-D Tree and Ball Tree Search Algorithms. Additionally, there is a curse of dimensionality problem in KNN. The KNN model makes the assumption that similar points share similar labels. It needs all points to be close along every dimension in the data space. However, each new dimension added, makes it harder and harder for two specific points to be close to each other in every dimension. Unfortunately, in high dimensional spaces, points that are drawn from a probability distribution, tend to never be close together - "a high-dimensional space is a lonely place”. The problem does not occur, for example, in the case of sparse matrices, or in image analysis (strong intragroup correlations affect significant closeness in all dimensions). A good approach to solving the multidimensionality problem is to create multiple models on subsets of data (subsets of variables) and then average their results (ensemble technique - bagging). In addition, bagging will also solve the problem of insignificant features. The KNN model is sensitive to variables with low predictive power. In such a case, variables should be selected in a very reasonable way, i.e. based on expert knowledge, but also using variable selection techniques, e.g. from general to specific or from specific to general, or other feature selection techniques. [source: Jeremy Jordan blog, Towards Data Science] 93 K-nearest neighbours - pros and cons PROS CONS intuitive and simple slow algorithm lack of assumptions (non-parametric) memory exhausting algorithm no training step curse of dimensionality applicability in the problem of classification low accuracy in many cases (binary and multiclass) and regression need of homogeneous features small number of hyperparameters not suited for imbalanced problems (directly) handles the specified problems very well (for lack of missing value treatment instance: problems with sparse matrices) sensitive to the selection of variables and the use of unnecessary variables 94 Support Vector Machines - general information The Support Vector Machines (SVM) is one of the fundamental non-parametric machine learning algorithms (and one of the most influential of its time). The main author of this model is Professor Vladimir Vapnik (one of the most recognizable researchers in the field of machine learning - interestingly if we only consider Vapnik's 'key' publications for SVM development, it took more than 40 years from his first paper to his last). The general idea of SVM is as follows: in a multi-dimensional space there exists a hyperplane which separates the classes in optimal way. The goal of SVM is to to find the hyperplane which maximizes the minimum distance (margin) between this hyperplane and observations from both classes. The idea of a support vector machine was implemented originally for the classification problem, while after some adjustments it is applicable to the regression problem and even unsupervised learning (e.g. searching for outliers). [source: An Introduction to Statistical Learning] 95 GOAL: Create a hyperplane which runs perfectly in the middle between classes and Support Vector Machines - general idea maximize the region between them Misclassifications Misclassifications No misclassifications —but is this the best position? No misclassifications 96 [source: Intel Course: Introduction to Machine Learning] SVM - formal definition of maximal margin (“wides street”) approach 1 margin support vector 2 maximum margin distance of vector w from origin to decision What we're interested in is how far vector x boundary is c. is from the decision boundary. The boundary can have an infinite number of points to measure this distance from. That's where the standard approach comes in: we use a perpendicular line as a reference. Then, we project all other data points onto this perpendicular line and compare their distances. [source: MIT 6.034 Artificial Intelligence Lecture, Jeremy Jordan blog] 97 SVM - decision margins definition (constraint definition) For every point to be correctly classified, this condition must always hold true: We are taking assumption about our margins: [source: MIT 6.034 Artificial Intelligence Lecture, Jeremy Jordan blog] 98 SVM - expression of distance between margins (objective function definition) shortest distance between margins given by x+ and x- points (same trick with dot product) f o ction e proje find th d then ne an perpla the hy vecto ndicular to. r on w perpe (x_plu a vector w inus) s -x_m ke We ta [source: MIT 6.034 Artificial Intelligence Lecture, Jeremy Jordan blog] 99 SVM - constrained optimization using Lagrangian In practice we will take final formula of L and we will maximize it (quadratic programming problem), subject to α (as a result, we obtain a corresponding α value for each observation from training set). [source: MIT 6.034 Artificial Intelligence Lecture, Jeremy Jordan blog] 100 SVM - Kernel trick for the model Unfortunately, kernels should be treated as hyperparameter and they themselves also have hyperparameters that need tweaking (e.g. gamma which is kernel coefficient for Radial and Polynomial and defines how much influence a single training example has, thus the larger gamma is, the closer other examples must be to be affected) [source: MIT 6.034 Artificial Intelligence Lecture, Jeremy Jordan blog, 101 Intel Course: Introduction to Machine Learning] SVM - regularization of the model Our previous solution is called “hard margin”. to obtain S is a slack variable an misclassific d express ation of the model A low C makes the decision surface smooth (wide margins), while a high C aims at classifying all training examples correctly (large values of C will give you the Hard Margin - regularization is canceled). C is crucial SVM hyperparameter. [source: MIT 6.034 Artificial Intelligence Lecture, Jeremy Jordan blog, Baeldung] 102 SVR - Support Vector Regression model tweaks The larger the ϵ is, the larger errors we admit in our model. ϵ (in addition to C and kernels) is a key hyperparameter for Support Vector Regression model. [source: Saed Sayad, MathWorks] 103 Attention! SVM - the landscape of support vector algorithms In practice SVM/SVR requires some standardization of features - just like with KNN (on training and testing)! We often use Z-score. One vs All [k models] or One vs One [k choose 2 = k(k-2)/2 models] hinge loss function [source: Aman.AI] 104 SVM - pros and cons PROS CONS Effective in high dimensional spaces (especially when Algorithm (in the basic implementation) is super slow for groups are nearly fully separable) large number of examples Still effective in cases where number of dimensions is SVM has a relatively low explainability greater than the number of samples When class separability is low (regardless of the kernel) it Uses a subset of training points in the decision function is not effective (called support vectors), so it is also memory efficient If the number of features is much greater than the Versatile: different Kernel functions can be specified for number of samples, avoid over-fitting in choosing Kernel the decision function functions and regularization term is crucial. Algorithm is partially immune to outliers (when we use SVMs do not directly provide probability estimates, these regularization term) are calculated using an expensive cross-validation Low number of hyperparameters to tune There is possibility to apply some class weighting and thus deal with imbalanced dataset “directly” [source: Scikit-learn] 105 Comparison of ML algorithms (classifiers) In this semester we will not learn more machine learning algorithms (of course we managed to cover only the most basic - so called shallow algorithms, in the next course ML 2 we will focus among others on the boosting mechanism and neural networks, etc.). At the end, it is worth to illustrate and analyse the nature of decision boundaries of different classifiers (which we already know!). The plots show training points in solid colors and testing points semi-transparent. The lower right shows the classification accuracy on the test set. As you can see, depending on the dataset and the problem we undertake, different models show better or worse predictive properties. Sometimes it is worth relying on your intuition about the set, but it's usually better to test each model for a given problem and compare the results. [source: Scikit-learn] 106 Labs no. 3 - machine learning modeling with KNN and SVM models Link do the materials: KNN: https://colab.research.google.com/drive/11v20E3Y4P5FPw3Dx0A432ftGYJvvBmTF?usp=sharing SVM: https://colab.research.google.com/drive/124St11vGvsMW_C7gJGepGhzU63u3KKww?usp=sharing Data: https://drive.google.com/drive/folders/1XJ2GQe-qi8ka0kO9qlI1p6GJ4_zXjFMu?usp=sharing 107 Chapter 5 Crucial machine learning techniques (part 2) 108 [Recap] P-norm definition: Regularization We came across the concept of regularization when we discussed the bias-variance trade-off and the concepts of underfitting and overfitting. We mentioned that regularization is a tool to deal with the problem of overfitting, i.e. too much model variance (it allows for better generalization of our model). Being more precise: regularization (regularization term ) is a modification of a loss function , that incorporates the information about the value of all parameters. Regularized cost function looks as follows ( is a parameter which controls the importance of the regularization term): There are three main types of regularization: L1 (Lasso) - thanks to its properties it is a good feature selection method L2 (Ridge) - thanks to its properties it is a good model when you deal with multicollinearity of your variables (sometimes you have to introduce couple highly correlated variables due to business requirements) combination of L1 and L2 (Elastic net) - in most cases it is the most reasonable approach Regularization is the most popular technique for linear models, however, more and more machine learning models are introducing this technique. [source: Stanford CS 229, Wikipedia] 109 Hyperparameters tuning approaches We know so far what a hyperparameter is. Let’s recall this definition: a hyperparameter is such a parameter that are not directly learnt within estimators. In addition, we know that the cross-validation procedure is helpful (and obligatory) in selecting the best hyperparameters (an appropriate cross-validation scheme should be selected for specific problem). In addition to cross-validation in the hyperparameter tuning process, we need components such as: hyperparameter space (from which space we want to search parameter values); a method for searching or sampling candidates (how we want to search the hyperparameter space); a score function (by which evaluation function we want to evaluate the models). By combining all the elements together, the process of tuning hyperparameters can be reduced to the following generalized procedure: 1. from the defined space of hyperparameters, select a set (one value per each hyperparameter) of hyperparameters (according to the selected search method); 2. for the selected set of hyperparameters apply cross-validation (perform evaluation on selected score functions); 3. go back to point 1 and keep doing 1-2 until the stop criterion is reached (e.g. number of iterations); 4. based on all iterations, choose the best set of hyperparameters. In practice some algorithms are creating n-sets of hyperparameters at the beginning of the procedure and then they test them in cross-validation with parallelization (to speed up computations). We already know how to properly choose cross-validation and score function for various problems (in the problem of tuning hyperparameters, it is recommended to evaluate the results of the procedure using several metrics). It remains for us to define the hyperparameter space and how to search it. The space is defined specifically to the requirements and properties of a given hyperparameter for a given algorithm (we will search a different space for k in KNN, and another for C in SVM). In practice, we define it either by specifying the set (e.g. for k in KNN {1,5,10}) or by specifying the distribution from which the parameters will be sampled (e.g. for k in KNN: uniform [1,15]). 110 Hyperparameters tuning approaches - cont’d Now we need to discuss the most popular approaches to searching for hyperparameter space: 1. Grid search - an approach that examines all combinations (exhaustive approach) of user-defined hyperparameters in the hyperparameter space. 2. Random search - an approach where we randomly select sets of hyperparameters (from a set or distribution) n times and therefore only try out some possible combinations 3. Bayesian search - an approach where we built a probabilistic model of the function mapping from hyperparameter values to the objective evaluated on a validation set. By iteratively evaluating a promising hyperparameter configuration based on the current model, and then updating it, Bayesian optimization aims to gather observations revealing as much information as possible about this function and, in particular, the location of the optimum. It tries to balance exploration (hyperparameters for which the outcome is most uncertain) and exploitation (hyperparameters expected close to the optimum). There are many discussions about which approach is best. However, as it usually happens, it all depends on the specifics of the problem being solved. The superiority of random search over grid search is very often indicated (primarily for high dimensional data). I personally recommend using Bayesian methods (while they are much more complicated than the first two solutions). [source: Random Search for Hyper-Parameter Optimization, Wikipedia] 111 Feature selection After a well-executed feature engineering process, we should have a large number of variables in our dataset. We already know that not all models have the ability to select variables (e.g. OLS, KNN, SVM), so now we need to choose which of the variables will be important from the point of view of our models. I personally recommend a two-step approach: 1) Univariate/bivariate feature selection: a) Variance Threshold - it removes all features whose variance doesn’t meet some threshold b) Mutual Information (for classification and regression) - it measures the mutual dependence between the two variables (close relationship with entropy). More specifically, it quantifies the "amount of information" obtained about one random variable by observing the other random variable. c) F-test / Chi2 test - statistical test for dependence between two variables d) Correlation (Pearson/Spearman/Kendall, in most cases we will prefer Spearman for instance it is applicable to categorical variables!) 2) Multivariate feature selection a) Embedded feature selection: i) Lasso / Elastic Net (we discussed it) ii) Boruta - powerful ML approach (with statistically elegant algorithm) based on feature importance from Random Forest (Polish author!) iii) Shap values feature importance - Explainable ML approach utilized in feature selection problem b) Wrapper feature selection (we often use here cross-validation): i) Backward selection - we start with a full model comprising all available features. In subsequent iterations, we remove one feature at a time, always the one that yields the largest gain in a model performance metric, until we reach the desired number of features. ii) Forward selection - this is the opposite of backward selection (we start with a model without variables and then we add more variables). iii) Recursive Feature Elimination - it is very similar to backward selection. It starts with a full model and iteratively eliminates the features one by one. RFE makes its decision based on feature importance extracted from the model. This could be feature weights in linear models, impurity decrease in tree-based models, or permutation importance (which is applicable to any model type). Most modern algorithms, e.g. Random Forest, XGBoost, CatBoost, LightGBM, have a feature importance mechanism. In practice, we will use it strongly! [source: Neptune.AI] 112 Classes rebalancing In our course, we discussed the problem of imbalanced classes for a classification task many times. This problem manifests itself primarily: 1) when we train our machine learning model (imbalance has negative impact on cost function - the algorithm during learning will focus on the majority class then on the minority class); 2) when we analyse evaluation metrics, e.g. Accuracy, ROC AUC, etc., which will mislead us through their optimism. Of course, we can deal with this problem by rebalancing the classes in the dataset. We can do this with the following techniques: 1) modification of the cost function (we have covered it already along with algorithms) 2) undersampling 3) oversampling 4) combination of undersampling and oversampling. Let’s discuss these techniques. Undersampling - in general, it is about reducing the number of observations from the dominant (major) class. Two categories of algorithms can be distinguished here: 1) Prototype generation - is a technique which will reduce the number of samples in the targeted classes but the remaining samples are generated — and not selected — from the original set. Examples of algorithms: a) Cluster Centroids - makes use of K-means to reduce the number of samples. Therefore, each class will be synthesized with the centroids of the K-means method instead of the original samples. 2) Prototype selection - is a technique which will select samples from the original set. Examples of algorithms: a) Random undersampler - is a fast and easy way to balance the data by randomly selecting a subset of data (but we can select for instance desired ratio of the number of samples in the minority class over the number of samples in the majority class) for the targeted classes (it allows to bootstrap the data) b) Tomek’s links - it removes unwanted overlap between classes where majority class links (Tomek’s link exist if the two samples are the nearest neighbors of each other) are removed until all minimally distanced nearest neighbor pairs are of the same class c) Edited Nearest Neighbours - it removes samples of the majority class for which their class differ from the one of their nearest-neighbors. This sieve can be repeated which is the principle of the Repeated Edited Nearest Neighbours. d) And many many more: reference [source: Imbalance learn] 113 Classes rebalancing - cont’d Oversampling - in general, it is about increasing the number of observations from the dominated (minor) class. We can distinguish following algorithms: 1) Random oversampler - it can be used to repeat some samples and balance the number of samples between the dataset (in the simplest case we will duplicate observations from minor class). By default, random over-sampling generates a bootstrap. 2) SMOTE (Synthetic Minority Over-sampling Technique) - it creates new synthetic observations of minority class as convex (〜linear) combinations of existing points and their nearest neighbors of same class. SMOTE proposes several variants by identifying specific samples to consider during the resampling. The borderline version (Borderline SMOTE) will detect which point to select which are in the border between two classes. The SVM version (SVM SMOTE) will use the support vectors found using an SVM algorithm to create new sample while the KMeans version (KMeans SMOTE) will make a clustering before to generate samples in each cluster independently depending each cluster density. 3) ADASYN (Adaptive Synthetic) - this method is similar to SMOTE but it generates different number of samples depending on an estimate of the local distribution of the class to be oversampled. ADASYN uses a weighted distribution for different minority class examples according to their level of difficulty in learning, where more synthetic data is generated for minority class examples that are harder to learn. Combination of undersampling and oversampling - SMOTE can generate noisy samples by interpolating new points between marginal outliers and inliers. This issue can be solved by cleaning the space resulting from oversampling. In this regard, Tomek’s link and edited nearest-neighbours are the two cleaning methods that have been added to the pipeline after applying SMOTE oversampling to obtain a cleaner space. Thanks to it we can we can distinguish following algorithms: SMOTETomek SMOTEENN. In business practice, the simplest approach is often used random undersampler. However, it is worth checking how different approaches will work in our business problem (check it with cross-validation). Attention: if you use these techniques, be very careful about leakage data - if you are doing cross-validation, you must rebalance each fold of the algorithm separately! Additionally, both SMOTE and Tomek links only work on low-dimensional data! [source: Imbalance learn] 114 Ensemble methods Let’s recall definition of ensembling: ensemble methods are techniques that aim to improve the performance of our model by combining multiple weak models instead of using a single model. The combined models should increase the accuracy of the results significantly. We have three standard ensemble strategies: 1. Bagging (we discussed it alongside Random Forest which is the most popular bagging algorithm); 2. Stacking - is a procedure where learner (so called meta-learner) is trained to combine the individual learne

Use Quizgecko on...
Browser
Browser