Machine Learning 1_ classification methods - lectures-1.pdf

Machine Learning 1: classiﬁcation methods [2400-DS1ML1] Spring 2023 Chapter 0 Introduction to the course 2 Meet your lecturers Szymon Lis E-mail: [email protected] Michał Woźniak E-mail: [email protected] LinkedIn: www.linkedin.com/in/mjwozniak Scholar proﬁles: https://linktr.ee/michalwozniak 3 High-level course goals ● Students have reliable and structured knowledge on a wide range of supervised machine learning algorithms for regression and classiﬁcation problems ● ○ theoretical foundations of machine learning algorithms, ○ practical programming skills to apply machine learning algorithms. Students are able select predictive modeling algorithms that are best suited to the speciﬁc research problem and perform an independent research project using the methods learned. 4 Course prerequisites We require you to know: 1. well linear algebra, calculus, statistics, and probability theory (recommended to read and understand Deisenroth P., Faisal A., Ong S. (2020). Mathematics for machine learning. Cambridge University Press); 2. at least basic Python programming skills (recommended to: 2.1. read and understand Matthes, E. (2019). Python crash course: A hands-on, project-based introduction to programming. no starch press or 2.2. do Programiz Python Programming Course: https://www.programiz.com/python-programming). 5 Course bibliography ● James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning. Springer, New York, NY (oﬃcal online access: https://www.statlearning.com/) ● Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning. Springer-Verlag. ● Harrington, P. (2012). Machine learning in action (Vol. 5). Greenwich, CT: Manning. ● Intel (2018). Introduction to Machine Learning. Retrieved from https://www.intel.com/content/www/us/en/developer/learn/course-machine-learning. html ● VanderPlas, J. (2016). Python data science handbook: Essential tools for working with data. O'Reilly Media, Inc. 6 Course agenda Lectures: 1. Introduction to Machine Learning 2. Crucial Machine Learning techniques (part 1) 3. Assessing model accuracy, machine learning diagnostics 4. Basic Supervised Learning models 5. Crucial Machine Learning techniques (part 2) Labs: 1. Introduction to exploratory data analysis, data wrangling, data engineering and modeling using econometric models 2. Machine learning diagnostics with diﬀerent evaluation metrics and dataset splits 3. Machine learning modeling with KNN and SVM models 4. Case study 1 5. Case study 2 7 Course credits regulations There are two elements that the ﬁnal grade consists of. The ﬁrst one is the theoretical part exam. The second is to prepare two machine learning project (in pairs of two) and create a presentation about them. The following weights are used to determine the ﬁnal grade (max 100 pts): - 40 pts - mid-term theoretical exam - 30 pts for each of 2 projects, including: - 10 pts for in class presentation - 10 pts for presentation contents - 10 pts for models performance in competition (out of sample test) In case of projects we will provide detailed information in one month. 8 What we expect / What to expect ● This is a challenging course (a lot of knowledge in a short time) ● The course requires at least several hours of study per week ● Systematic studying is required to learn the material ● Active participation in classes is recommended ● The most important things is to have the willingness to learn (fast) 9 Lecture materials https://tinyurl.com/ml2023spring 10 start of the 1st lecture Chapter 1 Introduction to Machine Learning 11 What is Machine Learning? Machine learning (ML) is the process of using mathematical models of data to help a computer learn without direct instruction. It’s considered a subset of artiﬁcial intelligence (AI). Machine learning uses algorithms to identify patterns within data, and those patterns are then used to create a data model that can make predictions. With increased data and experience, the results of machine learning are more accurate—much like how humans improve with more practice. The adaptability of machine learning makes it a great choice in scenarios where the data is always changing, the nature of the request or task are always shifting, or coding a solution would be eﬀectively impossible. [source: Microsoft Azure] 12 Types of Machine Learning ● Supervised learning ● Unsupervised learning ● Semi-supervised learning ● Reinforcement learning [source: IBM Developer] 13 Supervised learning Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value. Worth noting, econometrics is a subset of supervised learning family. [source: Wikipedia] In other words: given a set of data points {x1, …, xi} associated to a set of outcomes {y1, …, yi}, we build a model that learns to predict y from x. Types of supervised learning: ● Regression - outcome is continuous ● Classiﬁcation - outcome is category [source: Intel] 14 Unsupervised learning Unsupervised learning is a type of algorithm that learns patterns from untagged data. Since the examples given to the learner are unlabeled, there is no evaluation of the accuracy of the structure that is output by the relevant algorithm - which is one way of distinguishing unsupervised learning from supervised learning and reinforcement learning. [source: Wikipedia] In other words: given a set of data points {x1, …, xi}, we look for hidden patterns in the data. Types of unsupervised learning: ● Clustering - grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other clusters. ● Dimension reduction - transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data. ● Association rules - identiﬁcation of strong rules discovered in databases using some measures of interestingness. 15 Semi-supervised learning Semi-supervised learning is an approach to machine learning that combines a small amount of labeled data with a large amount of unlabeled data during training. In such approach algorithms make use of this additional unlabeled data to better capture the shape of the underlying data distribution and generalize better to new samples. Semi-supervised learning falls between unsupervised learning (with no labeled training data) and supervised learning (with only labeled training data). It is a special instance of weak supervision. [source: Wikipedia & scikit-learn] Reinforcement learning Reinforcement learning is the training of machine learning models to make a sequence of decisions. The agent learns to achieve a goal in an uncertain, potentially complex environment. In reinforcement learning, an artiﬁcial intelligence faces a game-like situation. The computer employs trial and error to come up with a solution to the problem. To get the machine to do what the programmer wants, the artiﬁcial intelligence gets either rewards or penalties for the actions it performs. Its goal is to maximize the total reward. [source: deepsense.ai] 16 Da ta &A Iw orl dm ap 17 Machine Learning glossary y, target, dependent variable, endogenous variable, output variable, regressand, ● response variable - variable predicted by the algorithm x, feature, independent variable, exogenous variable, explanatory variable, ● predictor, regressor - variable to predict target variable example ● example, entity, row - a single data point within the data (one row in the dataset) ● label - the target value for a single data point target column client_id age education income default 1 25 bachelor 25k USD 0 2 45 doctorate 40k USD 0 feature columns 3 52 master 70k USD 1 label 18 Machine Learning as a function The fundamental assumption of machine learning is as follows: there is a function that represents a causal relationship between features X and target Y, and can be written in very general form as: Y = f(X) + ϵ, where f is some ﬁxed but unknown function of X, and ϵ is a random error term, which is independent of X and has mean zero. The goal of Machine Learning is to “ﬁnd” a function f, where by “ﬁnd” is meant the set of methods that estimate this function (we need to approximate this function, as its actual form is unobservable). 19 Machine learning estimation idea In general, we can deﬁne estimation process as: Ŷ = f̂ (X), where Ŷ is the prediction of our estimator for target variable and f̂ (X) is our estimate of the function f(X). The estimator approximates reality imperfectly, so it generates prediction error, which is equal to Y - Ŷ. The size of this error reﬂects the quality of the model (in general, the smaller the error the better). Importantly, for a number of reasons, part of the error is reducible (bias part) and part is irreducible (e.g. due to omitted variables). = Irreducible + Reducible 20 Machine learning estimation approaches A great many approaches can be used to estimate our function f(X). Thus, the primary division of supervised machine learning methods is as follows: ● ● ● parametric algorithms (for instance econometrics) ○ known functional form ○ known distribution of random variables ○ ﬁnite number of parameters nonparametric algorithms ○ unknown functional form (lack of a priori assumptions) ○ inﬁnite number of parameters semi-parametric algorithms ○ theoretically inﬁnite number of parameters, but in practice we estimate part of them. Both parametric and non-parametric methods have their advantages and disadvantages (trade-oﬀs for parametric approaches: simplicity vs constrain, speed vs limited complexity, less data requireds vs poor ﬁt). 21 Training Machine learning model - error minimization Regardless of the estimation approach chosen for small as possible with the currently estimated parameters , we are always keen for the forecast error to be as . Therefore, it is necessary to deﬁne and then optimise a function that expresses how “wrong” the model is. First of all we deﬁne loss function (L) - usually a function which measures the error between a single prediction and the corresponding actual value. For instance: Based on that we can deﬁne more general object, which is cost function (J) - usually a function which measures the error between predictions and their actual values across the whole dataset. It might be a sum of loss functions over your training set plus some model complexity penalty (regularization). For instance: Model training is about minimizing the cost function! 22 Training Machine learning model - cost function properties Cost function directly inﬂuences our estimator f(X). Thus, when we choose this function, we should (possibly) ensure that our estimator is unbiased: and eﬃcient: estimator with the smallest variance. In the best situation, we obtain a minimum-variance unbiased estimator (MVUE). In addition, due to optimisation algorithms (based on diﬀerentiation), the cost function should be convex and it is good if it is smooth (continuous and diﬀerentiable). Last but not least, it is always important to consider whether our cost function reﬂects the real cost of prediction errors in the context of the research/modelling objective. It is worth considering whether it is more costly to overestimate or underestimate our problem (asymmetry) (e.g. whether it is better to employ more people in the shop for the Christmas peak, or whether it is better not to overestimate this number). 23 Training Machine learning model - idea of gradient descent Once we have deﬁned the cost function, we can generally take its derivative with respect to the parameters (weights), set it to zero, and solve for the parameters to get the perfect global solution (FOC). However, for most functions this is impossible! Therefore, we have to use an alternative (local) optimisation method which is the gradient descent algorithm. General idea of gradient descent is as follows: [1] we deﬁne surface created by the objective function (we don't know what it looks like in general); [2] we follow the direction of the slope of this function [1] downhill until we reach a valley. [source: PaperspaceBlog] 24 Training Machine learning model - gradient descent formally First of all, let's recall the simpliﬁed deﬁnition of a gradient. The gradient is the a vector whose coordinates consist of the partial derivatives of the parameters: . The gradient vector can be interpreted as the "direction and rate of fastest increase". Now we deﬁne gradient descent optimization algorithm. Gradient descent is a way to minimize an objective function parameterized by a model's parameters direction of the gradient of the objective function deﬁne learning rate by updating the parameters in the opposite w.r.t. to the parameters. Additionally, we have to which determines the size of the steps we take to reach a (local) minimum. Vanilla gradient descent algorithm : 1. Start with initial random guess of 2. Generate new guess by moving in the negative gradient direction (gradient computed on entire training dataset): 3. Repeat point 2. to successively reﬁne the guess and stop when convergence criteria is reached [sources: Sebastian Ruder blog, Stanford CS 229] 25 Training Machine learning model - gradient descent versions There are three variants of gradient descent, which diﬀer in how much data we use to compute the gradient of the objective function. Depending on the amount of data, we make a trade-oﬀ between the accuracy of the parameter update and the time it takes to perform an update. [source: Sebastian Ruder blog] 26 General purpose of the estimation When we do a research we have to answer the question: 1) are we interested in the best possible prediction, 2) the best possible understanding of the relationship between our features and target (inference), or 3) are we interested in both issues at the same time? Depending on the answer (business environment, research problem, etc.), we will decide on the choice of estimator, e.g. parametric or non-parametric, fully explanable model or black-box model (a system which can be viewed in terms of its inputs and outputs without any knowledge of its internal workings) etc. Note that a more complex model will not always be better than a simple model (e.g. some problems are purely linear and non-parametric methods may search for complexly artiﬁcial and spurious relationships). Before starting experiments, it is important to have a good understanding of the problem being undertaken! 27 Types of variables There are diﬀerent types of variables in statistics and machine learning. The most important ones are highlighted in the illustration below. [source: K2 Analytics] 28 Linear regression - general information Linear regression is a basic supervised learning algorithm for predicting continuous variables from a set of independent variables. From an econometric point of view, linear regression is primarily used for inference (much less frequently for prediction). In this course we look at linear regression from the machine learning perspective i.e. we are mostly interested in prediction. To get a good understanding of linear regression in economic applications, a separate course is generally devoted to it. We don't have time for that, so we will discuss its key elements from an ML perspective. At the same time, we recommend a very good course teaching the principles of linear regression (chapter 3 and 4). Importantly, linear regression can be estimated in a number of ways: ordinary least squares (OLS), weighted least squares (WLS), generalised least squares (GLS). We will focus on the most popular of these OLS. 29 Linear regression - external materials We use a Machine Learning University (MLU)-Explain course created by Amazon to present the concepts, assumptions, mathematical foundations, and interpretation of linear regression. The course is made available under the Attribution-ShareAlike 4.0 International licence (CC BY-SA 4.0). Thanks to numerous visualisations, the course allows many theoretical concepts to be discussed very quickly. Linear regression by MLU-EXPLAIN 30 Linear regression - additional materials Matrix notation of linear regression equation [source: Practical Econometrics and Data Science] Adjusted R squared Adjusted R2 is a corrected goodness-of-ﬁt (model accuracy) measure for linear models. It identiﬁes the percentage of variance in the target ﬁeld that is explained by the inputs. R2 tends to optimistically estimate the ﬁt of the linear regression. It always increases as the number of eﬀects are included in the model. Adjusted R2 attempts to correct for this overestimation. Adjusted R2 might decrease if a speciﬁc eﬀect does not improve the model. Adjusted R2 is always less than or equal to R2. A value of 1 indicates a model that perfectly predicts values in the target ﬁeld. A value that is less than or equal to 0 indicates a model that has no predictive value. If we assume that p is the total number of explanatory variables in the model, and n is the sample size, then R2 is equal to: [source: IBM] 31 Linear regression - additional materials OLS - Closed-Form Solution extension OLS - regression output analysis [source: Practical Econometrics and Data Science] R^2 and Adjusted R^2 P-value of F-statistic (interpretation: value below signiﬁcance level e.g. 5% means that our model is well speciﬁed - it is better than the model without features) Values of model parameters, thus regression is equal to: y = 5.2 + 0.47*x1 + 0.48*x2 - 0.02*x3 P-value of t-statistic (interpretation: value below signiﬁcance level e.g. 5% means that our variables is signiﬁcant in the model) Some model speciﬁcation tests [source: Statsmodels] 32 Linear regression - additional materials end of the 1st lecture Key assumptions in OLS and BLUE concept [source: Practical Econometrics and Data Science] 33 Logistic regression - general information start of the 2nd lecture Logistic regression is a basic supervised learning algorithm for predicting nominal binary variables (dichotomous variables) from a set of independent variables. As with linear regression, from an econometric point of view, logistic regression is primarily used for inference. However, the interpretation of logistic regression results is much more diﬃcult (we can’t interpret logistic regression results directly, thus we use marginal eﬀects and odds). During this course we look at logistic regression from the machine learning perspective i.e. we are mostly interested in prediction. At the same time, we recommend a very good course teaching the principles of logistic regression from econometric perspective (chapter 5.2). A natural generalisation of logistic regression to the ability to classify more than two classes is multinomial logistic regression. It is worth knowing that logistic regression is just one selected model representing the entire class of Generalized Linear Models (GLMs). Here you can ﬁnd more details about GLM and its families. 34 Logistic regression - external materials We use a Machine Learning University (MLU)-Explain course created by Amazon to present the concepts, mathematical foundations, and interpretation of logistic regression. The course is made available under the Attribution-ShareAlike 4.0 International licence (CC BY-SA 4.0). Thanks to numerous visualisations, the course allows many theoretical concepts to be discussed very quickly. Logistic regression by MLU-EXPLAIN 35 Logistic regression - additional materials Linear regression for binary classiﬁcation problem [source: Intel Course: Introduction to Machine Learning] 36 Logistic regression - additional materials Sigmoid function A sigmoid function is a mathematical function having a characteristic "S"-shaped curve or sigmoid curve. Some sigmoid functions compared: A common example of a sigmoid function is the logistic function: Logistic function has many useful properties: 1. it maps solution space to probability functions - output range is from 0 to 1 2. it is diﬀerentiable - important from the perspective of optimization problem 3. it uses exponential - most outputs are “attached” to 0 or 1 (not in the mid ambiguous zone) [source: Wikipedia] 37 Logistic regression - additional materials Logistic regression for binary classiﬁcation problem [source: Intel Course: Introduction to Machine Learning] 38 Logistic regression - additional materials The relationship between logistic and linear regression Logistic function Logistic function Odds ratio Log odds (or logit function) [source: Intel Course: Introduction to Machine Learning] 39 Logistic regression - additional materials Logistic regression - decision boundary Logistic regression - cost function We utilize cross-entropy (log-loss) as the cost function for logistic regression: [source: Intel Course: Introduction to Machine Learning] 40 Logistic regression - additional materials Multinomial logistic regression - one vs all approach [source: Intel Course: Introduction to Machine Learning] 41 Logistic regression - additional materials Multinomial logistic regression & softmax function Let’s assume that we have k classes. We can deﬁne multinomial logistic regression using following formula: , where is linear predictor function (linear regression) to predict that given observation has outcome i. The cost function for multinomial logistic regression is generalization of log-loss to cross entropy for k>2. We calculate a separate loss for each class label per observation and sum the result: , where y is binary indicator (0 or 1) if class label j is the correct classiﬁcation for observation o and p is predicted probability observation o is of class j. 42 Logistic regression - additional materials Generalized Linear Models GLM is a ﬂexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value. [source: Stanford CS 229, Wikipedia] 43 Logistic regression - additional materials Generalized Linear Models - examples (here we utilize other notation - be careful!) [source: Time series reasoning] 44 Chapter 2 Crucial machine learning techniques (part 1) 45 Problem deﬁnition - problem statement worksheet approach At the beginning of any ML research/consulting project, it is good practice to formulate a problem statement worksheet - a document that formalizes at a basic level the deﬁnition of the business task we will be tackling (what, why, how). This document is an excellent initiator (constitutes a document binding between the parties) and allows you to plan a complete project using nearly any project management technique (for instance scrum etc.). The masters in preparing such worksheets are consulting ﬁrms (e.g., top-3: McKinsey, BCG, Bain). analyze the worksheet template used Let's by McKinsey: [source: Betty Wu Talk] 46 Dataset preparation steps After collecting the business requirements and designing the project/experiment, the next step is to prepare the data for the project implementation. We usually distinguish the following elements of such a data preparation process: 1. Deﬁning / selecting the necessary data and their source 2. Data ingestion - data extraction from source systems 3. Transforming data to a convenient analytical form (preferably homogeneous for all sets) 4. Initial data exploration (e.g. with visualizations) and validation 5. Combining data sets into one (if eﬀective at this stage - depending on the project) 6. Eventual division of the set into {train, validation, test} or {train+validation and test}. Note: All transformations performed on the training set should be applicable later on the test set, e.g. parameters learned on the training set for normalisation should also be used for the same transformation on the test set. 7. Data cleaning like missing imputation 8. Feature engineering - generation of new variables 9. Extensive exploratory data analysis (preferably using visualizations and statistics) 10. Initial feature selection process 11. Eventual balancing target variable classes 12. Eﬃcient data saving to a universal format (preferred by models) 47 Exploratory data analysis The exploratory data analysis (EDA) stage is designed to help us build up both a general and detailed picture of the data we have (EDA is used to summarize their main characteristics). In the ﬁrst instance, we can perform a relatively simple visual analysis. In the case of tabular data, this involves analysing sets of data stored in a data frame (e.g. in the case of a project using images, we will look at images from the set). Furthermore, we should use visualisation techniques (both univariate - single variable, bivariate - two variables, and multivariate - several variables) to better analyse the data. The image on the right shows a 'guide' to visualisations by application. [source: TapClicks] 48 Exploratory data analysis with statistics In addition to visual analysis, it is also useful to use statistical tools to explore properties and relationships between data. In the ﬁrst instance, the use of univariate analysis is recommended. We want to examine what properties single variables have, e.g. using 1) descriptive statistics (frequency table, mean, standard deviation, skewness, kurtosis, quantiles) 2) one-sample tests, 3) tests for autocorrelation, white noise (for time series), etc. Next, we want to check the multivariate statistics. The table on the right shows the most common measures and tests of association between variables of diﬀerent types (source: Statistics & Explanatory Data Analysis course, Dr Marcin Chlebus, Dr Ewa Cukrowska - Torzewska). In addition, we can use various Unsupervised Learning techniques here, e.g. dimension reduction (principal component analysis), etc. 49 Data imputation When working with real data you will always encounter the problem of missing values. There can be many reasons for this, but ﬁrst and foremost the cause of the missing values can be attributed to: the data does not exist, the data was not captured by a hardware, software or human error, the data was deleted, etc. The ﬁgure on It is always worth checking what the lack of data results from, because perhaps it the right shows the most general classiﬁcation of reasons for can be supplemented through an additional data extraction process (without the missing data. need to use data mining techniques). We distinguish the following techniques for dealing with the problem of missing values: ● do nothing (some machine learning algorithms can deal with this problem automatically) ● remove missing variables/columns (consider it when the variable has more than ~10% missings and is not crucial for the analysis) or examples/raws (avoid if possible, especially if your dataset is small and missings are not random) ● ﬁll in (imputation) the missing values using: ○ univariate techniques (use only one feature): ■ for continuous variables: use constant (like 0); use statistics like: mean/median/mode (globally or in the subgroup); use random value from assumed distribution; ■ for categorical variables: encode missing as additional “missing” category; replace with mode (globally or in the subgroup); replace randomly from non-missings; ■ ○ for time series variables: use last or next observed value; use linear/polynomial/spline interpolation multivariate techniques (use multiple features): use KNN or other supervised ML algorithm; use Multivariate Imputation by Chained Equation [source: Kaggle] 50 Feature engineering Feature engineering, i.e. generation of new variables, is a key stage of modeling (the correctness of this process will determine the quality of the model). It is performed in several moments, while the two most popular are: during the ETL process and after the ETL process: 1) During the ETL process, we focus on the so-called analytical engineering (domain knowledge is key here), i.e. we try to transform our sets into a form consumable by our models - most often we use various types of aggregations here using descriptive statistics, e.g. mean/median/quantiles etc. (e.g. being a bank, we want to estimate the customer's credit risk - so we will aggregate his/her history for the selected period into one observation). 2) After the ETL process, the matter is more complicated because we focus on creating additional variables or processing existing ones in order to improve the predictive power of our algorithm (this requires particular creativity - it’s kind of art) - for instance some algorithms are not able to capture non linear relationships (OLS/SVM/KNN), so we have to feed them with the variables that make it possible. Let's discuss the most popular feature engineering techniques. (Attention! Each of the techniques requires training on the train set, and then inference should be used to apply to the test!) Numeric variables transformations (only the most important): ● scaling to a range (min-max scaler): z = (x - min(x))/(max(x)- min(x)) (recommendation: when the feature is more-or-less uniformly distributed across a ﬁxed range) ● clipping (winsorization): if x > max, then z = max. if x < min, then z = min (recommendation: when the feature contains some extreme outliers) ● log scaling: z = log(x) (recommendation: when the feature conforms to the power law) ● z-score (standard scaler): z = (x - u) / s (recommendation: when the feature distribution does not contain extreme outliers) ● quantile transformer: map the data to a uniform distribution with values between 0 and 1 ● power transformer (Yeo-Johnson transform and the Box-Cox transform): map data from any distribution to as close to a Gaussian distribution as possible in order to stabilize variance and minimize skewness. ● bucketing (discretization of continuous variables) with: 1) equally spaced boundaries, 2) quantile boundaries, 3) bivariate decision trees boundaries 4) expertly given boundaries - all in all it can help with both overﬁtting and nonlinear modelling ● polynomial transformer, spline transformer, rounding, replacing with PCA and any other arithmetic operation [source: Google] 51 Feature engineering - cont’d end of the 2nd lecture Categorical variables transformations (only the most important): ● one hot encoding - it transforms each categorical feature with n possible values (n categories or n levels) into n binary features, with one of them 1, and all others 0. Sometimes we have to drop/remove one category (one binary feature) to avoid perfect collinearity in the input matrix in some estimators, in most cases we will drop the most frequent category (for instance OLS without this dropping will be impossible to compute, however OLS with regularization like L2 or Lasso works work well with such collinearity, and we should not remove any level). This approach supports aggregating infrequent categories into a single output for each feature. ● ordinal econder - it transforms each categorical feature to one new feature of integers. Be careful with this encoding, because passing such a variable directly to the model will impose an order to the model. ● there is a lot of other super powerful encoders like: BaseN, CatBoost Encoder, Count Encoder, Hashing, Helmert Coding, James-Stein Encoder, Leave One Out, Polynomial Coding, Quantile Encoder, Sum Coding, Summary Encoder, Target Encoder, Weight of Evidence etc. (check out source for more information) (I personally use most often one hot when doing econometrics, but when doing ML I love CatBoost Encoder, credit risk bankers love WoE, etc.) Interactions between variables - of course we can look for some interactions between our variables (numeric & numeric, categorical & categorical or numeric & categorical). We can try multiplication, division, subtraction and basically anything math and our imagination allow us to do. Keep in mind that the possibilities for feature engineering are endless. The advantage of machine learning over classical econometrics is that in most cases we are interested in the result itself, not how we arrived at it, so the variables we produce may have low interpretability. In addition, some algorithms (especially those based on decision trees) are able to select the relevant variables themselves, and marginalize the non-relevant ones. However, my years of experience show that we should not overdo our creativity - in ﬁnancial problems, usually the best variables created in feature engineering are those that have a strong business/economic/theoretical basis. However, it is always good to look for happiness in numbers :) [source: Category Encoders] 52 Labs no. 1 - introduction to exploratory data analysis, data wrangling, data engineering and modeling using econometric models The data we will be working with can be accessed through the following link: https://www.kaggle.com/competitions/home-credit-default-risk/data. Link do the materials: Link 53 start of the 3rd lecture Chapter 3 Assessing model accuracy, machine learning diagnostics 54 Evaluation metrics - concept At this point, we have a broad understanding of the cost function and its crucial role in machine learning. We know that cost function should meet certain properties (e.g. diﬀerentiability with respect to parameters). In practice, this means that we can use a limited number of functions to train and monitor the quality of our model. However, for the extensive process of evaluation the performance of a model (during training and testing), evaluation metrics have been developed that do not have to comply with restrictive mathematical properties. Evaluation metrics are calculated after the estimator is already created with use of diﬀerent cost function, thus evaluation metric does not aﬀect estimator per se. We distinguish evaluation metrics for following problems: ● regression ● classiﬁcation ● probabilities 55 Evaluation metrics - regression In case of regression we deal with continuous target. Intuitively we are looking for metrics which describe distance between prediction and actual (it’s straightforward). The most popular regression metrics are: ● Mean Square Error (MSE): ● Root Mean Square Error (RMSE): ● Mean Absolute Error (MAE): ● Mean Absolute Percentage Error (MAPE): ● Mean Squared Logarithmic Error (MSLE): ● R² score: ● Median Absolute Error (MedAE): , where epsilon is small strictly positive number Mean Absolute Scaled Error, Mean Directional Accuracy and many many more… To visualize errors distribution we can use histogram/KDE model and we are able to get a complete picture of the performance of regression estimator. [source: Scikit-learn] 56 Evaluation metrics - regression When choosing an evaluation metric, be very careful and deeply understand the business outcome of your decisions: MAPE = symmetric MAPE (sMAPE) = [source: Towards Data Science] 57 Evaluation metrics - classiﬁcation In the case of a classiﬁcation problem, it is much more diﬃcult to make a correct assessment of the model. It requires a bit more knowledge and abstract thinking. First of all let’s introduce confusion matrix: Example: [source: Wikipedia] 58 Evaluation metrics - classiﬁcation * not applicable for imbalanced problem Based on confusion matrix we can derive following classiﬁcation metrics: ● Accuracy* (how many observations, both positive and negative, were correctly classiﬁed): ● True Positive Rate or Recall or Sensitivity (how many observations out of all positive observations are classiﬁed as positive): ● True Negative Rate or Speciﬁcity (how many observations out of all negative are classiﬁed as negative): ● Positive Predictive Value or Precision (how many observations predicted as positive are in fact positive): ● Negative Predictive Value (how many predictions out of all negative predictions were correct): ● False Positive Rate or Type I error: ● False Negative Rate or Type II error: ● F beta score (combination of precision and recall in one metric, the more you care about recall over precision the higher beta you should choose; well suited to the problem of imbalanced dataset): [source: Neptune.ai blog] 59 Evaluation metrics - classiﬁcation Based on confusion matrix we can derive following classiﬁcation metrics: ● Matthews Correlation Coeﬃcient (correlation between predicted classes and ground truth; well suited to the problem of imbalanced dataset): and many many more … In case of binary classiﬁcation metrics we strongly recommend following Neptune.ai blogpost: link. They accurately deﬁne each evaluation metric with an intuitive interpretation (super useful in regular business environment). Additionally they provide very pertinent advice on when to apply a given metric. Of course we can generalize binary classiﬁcation metrics to the multivariate classiﬁcation metrics. First of all we can plot confusion matrix which is sell explanatory. Additionally for each class (one vs all approach) we can calculate separately: precision, recall, f-beta score etc. and ﬁnally average each of them by some aggregation rule (micro, macro and weighted aggregation approach) (here you can ﬁnd more details link). [source: Neptune.ai blog] 60 Accuracy, precision, recall, F1-score - external materials We use a Machine Learning University (MLU)-Explain course created by Amazon to present the concept of accuracy, precision, recall, F1-score metrics. The course is made available under the Attribution-ShareAlike 4.0 International licence (CC BY-SA 4.0). Thanks to numerous visualisations, the course allows many theoretical concepts to be discussed very quickly. Precision & Recall by MLU-EXPLAIN 61 Evaluation metrics - probabilities (for classiﬁcation task) When we use classiﬁcation algorithms we nearly always want to deal with probabilities. In most cases we can setup up models to return probabilities (not predicted class)! We need to decide where to place the probability cut-oﬀ point (after which we assign observation to speciﬁc class). It’s not easy task. In most cases we start with 0.5 (50%) cut-oﬀ point, but for most cases it might be wrong value! Thanks to evaluation metrics and plots dedicated to probabilities (for classiﬁcation) we can make above decision in responsible and aware way. We can distinct here following metrics: ● Receiver Operating Characteristic Curve (ROC) ● Precision/recall curve ● Lift curve ● Gini curve ● Area Under the Curve ROC (AUC ROC) ● Area Under the Curve Precision/recall (AUC PR) ● Log-loss or Cross entropy or Entropy 62 Evaluation metrics - probabilities Receiver Operating Characteristic Curve (ROC) ROC allows to address the tradeoﬀ between true positive rate (TPR) and false positive rate (FPR). For every probability cut-oﬀ point, we calculate TPR and FPR and plot it on one chart. At the beginning when the cutoﬀ point is 1 we classify every observation as "0". Obviously in this situation FPR is equal to 0. With the decrease of the cut-oﬀ point we increase the number of "1" - the TPR starts to increase. However our estimator will probably not be perfect so some of predicted "1" are incorrect, therefore the increase of FPR (and decrease of TNR). Generally, the higher TPR and the lower FPR is for each threshold the better and so classiﬁers that have curves that are more top-left side are better. As you may notice ROC curve is not well suited for imbalanced classiﬁcation tasks (for more details please read this article). Area Under the Curve ROC (AUC ROC) Additionally we can calculate AUC ROC, which will be one ultimate metric to assess quality of the model. It takes values from 0.5 to 1. We should not use it with imbalanced dataset. It is recommended if you care about true negatives as much as true positives, and you care about ranking predictions. Additionally, this metric can be interpreted as: the probability that a uniformly drawn random positive has a higher score than a uniformly drawn random negative. Notice: AUC metric treats all classiﬁcation errors equally, ignoring the potential consequences of ignoring one type over another. For example, in cancer detection, we’ll probably want to minimize false negatives. [source: Wikipedia] 63 ROC and ROC AUC - external materials We use a Machine Learning University (MLU)-Explain course created by Amazon to present the concept of ROC and ROC AUC. The course is made available under the Attribution-ShareAlike 4.0 International licence (CC BY-SA 4.0). Thanks to numerous visualisations, the course allows many theoretical concepts to be discussed very quickly. ROC & AUC by MLU-EXPLAIN 64 Evaluation metrics - probabilities Precision recall curve (PR curve) PR curve combines precision and recall in a single visualization. For every probability cut-oﬀ, we calculate PPV and TPR and we plot it into the graph. The higher results the better model is. However we deal here with classic precision/recall dilemma (the higher the precision the lower the recall). Area Under the Curve Precision recall (AUC PR) Similarly to ROC AUC score we can calculate the Area Under the Precision-Recall Curve to get one representative number for the whole model. We can treat PR AUC as the average of precision calculated for each recall threshold from 0.0 to 1.0. AUC AUC PR is recommended to highly imbalance problems and when we communicate precision/recall decision to our stakeholders (and we additionally suggest where is the best possible cut-oﬀ point). [source: Stackoverﬂow] 65 Evaluation metrics - probabilities (for regression task) As in the case of classiﬁcation in regression, we can also use the notion of probability, but in a slightly diﬀerent sense: we can build regression models that, in addition to the expected value, estimate the conﬁdence intervals of the forecast. We will not discuss this issue in our classes due to its high level of advancement, but it is worth knowing about the existence of the metric: The Continuous Ranked Probability Score (CRPS) which generalizes the MAE to the case of probabilistic forecasts. Link for more details: https://www.lokad.com/continuous-ranked-probability-score. 66 Bias/variance trade-oﬀ - concept The bias of a model is the diﬀerence between the expected prediction and the correct model that we try to predict for given data points. The variance of a model is the variability of the model prediction for given data points. MSE decomposition: Error = Bias^2 + Variance + Noise Bias/variance tradeoﬀ - The simpler the model, the higher the bias, and the more complex the model, the higher the variance. For MSE decomposition check page 19. [source: Stanford CS 229, Scott Fortmann-Roe Essay] 67 Bias/variance trade-oﬀ - overﬁtting and underﬁtting [source: Stanford CS 229] 68 Bias/variance trade-oﬀ - overﬁtting and underﬁtting (cont’d) + use boosting + use bagging reduce complexity of the model These illustrations present learning curves. A learning curve is a plot of model learning performance over experience or time. We distinguish following learning curves: Train Learning Curve: Learning curve calculated from the training dataset that gives an idea of how well the model is learning. Validation Learning Curve: Learning curve calculated from a hold-out validation dataset that gives an idea of how well the model is generalizing. Optimization Learning Curves: Learning curves calculated on the metric by which the parameters of the model are being optimized, e.g. log-loss. Performance Learning Curves: Learning curves calculated on the metric by which the model will be evaluated and selected, e.g. AUC ROC. [source: Stanford CS 229, Machine Learning mastery] 69 Bias/variance trade-oﬀ - external materials We use a Machine Learning University (MLU)-Explain course created by Amazon to present the concept of bias/variance trade-oﬀ. The course is made available under the Attribution-ShareAlike 4.0 International licence (CC BY-SA 4.0). Thanks to numerous visualisations, the course allows many theoretical concepts to be discussed very quickly. Bias/variance trade-oﬀ by MLU-EXPLAIN 70 Training, validation and testing sets - concept Generally, learning the parameters of a prediction function and testing it on the same data is a methodological mistake (we can easily overﬁt our model). In statistics and machine learning, there is a good practice of dividing a data set into three parts that have a dedicated purpose. (for more details: see slide 61) In some classiﬁcation problems we can encounter imbalanced datasets (e.g. few “1” and lot “0”). It’s super important to use stratify approach, which allows you to ensure that relative class frequencies is approximately preserved in train-validation pair. Sometimes a strategy of creating several models independently on the train and validation data is used, and then one best model is selected on the testing sample. [source: Stanford CS 229] 71 Training, validation and testing sets - external materials We use a Machine Learning University (MLU)-Explain course created by Amazon to present the concept of train-test-validation dataset split. The course is made available under the Attribution-ShareAlike 4.0 International licence (CC BY-SA 4.0). Thanks to numerous visualisations, the course allows many theoretical concepts to be discussed very quickly. Train-test-validation by MLU-EXPLAIN 72 Cross-validation - concept We can generalize idea of training, validation and testing sets split to much more complex and powerful solution which is Cross-validation (CV). CV is a technique for evaluating a machine learning model and testing its performance. More precisely, CV is a resampling method that uses diﬀerent portions of the data to validate and train a model on diﬀerent iterations. This approach is much more robust that single train-validation split, because we shouldn't treat the value in a single validation as ideal approximation of ground truth. There are two crucial reasons for validation/cross validation usage: ● assessment of the quality of our model in a quasi-objective way (less probability of overﬁtting) ● “safe” (again less probability of overﬁtting) execution of the hyperparameter tuning procedure (hyperparameter - is a parameter whose value is used to control the learning process, thus it is not estimable and researcher has to specify it by hand based on intuition or via hyperparameter searching procedure, where CV is crucial) [source: Wikipedia, Scikit-learn] 73 Cross-validation - diﬀerent types We can distinguish dozens types of cross-validations, for instance (we will discuss some of them): ● ● ● ● ● ● ● ● Hold-out K-folds Leave-one-out (LOO) Leave-p-out Stratiﬁed K-folds Repeated K-folds Nested K-folds Time series CV The needs for having multiple types are many: the speciﬁcs of the data (e.g. cross-sectional vs. time series data), the speciﬁcs of the business problem, the size of the dataset, the imbalance of the dataset, computing resources, probability of data leakage etc. In such a view, it is impossible to say which approach is best and should only be followed. However in everyday use k-fold seems most popular (for cross-sectional problems). 74 Cross-validation - diﬀerent types [source: Neptune.ai blog] 75 Cross-validation - diﬀerent types [source: Neptune.ai blog] 76 Cross-validation - diﬀerent types [source: Neptune.ai blog] 77 Cross-validation - diﬀerent types [source: Neptune.ai blog] 78 Cross-validation - diﬀerent types (target imbalance problem) [source: Neptune.ai blog] 79 Cross-validation - diﬀerent types [source: Neptune.ai blog] 80 Cross-validation - diﬀerent types (time series problems) [source: Neptune.ai blog] 81 Cross-validation - diﬀerent types (time series problems) [source: Neptune.ai blog] 82 Cross-validation - diﬀerent types (Nested CV) Nested Cross-validation is an extension of the above CVs, but it ﬁxes one of the problems that we have with normal cross-validation. In normal cross-validation you only have a training and testing set, which you ﬁnd the best hyperparameters for. This may cause information leakage and signiﬁcant bias. You would not want to estimate the error of your model, on the same set of training and testing data, that you found the best hyperparameters for. As the image below suggests, we have two loops. The inner loop is basically normal cross-validation with a search function, e.g. random search or grid search. Though the outer loop only supplies the inner loop with the training dataset, and the test dataset in the outer loop is held back. [source: ML from scratch] 83 Cross-validation - external materials end of the 3rd lecture We use a Machine Learning University (MLU)-Explain course created by Amazon to present the concept of cross-validation. The course is made available under the Attribution-ShareAlike 4.0 International licence (CC BY-SA 4.0). Thanks to numerous visualisations, the course allows many theoretical concepts to be discussed very quickly. Cross Validation by MLU-EXPLAIN 84 Labs no. 2 - machine learning diagnostics with diﬀerent evaluation metrics and dataset splits Link do the materials: https://colab.research.google.com/drive/195_9tF4bbkyBqnix4-UqRXMﬀ00tq3ZJ?usp=sharing 85 start of the 4th lecture Chapter 4 Basic Supervised Learning models 86 K-nearest neighbours - general information The K-nearest neighbours (KNN) algorithm is a basic and probably the simplest supervised machine learning algorithm for both classiﬁcation and regression problems. Behind this algorithm is the following idea of locality: the best prediction for a certain observation is the known target value (label) for the observation from the training set that is most similar to the observation for which we are predicting. The KNN algorithm belongs to the following group of methods, it is: non-parametric (it does not require the assumption of a sample distribution) and instance-based (it does not carry out the learning process directly - it remembers the training set and creates predictions on the basis of it on an ongoing basis). The model does not generate computational costs at the time of learning, while the entire computational cost lies on the side of making the prediction (lazy learning). The regression version diﬀers little from the classiﬁcation approach. In the classiﬁcation approach we use an algorithm to vote for the most popular class of neighbours, while in the regression problem we use a technique to average the values of the target variable across neighbours. 87 K-nearest neighbours - general idea and formal algorithm (classiﬁcation case) KNN classiﬁcation algorithm: [source: Intel Course: Introduction to Machine Learning, Application of K-Nearest Neighbor (KNN) Approach for Predicting Economic Events: Theoretical Background] 88 K-nearest neighbours - key hyperparameters The three key hyperparameters for the KNN model are: ● distance metric ● number of k neighbours ● weights of the individual neighbours. Distance metrics allow us to formally deﬁne a measure of similarity between observations. Thanks to them we can determine whether two points lying in a multidimensional space are close to each other. In general, there are many ways to measure the distance between two points (X and Y) in space. The most popular of these are: ● minkowski p distance: ● euclidean distance: minkowski distance with p = 2 ● manhattan distance: minkowski distance with p = 1 ● chebyshev distance: minkowski distance with p reaching inﬁnity: [source: Wikipedia, Lyfat blog] 89 K-nearest neighbours - key hyperparameters Additionally we have to determine how many of the k nearest observations we would like to take into account in our computations (this will also signiﬁcantly aﬀect our decision boundary). There is a rule thumb than square root of number of samples in our training set might be good choice for k. However, in practice we should look for values smaller than the square root of n and we use cross-validation for this task. Generally, the higher the parameter k, the higher the bias, and the lower the parameter k, the higher the variance (for instance for k = 1, our algorithm will be characterised by overﬁtting and a large variance). We can observe that easily via Bias/variance trade-oﬀ by MLU-EXPLAIN course page (paragraph dedicated to: K-Nearest Neighbors). KNN allows for weighing neighbours during the ﬁnal stage of the prediction execution (voting - classiﬁcation, averaging - regression). In the default algorithm, all points in the neighbourhood are weighted the same. However, weighting by distance can also be introduced. There are many methods for this approach, e.g. weighting by inverse distance - this means that observations that are closer will have a higher impact on the ﬁtted value. [source: Intel Course: Introduction to Machine Learning] 90 K-nearest neighbours - features scaling (lack of homogeneity of features) Distance metrics, in addition to their many advantages, also introduce a number of problems into KNN. First and foremost, they are absolute in nature, which can very strongly aﬀect the correctness of the KNN. A very common situation is that one or more explanatory variables (features) in our dataset are set on a large domain (signiﬁcantly larger than the rest of the variables) and these features have low predictive power. This variable(s) will strongly inﬂuence the distances and dominate the other variables in KNN. By the fact that variables are weak predictors, they will make our model very ineﬀective. In order to get rid of the above problem, it is necessary to use the technique of feature scaling (normalization, standardization, etc.) This is a necessary step for the KNN algorithm (it is worth to try numerous techniques on the same variable, to check which one is the best)! The most popular scaling approaches for continuous variables are: ● standardization (z-score normalization): ● rescaling (min-max normalization): ● quantile normalization The most popular standardization approaches for nominal variables are: ● one hot encoder (with potential rescaling of 0-1 to other range for instance: 0-2, 0-0.5 etc.) ● ordinal encoder with further rescaling to 0-1 or other convenient range [source: Wikipedia] 91 K-nearest neighbours - other important informations KNN requires choosing a method to search our stored data for k-nearest neighbors. Brute force searching, which is simply calculating the distance of our query from each point in our dataset, will work fairly well with small datasets, but becomes undesirably slow at larger scales. Tree-based approaches can make the search process more eﬃcient by inferring distances. The two most popular algorithms are: K-D Tree and Ball Tree Search Algorithms. Additionally, there is a curse of dimensionality problem in KNN. The KNN model makes the assumption that similar points share similar labels. It needs all points to be close along every dimension in the data space. However, each new dimension added, makes it harder and harder for two speciﬁc points to be close to each other in every dimension. Unfortunately, in high dimensional spaces, points that are drawn from a probability distribution, tend to never be close together - "a high-dimensional space is a lonely place”. The problem does not occur, for example, in the case of sparse matrices, or in image analysis (strong intragroup correlations aﬀect signiﬁcant closeness in all dimensions). A good approach to solving the multidimensionality problem is to create multiple models on subsets of data (subsets of variables) and then average their results (ensemble technique - bagging). In addition, bagging will also solve the problem of insigniﬁcant features. The KNN model is sensitive to variables with low predictive power. In such a case, variables should be selected in a very reasonable way, i.e. based on expert knowledge, but also using variable selection techniques, e.g. from general to speciﬁc or from speciﬁc to general, or other feature selection techniques. [source: Jeremy Jordan blog, Towards Data Science] 92 K-nearest neighbours - pros and cons PROS CONS ● intuitive and simple ● slow algorithm ● lack of assumptions (non-parametric) ● memory exhausting algorithm ● no training step ● curse of dimensionality ● applicability in the problem of classiﬁcation ● low accuracy in many cases (binary and multiclass) and regression ● need of homogeneous features ● small number of hyperparameters ● not suited for imbalanced problems (directly) ● handles the speciﬁed problems very well (for ● lack of missing value treatment instance: problems with sparse matrices) ● sensitive to the selection of variables and the use of unnecessary variables 93 Support Vector Machines - general information The Support Vector Machines (SVM) is one of the fundamental non-parametric machine learning algorithms (and one of the most inﬂuential of its time). The main author of this model is Professor Vladimir Vapnik (one of the most recognizable researchers in the ﬁeld of machine learning - interestingly if we only consider Vapnik's 'key' publications for SVM development, it took more than 40 years from his ﬁrst paper to his last). The general idea of SVM is as follows: in a multi-dimensional space there exists a hyperplane which separates the classes in optimal way. The goal of SVM is to to ﬁnd the hyperplane which maximizes the minimum distance (margin) between this hyperplane and observations from both classes. The idea of a support vector machine was implemented originally for the classiﬁcation problem, while after some adjustments it is applicable to the regression problem and even unsupervised learning (e.g. searching for outliers). [source: An Introduction to Statistical Learning] 94 Support Vector Machines - general idea Misclassiﬁcations No misclassiﬁcations —but is this the best position? GOAL: Create a hyperplane which runs perfectly in the middle between classes and maximize the region between them Misclassiﬁcations No misclassiﬁcations [source: Intel Course: Introduction to Machine Learning] 95 SVM - formal deﬁnition of maximal margin (“wides street”) approach [source: MIT 6.034 Artiﬁcial Intelligence Lecture, Jeremy Jordan blog] 96 SVM - decision margins deﬁnition (constraint deﬁnition) [source: MIT 6.034 Artiﬁcial Intelligence Lecture, Jeremy Jo

Machine Learning 1_ classification methods - lectures-1.pdf

Document Details

Tags

Related

Full Transcript