Logistic Regression - Additional Materials PDF

Linear regression - additional materials end of the 1st lecture Key assumptions in OLS and BLUE concept [source: Practical Econometrics and Data Science] 33 Logistic regression - general information start of the 2nd lecture Logistic regression is a basic supervised learning algorithm for predicting nominal binary variables (dichotomous variables) from a set of independent variables. As with linear regression, from an econometric point of view, logistic regression is primarily used for inference. However, the interpretation of logistic regression results is much more diﬃcult (we can’t interpret logistic regression results directly, thus we use marginal eﬀects and odds). During this course we look at logistic regression from the machine learning perspective i.e. we are mostly interested in prediction. At the same time, we recommend a very good course teaching the principles of logistic regression from econometric perspective (chapter 5.2). A natural generalisation of logistic regression to the ability to classify more than two classes is multinomial logistic regression. It is worth knowing that logistic regression is just one selected model representing the entire class of Generalized Linear Models (GLMs). Here you can ﬁnd more details about GLM and its families. 34 Logistic regression - external materials We use a Machine Learning University (MLU)-Explain course created by Amazon to present the concepts, mathematical foundations, and interpretation of logistic regression. The course is made available under the Attribution-ShareAlike 4.0 International licence (CC BY-SA 4.0). Thanks to numerous visualisations, the course allows many theoretical concepts to be discussed very quickly. Logistic regression by MLU-EXPLAIN 35 Logistic regression - additional materials Linear regression for binary classiﬁcation problem [source: Intel Course: Introduction to Machine Learning] 36 Logistic regression - additional materials Sigmoid function A sigmoid function is a mathematical function having a characteristic "S"-shaped curve or sigmoid curve. Some sigmoid functions compared: A common example of a sigmoid function is the logistic function: Logistic function has many useful properties: 1. it maps solution space to probability functions - output range is from 0 to 1 2. it is diﬀerentiable - important from the perspective of optimization problem 3. it uses exponential - most outputs are “attached” to 0 or 1 (not in the mid ambiguous zone) [source: Wikipedia] 37 Logistic regression - additional materials Logistic regression for binary classiﬁcation problem [source: Intel Course: Introduction to Machine Learning] 38 Logistic regression - additional materials The relationship between logistic and linear regression Logistic function Logistic function Odds ratio Log odds (or logit function) [source: Intel Course: Introduction to Machine Learning] 39 Logistic regression - additional materials Logistic regression - decision boundary Logistic regression - cost function We utilize cross-entropy (log-loss) as the cost function for logistic regression: [source: Intel Course: Introduction to Machine Learning] 40 Logistic regression - additional materials Multinomial logistic regression - one vs all approach [source: Intel Course: Introduction to Machine Learning] 41 Logistic regression - additional materials Multinomial logistic regression & softmax function Let’s assume that we have k classes. We can deﬁne multinomial logistic regression using following formula: , where is linear predictor function (linear regression) to predict that given observation has outcome i. The cost function for multinomial logistic regression is generalization of log-loss to cross entropy for k>2. We calculate a separate loss for each class label per observation and sum the result: , where y is binary indicator (0 or 1) if class label j is the correct classiﬁcation for observation o and p is predicted probability observation o is of class j. 42 Logistic regression - additional materials Generalized Linear Models GLM is a ﬂexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value. [source: Stanford CS 229, Wikipedia] 43 Logistic regression - additional materials Generalized Linear Models - examples (here we utilize other notation - be careful!) [source: Time series reasoning] 44 Chapter 2 Crucial machine learning techniques (part 1) 45 Problem deﬁnition - problem statement worksheet approach At the beginning of any ML research/consulting project, it is good practice to formulate a problem statement worksheet - a document that formalizes at a basic level the deﬁnition of the business task we will be tackling (what, why, how). This document is an excellent initiator (constitutes a document binding between the parties) and allows you to plan a complete project using nearly any project management technique (for instance scrum etc.). The masters in preparing such worksheets are consulting ﬁrms (e.g., top-3: McKinsey, BCG, Bain). analyze the worksheet template used Let's by McKinsey: [source: Betty Wu Talk] 46 Dataset preparation steps After collecting the business requirements and designing the project/experiment, the next step is to prepare the data for the project implementation. We usually distinguish the following elements of such a data preparation process: 1. Deﬁning / selecting the necessary data and their source 2. Data ingestion - data extraction from source systems 3. Transforming data to a convenient analytical form (preferably homogeneous for all sets) 4. Initial data exploration (e.g. with visualizations) and validation 5. Combining data sets into one (if eﬀective at this stage - depending on the project) 6. Eventual division of the set into {train, validation, test} or {train+validation and test}. Note: All transformations performed on the training set should be applicable later on the test set, e.g. parameters learned on the training set for normalisation should also be used for the same transformation on the test set. 7. Data cleaning like missing imputation 8. Feature engineering - generation of new variables 9. Extensive exploratory data analysis (preferably using visualizations and statistics) 10. Initial feature selection process 11. Eventual balancing target variable classes 12. Eﬃcient data saving to a universal format (preferred by models) 47 Exploratory data analysis The exploratory data analysis (EDA) stage is designed to help us build up both a general and detailed picture of the data we have (EDA is used to summarize their main characteristics). In the ﬁrst instance, we can perform a relatively simple visual analysis. In the case of tabular data, this involves analysing sets of data stored in a data frame (e.g. in the case of a project using images, we will look at images from the set). Furthermore, we should use visualisation techniques (both univariate - single variable, bivariate - two variables, and multivariate - several variables) to better analyse the data. The image on the right shows a 'guide' to visualisations by application. [source: TapClicks] 48 Exploratory data analysis with statistics In addition to visual analysis, it is also useful to use statistical tools to explore properties and relationships between data. In the ﬁrst instance, the use of univariate analysis is recommended. We want to examine what properties single variables have, e.g. using 1) descriptive statistics (frequency table, mean, standard deviation, skewness, kurtosis, quantiles) 2) one-sample tests, 3) tests for autocorrelation, white noise (for time series), etc. Next, we want to check the multivariate statistics. The table on the right shows the most common measures and tests of association between variables of diﬀerent types (source: Statistics & Explanatory Data Analysis course, Dr Marcin Chlebus, Dr Ewa Cukrowska - Torzewska). In addition, we can use various Unsupervised Learning techniques here, e.g. dimension reduction (principal component analysis), etc. 49 Data imputation When working with real data you will always encounter the problem of missing values. There can be many reasons for this, but ﬁrst and foremost the cause of the missing values can be attributed to: the data does not exist, the data was not captured by a hardware, software or human error, the data was deleted, etc. The ﬁgure on It is always worth checking what the lack of data results from, because perhaps it the right shows the most general classiﬁcation of reasons for can be supplemented through an additional data extraction process (without the missing data. need to use data mining techniques). We distinguish the following techniques for dealing with the problem of missing values: ● do nothing (some machine learning algorithms can deal with this problem automatically) ● remove missing variables/columns (consider it when the variable has more than ~10% missings and is not crucial for the analysis) or examples/raws (avoid if possible, especially if your dataset is small and missings are not random) ● ﬁll in (imputation) the missing values using: ○ univariate techniques (use only one feature): ■ for continuous variables: use constant (like 0); use statistics like: mean/median/mode (globally or in the subgroup); use random value from assumed distribution; ■ for categorical variables: encode missing as additional “missing” category; replace with mode (globally or in the subgroup); replace randomly from non-missings; ■ ○ for time series variables: use last or next observed value; use linear/polynomial/spline interpolation multivariate techniques (use multiple features): use KNN or other supervised ML algorithm; use Multivariate Imputation by Chained Equation [source: Kaggle] 50 Feature engineering Feature engineering, i.e. generation of new variables, is a key stage of modeling (the correctness of this process will determine the quality of the model). It is performed in several moments, while the two most popular are: during the ETL process and after the ETL process: 1) During the ETL process, we focus on the so-called analytical engineering (domain knowledge is key here), i.e. we try to transform our sets into a form consumable by our models - most often we use various types of aggregations here using descriptive statistics, e.g. mean/median/quantiles etc. (e.g. being a bank, we want to estimate the customer's credit risk - so we will aggregate his/her history for the selected period into one observation). 2) After the ETL process, the matter is more complicated because we focus on creating additional variables or processing existing ones in order to improve the predictive power of our algorithm (this requires particular creativity - it’s kind of art) - for instance some algorithms are not able to capture non linear relationships (OLS/SVM/KNN), so we have to feed them with the variables that make it possible. Let's discuss the most popular feature engineering techniques. (Attention! Each of the techniques requires training on the train set, and then inference should be used to apply to the test!) Numeric variables transformations (only the most important): ● scaling to a range (min-max scaler): z = (x - min(x))/(max(x)- min(x)) (recommendation: when the feature is more-or-less uniformly distributed across a ﬁxed range) ● clipping (winsorization): if x > max, then z = max. if x < min, then z = min (recommendation: when the feature contains some extreme outliers) ● log scaling: z = log(x) (recommendation: when the feature conforms to the power law) ● z-score (standard scaler): z = (x - u) / s (recommendation: when the feature distribution does not contain extreme outliers) ● quantile transformer: map the data to a uniform distribution with values between 0 and 1 ● power transformer (Yeo-Johnson transform and the Box-Cox transform): map data from any distribution to as close to a Gaussian distribution as possible in order to stabilize variance and minimize skewness. ● bucketing (discretization of continuous variables) with: 1) equally spaced boundaries, 2) quantile boundaries, 3) bivariate decision trees boundaries 4) expertly given boundaries - all in all it can help with both overﬁtting and nonlinear modelling ● polynomial transformer, spline transformer, rounding, replacing with PCA and any other arithmetic operation [source: Google] 51 Feature engineering - cont’d end of the 2nd lecture Categorical variables transformations (only the most important): ● one hot encoding - it transforms each categorical feature with n possible values (n categories or n levels) into n binary features, with one of them 1, and all others 0. Sometimes we have to drop/remove one category (one binary feature) to avoid perfect collinearity in the input matrix in some estimators, in most cases we will drop the most frequent category (for instance OLS without this dropping will be impossible to compute, however OLS with regularization like L2 or Lasso works work well with such collinearity, and we should not remove any level). This approach supports aggregating infrequent categories into a single output for each feature. ● ordinal econder - it transforms each categorical feature to one new feature of integers. Be careful with this encoding, because passing such a variable directly to the model will impose an order to the model. ● there is a lot of other super powerful encoders like: BaseN, CatBoost Encoder, Count Encoder, Hashing, Helmert Coding, James-Stein Encoder, Leave One Out, Polynomial Coding, Quantile Encoder, Sum Coding, Summary Encoder, Target Encoder, Weight of Evidence etc. (check out source for more information) (I personally use most often one hot when doing econometrics, but when doing ML I love CatBoost Encoder, credit risk bankers love WoE, etc.) Interactions between variables - of course we can look for some interactions between our variables (numeric & numeric, categorical & categorical or numeric & categorical). We can try multiplication, division, subtraction and basically anything math and our imagination allow us to do. Keep in mind that the possibilities for feature engineering are endless. The advantage of machine learning over classical econometrics is that in most cases we are interested in the result itself, not how we arrived at it, so the variables we produce may have low interpretability. In addition, some algorithms (especially those based on decision trees) are able to select the relevant variables themselves, and marginalize the non-relevant ones. However, my years of experience show that we should not overdo our creativity - in ﬁnancial problems, usually the best variables created in feature engineering are those that have a strong business/economic/theoretical basis. However, it is always good to look for happiness in numbers :) [source: Category Encoders] 52 Labs no. 1 - introduction to exploratory data analysis, data wrangling, data engineering and modeling using econometric models The data we will be working with can be accessed through the following link: https://www.kaggle.com/competitions/home-credit-default-risk/data. Link do the materials: Link 53

Logistic Regression - Additional Materials PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue