95.pdf

Cross-validation - diﬀerent types [source: Neptune.ai blog] 75 Cross-validation - diﬀerent types [source: Neptune.ai blog] 76 Cross-validation - diﬀerent types [source: Neptune.ai blog] 77 Cross-validation - diﬀerent types [source: Neptune.ai blog] 78 Cross-validation - diﬀerent types (target imbalance problem) [source: Neptune.ai blog] 79 Cross-validation - diﬀerent types [source: Neptune.ai blog] 80 Cross-validation - diﬀerent types (time series problems) [source: Neptune.ai blog] 81 Cross-validation - diﬀerent types (time series problems) [source: Neptune.ai blog] 82 Cross-validation - diﬀerent types (Nested CV) Nested Cross-validation is an extension of the above CVs, but it ﬁxes one of the problems that we have with normal cross-validation. In normal cross-validation you only have a training and testing set, which you ﬁnd the best hyperparameters for. This may cause information leakage and signiﬁcant bias. You would not want to estimate the error of your model, on the same set of training and testing data, that you found the best hyperparameters for. As the image below suggests, we have two loops. The inner loop is basically normal cross-validation with a search function, e.g. random search or grid search. Though the outer loop only supplies the inner loop with the training dataset, and the test dataset in the outer loop is held back. [source: ML from scratch] 83 Cross-validation - external materials end of the 3rd lecture We use a Machine Learning University (MLU)-Explain course created by Amazon to present the concept of cross-validation. The course is made available under the Attribution-ShareAlike 4.0 International licence (CC BY-SA 4.0). Thanks to numerous visualisations, the course allows many theoretical concepts to be discussed very quickly. Cross Validation by MLU-EXPLAIN 84 Labs no. 2 - machine learning diagnostics with diﬀerent evaluation metrics and dataset splits Link do the materials: https://colab.research.google.com/drive/195_9tF4bbkyBqnix4-UqRXMﬀ00tq3ZJ?usp=sharing 85 start of the 4th lecture Chapter 4 Basic Supervised Learning models 86 K-nearest neighbours - general information The K-nearest neighbours (KNN) algorithm is a basic and probably the simplest supervised machine learning algorithm for both classiﬁcation and regression problems. Behind this algorithm is the following idea of locality: the best prediction for a certain observation is the known target value (label) for the observation from the training set that is most similar to the observation for which we are predicting. The KNN algorithm belongs to the following group of methods, it is: non-parametric (it does not require the assumption of a sample distribution) and instance-based (it does not carry out the learning process directly - it remembers the training set and creates predictions on the basis of it on an ongoing basis). The model does not generate computational costs at the time of learning, while the entire computational cost lies on the side of making the prediction (lazy learning). The regression version diﬀers little from the classiﬁcation approach. In the classiﬁcation approach we use an algorithm to vote for the most popular class of neighbours, while in the regression problem we use a technique to average the values of the target variable across neighbours. 87 K-nearest neighbours - general idea and formal algorithm (classiﬁcation case) KNN classiﬁcation algorithm: [source: Intel Course: Introduction to Machine Learning, Application of K-Nearest Neighbor (KNN) Approach for Predicting Economic Events: Theoretical Background] 88 K-nearest neighbours - key hyperparameters The three key hyperparameters for the KNN model are: ● distance metric ● number of k neighbours ● weights of the individual neighbours. Distance metrics allow us to formally deﬁne a measure of similarity between observations. Thanks to them we can determine whether two points lying in a multidimensional space are close to each other. In general, there are many ways to measure the distance between two points (X and Y) in space. The most popular of these are: ● minkowski p distance: ● euclidean distance: minkowski distance with p = 2 ● manhattan distance: minkowski distance with p = 1 ● chebyshev distance: minkowski distance with p reaching inﬁnity: [source: Wikipedia, Lyfat blog] 89 K-nearest neighbours - key hyperparameters Additionally we have to determine how many of the k nearest observations we would like to take into account in our computations (this will also signiﬁcantly aﬀect our decision boundary). There is a rule thumb than square root of number of samples in our training set might be good choice for k. However, in practice we should look for values smaller than the square root of n and we use cross-validation for this task. Generally, the higher the parameter k, the higher the bias, and the lower the parameter k, the higher the variance (for instance for k = 1, our algorithm will be characterised by overﬁtting and a large variance). We can observe that easily via Bias/variance trade-oﬀ by MLU-EXPLAIN course page (paragraph dedicated to: K-Nearest Neighbors). KNN allows for weighing neighbours during the ﬁnal stage of the prediction execution (voting - classiﬁcation, averaging - regression). In the default algorithm, all points in the neighbourhood are weighted the same. However, weighting by distance can also be introduced. There are many methods for this approach, e.g. weighting by inverse distance - this means that observations that are closer will have a higher impact on the ﬁtted value. [source: Intel Course: Introduction to Machine Learning] 90 K-nearest neighbours - features scaling (lack of homogeneity of features) Distance metrics, in addition to their many advantages, also introduce a number of problems into KNN. First and foremost, they are absolute in nature, which can very strongly aﬀect the correctness of the KNN. A very common situation is that one or more explanatory variables (features) in our dataset are set on a large domain (signiﬁcantly larger than the rest of the variables) and these features have low predictive power. This variable(s) will strongly inﬂuence the distances and dominate the other variables in KNN. By the fact that variables are weak predictors, they will make our model very ineﬀective. In order to get rid of the above problem, it is necessary to use the technique of feature scaling (normalization, standardization, etc.) This is a necessary step for the KNN algorithm (it is worth to try numerous techniques on the same variable, to check which one is the best)! The most popular scaling approaches for continuous variables are: ● standardization (z-score normalization): ● rescaling (min-max normalization): ● quantile normalization The most popular standardization approaches for nominal variables are: ● one hot encoder (with potential rescaling of 0-1 to other range for instance: 0-2, 0-0.5 etc.) ● ordinal encoder with further rescaling to 0-1 or other convenient range [source: Wikipedia] 91 K-nearest neighbours - other important informations KNN requires choosing a method to search our stored data for k-nearest neighbors. Brute force searching, which is simply calculating the distance of our query from each point in our dataset, will work fairly well with small datasets, but becomes undesirably slow at larger scales. Tree-based approaches can make the search process more eﬃcient by inferring distances. The two most popular algorithms are: K-D Tree and Ball Tree Search Algorithms. Additionally, there is a curse of dimensionality problem in KNN. The KNN model makes the assumption that similar points share similar labels. It needs all points to be close along every dimension in the data space. However, each new dimension added, makes it harder and harder for two speciﬁc points to be close to each other in every dimension. Unfortunately, in high dimensional spaces, points that are drawn from a probability distribution, tend to never be close together - "a high-dimensional space is a lonely place”. The problem does not occur, for example, in the case of sparse matrices, or in image analysis (strong intragroup correlations aﬀect signiﬁcant closeness in all dimensions). A good approach to solving the multidimensionality problem is to create multiple models on subsets of data (subsets of variables) and then average their results (ensemble technique - bagging). In addition, bagging will also solve the problem of insigniﬁcant features. The KNN model is sensitive to variables with low predictive power. In such a case, variables should be selected in a very reasonable way, i.e. based on expert knowledge, but also using variable selection techniques, e.g. from general to speciﬁc or from speciﬁc to general, or other feature selection techniques. [source: Jeremy Jordan blog, Towards Data Science] 92 K-nearest neighbours - pros and cons PROS CONS ● intuitive and simple ● slow algorithm ● lack of assumptions (non-parametric) ● memory exhausting algorithm ● no training step ● curse of dimensionality ● applicability in the problem of classiﬁcation ● low accuracy in many cases (binary and multiclass) and regression ● need of homogeneous features ● small number of hyperparameters ● not suited for imbalanced problems (directly) ● handles the speciﬁed problems very well (for ● lack of missing value treatment instance: problems with sparse matrices) ● sensitive to the selection of variables and the use of unnecessary variables 93 Support Vector Machines - general information The Support Vector Machines (SVM) is one of the fundamental non-parametric machine learning algorithms (and one of the most inﬂuential of its time). The main author of this model is Professor Vladimir Vapnik (one of the most recognizable researchers in the ﬁeld of machine learning - interestingly if we only consider Vapnik's 'key' publications for SVM development, it took more than 40 years from his ﬁrst paper to his last). The general idea of SVM is as follows: in a multi-dimensional space there exists a hyperplane which separates the classes in optimal way. The goal of SVM is to to ﬁnd the hyperplane which maximizes the minimum distance (margin) between this hyperplane and observations from both classes. The idea of a support vector machine was implemented originally for the classiﬁcation problem, while after some adjustments it is applicable to the regression problem and even unsupervised learning (e.g. searching for outliers). [source: An Introduction to Statistical Learning] 94 Support Vector Machines - general idea Misclassiﬁcations No misclassiﬁcations —but is this the best position? GOAL: Create a hyperplane which runs perfectly in the middle between classes and maximize the region between them Misclassiﬁcations No misclassiﬁcations [source: Intel Course: Introduction to Machine Learning] 95

Document Details

Tags

Related

Full Transcript