Document Details
Uploaded by CozyOctopus
null
Tags
Related
- Machine Learning 1_ classification methods - lectures-1.pdf
- Lecture 6: Machine Learning for Remote Sensing Image Processing - Part I PDF
- Machine learning.pdf
- Model Evaluation Presentation PDF
- Model Assessment (Evaluation/Validation) and Model Selection PDF
- CPEN 355 Logistic Regression Lecture Notes PDF
Full Transcript
Cross-validation - different types [source: Neptune.ai blog] 75 Cross-validation - different types [source: Neptune.ai blog] 76 Cross-validation - different types [source: Neptune.ai blog] 77 Cross-validation - different types [source: Neptune.ai blog] 78 Cross-validation - different types...
Cross-validation - different types [source: Neptune.ai blog] 75 Cross-validation - different types [source: Neptune.ai blog] 76 Cross-validation - different types [source: Neptune.ai blog] 77 Cross-validation - different types [source: Neptune.ai blog] 78 Cross-validation - different types (target imbalance problem) [source: Neptune.ai blog] 79 Cross-validation - different types [source: Neptune.ai blog] 80 Cross-validation - different types (time series problems) [source: Neptune.ai blog] 81 Cross-validation - different types (time series problems) [source: Neptune.ai blog] 82 Cross-validation - different types (Nested CV) Nested Cross-validation is an extension of the above CVs, but it fixes one of the problems that we have with normal cross-validation. In normal cross-validation you only have a training and testing set, which you find the best hyperparameters for. This may cause information leakage and significant bias. You would not want to estimate the error of your model, on the same set of training and testing data, that you found the best hyperparameters for. As the image below suggests, we have two loops. The inner loop is basically normal cross-validation with a search function, e.g. random search or grid search. Though the outer loop only supplies the inner loop with the training dataset, and the test dataset in the outer loop is held back. [source: ML from scratch] 83 Cross-validation - external materials end of the 3rd lecture We use a Machine Learning University (MLU)-Explain course created by Amazon to present the concept of cross-validation. The course is made available under the Attribution-ShareAlike 4.0 International licence (CC BY-SA 4.0). Thanks to numerous visualisations, the course allows many theoretical concepts to be discussed very quickly. Cross Validation by MLU-EXPLAIN 84 Labs no. 2 - machine learning diagnostics with different evaluation metrics and dataset splits Link do the materials: https://colab.research.google.com/drive/195_9tF4bbkyBqnix4-UqRXMff00tq3ZJ?usp=sharing 85 start of the 4th lecture Chapter 4 Basic Supervised Learning models 86 K-nearest neighbours - general information The K-nearest neighbours (KNN) algorithm is a basic and probably the simplest supervised machine learning algorithm for both classification and regression problems. Behind this algorithm is the following idea of locality: the best prediction for a certain observation is the known target value (label) for the observation from the training set that is most similar to the observation for which we are predicting. The KNN algorithm belongs to the following group of methods, it is: non-parametric (it does not require the assumption of a sample distribution) and instance-based (it does not carry out the learning process directly - it remembers the training set and creates predictions on the basis of it on an ongoing basis). The model does not generate computational costs at the time of learning, while the entire computational cost lies on the side of making the prediction (lazy learning). The regression version differs little from the classification approach. In the classification approach we use an algorithm to vote for the most popular class of neighbours, while in the regression problem we use a technique to average the values of the target variable across neighbours. 87 K-nearest neighbours - general idea and formal algorithm (classification case) KNN classification algorithm: [source: Intel Course: Introduction to Machine Learning, Application of K-Nearest Neighbor (KNN) Approach for Predicting Economic Events: Theoretical Background] 88 K-nearest neighbours - key hyperparameters The three key hyperparameters for the KNN model are: ● distance metric ● number of k neighbours ● weights of the individual neighbours. Distance metrics allow us to formally define a measure of similarity between observations. Thanks to them we can determine whether two points lying in a multidimensional space are close to each other. In general, there are many ways to measure the distance between two points (X and Y) in space. The most popular of these are: ● minkowski p distance: ● euclidean distance: minkowski distance with p = 2 ● manhattan distance: minkowski distance with p = 1 ● chebyshev distance: minkowski distance with p reaching infinity: [source: Wikipedia, Lyfat blog] 89 K-nearest neighbours - key hyperparameters Additionally we have to determine how many of the k nearest observations we would like to take into account in our computations (this will also significantly affect our decision boundary). There is a rule thumb than square root of number of samples in our training set might be good choice for k. However, in practice we should look for values smaller than the square root of n and we use cross-validation for this task. Generally, the higher the parameter k, the higher the bias, and the lower the parameter k, the higher the variance (for instance for k = 1, our algorithm will be characterised by overfitting and a large variance). We can observe that easily via Bias/variance trade-off by MLU-EXPLAIN course page (paragraph dedicated to: K-Nearest Neighbors). KNN allows for weighing neighbours during the final stage of the prediction execution (voting - classification, averaging - regression). In the default algorithm, all points in the neighbourhood are weighted the same. However, weighting by distance can also be introduced. There are many methods for this approach, e.g. weighting by inverse distance - this means that observations that are closer will have a higher impact on the fitted value. [source: Intel Course: Introduction to Machine Learning] 90 K-nearest neighbours - features scaling (lack of homogeneity of features) Distance metrics, in addition to their many advantages, also introduce a number of problems into KNN. First and foremost, they are absolute in nature, which can very strongly affect the correctness of the KNN. A very common situation is that one or more explanatory variables (features) in our dataset are set on a large domain (significantly larger than the rest of the variables) and these features have low predictive power. This variable(s) will strongly influence the distances and dominate the other variables in KNN. By the fact that variables are weak predictors, they will make our model very ineffective. In order to get rid of the above problem, it is necessary to use the technique of feature scaling (normalization, standardization, etc.) This is a necessary step for the KNN algorithm (it is worth to try numerous techniques on the same variable, to check which one is the best)! The most popular scaling approaches for continuous variables are: ● standardization (z-score normalization): ● rescaling (min-max normalization): ● quantile normalization The most popular standardization approaches for nominal variables are: ● one hot encoder (with potential rescaling of 0-1 to other range for instance: 0-2, 0-0.5 etc.) ● ordinal encoder with further rescaling to 0-1 or other convenient range [source: Wikipedia] 91 K-nearest neighbours - other important informations KNN requires choosing a method to search our stored data for k-nearest neighbors. Brute force searching, which is simply calculating the distance of our query from each point in our dataset, will work fairly well with small datasets, but becomes undesirably slow at larger scales. Tree-based approaches can make the search process more efficient by inferring distances. The two most popular algorithms are: K-D Tree and Ball Tree Search Algorithms. Additionally, there is a curse of dimensionality problem in KNN. The KNN model makes the assumption that similar points share similar labels. It needs all points to be close along every dimension in the data space. However, each new dimension added, makes it harder and harder for two specific points to be close to each other in every dimension. Unfortunately, in high dimensional spaces, points that are drawn from a probability distribution, tend to never be close together - "a high-dimensional space is a lonely place”. The problem does not occur, for example, in the case of sparse matrices, or in image analysis (strong intragroup correlations affect significant closeness in all dimensions). A good approach to solving the multidimensionality problem is to create multiple models on subsets of data (subsets of variables) and then average their results (ensemble technique - bagging). In addition, bagging will also solve the problem of insignificant features. The KNN model is sensitive to variables with low predictive power. In such a case, variables should be selected in a very reasonable way, i.e. based on expert knowledge, but also using variable selection techniques, e.g. from general to specific or from specific to general, or other feature selection techniques. [source: Jeremy Jordan blog, Towards Data Science] 92 K-nearest neighbours - pros and cons PROS CONS ● intuitive and simple ● slow algorithm ● lack of assumptions (non-parametric) ● memory exhausting algorithm ● no training step ● curse of dimensionality ● applicability in the problem of classification ● low accuracy in many cases (binary and multiclass) and regression ● need of homogeneous features ● small number of hyperparameters ● not suited for imbalanced problems (directly) ● handles the specified problems very well (for ● lack of missing value treatment instance: problems with sparse matrices) ● sensitive to the selection of variables and the use of unnecessary variables 93 Support Vector Machines - general information The Support Vector Machines (SVM) is one of the fundamental non-parametric machine learning algorithms (and one of the most influential of its time). The main author of this model is Professor Vladimir Vapnik (one of the most recognizable researchers in the field of machine learning - interestingly if we only consider Vapnik's 'key' publications for SVM development, it took more than 40 years from his first paper to his last). The general idea of SVM is as follows: in a multi-dimensional space there exists a hyperplane which separates the classes in optimal way. The goal of SVM is to to find the hyperplane which maximizes the minimum distance (margin) between this hyperplane and observations from both classes. The idea of a support vector machine was implemented originally for the classification problem, while after some adjustments it is applicable to the regression problem and even unsupervised learning (e.g. searching for outliers). [source: An Introduction to Statistical Learning] 94 Support Vector Machines - general idea Misclassifications No misclassifications —but is this the best position? GOAL: Create a hyperplane which runs perfectly in the middle between classes and maximize the region between them Misclassifications No misclassifications [source: Intel Course: Introduction to Machine Learning] 95