Machine Learning - Overview PDF
Document Details
Uploaded by CozyOctopus
Tags
Summary
This document provides a general overview of machine learning, outlining different types of machine learning and their key concepts. It presents various definitions and examples. The text explores supervised, unsupervised, semi-supervised and reinforcement learning.
Full Transcript
What is Machine Learning? Machine learning (ML) is the process of using mathematical models of data to help a computer learn without direct instruction. It’s considered a subset of artificial intelligence (AI). Machine learning uses algorithms to identify patterns within data, and those patterns are...
What is Machine Learning? Machine learning (ML) is the process of using mathematical models of data to help a computer learn without direct instruction. It’s considered a subset of artificial intelligence (AI). Machine learning uses algorithms to identify patterns within data, and those patterns are then used to create a data model that can make predictions. With increased data and experience, the results of machine learning are more accurate—much like how humans improve with more practice. The adaptability of machine learning makes it a great choice in scenarios where the data is always changing, the nature of the request or task are always shifting, or coding a solution would be effectively impossible. [source: Microsoft Azure] 12 Types of Machine Learning ● Supervised learning ● Unsupervised learning ● Semi-supervised learning ● Reinforcement learning [source: IBM Developer] 13 Supervised learning Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value. Worth noting, econometrics is a subset of supervised learning family. [source: Wikipedia] In other words: given a set of data points {x1, …, xi} associated to a set of outcomes {y1, …, yi}, we build a model that learns to predict y from x. Types of supervised learning: ● Regression - outcome is continuous ● Classification - outcome is category [source: Intel] 14 Unsupervised learning Unsupervised learning is a type of algorithm that learns patterns from untagged data. Since the examples given to the learner are unlabeled, there is no evaluation of the accuracy of the structure that is output by the relevant algorithm - which is one way of distinguishing unsupervised learning from supervised learning and reinforcement learning. [source: Wikipedia] In other words: given a set of data points {x1, …, xi}, we look for hidden patterns in the data. Types of unsupervised learning: ● Clustering - grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other clusters. ● Dimension reduction - transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data. ● Association rules - identification of strong rules discovered in databases using some measures of interestingness. 15 Semi-supervised learning Semi-supervised learning is an approach to machine learning that combines a small amount of labeled data with a large amount of unlabeled data during training. In such approach algorithms make use of this additional unlabeled data to better capture the shape of the underlying data distribution and generalize better to new samples. Semi-supervised learning falls between unsupervised learning (with no labeled training data) and supervised learning (with only labeled training data). It is a special instance of weak supervision. [source: Wikipedia & scikit-learn] Reinforcement learning Reinforcement learning is the training of machine learning models to make a sequence of decisions. The agent learns to achieve a goal in an uncertain, potentially complex environment. In reinforcement learning, an artificial intelligence faces a game-like situation. The computer employs trial and error to come up with a solution to the problem. To get the machine to do what the programmer wants, the artificial intelligence gets either rewards or penalties for the actions it performs. Its goal is to maximize the total reward. [source: deepsense.ai] 16 Da ta &A Iw orl d ma p 17 Machine Learning glossary y, target, dependent variable, endogenous variable, output variable, regressand, ● response variable - variable predicted by the algorithm x, feature, independent variable, exogenous variable, explanatory variable, ● predictor, regressor - variable to predict target variable example ● example, entity, row - a single data point within the data (one row in the dataset) ● label - the target value for a single data point target column client_id age education income default 1 25 bachelor 25k USD 0 2 45 doctorate 40k USD 0 feature columns 3 52 master 70k USD 1 label 18 Machine Learning as a function The fundamental assumption of machine learning is as follows: there is a function that represents a causal relationship between features X and target Y, and can be written in very general form as: Y = f(X) + ϵ, where f is some fixed but unknown function of X, and ϵ is a random error term, which is independent of X and has mean zero. The goal of Machine Learning is to “find” a function f, where by “find” is meant the set of methods that estimate this function (we need to approximate this function, as its actual form is unobservable). 19 Machine learning estimation idea In general, we can define estimation process as: Ŷ = f̂ (X), where Ŷ is the prediction of our estimator for target variable and f̂ (X) is our estimate of the function f(X). The estimator approximates reality imperfectly, so it generates prediction error, which is equal to Y - Ŷ. The size of this error reflects the quality of the model (in general, the smaller the error the better). Importantly, for a number of reasons, part of the error is reducible (bias part) and part is irreducible (e.g. due to omitted variables). = Irreducible + Reducible 20 Machine learning estimation approaches A great many approaches can be used to estimate our function f(X). Thus, the primary division of supervised machine learning methods is as follows: ● ● ● parametric algorithms (for instance econometrics) ○ known functional form ○ known distribution of random variables ○ finite number of parameters nonparametric algorithms ○ unknown functional form (lack of a priori assumptions) ○ infinite number of parameters semi-parametric algorithms ○ theoretically infinite number of parameters, but in practice we estimate part of them. Both parametric and non-parametric methods have their advantages and disadvantages (trade-offs for parametric approaches: simplicity vs constrain, speed vs limited complexity, less data requireds vs poor fit). 21 Training Machine learning model - error minimization Regardless of the estimation approach chosen for small as possible with the currently estimated parameters , we are always keen for the forecast error to be as . Therefore, it is necessary to define and then optimise a function that expresses how “wrong” the model is. First of all we define loss function (L) - usually a function which measures the error between a single prediction and the corresponding actual value. For instance: Based on that we can define more general object, which is cost function (J) - usually a function which measures the error between predictions and their actual values across the whole dataset. It might be a sum of loss functions over your training set plus some model complexity penalty (regularization). For instance: Model training is about minimizing the cost function! 22 Training Machine learning model - cost function properties Cost function directly influences our estimator f(X). Thus, when we choose this function, we should (possibly) ensure that our estimator is unbiased: and efficient: estimator with the smallest variance. In the best situation, we obtain a minimum-variance unbiased estimator (MVUE). In addition, due to optimisation algorithms (based on differentiation), the cost function should be convex and it is good if it is smooth (continuous and differentiable). Last but not least, it is always important to consider whether our cost function reflects the real cost of prediction errors in the context of the research/modelling objective. It is worth considering whether it is more costly to overestimate or underestimate our problem (asymmetry) (e.g. whether it is better to employ more people in the shop for the Christmas peak, or whether it is better not to overestimate this number). 23 Training Machine learning model - idea of gradient descent Once we have defined the cost function, we can generally take its derivative with respect to the parameters (weights), set it to zero, and solve for the parameters to get the perfect global solution (FOC). However, for most functions this is impossible! Therefore, we have to use an alternative (local) optimisation method which is the gradient descent algorithm. General idea of gradient descent is as follows: [1] we define surface created by the objective function (we don't know what it looks like in general); [2] we follow the direction of the slope of this function [1] downhill until we reach a valley. [source: PaperspaceBlog] 24 Training Machine learning model - gradient descent formally First of all, let's recall the simplified definition of a gradient. The gradient is the a vector whose coordinates consist of the partial derivatives of the parameters: . The gradient vector can be interpreted as the "direction and rate of fastest increase". Now we define gradient descent optimization algorithm. Gradient descent is a way to minimize an objective function parameterized by a model's parameters direction of the gradient of the objective function define learning rate by updating the parameters in the opposite w.r.t. to the parameters. Additionally, we have to which determines the size of the steps we take to reach a (local) minimum. Vanilla gradient descent algorithm : 1. Start with initial random guess of 2. Generate new guess by moving in the negative gradient direction (gradient computed on entire training dataset): 3. Repeat point 2. to successively refine the guess and stop when convergence criteria is reached [sources: Sebastian Ruder blog, Stanford CS 229] 25 Training Machine learning model - gradient descent versions There are three variants of gradient descent, which differ in how much data we use to compute the gradient of the objective function. Depending on the amount of data, we make a trade-off between the accuracy of the parameter update and the time it takes to perform an update. [source: Sebastian Ruder blog] 26 General purpose of the estimation When we do a research we have to answer the question: 1) are we interested in the best possible prediction, 2) the best possible understanding of the relationship between our features and target (inference), or 3) are we interested in both issues at the same time? Depending on the answer (business environment, research problem, etc.), we will decide on the choice of estimator, e.g. parametric or non-parametric, fully explanable model or black-box model (a system which can be viewed in terms of its inputs and outputs without any knowledge of its internal workings) etc. Note that a more complex model will not always be better than a simple model (e.g. some problems are purely linear and non-parametric methods may search for complexly artificial and spurious relationships). Before starting experiments, it is important to have a good understanding of the problem being undertaken! 27 Types of variables There are different types of variables in statistics and machine learning. The most important ones are highlighted in the illustration below. [source: K2 Analytics] 28 Linear regression - general information Linear regression is a basic supervised learning algorithm for predicting continuous variables from a set of independent variables. From an econometric point of view, linear regression is primarily used for inference (much less frequently for prediction). In this course we look at linear regression from the machine learning perspective i.e. we are mostly interested in prediction. To get a good understanding of linear regression in economic applications, a separate course is generally devoted to it. We don't have time for that, so we will discuss its key elements from an ML perspective. At the same time, we recommend a very good course teaching the principles of linear regression (chapter 3 and 4). Importantly, linear regression can be estimated in a number of ways: ordinary least squares (OLS), weighted least squares (WLS), generalised least squares (GLS). We will focus on the most popular of these OLS. 29 Linear regression - external materials We use a Machine Learning University (MLU)-Explain course created by Amazon to present the concepts, assumptions, mathematical foundations, and interpretation of linear regression. The course is made available under the Attribution-ShareAlike 4.0 International licence (CC BY-SA 4.0). Thanks to numerous visualisations, the course allows many theoretical concepts to be discussed very quickly. Linear regression by MLU-EXPLAIN 30 Linear regression - additional materials Matrix notation of linear regression equation [source: Practical Econometrics and Data Science] Adjusted R squared Adjusted R2 is a corrected goodness-of-fit (model accuracy) measure for linear models. It identifies the percentage of variance in the target field that is explained by the inputs. R2 tends to optimistically estimate the fit of the linear regression. It always increases as the number of effects are included in the model. Adjusted R2 attempts to correct for this overestimation. Adjusted R2 might decrease if a specific effect does not improve the model. Adjusted R2 is always less than or equal to R2. A value of 1 indicates a model that perfectly predicts values in the target field. A value that is less than or equal to 0 indicates a model that has no predictive value. If we assume that p is the total number of explanatory variables in the model, and n is the sample size, then R2 is equal to: [source: IBM] 31 Linear regression - additional materials OLS - Closed-Form Solution extension OLS - regression output analysis [source: Practical Econometrics and Data Science] R^2 and Adjusted R^2 P-value of F-statistic (interpretation: value below significance level e.g. 5% means that our model is well specified - it is better than the model without features) Values of model parameters, thus regression is equal to: y = 5.2 + 0.47*x1 + 0.48*x2 - 0.02*x3 P-value of t-statistic (interpretation: value below significance level e.g. 5% means that our variables is significant in the model) Some model specification tests [source: Statsmodels] 32