Machine Learning Notes PDF
Document Details
Uploaded by Deleted User
2024
Steinn Guðmundsson
Tags
Related
- Linear Models PDF
- Mathematics and Statistical Foundations for Machine Learning (FIC 504, FIC 506, FIC 507), Part 1, PDF
- Mathematics and Statistical Foundations for Machine Learning (FIC 504), Data Science (FIC 506), Cyber Security (FIC 507) PDF
- Lecture 7: An Introduction to the World of Machine Learning PDF
- Machine Learning 1 Classification Methods Lectures PDF
- Linear Algebra and Its Applications (2021) PDF
Summary
These notes cover machine learning topics, including linear algebra review and supervised learning examples, for example from different domains like predicting apartment prices or heart attacks. The document also includes examples of various machine learning tasks and linear regression calculations.
Full Transcript
REI505M Machine Learning Steinn Guðmundsson ([email protected]) 29. nóvember 2024 Translated from Icelandic to English with ChatGPT-4. Credits: CS229 at Stanford (logistic regression, PCA), 6.036 at M...
REI505M Machine Learning Steinn Guðmundsson ([email protected]) 29. nóvember 2024 Translated from Icelandic to English with ChatGPT-4. Credits: CS229 at Stanford (logistic regression, PCA), 6.036 at MIT (reinforcement learning), Elements of Statistical learning by Hastie et al., Reinforcement learning by Sutton and Barto. ChangeLog 29.11.24. Corrected result of ReLU operation (matrix at the top of p. 65) 1 Linear algebra review This chapter covers the main concepts from linear algebra that appear in the course. Starting a course in machine learning with dry definitions isn’t very exciting. Therefore, it is advisable to go directly to chapter 2 and refer to this chapter as needed. A vector with p elements (x1 , x2 ,... , xp ) is called a p-vector. It is assumed that all vectors are column vectors (p × 1 matrices) unless otherwise specified. The vector ek is a p-vector where all elements are zero except the k-th element, is one. The vector 1 is a p-vector where all elements are one. The inner product (or dot product) of p-vectors a and b is p X aT b = a1 b1 + a2 b2 +... + ap bp = aj bj. j=1 The notation aT indicates that vector a has been transposed, so that aT is a row vector (1 × p matrix). This notation is consistent with matrix multiplication (see below). Vectors are usually denoted with lowercase letters. The norm of a vector x is a measure of its length. In the course, we use both the 2-norm (Euclidean norm) v u p 2 √ q uX ||x||2 = x21 +... + x2p = t xj = xT x j=1 1 and the 1-norm p X ||x||1 = |x1 | +... + |xp | = |xj | j=1 where |xj | denotes the absolute value of xj. The Euclidean norm is often used to estimate the error of models (e.g., mean square error in linear regression). The 1-norm is used less frequently for this purpose. Both norms are widely used regularization of models. The Euclidean norm has the advantage of being differentiable, which simplifies optimization of model parameters, while the 1-norm is not differentiable everywhere. Using 1-norm in regularization provides information about which input variables are most important in the model. When ||x|| is written, it usually refers to the 2-norm. The angle φ between vectors a and b is defined as aT b cos φ = ||a|| ||b|| When working with text in vector form, such as bag-of-words representation (more on this later), the angle between vectors is a useful measure of the similarity of two text snippets. Vectors a and b are said to be perpendicular if aT b = 0. The projection of vector a onto vector b is the vector aT b b̃ = b. ||b||2 Pk Vectors x1 ,... , xk are linearly independent if i=1 αi xi = 0 only if α1 =... = αk = 0. A table of numbers with n rows and m columns is called a matrix. It is often convenient to interpret a matrix as a collection of row or column vectors. Matrices are usually denoted with uppercase letters. Let A be an n × m matrix. The element in row i, column j is denoted by Aij. The matrix AT is called the transpose of matrix A and is defined as the m × n matrix with ATji = Aij. Example: 1 2 T 1 3 5 A= 3 4 , A = . 2 4 6 5 6 Matrix A is symmetric if A = AT. In a diagonal matrix, all elements outside the diagonal are zero. A diagonal matrix with 1s on the diagonal is denoted by I (identity matrix). A matrix is said to be sparse if most of its elements are zero. The rank of a matrix is the number of linearly independent columns in the matrix. 2 The product y = Ax,Pwhere A is an n × p matrix and x is a p-vector, is defined as the n-element p vector y with the yi = j=1 Aij xj. The product y = Ax can also be interpreted as the weighted sum of columns of A where the elements of x correspond to the weights, | | y = x1 a1 +...+ xk a k. | | The matrix product C = AB, where A is an n × p matrix and B is a p × m matrix, is defined as that n P × m matrix whose element (i, j) equals the inner product of row i of A and column j of B, p Cij = k=1 Aik Bkj. Matrix multiplication can also be interpreted as repeated multiplication of a matrix and a vector. The inner product can be interpreted as the matrix product of a p-element row vector a and a p-element column vector b. The outer product of a p-element column vector a and a p-element row vector b is a p × p matrix with ai bj in element (i, j). Example: If a = [1, 2, 3] is a column vector and b = [4, 5, 6]T is a row vector, then 4 5 6 abT = 8 10 12 . 12 15 18 Let A be an n × n symmetric1 matrix. A vector v ∈ Rn , v 6= 0, is called an eigenvector of A and λ ∈ R the corresponding eigenvalue if Av = λv. From the definition, it is seen that the effect of multiplying matrix A with its eigenvector is the same as multiplying the vector by a scalar. 1 If A is not a symmetric matrix, its eigenvalues and eigenvectors are complex. 3 2 Supervised Learning In supervised learning, we have a dataset X = {(x(1) , y (1) ),... , (x(n) , y (n) )}, known as training data, and we seek a function (”model”) f such that y ≈ f (x). The elements x(i) are called input variables or features, and y (i) are output variables. The input variables are often vectors, x(i) ∈ Rp and y (i) are often scalar values, y (i) ∈ R. Example: Predicting apartment prices in Reykjavik. The following data is available: Size (m2 ) Price (million ISK) 54 33.2 107 48.5 70 42.0...... Here, x ∈ R corresponds to the apartment size and y ∈ R to the apartment price. The task is to find a function f that describes the apartment price as a function of its size. Example: Predicting whether a person arriving at the emergency room has had a heart attack. The following data has been collected: Age Maximum Pulse Blood Pressure Cholesterol Heart Attack 52 65 95 5.3 0 67 77 101 6.3 1 81 63 90 5.5 0............... Here, x(i) ∈ R3 corresponds to information about individual patients (age, maximum pulse, and systolic blood pressure) and y (i) ∈ {0, 1} to whether patient i has had a heart attack (1) or not (0). The task is to find a function f that returns 1 if the person is likely to have had a heart attack and 0 otherwise. Example: Classification of photographs. Here, the vector x ∈ Rmn corresponds to m × n bitmap images and y ∈ {1,... , K} to K different categories of images (dogs, cats, cars,...). The task is to find a function f that takes a bitmap image and determines which category it belongs to. The AlexNet neural network marked a turning point in solving such tasks in 2012 and sparked the ongoing revolution in artificial intelligence based on so-called deep neural networks. 4 Example: Translating text from English to Icelandic. Here, x ∈ Σ∗E corresponds to strings from the English alphabet ΣE and y ∈ Σ∗I to strings from the Icelandic alphabet ΣI. The training data consists of English-Icelandic sentence pairs. The task is to find a function that maps English strings to their corresponding Icelandic strings. 2.1 Linear Regression If there are, for example, two input variables, x = (x1 , x2 ), and a linear relationship is assumed between input and output variables, then the function f is of the form fθ (x) = θ0 + θ1 x1 + θ2 x2. We distinguish between the variables of the function f , x = (x1 , x2 ), which reflect inputs to the function, and the parameters of the function, θ = (θ0 , θ1 , θ2 ). The parameters are determined from the training data (usually just once) and subsequently treated as constants when calculating function values. When the number of input variables is p, the model is p X fθ (x) = θ0 x0 + θ1 x1 +... + θp xp = θj xj = θT x. j=0 where, for simplicity, we have fixed x0 = 1 to avoid treating the parameter θ0 (”intercept”) separately. To choose the parameters θ, we define a cost function that measures the error of the model fθ (x) with respect to the training data and we then find the values of the parameters that minimize the function. In the method of least squares, the following function is minimized, n 1X J(θ) = (fθ (x(i) ) − y (i) )2. 2 i=1 The cost function J is a function of the parameters θ only, while x and y are given and thus treated as constants in the minimization. The factor 1/2 in front of the sum is included in order to simplify slightly formulas in subsequent calculations. Now, how do we find the values θ = (θ0 ,... , θp ) that minimize J(θ)? 3 Function minimization A typical learning algorithm determines parameters in a model by finding those values of parameters θ that maximize or minimize a given function, known as the objective function. In supervised learning, the objective function is some kind of measure of how well the model fits (given) data, cf. squared error above. Maximization or minimization of a function is called optimization. 5 A conventional algorithm to minimize2 a function starts at a given initial point, iterates towards a local minimum by calculating function values and derivatives along the way, until some stopping criterion is met. The time complexity is thus dependent on the number of iterations and the ”cost” (number of computational operations) of calculating function and derivative values. Minimization algorithms have one or more hyper-parameters that the user chooses in advance. 3.1 Functions of a single variable First, consider a function f of a single variable θ at the point θ = t. The derivative of f at t indicates how quickly f changes in the vicinity of θ. Example The function f (θ) = (θ − 1)2 has the derivative dfdθ(θ) = 2(θ − 1). The function in Figure 1 is decreasing to the left of θ = 1 where the derivative is negative, increasing to the right of θ = 1 where the derivative is positive, and takes its minimum value at the point θ = 1 where the derivative is zero. Mynd 1: The function f (θ) = (θ − 1)2 has its minimum at θ = 1. A necessary condition for f to have a minimum at θ is that dfdθ(θ) = 0. If dfdθ(θ) < 0, a lower function value can be found by moving to the right from θ, and if the derivative is positive, a lower function value can be found by moving to the left from θ. In light of this discussion, a basis for an algorithm to determine a (local) minimum point from a given starting point θ(0) can be established: Iterate (0) θ(1) ← θ(0) − α df (θ dθ ) 2 Maximizing a function f corresponds to minimizing −f. 6 (1) θ(2) ← θ(1) − α df (θ dθ )... Until stop criterion met. Here θ(k) denotes the value of the parameter in iteration k, and α > 0 is a constant that needs to be specified. It can be shown that under certain conditions on f and α, the sequence θ(1) , θ(2) ,... converges to a local minimum of f , which we denote by θ∗. Note that dfdθ (θ ) = 0 does not guarantee that θ is a minimum point. The point could be a ∗ ∗ maximum point (e.g., if f (θ) = −θ ) or a saddle point (e.g., if f (θ) = θ3 ). To obtain a sufficient 2 condition for a minimum point, higher derivatives of the function at the point need to be examined. We won’t delve further into this here3. 3.2 Functions of multiple variables Next, consider a function of p variables, f (θ1 , θ2 ,... , θp ). Derivatives can also be defined for such functions, but it’s a bit more involved than than for univariate functions. The directional derivative of a function f at point θ = (θ1 ,... , θp ) in the direction of a unit vector u indicates how f changes when moving from θ along a line along u. Specifically, the directional derivative of f at θ, denoted Du f (θ), is defined by f (θ + hu) − f (θ) Du f (θ) = lim =. h→0 h We see that the numerator indicates how much f changes when moving from the point θ to the point θ + hu (h is a scalar). Usually, it is possible to find a direction u such that f decreases if moved in that direction (an exception is when θ is a minimum point). The partial derivative of f at point θ = (θ1 ,... , θp ) with respect to variable θk , denoted ∂f (θ) ∂θk , is the directional derivative of f in the direction of the unit vector ek (unit vector parallel to the k-th coordinate axis), ∂f = Dek f (θ) ∂θk The partial derivative of f with respect to θk corresponds to the regular (univariate) derivative of f with respect to θk while treating all other variables as constants. Example: The partial derivatives of the function f (θ1 , θ2 ) = θ12 + 5θ1 θ2 + θ23 are ∂f = 2θ1 + 5θ2 , ∂θ1 ∂f = 5θ1 + 3θ22. ∂θ2 3 In the minimization problems that arise in machine learning, there seems to be little or no benefit in using higher order derivatives. 7 The gradient of f at point θ is the vector " # ∂f (θ) ∇f (θ) = ∂θ1. ∂f (θ) ∂θ2 2θ1 − 5 Example: The gradient of the function f (θ1 , θ2 ) = θ12 − 5θ1 + θ23 + 10 is ∇f =. 3θ22 Note that both the directional derivative and partial derivative are scalar quantities (numbers), but the gradient is a vector where the number of elements corresponds to the number of variables. A contour line of f for a constant c is defined as the set of all points where the function value is c, i.e. {(θ1 ,... , θp ) | f (θ1 ,... , θp ) = c}. Example: The figure 2 shows the graph of the function f (θ1 , θ2 ) = 3(1−θ1 )2 exp(−θ12 −(θ2 +1)2 )−10(θ1 /5−θ13 −θ25 ) exp(−θ12 −θ22 )−1/3 exp(−(θ1 +1)2 −θ22 ) along with some of its contour lines. Mynd 2: Graph of a function along with its contours. According to a theorem from Mathematical Analysis II, it holds that Du f (θ) = (∇f (θ))T u. We use this relation to find the direction u that causes f to decrease fastest in the vicinity of θ. First, consider the contour line of f at θ. If we move a very short distance in the direction parallel to the contour line, the value of the function does not change, i.e., Du f (θ) = 0. The theorem then gives that (∇f (θ))T u = 0, i.e., the gradient of f at θ is perpendicular to the contour line at the point. By rewriting the directional derivative (∇f (θ))T u = ||∇f (θ)|| ||u|| cos φ 8 where φ is the angle between the gradient and u. Since ||u|| is a unit vector, we find that the derivative is largest when u is parallel to the gradient and smallest when it is in the opposite direction of the gradient. Therefore, the step that gives the fastest decrease in the function value is u = −∇f (θ). Example: Figure 3 shows the contour lines of the function f (θ1 , θ2 ) = θ12 + 2θ22 along with the vector u = −∇f (θ)/||∇f (θ)|| at the point θ = (−1.5, −0.9354). Mynd 3: Contour lines of the function f (θ1 , θ2 ) = θ12 + 2θ22 along with the gradient at θ = (−1.5, −0.9354). We see that by taking a small step in the direction of the gradient, it is possible to find a point that gives a lower function value. The discussion above underlies the steepest descent method algorithm: k←1 repeat (k−1) (k) (k−1) θj ← θj − αk ∂f (θ∂θj ) , j = 0,... , p k ←k+1 until stopping condition is met The parameters αk > 0 determine the step size. The simplest way is to set αk = α, i.e., use a constant, but then the value of α needs to be chosen carefully. If α is too small, convergence will be very slow and if α is too large, the method diverges, i.e., moves away from the minimum. The iterations continue until one or more of the following conditions are met: Maximum number of iterations reached: k > kmax Steps become very small: ||θ(k) − θ(k−1) ||2 < 1 Function values change little: |f (θ(k) ) − f (θ(k−1) )| < 2 where kmax , 1 > 0 and 2 > 0 are parameters that need to be chosen. 9 Example: Figure 4 shows the first steps when the steepest descent method is applied to the function f (θ1 , θ2 ) = θ12 + 2θ22. Mynd 4: Typical procedure for the steepest descent method. The minimum point is at (0, 0). The step size is α = 0.21. If the contour lines are highly elongated, (consider a function like f (θ1 , θ2 ) = θ12 + 106 θ22 ), the method tends to ”zigzag” towards the minimum and convergence is exceedingly slow. The steepest descent method is found in almost all books on optimization and was long taken as an example of a method that performs poorly in practice due to slow convergence. Later, it was been found that methods based on it are particularly well suited in many optimization tasks that arise in artificial intelligence. 3.3 Method of steepest descent and least squares minimization Let’s consider the squared error J(θ) above for the case n = 1 (only one point in the dataset), ∂J(θ) ∂ 1 = (fθ (x) − y)2 ∂θj ∂θj 2 1 ∂ = 2(fθ (x) − y) (fθ (x) − y) 2 ∂θj ∂ = (fθ (x) − y) (θ0 x0 +... + θj xj +... + θp xp − y) ∂θj = (fθ (x) − y)xj. The update step then becomes (k) (k−1) θj ← θj − α(fθ(k−1) (x) − y)xj , j = 0,... , p. 10 We see that if fθ (x) is far from y, a large step is taken, but if fθ (x) is close to y, a small step is taken (the values of the parameters change little). In case n ≥ 1, we use the fact that the derivative of a sum is equal to the sum of derivatives and the steepest descent method becomes k←1 repeat (k) (k−1) Pn (i) θj ← θj − α i=1 (fθ(k−1) (x(i) ) − y (i) )xj , j = 0,... , p k ←k+1 until stopping condition is met This method is sometimes called ”batch gradient descent” in machine learning. If f and αk meet certain conditions, it can be shown that θ(k) converges to a minimum point of f and J(θ) decreases in each step. The disadvantage of this method is that all n points in the dataset must be processed before the parameters are updated. This proves difficult, for example, when dealing with large datasets that cannot all fit in memory at once. A related method that often works very well is k←1 repeat Choose i randomly from the range 1,... , n (k) (k−1) (i) θj ← θj − α(fθ(k−1) (x(i) ) − y (i) )xj , j = 0,... , p k ←k+1 until stopping condition is met This method is called stochastic gradient descent (SGD). Instead of updating the parameters after processing the entire dataset, they are updated for each data point. This usuallu results in a much faster reduction in J(θ). Variants of SGD are widely used in training deep neural networks and other learning methods (more on this later). Note In linear regression, fθ (x) = θT x. The parameter θ can be determined using the SGD method, but in practice, it is often determined by solving the so-called normal equations directly, X T Xθ = X T → −y. where X represents an n×p matrix with x(i) in row i, and → − y represents the vector → − y = (y (1) ,... , y (n) ). This linear system of equations is derived from the condition ∇J(θ ) = 0, which is a necessary ∗ condition for a local minimum of J(θ) (and also sufficient in this case). It is efficient to use QR decomposition to solve these equations. The time complexity then becomes O(np2 ). 3.4 Scaling of Input and Output Variables Scaling of variables can affect the accuracy of predictive models, e.g., nearest neighbour classifiers, support vector machines and neural networks. Scaling of variables also affects the convergence rate of many minimization algorithms, e.g., SGD, even if the model itself is not sensitive to scaling of variables. Let’s consider the case when input variables take continuous values. 11 Example: In linear regression, input variables x1 and x2 represent the weight of comparable objects but in different units. x1 is of the magnitude 10−3. x2 is of the magnitude 103. The contour lines of the objective function become elongated ellipses, and the convergence of gradient decsent methods becomes very slow (points zigzag towards the minimum). By multiplying all x1 values by 1000 and dividing the x2 values by 1000, x1 and x2 are of the same magnitude. Contour lines become circular, and the convergence rate increases. Note that the method based on normal equations is not as sensitive. The main methods used for scaling continuous input variables are 1. Scaling to the range [0, 1]. Input variable j in example i becomes (i) (i) xj − mj x̃j = Mj − mj (i) where Mj is the largest value that xj takes in the dataset, Mj = maxi=1,...,n xj and mj is (i) the smallest value that xj takes, mj = mini=1,...,n xj. 2. Standardization (normalization). Input variable j in example i becomes (i) (i) xj − x̄j x̃j = sj where x̄j is the average of values that xj takes in the dataset and sj is the standard deviation. To predict unseen values, e.g., data from a test set, information about the scaling, i.e., (mj , Mj ) or (x̄j , sj ), must be stored. To calculate the predicted value at point x̂, it is first scaled and then fed into fθ. Note that test data should not be included when determining (mj , Mj ) or (x̄j , sj ). Including it in the training can skew accuracy estimates of the model (this is an example of test data leaking into the training data). Outliers in the data significantly impact the scaling, as shown in figure 5. The figure shows that one point is an outlier as its value is significantly different from all other examples in the dataset. After scaling, its value is approximately 1, while all other points are close to 0. This can then compromise accuracy later on. If an input variable is discrete, its values can be mapped to integers, but how this is done can affect performance4. 4 Decision tree methods are an exception, they are insensitive to scaling of input variables. 12 Mynd 5: Example of how outliers affect data scaling. Example Input variable x1 takes the values red, yellow, or blue and is mapped to the values 0, 1, or 2. If we consider the Euclidean distance between points i and j in the dataset, q (i) (j) d(i, j) = ||x(i) − x(j) || = (x1 − x1 )2 we see, for example, that the √ distance between the points (red) and (yellow) is 1, but the distance between (red) and (blue) is 2, which is hardly what we want. So called one-hot encoding alleviates this problem. Instead of a single variable x taking k discrete values, k new variables, x̃1 ,... , x̃k , are created where variable x̃j takes the value 1 if and only if x takes the value j and 0 otherwise. Example (continued) Instead of the variable x1 taking the values red, yellow, or blue, three variables are created so that red = (1,0,0), yellow = (0,1,0), and blue = (0,0,1). We see that the distance between all colors is now the same, i.e., 1. If output variables in regression analysis span a large range of values, it can help to transform the using the log-transform. 13 4 Classifiers Let X = {(x(1) , y (1) ),... , (x(n) , y (n) )} be a given dataset. As in regression analysis, we seek a rule f such that y ≈ f (x), but now the y-values are discrete, y (i) ∈ {1,... , K}, where K is the number of classes. Numerous classification algorithms have been proposed since Ronald Fisher introduced linear discriminant analysis in 1936. While some classifiers assume strict assumptions about the distri- bution of x(i) values in the input space (e.g., linear discriminants), others make little to no demands (e.g., k-NN classifiers, tree classifiers, neural networks). Some classifiers are limited to continuous input variables (e.g., tree classifiers) but others support more diverse structures such as graphs, ima- ges and strings. Some classifiers are fundamentally limited to two classes (e.g., logistic regression and support vector machines), while others are not (e.g., k-NN classifiers, tree classifiers, and neural networks). Some classifiers struggle with ”junk” input variables in X, i.e., variables that have little or no connection to y (e.g., k-NN classifiers), while others handle such cases well (e.g., tree classifiers). From this enumeration, it is clear that there is no single type of classifier that is best in all cases, but generally, tree classifiers and support vector machines work well for small datasets (e.g., n ≈ 104 ). Deep neural networks perform well for certain types of classification tasks, especially tasks related to image and text processing. Deep neural networks typically require a large amount of training data. Figure 6 shows two classifiers for a simple two-class dataset. The two classes are represented by different colors, and the training data are marked with circles. The images are created by classifying points in the area [−2.1, 1.9] × [−1.0, 2.0] and coloring them with the corresponding predicted value. The boundary between classes is called the decision boundary. Mynd 6: 1-NN and a nonlinear support vector machine trained on a two-dimensional, two-class task. 14 4.1 Nearest Neighbor Classifiers In a k-nearest neighbor classifier, a point x̂ is classified by finding the k closest points in X to x̂ and assigning x̂ to the class that occurs most frequently among the neighbors. If more than one class is most frequent, one of them is selected at random. The number of neighbors, k, is a hyper-parameter that needs to be selected. To estimate the distance between x̂ and a point x(i) , the Euclidean distance is often used, d(x̂, x(i) ) = ||x̂ − x(i) || but many other distance measures are possible, e.g., measures tailored to the data, such as dynamic time warping, which is sometimes used to measure distances between time series and was widely used in speech analysis before neural networks became dominant in the field. Note that when using Euclidean distance, input variables must be scaled in advance, e.g., by subtracting the mean and dividing by the standard deviation (standardization). Advantages Easy to understand and simple to implement. Handles K ≥ 2 classes. Can use general distance measures. Does not make strict demands on the distribution of input data. Often works well when the number of input variables is small. Disadvantages: When classifying a point x̂, it is necessary to iterate through the entire dataset, and the number of operations is O(np). With appropriate data structure choices, efficiency can be somewhat improved. Accuracy decreases as the number of input variables increases (a common problem that is not restricted to k-nearest neighbors). 4.2 Classifier Based on Linear Regression Here it is assumed that y ∈ {0, 1}, and linear regression is used to fit a model of the form fθ (x) = θT x to the data. Since fθ (x) is a real-valued function, the function values can be negative or very large, so the model cannot be used directly to classify a point x̂. Instead, a rule such as 1 if θT x̂ > 0.5 ŷ = 0 otherwise. is used. 15 The classifier has the advantage of being easy to understand and simple to implement, but it performs poorly in practice. The main reason is that the quadratic loss does not take into account that y is a discrete variable taking values 0 or 1. The quadratic loss furthermore penalizes heavily for correctly classified points that lie far from the decision boundary. 4.3 Logistic Regression Classifier Here it is assumed that y ∈ {0, 1} but the model is now of the form fθ (x) = g(θT x) where 1 g(z) =. 1 + exp(−z) The function g(z) is called the sigmoid function and is shown in Figure 7. It maps z ∈ R to the interval (0, 1). Mynd 7: The sigmoid function, g(z) = 1+exp(−z). 1 As before, the parameters of the model, θ, are found by optimization. In linear regression, the quadratic error of the model was minimized Pn with respect to the training data but here the sigmoid function causes the quadratic error i=1 (g(θT x(i) ) − y (i) )2 to be a nonconvex function5 and thus another objective function is used. In logistic regression, it is assumed that the input and output variables are related by the following statistical model, p(y = 1 | x; θ) = g(θT x) p(y = 0 | x; θ) = 1 − g(θT x) 5 Minimizing convex functions is relatively easy but nonconvex functions are in general much more difficult to optimize. 16 where ”y | x; θ” indicates that y is conditioned on the input variable x and not the parameters θ. This can be rewritten as p(y | x; θ) = gθy (x)(1 − gθ (x))1−y. The parameters are obtained by maximizing the likelihood that the data fits the model, i.e., by maximizing the likelihood function, L(θ) = p(→ − y | X; θ) with respect to θ. Assuming that the x(i) values are statistically independent we get n Y L(θ) = p(y (i) | x(i) ; θ), i=1 i.e. n Y (i (i L(θ) = g(θT x(i) )y ) (1 − g(θT x(i) )1−y ). i=1 It is more convenient to maximize log L(θ) than to directly maximize L(θ). We then get `(θ) = log L(θ) n Y (i (i = log g(θT x(i) )y ) (1 − g(θT x(i) )1−y ) i=1 n X = y (i) log g(θT x(i) ) + (1 − y (i) ) log(1 − g(θT x(i) ). i=1 It is common to use Newton’s method to maximize `(θ). Since the method is not on the course agenda, we consider maximization using a steepest ascent method, in particular SGD. Note that the derivative of g(z) can be written as dg(z) = g(z)(1 − g(z)) dz For the case n = 1, by using the rules of differentiation we have (work this out yourself!) ∂`(θ) = (y − g(θT x))xj , j = 0,... , p ∂θj (compare with the corresponding formula for linear regression) which then gives the following SGD algorithm Iterate k = 1,... , kmax Randomly select i in the range 1,... , n (k) (k−1) θj = θj + α(y (i) − g((θ(k−1) )T x(i) )xj ), j = 0,... , p Note that the sign on the update term is + as we are maximizing `(θ). 17 Classification of a point x̂ is performed in the following way 1 if g(θT x̂) ≥ 0.5 ŷ = 0 otherwise equivalently, if θT x̂ ≥ 0 1 ŷ = 0 otherwise. From this, it can be seen that the cost of classifying a single point x̂ is O(p) floating-point operations. Mynd 8: Applying the logistic regression classifier to a two-dimensional dataset. Note that the hyperplane (line when p = 2) θT x = 0 divides the space of input variables into two half-planes as shown in Figure 8. Such classifiers are called linear classifiers. A dataset that can be correctly classified by a linear classifier is said to be linearly separable. A simple task that a linear classifier cannot handle is the XOR problem (the left image below). In some cases, hyperplanes simply do not apply as illustrated in figure 9 but linear classifiers are nevertheless highly useful, in particular when the number of input variables is large. Mynd 9: Examples of datasets where a linear classifier does not apply. 18 5 Practicalities In this chapter we see how we can estimate model performance, discuss evaluation metrics for regression and classification models and see some methods for dealing with data sets where the number of training examples varies widely between classes (class imbalance). 5.1 Evaluating Model Quality and Choosing a Model Let f be a given model derived from training data X = {(x(1) , y (1) ),... , (x(n) , y (n) )}. It’s common to use model error as a measure of model quality. Error assessment is used to answer questions like: 1) Is f accurate enough (for us or our clients)? 2) Is model fa more accurate than model fb ? This question arises when choosing between different models, e.g., which degree of polynomial to choose in polynomial regression regression. We now consider different ways to assess the accuracy of regression models6. 5.1.1 Error Assessment from Training Data After obtaining a model fθ (x) from X, it’s possible to calculate the mean error on the training data (e. training set error), n 1X M SEtrain = (fθ (x(i) ) − y (i) )2. n i=1 If the value of M SEtrain is high, the model is likely not good (underfits). Possible reasons for this include i) little correlation between input and output variables in the data, ii) the model is unsuitable for the data (e.g., a linear model is used when the data contains seasonal fluctuations), or iii) the model’s flexibility is too limited. However, a low M SEtrain value is not a guarantee that the model is good. Why not? For instance, we can achieve M SEtrain = 0 by ”memorizing” the data, e.g., by using a Python dictionary to store the values in X. Such a model is an example of major ”overfitting” and is completely useless as for prediction. The error M SEtrain decreases with increasing flexibility of f , e.g., with increasing degree of a polynomial in polynomial regression, and therefore M SEtrain cannot be used to choose between two models. 6 A similar approach applies to classifiers, but MSE is not used to measure error. Instead, the proportion of incorrectly classified examples is often used. More on this in the following section 19 5.1.2 Error Estimation from an Independent Dataset We consider three methods. i) Randomly split X into two non-overlapping sets, Xtrain and Xtest prior to fitting the model. The split could, for example, be such that 2/3 of the data goes into training and 1/3 into testing7. Use ntest 1 X (i) (i) M SEtest = (fθ (xtest ) − ytest )2 ntest i=1 as a measure for model quality. The premise for M SEtest to provide a good estimate of model error is that the test set is sufficiently large (otherwise, randomness has a large influence). A disadvantage is that this method overestimates the error since not all the available the data is used for training. A variant of this method called random subsampling is sometimes used. The dataset is then randomly split several times into Xtrain and Xtest. A model is found for each split and M SEtest calculated for Xtest. The average of the MSE values is then used as final error estimate. ii) Leave one out validation In each step, example, x(i) , is set aside and a model is obtained using all the other examples. The prediction error M SEi for example i is calculated, and this is repeated for all examples i = 1,... , n. Finally, the average of the M SEi values is used as the final bias estimate. This method has the advantage that we do not sacrifice valuable training data for error estimati- on. However, it has the disadvantage of being computationally inefficient (except in exceptional cases). iii) k-fold cross-validation Elements in X are first randomly shuffled. Next, the first bn/kc examples from X are set aside in the set Xtest , and a model is obtained using all the other examples (X\Xtest ). The squared error is calculated for the elements in Xtest. This is then repeated for the next bn/kc elements and so on, and the average error is reported (see figure 10). It’s common to use k = 5 or k = 10. 5.1.3 Model Selection To select a model, e.g., the degree of a polynomial in a polynomial model, it is possible to iterate over several different values of the hyper-parameter, assess the model’s error using methods i), ii), or iii) in section 5.1.2, and choose the value that yields the lowest average error on test data. Figure 11 shows how training and testing error can depend on the value of a hyperparameter. The training error steadily decreases, but the testing error has a minimum when the value of the hyperparameter is approximately 30. This value would then be selected, and the model subsequently retrained with all the data. Note that the average error corresponding to the best hyperparameter value gives an overly optim- istic estimate of the model’s error on unseen data. To choose values for one or more hyperparameters 7 These numbers are not set in stone. 20 Mynd 10: Example of k-fold cross-validation with k = 3. Here Ti corresponds to test set number i, Si corresponds to training set i, Pi to error on Ti , and P to the average error over all splits. Mynd 11: Example of how the training error and testing error of a model depend on the value of a hyperparameter. The best value of the parameter is approximately 30. while also getting a good estimate of the classifier’s error on unseen data, nested cross-validation can be performed as shown in figure 12. This method can be very time-consuming, so a common approach is to set aside part of the data at the beginning, as in method i) above (these data are under no circumstances used for training the model). Cross-validation is then used to find good values for the hyperparameters based on the remaining data. These values are then used to train the final version of the model. The final version is then tested on the data set aside at the beginning. 21 Mynd 12: Nested cross-validation. Here σi and Ci represent hyperparameters being assessed (the image is from an article about support vector machines). 5.2 Performance metrics Now that we have seen how we can evaluate model quality by using a test set or with cross-validation, we will briefly look at the most commonly used performance measures for regression and classificati- on. 5.2.1 Regression The (root) mean square error is frequently used to measure the performance of regression models, v u n u1 X RM SE = t (fθ (x(i) ) − y (i) )2. n i=1 The advantage of using RMSE instead of MSE is that RMSE has the same units as the y-values. RMSE is a good choice if we think that large errors are disproportionately worse than small errors, e.g., if being off by 10 is more than twice as bad as being off by 5. 22 Another frequently used measure is the mean absolute deviation, n 1X M AD = |fθ (x(i) ) − y (i) |. n i=1 In the context of the above example, MAD would be the metric of choice if being off by 10 is twice as bad as being off by 5. 5.2.2 Classification Until now we’ve used accuracy, the fraction of correct predictions, as a performance measure (or equivalently error rate which is 1 - accuracy). Accuracy is an appropriate metric if the ”cost” of misclassification is roughly the same for all classes and the number of examples in each class is roughly the same. These assumptions do not always hold. Consider e.g., classification of healthy vs. sick individuals where incorrectly classifying a sick individual as healthy is arguably worse than incorrectly classifying a healthy person as sick. An example where there may be a large difference in the number of examples in each class is a dataset on credit card transactions. In this case the number of fraudulent transactions is likely much smaller than the number of non-fraudulent transactions. In the following we assume that there are two classes8. Denote one class as ”postive” and the other as ”negative”. A true positive (TP) is an example that is predicted as positive that is truly positive. A false positive (FP) is an example that is predicted as positive that is truly negative. A true negative (TN) is an example that is predicted as negative that is truly negative. A false negative (FN) is an example that is predicted as negative that is truly positive. We can define accuracy in the above terms, #T P + #T N accuracy = #T P + #F P + #T N + #F N. When the assumptions of ”equal costs” and balanced labels do not hold we may need to consider more than a single measure. The true positive rate 9 (TPR) is defined as #T P TPR = #T P + #F N. The true negative rate 10 (TNR) is defined as #T N TNR = #T N + #F P 8 These metrics can be generalized to multi-class problems as well. 9 Sometimes called recall or sensitivity. 10 Sometimes called specificity. 23. We want TPR and TNR to be high but we may have to settle for some tradeoff between the two. Precision is another metric that is sometimes used, #T P precision = #T P + #F P The F1 -score is the harmonic mean of precision and recall (TPR), 2 F1 score = recall−1 + precision−1 and is often used to measure performance on imbalanced datasets. 5.3 Class Imbalance When there is a significant difference in the number of examples in individual classes in X, the classes are said to be imbalanced. For instance, consider the diagnosis of a rare disease based on measurements. Most likely, there are many more measurements of healthy individuals than sick ones available. Another example is diagnosing network attacks based on network traffic data. Much of the data likely corresponds to normal traffic, and attacks are a minority. Yet another example would be predicting an imminent stock market crash based on historical data. If the imbalance between classes is very large, a classifier that simply always chooses the most common class (e.g., all healthy, no network attack, no stock market crash) can have a low error rate but also be completely useless. Several methods have been proposed to address this. We can simply discard examples at random from each of the larger class until we end up with equally sized classes (”undersampling”). This is straightforward to implement and is often used. The method is wasteful as part of the data is discarded. An alternative is to duplicate examples from the minority class at random thus ending up having multiple copies of each such example in the training set (”oversampling”). Classifiers can be modified so that different classes receive different weights in the objective function. How this is implemented depends on the type of classifier, but often the changes are relatively simple. Consider e.g., the logistic regression classifier. The objective function is n X `(θ) = y (i) log g(θT x(i) ) + (1 − y (i) ) log(1 − g(θT x(i) ). i=1 We now rearrange this expression so that we first sum over all the examples where y (i) = 0 and then over all examples where y (i) = 0 X X `(θ) = log g(θT x(i) ) + log(1 − g(θT x(i) )) i:y (i) =1 i:y (i) =0 24 By introducing hyper-parameters C0 > 0 and C1 > 0 we can emphasize one class over the other, X X `C (θ) = C1 log g(θT x(i) ) + C0 log(1 − g(θT x(i) )) i:y (i) =1 i:y (i) =0 The hyperparameters are chosen using cross-validation or a similar method. This variant of logistic regression can also be used if, for example, mistakes with one class are more costly than mistakes with the other. For instance, one might imagine that y = 1 corresponds to an individual having cancer and y = 0 to being healthy. We may want to tune the classifier such that it preferably does not predict an individual with cancer being healthy. Although it is unpleasant for a healthy individual to receive a cancer diagnosis, further investigations can usually be conducted to correct such errors. 6 Biance-variance tradeoff and regularization 6.1 Bias-variance tradeoff Bias refers to the ability of a model to capture the true relationship between input – output pairs. Consider a set of points that lie on a parabola, y = x2. If we fit a line to this set of points, the fit will always be bad, no matter how big the training set. We can quantify the bias of a given model as the test set error when we have an infinite amount of test data. In our example, the test set error will be large, reflecting the mismatch between the model (line) and actual input output relationship (parabola). Variance is a measure of how sensitive the model is to small changes in the training data. A model with high variance will be strongly affected if the training data set is sligthly perturbed, whereas a model with small variance would not be affected much by a small perturbation. Fitting a polynomial of a high degree to a small training set corresponds to the high variance scenario while fitting a the same polynomial to a large dataset would result in lower variance. A more detailed account is given in the bias-variance notebook on Canvas. 6.2 Loss Functions and Regularization Functions We can control the bias-variance tradeoff to some extent by varying the number of model parameters, e.g., the poly degree in polynomial regression. This provides relatively coarse control over model complexity (flexibility). We control the complexity instead via a function of model parameters. We now minimize Jλ (θ) = J(θ) + λR(θ) where J(θ) is a standard loss/cost function, R(θ) is a regularization function which measures model complexity and λ > 0 is a hyper-parameter that controls the model complexity. 25 A commonly used regularizer is the L2 regularizer, p X R(θ) = θj2 = θT θ = ||θ2 ||2. j=1 An alternative is the L1 regularizer, p X R(θ) = |θj | = ||θ||1 j=1 which has the nice property that it induces sparsity in the solutions, i.e., some of the θj ’s become zero with high enough value of λ. 6.3 Decision Trees and Ensemble Methods 6.3.1 Decision Trees Decision trees are hierarchical models of the relationship between input x and output variables y in the form of a tree. Although their accuracy as predictive models is generally lower than many other methods, decision trees have several advantages that make them worth a closer look: They are not sensitive to monotone scaling of data. They can handle both discrete and continuous input variables. They are not limited to two classes. They are not sensitive to outliers in the training data. They provide information about which input variables are most important. If the trees are not very large, it is possible to gain insight into how the predictions are obtained. There are many types of decision trees in the literature, but the one we look at a type of trees called CART (Classification And Regression Trees), where binary trees are used to divide the input space into regions and calculate predictions. Internal nodes correspond to tests of the form IF (xj < tr ) where tr is the threshold value of node r, and branches from the node correspond to yes/no outcomes from the test. The tests divide the input space into box-shaped regions (axis- parallel), and leaves correspond to the prediction of the corresponding region. In classification, more than one leaf can correspond to the same class. Classification of an example x̂ with T is obtained by starting at the root, performing the test there, and following the branches until reaching a leaf that determines the prediction ŷ. We will focus on trees for classification, but trees for regression are obtained in a similar manner. Remarks: 1. Instead of binary trees, m-trees (m > 2) are sometimes used, but such trees can always be converted into equivalent binary trees. 26 2. More complex tests than IF (xk < tr ) are sometimes used in nodes, e.g., a linear combination of x-values might be used. 3. The following problem is NP-complete: Is there a decision tree of height h that correctly classifies all elements in X? As a result, greedy algorithms like CART used in practice do not necessarily deliver the smallest possible tree. Example: A decision tree for predicting the creditworthiness of individuals. Let x1 represent annual income and x2 debt status. The output variable is y = 0 if the individual does not get a loan and y = 1 if they do. On the left side of Figure 13, you can see historical data for 17 individuals. In the middle is a decision tree that describes who gets a loan and who does not, and on the far right, you can see how the tree divides the input space into box-shaped areas. The root divides the input space into two parts. As you go down the tree, you get an increasingly finer division of the input space. Mynd 13: Decision tree to determine whether a person should get a loan or not. 6.3.2 The CART Algorithm The following algorithm repeatedly divides the dataset X into smaller parts in a greedy manner11 by selecting decision variables and threshold values that minimize a cost function G called impurity. 1. Partition X into subsets XL and XR that minimizes G. This gives the root of the tree. 2. Left and right subtrees connected to the root are obtained by recursively dividing X1 and X2 in the same way until a stopping condition is met. The algorithm stops when all leaves contain only examples from the same class or when the number of examples in the leaves falls below a certain limit (hyperparameter). In the latter case, 11 A greedy algorithm makes decisions without considering the effects it may have on subsequent decisions. Most greedy algorithms give suboptimal solutions. In our case we are e.g., extremely unlikely to end up with the shortest possible decision tree. 27 a simple majority vote can determine the prediction for an example x̂, or average in the case of regression. How can we determine a split criterion of the form IF (xj > tr ) at node r? First, assume that we have only two classes and assume that the set of examples arriving at r is X = {(x(1) , y (1) ,... , x(n) , y (n) }. The Gini-impurity of node r is defined as G(X) = q(1 − q) where q denotes the fraction of examples from class 1 in X at r (the fraction of the remaining labels is 1 − q). Applying the split rule IF (xj > tr ) will send a subset of examples down the left branch and the remaining part down the right branch. The impurity of the split is obtained by weighting the impurity of the left and right child nodes with the number of examples that end up at the corresponding nodes, |XL | |XR | GT (X) = G(XL ) + G(XR ) |X| |X| where XL and XR denote the examples that go down the left and right subtrees, respectively, and |X| denotes the number of examples in X. For K > 2 classes the Gini-impurity at node r is defined as XK G(X) = ql (1 − ql ) l=1 where ql is the fraction of examples with label l. To find the best split at r we iterate over the input variables j = 1,... , p and threshold values t and select the (j, t) pair that minimizes the Gini-impurity. In the case when xj can take only two different values t1 , t2 (binary feature), we compute the Gini-impurity corresponding to each value (the tests are on the form IF (xj = tk ) and select the one that gives lower impurity. If xj is a categorical feature that takes more than two different values, e.g., a, b, and c, we consider the impurity values corresponding to sending all examples with xj = a down the left branch and all examples with xj 6= a down the right branch. This is then repeated for b and c, and the split that gives the lowest impurity is chosen. This approach becomes prohibitively expensive if variables can take many different values. Then, it may pay off to use one-hot-encoding and continuous variables instead. If the input variable xj is continuous, the corresponding values in X are sorted in ascending order, e.g., 1.01, 2.5, 3, 10, 10.5, 20, 21, 30, and the impurity is calculated for all threshold values tm that correspond to the midpoint between adjacent values, e.g, for t1 = (1.01 + 2.5)/2 = 1.755, t2 = 2.75, etc., and the split that gives the lowest impurity is chosen. This algorithm tends to overfit X if the tree is allowed to grow until only one or a few examples fall into the same leaf, especially when the number of features is large compared to the number of examples. To counter this, one can prune the tree afterward. Leaves are then reduced by merging internal nodes so that the error of the classifier decreases, e.g., as estimated by cross-validation. Another option is to apply dimensionality reduction to the data before creating the tree (more on this later). 28 Figure 14 shows a tree classifier trained on a linearly separable dataset. The boundary consists of line segments parallel to the axes of the coordinate system. Mynd 14: Decision boundaries consist of line segments parallel to the axes of the coordinate system. Decision trees are inherently discrete; they are a collection of IF-THEN-ELSE statements, and most of the algorithms in use are greedy. This means that small changes in input variables can cause large changes in the trees, in other words, decision trees are unstable (they have high variance). In contrast, methods such as Logistic regression, and k-NN are stable (if k is large enough). Generalized Optimal Sparse Decision Trees (GOSDT) belongs to a group of recent tree constructi- on methods that aim at finding optimal or close to optimal decision trees, by leveraging dynamic programming and various tricks to speed up the tree building process. Experiments suggest that these methods provide classifiers with prediction accuracy that is close to state of the art classifiers while providing interpretable models. 7 Ensemble Methods Ensemble methods refer to techniques that combine predictions from multiple predictive models into one model. Surprisingly, even if the underlying models are ”weak learners” in the sense that their accuracy is only slightly better than random guessing, the accuracy of the ensemble model can be very good. In regression, the prediction is obtained by taking a weighted average of predictions from individual models. Example (One-dimensional regression). Measurements are prepared by adding noise to a sine wave (leftmost image). Individual predictive models are obtained by randomly selecting 10 points from the set and connecting them with line segments (middle image). It is clear that this model captures the underlying function poorly. However, by creating 1000 such models and letting the prediction correspond to the average of the predictions of individual models, a fairly good model is obtained (red dots in the rightmost image). 29 Mynd 15: One-dimensional dataset (left), Weak learners (middle), Average of 1000 weak learners (right). In classification, the prediction can e.g., be obtained by letting individual models vote for a class where the majority of votes decides class membership. Example: Consider an ensemble classifier consisting of 25 binary classifiers, each with an error rate of = 0.35, slightly better than random guessing, assume also that the ensemble’s prediction obtained by majority voting. If the classifiers are all similar, the ensemble’s error rate will be ensemble = 0.35. If the classifiers are all independent, the error rate of the ensemble classifier is derived from the binomial distribution. Let the random variable Z denote the number of classifiers that are wrong. Then, 12 X 25 ensemble = P (Z ≥ 13) = 1 − P (Z ≤ 12) = 1 − k (1 − )25−k = 0.06. k k=1 The above example shows that the benefit of ensemble models depends on the individual predicti- ve models being diverse. In practice, models are seldom independent, but efforts are made to limit the correlation between them. Ensemble methods are often used with decision trees where the accuracy of individual models is intentionally kept low, e.g., by having the trees very large (overfitt- ing) or very small (tree stumps). Although the focus is on decision trees, most of the methods are independent of the underlying predictive models. Random Forests, Extra Trees, and AdaBoost are examples of models consisting of hundreds or thousands of trees. The accuracy of these methods is often state-of-the art, at least in areas where deep neural networks have not (yet) taken over. On the downside, it is much harder to gain insight into the predictions, i.e., how prediction ŷ is obtained from input x̂, compared to, say, regular decision trees. To some extent, explainability is sacrificed for increased accuracy. Ensemble methods can broadly be divided into two groups: methods where individual models are built independently of each other, e.g., Bagging, Random Forests, and Extremely Randomized Trees; and methods where model Tj depends on models Ti , i < j in some way, such as AdaBoost and Gradient Boosted Trees. 30 7.1 Extremely Randomized Trees This method uses a random version of the CART algorithm to create N trees so that instead of choosing among all p input variables in node r, the choice is only between variables in a subset D ⊆ {x1 , x2 ,... , xp }, where D is randomly selected in each node. The number of elements in D is a hyperparameter. Instead of determining the threshold value for variable j from the entire dataset as in CART, they are randomly chosen between the largest and smallest values, speeding up the tree building process. The split that minimizes Gini impurity over the variables in D is then selected. A √ reasonable choice for the size of D in classification is |D| = p and |D| = p in regression. 7.2 Bootstrap Aggregation Bootstrap Aggregation (”bagging”) relies on using N bootstrap samples from X, obtained by randomly selecting n elements from the set each time. In a bootstrap sample, therefore, an example (x(i) , y (i) ) may never appear, appear once, or more than once. Each bootstrap sample is used to train a classifier, e.g., a CART tree classifier, and predictions are obtained by majority voting among the trees (classification) or the average of predictions (regression). A Bagging classifier based on CART typically offers significantly better accuracy than a single CART classifier. Note that the proportion of examples from X that do not end up in a particular bootstrap sample is (1 − 1/n)n. This proportion approaches 1/e ≈ 0.368 as n becomes large. Since these examples are not used in training this particular model, they can be used as a test set to estimate model error. The average error across all models then gives an estimate of the ensemble model’s error, the out-of-bag error. 7.3 Random Forests Random Forests is a tree classifier that uses bagging, but instead of searching for the best division from all p input variables, a random subset D is used as in Extra Trees, except here the threshold values at the nodes are computed in the same way as in CART. Random Forests classifiers frequently outperform Bagging+CART and are often as accurate as nonlinear SVM classifiers but Random Forests are somewhat easier to use. Like CART, Random Forests can provide an estimate of the importance of individual input variables (feature importance) and out-of-bag error. Furthermore, the forest can be used for various data analyses, such as outlier detection and clustering. It is often possible to get good results by tuning only a single hyperparameter in Random Forests √ and Extremely Randomized Trees, namely the size of the subset D. For small datasets, |D| = p usually works well, but if the number of variables is very large, the value should be chosen by looking at the out-of-bag error, cross-validation error, or similar. In some cases, it pays to limit the size of individual trees, e.g., by increasing the number of examples that end up in each leaf, which amounts to a kind of regularization of individual trees. For Extremely Randomized Trees, Bagging, and Random Forests, accuracy increases with N up 31 to a certain limit and then stabilizes. Therefore, it is crucial to use enough trees. Start, for example, with N = 100 and double the number of trees until the performance stops improving. There is no gain in carefully tuning the number of trees. It is clear that all three methods are particularly well suited for parallel processing, both training and classification. The computation time scales linearly with the number of processors/cores. 7.4 Boosting (optional reading) Boosting is a methodology that can be used in both classification and regression. It aims to boost weak models so that they become a single strong model. A collection of models g1 ,... , gM is obtained by introducing weights w(i) for the examples in the training set X. The weights change between iterations, usually so that in iteration m, examples that were incorrectly classified (were far from the correct value) in the previous iteration receive higher weight than those that were correctly classified (had little prediction error). This can be schematically presented: X1 = X → g1 (x) ↓ X2 → g2 (x) ↓... ↓ XM → gM (x) where Xj is like X but with individual examples weighted by weights w(i). The individual models also receive different weights, the more accurate models getting larger weight and the less accurate get smaller weight. In the case of regression, the ensemble model becomes M X G(x) = αm gm (x) m=1 and in classification, ! M X G(x) = sign αm gm (x) m=1 where the weights α1 ,... , αM are obtained from the algorithm. 7.5 AdaBoost AdaBoost (1997) is the first classifier based on boosting that proved successful in practice. It is a binary classifier that works best with small decision trees. The algorithm is as follows: 1. wi = 1 n, i = 1,... , n 32 2. Iterate m = 1,... , M 3. Determine the prediction model gm (x) from X and w(1) ,... , w(n) Pn w(i) I(y (i) 6=gm (x(i) ) 4. m = i=1 Pn (i) i=1 w 5. αm = log 1−m m 6. w(i) = w(i) exp (αm I(y (i) 6= gm (x(i) )) Return ((α1 ,... , αm ), (g1 (x),... , gm (x))). P M The prediction for a point x̂ is given by ŷ = sign m=1 αm gm (x). Remarks: 1. m is the weighted error of the classifier gm , and the function I(s) is 1 if s is true and 0 otherwise. If m becomes 0, the iteration stops. 2. The number of weak models M is a parameter that needs to be chosen, e.g., through cross- validation. If M is too large, the ensemble classifier will overfit the data (unlike Random Forests and Extra Trees). 3. The model gm depends on gm−1 , making it difficult to parallelize the training procedure, unlike methods based on Bagging. However, predictions can easily be parallelized. 4. Although AdaBoost is independent of the type of classifier, it is most often used with tree classifiers. Small decision trees, particularly with 4 ≤ J ≤ 10 leaves, perform especially well. Smaller trees reduce the likelihood of overfitting, but too small trees can also reduce the accuracy of the ensemble. 5. The method is inherently deterministic, unlike Random Forests and Extra Trees. 6. If the weak classifiers are CART decision trees, the weights are incorporated by scaling calculations of Gini impurity. If the weak classifiers cannot handle weighting individual examp- les differently, a P random sample from X can be taken such that example i is chosen with n probability w(i) / i=1 w(i). It can be shown that AdaBoost is equivalent to the following algorithm12. 1. g0 (x) = 0 2. Iterate m = 1,... , M 3. Solve the minimization problem n X (βm , γm ) = argminβ,γ ` y (i) , gm−1 (x(i) + βb(x(i) ; γ)) i=1 12 See Elements of Statistical Learning for more details. 33 4. gm = gm−1 (x) + βm b(x; γm ) where b(x; γ) denotes a base function (e.g., a tree), γ are the corresponding parameters (e.g., def- inition of nodes and edges in a tree), and the loss function ` is the exponential loss `(y, f (x)) = exp(−yf (x)). This particular loss function has the disadvantage that examples that are misclassified and are far from the separation margin get relatively much weight in the optimization, making the method sensitive to outliers and incorrect labels (y values). In the LogitBoost classifier, the exponential loss is replaced with the logistic regression loss function (cross entropy), which is not as sensitive to outliers. In Gradient Boosting, the method of steepest descent is used to find an approximate solution to the above minimization problem, and regularization is introduced to reduce the risk of overfitting. Stochastic Gradient Boosting (SGB) is a variant of Gradient Boosting where a random sample of size 0 < η < 1 from X is used in each iteration (sampling without replacement, i.e., x(i) app- ears at most once in the sample). For this to work well, it seems necessary to use regularization simultaneously. The parameters that need to be chosen in a typical boosting algorithm are: the number of weak models (M ), the number of leaves (J), the strength of regularization (λ), and the size of the random sample (η). Stochastic Gradient Boosting is implemented in the XGBoost and LightGBM packages. 7.6 Stacking Models (optional reading) Stacking models are layered ensemble models where predictions from one layer are used as inputs in the model of the next layer. Stacking models can be used in both classification and regression and can sometimes provide a slight improvement in accuracy compared to individual models. Therefore, they are often used in competitions like those found on Kaggle. Here, we examine a two-layer stacking model for classification, but from the discussion, it should be clear how to obtain a similar model for regression. In a two-layer stacking classifier, we train M classifiers L1 ,... , LM in layer 1 on the set X. Predictions from the first layer ŷ1 ,... , ŷM become input variables X̃ in classifier LS in layer 2, | | | X̃ = ŷ1 ŷ2 · · · ŷM | | | This classifier learns to distinguish more accurate prediction models from less accurate ones. If the models in the first layer are very similar to each other, i.e., make similar mistakes, then there is little to gain from blending predictions. Stacking models are thus most useful when the models in the first layer make different kinds of mistakes. 34 In training, k-fold cross-validation is used to calculate the values that go into X̃ in the following way. The first bn/kc elements are set aside and model Li is trained on elements bn/kc + 1,... , n and predictions calculated for those elements that were set aside. This forms the first bn/kc ŷi values in X̃. Once all k splits have been iterated over and all M models have been trained, an n × M matrix X̃ is ultimately obtained. Model LS is then trained on (X̃, ytrain ). If a validation set is available, it is used to prepare a nval × M matrix X̃ instead of using cross- validation, and LS is trained on (X̃, yval ). Note that it is not advisable to prepare X̃ directly from training data. The reason is that if any model in layer 1 overfits the training data, then that model will dominate in LS and the accuracy of the stacking classifier will be low. Finally, the models in layer 1 are trained on all training data. To calculate the prediction for a point x̂, it is classified using models L1 ,... , LM. The predictions ŷ1 ,... , ŷM (note that ŷi is a scalar) are fed into classifier LS which returns the final prediction ŷ. To simplify training in layer 2, a Logistic Regression classifier, which is parameter-free, can be used. If tuning hyper-parameters in classifiers in layer 1, cross-validation can be used, but special attention must be paid to the risk of overfitting. 35 8 Unsupervised Learning In unsupervised learning, we work with a dataset X = {x(1) ,... , x(n) }, usually with x(i) ∈ Rp. The dataset is unlabelled in the sense that there are no output variables associated with the x- values. We want to learn something about the structure or characteristics of X. Common tasks are dimensionality reduction and clustering. In clustering, the goal is to find clusters of points x(i) such that points within each cluster are similar in some way, but points in different clusters are dissimilar. In dimensionality reduction, points in X are mapped to a space Y = {y (1) ,... , y (n) } where y (i) ∈ Rq and q < p. The main reasons for performing dimensionality reduction are: Graphical representation of data (q = 2 or q = 3). Data compression (lossy compression). Noise filtering. Preprocessing for learning algorithms sensitive to a high number of dimensions (input varia- bles). 8.1 Principal Component Analysis In Principal Component Analysis (PCA), a subspace of dimension q is found such that the data in X lie approximately in that subspace. Mynd 16: Grades in Mathematics Patterns and Calculus Analysis. Example: Let x ∈ R2 where x1 represents students’ grades in Discrete mathematics and x2 represents grades in Calculus (see Figure 16). The grades are highly correlated and we see that the 36 data approximately lie in the subspace corresponding to the red line. The first step in PCA is to shift the input variables so that the mean (over all examples) is zero, n 1 X (i) µ= x n i=1 x(i) = x(i) − µ As input variables usually have different units, they are typically scaled such that their standard deviation becomes one, n 1 X (i) 2 σj2 = (x ). n i=1 j Then (i) (i) xj xj = σj The second step is sometimes omitted, e.g., in image processing when all pixels in each image have a similar standard deviation. The following derivation assumes that the data have been shifted and scaled. If we assume that the variance of the data is a measure of the information content, it makes sense to look for a vector u that maximizes the variance after projecting the data onto the vector. Mynd 17: Example of projections. Consider the two projections shown in figure 17. The projection onto u1 (left graph) preserves more of the variance in the original data than projection onto the u2 (right graph). How do we find a vector similar to u1 ? Assume that ||u|| = 1. The length of the projection13 of x onto u is then x̃ = xT u, and the 13 See section 1 for the definition of projection onto a vector. 37 variance of the data after the projection is n n 1 X (i) 2 1 X (i) T 2 (x̃ ) = (x ) u) n i=1 n i=1 n 1 X T (i) (i) T = (u x (x ) u) n i=1 n 1 X (i) (i) T = uT (x (x ) )u n i=1 = uT Cu Pn where the p × p matrix C = 1 n i=1 x(i) (x(i) )T is called the covariance matrix of X. We determine u by solving the maximization problem max uT Cu u ||u||2 = 1. Without the constraint, the objective function would be unbounded, since for a given vector u, a larger objective function value can always be obtained by multiplying u by a constant c > 1. Consider the general maximization problem with a single constraint (Figure 18), max f (u) u g(u) = 0. It can be shown that if f (u) is a concave function (i.e., −f (u) is convex) and the constraint g(u) = 0 defines a convex set, then at the maximum point u∗ , the following must hold ∇f (u∗ ) = λ∇g(u∗ ) g(u∗ ) = 0 where λ is a new variable called the Lagrange multiplier. With f (u) = uT Cu and g(u) = uT u − 1, it follows that the vector u maximizing variance satisfies ∇(uT Cu) = λ∇(uT u − 1). The rules of matrix and vector differentiation give 2Cu = 2λu, i.e., Cu = λu. A vector u satisfying this equation is called an eigenvector of the matrix C, and the corresponding λ value is called an eigenvalue. There are p eigenvectors, and thus we need to decide which vector to choose. 38 Mynd 18: Constrained optimization. The minimum lies on the curve g(u) = 0 where the gradients g and f point in opposite directions. The variance is uT Cu = uT (λu) = λ(uT u) = λ(1) = λ. We therefore choose the vector corresponding to the largest eigenvalue. We assume that the eigenvectors and eigenvalues have been sorted in decreasing order according to λ. It can be shown that to maximize variance for a given value of q (dimension of the subspace), we should project X into a space spanned by the q largest eigenvectors. It can be shown that this corresponds to solving the maximization problem above with the side constraints that the vectors u1 ,... , uq are all mutually orthogonal. After the projection, the coordinates of the points become T (i) u1 x y (i) = ... , i = 1,... , n. uTq x(i) Noise filtering and lossy compression are performed by projecting the data back into the p- dimensional space, simply by omitting the contribution from eigenvectors q + 1,... , p. The coord- inates then become q (i) X x̂j = yk uTk ej , j = 1,... p, i = 1,... , n k=1 where ej is the unit vector that has one in the j-th position and zero otherwise. To select a suitable value for q, it can be helpful to draw a graph showing the eigenvalues in decreasing order. Similarly, one can consider the ratio Pq λj Pj=1 p. j=1 λj Example: Principal component analysis was performed on a small dataset with n = 13 and p = 4, and plots of the eigenvalues were created. On the left side of Figure 19, a ’kink’ can be seen 39 in the process at λ2 , indicating that little is gained by choosing q larger than 2. The right-hand picture shows that the largest eigenvector accounts for just over 86% of the variance in X and the two largest eigenvalues together explain 98% of the variance. Mynd 19: Variance after projection. 8.2 Non-negative Matrix Factorization In non-negative matrix factorization (NMF), it is assumed that all elements in the data matrix X are greater than or equal to zero. Examples of data in this form include bitmap images, word frequencies in text documents, ratings given by consumers to products and more. It is further assumed that the items we work with are stored as columns in X rather than rows (as before). Let X be an n × p matrix of rank k with Xij ≥ 0. Consider approximating X with an n × p matrix of rank k < p on the form X ≈ WH where the weight matrix W is an n × k matrix with Wij ≥ 0 and the coefficient matrix H is a k × p matrix with Hij ≥ 0. Given that the value of k is chosen appropriately, the matrices W and H can provide certain insights into the structure of the data. Note that column j in X can be written as X:,j ≈ (W H):,j = H1j W:,1 + H2j W:,2 +... + Hkj W:,k i.e., column j of X can approximately be written as a linear combination of the k columns of W where column ` in W gets the weight H`j. The factorization is not uniquely determined. For example, if columns of X correspond to news texts and rows to the frequency of individual words in the text, the columns of W can be interpreted as individual topics of discussion, e.g., environmental issues, technology, sports, etc. Column j in H then indicates how text j consists (approximately) of k discussion topics. If the value H`j is small, then the topic ` is minimally 40 involved in document j. By examining the largest elements in H for columns 1,... , k, one can see which words characterize the discussion topics. For environmental issues, for example, it is likely that the largest elements in H correspond to words like sustainability, pollution, and warming. NMF has been applied to bitmap images of faces where columns correspond to individual images and rows to grayscale values of pixels. Then the columns of W correspond roughly to images of nose, ears, mouth, eyes, etc., and H indicates how individual faces are obtained by mixing these body parts together. To determine the matrices W and H, one needs to define a cost function that measures the quality of the approximation X ≈ W H. A simple measure of distance between matrices A and B is the Euclidean distance between their elements, squared, XX ||A − B||2F = (Aij − Bij )2. i j The matrices W and H can be determined by solving the following optimization problem, min ||X − W H||2F. Wij ≥0, Hij ≥0 The optimization problem is challenging as it is not convex and has many local minima. Therefore, it is not realistic to seek the global minimum, and we must be content with a local minimum instead. The optimization problem is also constrained as the elements of W and H cannot be negative. By examining the necessary conditions for minima in constrained optimization problems (Kuhn-Tucker conditions), a simple iterative algorithm can be formulated that usually works well. Many other objective functions have been proposed for matrix factorization. P For example, one can add regularization such as λ(||W ||2F +||H||2F ) to the objective function, or λ( ij |Wij |+ ij |Hij |) P to obtain a sparse solution. 41 9 Clustering In clustering, the task is to find homogeneous groups (clusters) of examples, such that within each group all examples are similar to each other but different from those in other groups (see Figure 20). Mynd 20: An example of clustering. The input to a clustering algorithm is the set X and the output is the partition of the set into K groups, C = (c1 ,... , cn ) where ci is the number of the group to which x(i) belongs. A clustering algorithm consists of 1. A distance measure d between points in X such that d(x(i) , x(j) ) is large if x(i) and x(j) are dissimilar and small otherwise. 2. A measure of the quality of clustering. 3. An algorithm that maximizes quality of clustering. A desirable property of clustering methods is that similar points end up in the same cluster and dissimilar points do not. This is clearly not the case in the clustering shown in Figure 21. Mynd 21: An example of questionable clustering. 42 Consider the quantity n n 1 XX T = d(x(i) , x(j) ) 2 i=1 j=1 K 1X X X X = ( d(x(i) , x(j) ) + d(x(i) , x(j) )) 2 k=1 i:ci =k j:cj =k j:cj 6=k | {z } W (C) = W (C) + B(C) where W (C) measures how points are dispersed within clusters (within cluster scatter) and B(C) measures how points are dispersed between clusters (between cluster scatter). We can write W (C) = T − B(C). Since T is a constant for given X, minimizing W (C) is equivalent to maximizing B(C). This is a difficult combinatorial optimization problem (there are about K n different partitions). Therefore, approximation methods such as K-means are needed. 9.1 Clustering with K-means The k-means algorithm assumes Euclidean distances between points, d(x(i) , x(j) ) = ||x(i) − x(j) ||2. Then K 1X X X W (C) = ||x(i) − x(j) ||2. 2 k=1 i:ci =k j:cj =k Let Nk denote the number of points in group k and z (k) denote the centroid of group k, 1 X (i) z (k) = x. Nk i:ci =k After some algebra we obtain K 1X X W (C) = Nk ||x(i) − x(j) ||2. 2 k=1 i:ci =k A local minimum of W (C) is achieved with the following algorithm called K-means where K denotes the number of clusters: 1. Choose centroids z (1) ,... , z (K) at random. 2. Iterate i) Assign points i = 1,... , n to the group corresponding to the nearest centroid, ci = arg mink=1,...,K ||x(i) − z (k) ||2 43 ii) Update centroids 1 X (i) z (k) = x , k = 1,... , K Nk i:ci =k Until no changes occur between iterations. The number of clusters K is a hyper-parameter, and the outcome largely de