Machine Learning Notes - An Unofficial Guide PDF

Machine Learning or: an Unofficial Guide to the Georgia Institute of Technology’s CS7641: Machine Learning George Kudrayvtsev [email protected]...

Machine Learning or: an Unofficial Guide to the Georgia Institute of Technology’s CS7641: Machine Learning George Kudrayvtsev [email protected] Last Updated: May 23, 2020 Creation of this guide was powered entirely by caffeine in its many forms. If you found it useful and are gener- ously looking to fuel my stimulant addiction, feel free to shoot me a donation on Venmo @george_k_btw or PayPal [email protected] with whatever this guide was worth to you. Happy studying! I’m just a student, so I can’t really make any guarantees about the correctness of my content. If you encounter typos; incorrect, misleading, or poorly-worded in- formation; or simply want to contribute a better explanation or extend a section, please raise an issue on my notes’ GitHub repository. I Supervised Learning 5 0 Techniques 6 1 Classification 7 1.1 Decision Trees............................... 7 1.1.1 Getting Answers......................... 8 1.1.2 Asking Questions: The ID3 Algorithm............. 9 1 1.1.3 Considerations.......................... 10 1.2 Ensemble Learning............................ 12 1.2.1 Bagging.............................. 13 1.2.2 Boosting.............................. 13 1.3 Support Vector Machines......................... 18 1.3.1 There are lines and there are lines................. 19 1.3.2 Support Vectors.......................... 20 1.3.3 Extending SVMs: The Kernel Trick............... 21 1.3.4 Summary............................. 24 2 Regression 26 2.1 Linear Regression............................. 26 2.2 Neural Networks............................. 28 2.2.1 Perceptron............................. 28 2.2.2 Sigmoids.............................. 32 2.2.3 Structure............................. 33 2.2.4 Biases............................... 34 2.3 Instance-Based Learning......................... 35 2.3.1 Nearest Neighbors........................ 35 3 Computational Learning Theory 39 3.1 Learning to Learn: Interactions..................... 40 3.2 Space Complexity............................. 42 3.2.1 Version Spaces.......................... 43 3.2.2 Error................................ 43 3.2.3 PAC Learning........................... 44 3.2.4 Epsilon Exhaustion........................ 44 3.3 Infinite Hypothesis Spaces........................ 46 3.3.1 Intuition.............................. 46 3.3.2 Vapnik-Chervonenkis Dimension................. 47 3.4 Information Theory............................ 48 3.4.1 Entropy: Information Certainty................. 49 3.4.2 Joint Entropy: Mutual Information............... 50 3.4.3 Kullback-Leibler Divergence................... 51 4 Bayesian Learning 52 4.1 Bayesian Learning............................ 54 4.1.1 Finding the Best Hypothesis................... 54 4.1.2 Finding the Best Label...................... 55 4.2 Bayesian Inference............................ 56 4.2.1 Bayesian Networks........................ 56 4.2.2 Making Inferences......................... 57 4.2.3 Naïve Bayes............................ 60 2 II Unsupervised Learning 62 5 Randomized Optimization 63 5.1 Hill Climbing............................... 63 5.2 Simulated Annealing........................... 63 5.3 Genetic Algorithms............................ 65 5.3.1 High-Level Algorithm....................... 66 5.3.2 Cross-Over............................ 66 5.3.3 Challenges............................. 67 5.4 MIMIC................................... 67 5.4.1 High-Level Algorithm....................... 68 5.4.2 Estimating Distributions..................... 68 5.4.3 Practical Considerations..................... 70 6 Clustering 71 6.1 Single Linkage Clustering......................... 72 6.1.1 Considerations.......................... 72 6.2 k-Means Clustering............................ 73 6.2.1 Convergence............................ 74 6.2.2 Considerations.......................... 74 6.3 Soft Clustering.............................. 76 6.3.1 Expectation Maximization.................... 77 6.3.2 Considerations.......................... 77 6.4 Analyzing Clustering Algorithms.................... 78 6.4.1 Properties............................. 78 6.4.2 How Many Clusters?....................... 79 7 Features 80 7.1 Feature Selection............................. 80 7.1.1 Filtering.............................. 81 7.1.2 Wrapping............................. 81 7.1.3 Describing Features........................ 82 7.2 Feature Transformation.......................... 82 7.2.1 Motivation............................. 83 7.2.2 Principal Component Analysis.................. 83 7.2.3 Independent Component Analysis................ 86 7.2.4 Alternatives............................ 87 III Reinforcement Learning 88 8 Markov Decision Processes 89 8.1 Bellman Equation............................. 91 8.2 Finding Policies.............................. 92 3 8.3 Q-Learning................................ 93 9 Game Theory 96 9.1 Games................................... 96 9.1.1 Relaxation: Non-Determinism.................. 97 9.1.2 Relaxation: Hidden Information................. 98 9.1.3 Prisoner’s Dilemma........................ 100 9.1.4 Nash Equilibrium......................... 101 9.1.5 Summary............................. 102 9.2 Uncertainty................................ 103 9.2.1 Tit-for-Tat............................. 103 9.2.2 Folk Theorem........................... 104 9.2.3 Pavlov’s Strategy......................... 105 9.3 Coming Full Circle............................ 106 9.3.1 Example: Grid World....................... 106 9.3.2 Generalization........................... 107 9.3.3 Solving Stochastic Games.................... 107 Index of Terms 110 4 PART I Supervised Learning ur first minicourse will dive into supervised learning, which is a school of O machine learning that relies on human input (or “supervision”) to train a model. Examples of supervised learning include anything that has to do with labelling, and it occurs far more often than unsupervised learning. It’s often reduced down to function approximation (think numpy’s polyfit, for example): given enough predetermined pairs of (input, output)s, the trained model can eventually predict a never-before-seen input with reasonable accuracy. x f (x) An elementary example of supervised learning would be a model 2 4 that “learns” that the dataset on the right represents f (x) = 9 81 x2. Of course, there’s no guarantee that this data really does 4 16 represent f (x) = x2. It certainly could just “look a lot like it.” 7 49 Thus, in supervised learning we will need to a basal assumption... about the world: that we have some well-behaved, consistent function behind the data we’re seeing. Contents 0 Techniques 6 1 Classification 7 2 Regression 26 3 Computational Learning Theory 39 4 Bayesian Learning 52 5 Techniques upervised learning is typically broken up into two main schools of algorithms. S Classification involves mapping between complex inputs (like image of faces) and labels (like True or False, though they don’t necessarily need to be binary) which we call “classes”. This is in contrast with regression, in which we map our complex inputs to an arbitrary, often-continuous, often-numeric value (rather than a discrete value that comes from a small set of labels). Classification leans more towards data with discrete values, whereas regression is more-universal, being applicable to any numeric values. Though data is everything in machine learning, it isn’t perfect. Errors can come from a variety of places: hardware (sensors, precision) malicious intent (willful misrepre- sentation) human element (mistakes) unmodeled influences These hidden errors will factor into the resulting model if they’re present in the training data. Similarly, they’ll cause in- accuracies in evaluation if they’re present in the testing data. We need to be careful about how accurately we “fit” the training data, ideally keeping it general enough to flourish on real-world data. A method for reducing this risk of overfitting is called cross-validation: we can use some of the training data as a “fake” testing set. The “Goldilocks zone” of training is be- tween underfitting and overfitting, where the error across both training data and cross-validation data are relatively similar. 6 Classification Science is the systematic classification of experience. — George Henry Lewes, Physical Basis of Mind reaking down a classification problem requires a number of important ele- B ments: instances, representing the input data from which the overall model will “learn;” the concept, which is the abstract concept that the data represents (hopefully representable by a well-formed function); a target concept, which is the “answer” we want: the ability to classify based on our concept; the hypotheses are all of the possible functions (ideally, we can restrict our- selves from literally all functions) we’re willing to entertain that may describe our concept; some input samples pulled from our instances and paired (by someone who “knows”) with the correct output; some candidate which is a potential target concept; and a testing set from our instances that our candidate concept has not yet seen in order to evaluate how close it is to the ideal target concept. 1.1 Decision Trees Decision trees are a form of classification learning. They are exactly what they Quinlan ‘86 sound like: trees of decisions, in which branches are particular questions (in which each path down a branch represents a different answer to said question) based on the input, and leaves are final decisions to be made. It maps various choices to diverging paths that end with some decision. 7 CHAPTER 1: Classification To create a decision tree for a concept, we need to identify pertinent features that would describe it well. For example, if we wanted to decide whether or not to eat at a restaurant, we could use the weather, particular cuisine, average cost, atmosphere, or even occupancy level as features. For a great example of “intelligence” being driven by a decision tree in popular culture, consider the famous “character guessing” AI Akinator. For each yes-or-no question it asks, there are branches the answers that lead down a tree of further questions until it can make a confident guess. One could imagine the following (incredibly oversimplified) tree in Akinator’s “brain:” Does your character really exist? Yes No Is your characteran animal? Does your character play a sport? Yes No Yes No Pumba Is your character’sgender female? Steph Curry... Yes No Turanga Leela... It’s important to note that decision trees are a representation of our features. Only after we’ve formed a representation can we start talking about the algorithm that will use the tree to make a decision. The order in which we apply each feature to our decision tree should be correlated with its ability to reduce our space. Just like Akinator divides the space of characters in the world into fiction and non-fiction right off the bat, we should aim to start our decision tree with questions whose answers can sweep away swaths of finer decision- making. For our restaurant example, if we want to spend ≤ $10 no matter what, that would eliminate a massive amount of restaurants immediately from the first question. 1.1.1 Getting Answers The notion of a “best” question is obviously subjective, but we can make an attempt to define it with a little more mathematical rigor. Taking some inspiration from binary search, we could define a question as being good if it divides our data roughly in half. Regardless of our final decision (heh) regarding the definition of “best,” the algorithm is roughly the same: 1. Pick the “best” attribute. Kudrayvtsev 8 MACHINE LEARNING Supervised Learning 2. Ask the question. 3. Follow the answer path. 4. If we haven’t hit a leaf, go to Step 1. Of course the flaw is that we want to learn our decision tree based on the data. The above algorithm is for using the tree to make decisions. How do we create the tree in the first place? Do we need to search over the (massive) space of all possible decision trees and use some criteria to filter out the best ones? Given n boolean attributes, there are 2n possible ways to arrange the attributes, and 22 possible answers (since there are 2n different decisions for each of those n arrangements)... we probably want to be a little smarter than that. 1.1.2 Asking Questions: The ID3 Algorithm If we approach the feature-ranking process greedily, a simple top-down approach ID3 Algorithm, emerges: Udacity A ← best attribute Assign A as the decision attribute (the “question” we’re asking) for the particular node n we’re working with (initially, this would be the tree’s root node). For each v ∈ A, create a branch from n. Lump the training examples that correspond to the particular attribute value, v, to their respective branch. If the examples are perfectly classified with this arrangement (that is, we have one training example per leaf), we can stop. Otherwise, repeat this process on each of these branches. The “information gain” from a particular attribute A can be a good metric for quali- fying attributes. Namely, we measure how much the attribute can reduce the overall entropy: X |Sv | Gain(S, A) = Entropy(S) − · Entropy(Sv ) (1.1) v∈A |S| where the entropy is calculated based on the probability of seeing values in A: X Entropy(A) = − Pr [v] · log2 Pr [v] (1.2) v∈A These concepts come from Information Theory which we’ll dive into more when we discuss randomized optimization algorithms in chapter 5; for now, we should just think of this as a measure of how much information an attribute gives us about a Kudrayvtsev 9 CHAPTER 1: Classification system. Attributes that give a lot of information are more valuable, and should thus be higher on the decision tree. Then, the “best attribute” is the one that gives us the maximum information gain: max Gain(S, A). A∈A Inductive Bias There are two kinds of bias we need to worry about when designing any classifier: restriction bias, which automatically occurs when we decide our hypothesis set, H. In this case, our bias comes from the fact that we’re only considering functions that can be represented with a decision tree. preference bias, which tells us what sort of hypotheses from our hypothesis set, h ∈ H, we prefer. The latter of these is at the heart of inductive bias. Which decision trees—out of all of the possible decision trees in the universe that can represent our target concept—will the ID3 algorithm prefer? Splits Since it’s greedily choosing the attributes with the most information gain from the top down, we can confidently say that it will prefer trees with good splits at the top. Correctness Critically, the ID3 algorithm repeats until the labels are correctly classified. And though it may be obvious, it’s still important to note that it will hence prefer correct decision trees to incorrect ones. Depth This arises naturally out of the top-heavy split preference, but again, it’s still worth noting that ID3 will prefer trees that are shallower or “shorter.” 1.1.3 Considerations Asking (Continuous) Questions The careful reader may have noticed the explicit mention of branching on attributes based on every possible value of an attribute: v ∈ A. This is infeasible for many features, especially continuous ones. For our earlier restaurant example, we may have discretized our “cost” feature into one, two, or three dollar signs (à la Yelp), but what if we wanted to keep them as a raw average dish dollar value instead? Well, if we’re sticking to the “only ask Boolean questions” model, then binning is a viable approach. Instead of making decisions based on a precise cost, we instead make decisions based on a place being “cheap,” which we might subjectively define as cost ∈ [0, 10), for example. Kudrayvtsev 10 MACHINE LEARNING Supervised Learning Repeating Attributes Does it make sense to ask about an attribute more than once down its branch? That is, if we ask about cost somewhere down a path, can (or should) we ask again, later? Is the average cost ≤ $10? Yes No Is the average cost ≤ $20? Is it Mongolian food? Yes No Yes No...... Just eat in... Is it crowded? Yes No Is the average cost ≤ $5? Go!... With our “proof by example,” it’s pretty apparent that the answer is “yes, it’s ac- ceptable to ask about the same attribute twice.” However, it really depends on the attribute. For example, we wouldn’t want to ask about the weather twice, since the weather will be constant throughout the duration of the decision-making process. With our bucketed continuous values (cost, age, etc.), though, it does make sense to potentially refine our buckets as we go further down a branch. Stopping Point The ID3 algorithm tells us to stop creating our decision tree when all of our training examples are classified correctly. That’s... a lot of leaf nodes... It may actually be pretty problematic to refine the decision tree to such a point: when we leave our training set, there may be examples that don’t fall into an exact leaf. There may also be examples that have identical features but actually have a different outcome; when we’re talking about restaurant choices, opinions may differ: Weather Cost Cuisine Go? Alice: Cloudy $ Mexican Bob: Cloudy $ Mexican × Kudrayvtsev 11 CHAPTER 1: Classification If both of these rows were in our training set, we’d actually get an infinite loop in the naïve ID3 algorithm: it’s impossible to classify every example correctly. It makes sense to adopt a termination approach that is a little more general and robust. We want to avoid overfitting our training examples! If we bubble up the decisions down the branch of a tree back up to its parent node, then prune the branch entirely, we can avoid overfitting. Of course, we’d need to make sure that the generalized decision does not increase our training error by too much. For example, if a cost-based branch had all of its children branches based on weather, and all but one of those resulted in the go-ahead to eat, we could generalize and say that we should always eat for the cost branch without incurring a very large penalty for the one specific “don’t eat” case. Adapting to Regression Decision trees as we’ve defined them here don’t transfer directly to regression prob- lems. We no longer have a useful notion of information gain, so our approach at at- tribute sorting falls through. Instead, we can rely on purely statistical methods (like variance and correlation) to determine how important an attribute is. For leaves, too, we can do averages, local linear fit, or a host of other approaches that mathematically generalize with no regard for the “meaning” of the data. 1.2 Ensemble Learning An ensemble is a fancy word for a collective that works together. In this section, we’re going to discuss combining learners into groups who will each contribute to the hypothesis individually and essentially corroborate towards the answer. Ensemble learning is powerful when there are features that may slightly be indicative of a result on their own, but definitely inconclusive, whereas a combination of some of the rules is far more conclusive. In essence, when the whole is greater than the sum of its parts. For example, it’s hard to consider an email containing the word “money” as spam, but containing a sketchy URL, the word “money,” and having mispelled words might be far more indicative. The general approach to ensemble learning algorithms is to learn rules over smaller subsets of the training data, then combine all of the rules into a collective, smarter decision-maker. A particular rule might apply well to a subset (such as a bunch of spam emails all containing the word “Viagra”), but might not be as prevalent in the whole; hence, each weak learner picks up simple rules that, when combined with the other learners, can make more-complex inferences about the overall dataset. This approach begs a few critical questions: we need to determine how to pick subsets and how to combine learners. Kudrayvtsev 12 MACHINE LEARNING Supervised Learning 1.2.1 Bagging Turns out, simply making choosing data uniformally randomly to form our subset (with replacement) works pretty well. Similarly-simply, combining the results with an average also works well. This technique is called boostrap aggregation, or more-commonly bagging. The reason why taking the average of a set of weak learners trained on subsets of the data can outperform a single learner trained on the entire dataset is because of overfitting, our mortal fear in machine learning. Overfitting a subset will not overfit the overall dataset, and the average will “smooth out” the specifics of each individual learner. 1.2.2 Boosting We must be able to pick subsets of the data a little more cleverly than randomly, right? The basic idea behind boosting is to prefer data that we’re not good at analyzing. We essentially craft learners that are specifically catered towards data that previous learners struggled with in order to form a cohesive picture of the entire dataset. Initially, all training examples are weighed equally. Then, in each “boosting round,” we find the weak learner that achieves the lowest error. Following that, we raise the weights of the training examples that it misclassified. In essence, we say, “learn these better next time.” Finally, we combine the weak learners from each step into our final learner with a simple weighted average: weight is directly proportional to accuracy. (a) Our set of training examples (b) Reweighing the error examples (c) Final classifier is a combination and the first boundary guess. and trying another boundary. of the weak learners. Figure 1.1: Iteratively applying weak learners to differentiate between the red and blue classes while boosting mistakes in each round. Let’s define this notion of a learner’s “error” a little more rigorously. Previously, when we pulled from the training set with uniform randomness, this was easy to define. The number of mismatches M from our model out of the N -element subset meant an error of M/N. However, since we’re now weighing certain training examples differently (incorrect =⇒ likelier to get sampled), our error is likewise different. Shouldn’t we punish an incorrect result on a data point that we are intentionally trying to learn more-so than an incorrect result that a dozen other learners got correct? Kudrayvtsev 13 CHAPTER 1: Classification Example 1.1: Understanding Check: Training Error Suppose our training subset is just 4 values, and our weak learner H(x) got two of them correct: x1 x2 x3 x4 × × What’s our training error? Trivially 1/2, you might say. Sure, but what if the probability of each xi being chosen for this subset was different? Suppose x1 x2 x3 x4 × × D: 1/2 1/20 2/5 1/20 Now what’s our error? Well getting x1 wrong is a way bigger deal now, isn’t it? It barely matters that we got x2 and x4 correct... So we need to weigh each incorrect answer accordingly: 1 2 1 1 9 ε= + =1− − = |2 {z 5} |20 {z 20} 10 incorrects corrects To drive the point in the above example home, we’ll now formally define our error as the probability of a learner H not getting a data point xi correct over the distribution of xi s. That is, εi = Pr [H(xi ) 6= yi ] D Our notion of a weak learner—a term we’ve been using so far to refer to a learner that does well on a subset of the training data—can now likewise be formally defined: it’s a learner that performs better than chance for any distribution of data (where ε is a number very close to zero): ∀D : Pr [·] ≤ 1/2 − ε D Note the implication here: if there is any distribution for which a set of hypotheses can’t do better than random chance, there’s no way to create a weak learner from those hypotheses. That makes this a pretty strong condition, actually, since you need a lot of good hypotheses to cover the various distributions. Boosting at a high level can be broken down into a simple loop: on iteration t, construct a distribution Dt , and find a weak classifier Ht (x) that minimizes the error over it. Then after the loop, combine the weak classifiers into a stronger one. Kudrayvtsev 14 MACHINE LEARNING Supervised Learning Specific boosting algorithms vary in how they perform these steps, but a well-known one is the adaptive boosting (or AdaBoost) algorithm outlined in algorithm 1.1. Let’s dig into its guts. AdaBoost From a very high level, this algorithm follows a very human approach for learning: The better we’re doing overall, the more we should focus on individual mistakes. We return to the world of classification, where our training set maps from a feature Freund & vector to a “correct” or “incorrect” label, yi ∈ {−1, 1}: Schapire, ‘99 X = {(x1 , y1 ), (x2 , y2 ),... , (xn , yn )} Determining Distribution We start with a uniform distribution, so D1 (i) = 1/n. Then, the probability of a sample in the next distribution is weighed by its “correct- ness”: Dt (i) 1 1 − εt Dt+1 (i) = · exp(−αt yi Ht (xi )) where αt = ln zt 2 εt Uh, scrrrrrr... let’s break this down piece by piece. The leading fraction is the previous probability scaled by a normalization factor zt that is necessary to keep Dt+1 a proper probability distribution, so we can basically ignore it.1 Notice the clever trick here given our values for yi : when the classification is correct, yi = Ht (xi ) and their product is 1; when it’s incorrect, their product is always −1. Because α > 0 (shown later in the detailed math aside), the −α will always flip the sign of the “correctness” result. Thus, we’re doing e−α when the learner agrees and eα otherwise. A negative exponential is a fraction, so we should expect the whole term to make Dt (i) decrease when we get it right and increase when we get it wrong: ( ↑ if H(·) is incorrect Dt+1 (i) =⇒ ↓ otherwise On each iteration, our probability distribution adjusts to make D favor incorrect answers so that our classifier H can learn them better on the next round; it weighs incorrect results more and more as the overall model performance increases. 1 We don’t dive into this in the main text for brevity, but here zt would be the sum of the pre- normalized weights. In an algorithm (like in algorithm P 1.1), you might first calculate a D0t+1 that 0 didn’t divide any terms by zt , then calculate z = d∈D d and do Dt+1 (i) = Dt+1 (i)/z at the end. Kudrayvtsev 15 CHAPTER 1: Classification Deng, ‘07 Finding the Weak Classifier Notice that we kind-of glossed over determining Ht (x) for any given round of boosting. Weak learners encompass a large class of learners; a simple decision tree could be a weak learner. All it has to do is guarantee performance that is slightly better than random chance. Final Hypothesis As you can see in algorithm 1.1, the final classifier is just a weighted average of the individual weak learners, where the weight of a learner is its respective α. And remember, αt is in terms of εt , so it measures how well the tth round went overall; thus, a good round is weighed more than a bad round. The beauty of ensemble learning is that you can combine many simple weak classifiers that individually hardly do better than chance together into a final classifier that performs really well. Quick Maffs: Boosting, Beyond Intuition Let’s take a deeper look at the definition of Dt+1 (i) to understand how the probability of the ith sample changes. Recall that in our equation, the exponential simplifies to e∓α depending on whether H(·) guesses yi correctly or incorrectly, respectively. Dt (i) 1 1 − εt Dt+1 (i) = · exp(−αt yi Ht (xi )) where αt = ln zt | {z } 2 εt e∓α Let’s look at what eα simplifies to: α 1 1−ε e = exp · ln 2 ε " 1 !# 1−ε 2 = exp ln power rule of logarithms ε r 1−ε recall that ln x = loge x = and aloga n = n ε (it should be obvious now why we said α ≥ 0) We can follow the same reasoning for e−α and get a flipped result: 1 − ε − /2 1 1/2 r −α 1 1−ε ε ε e = exp − · ln = = = 2 ε ε 1−ε 1−ε So we have two general outcomes depending on whether or not the weak Kudrayvtsev 16 MACHINE LEARNING Supervised Learning √ learner classified xi correctly (dropping the · for simplicity of analysis): 1−ε    if H(·) was wrong 2 f (ε) = exp (−αt yi Ht (xi )) = ε ε   if H(·) was right 1−ε Remember that εt is the total error of all incorrect answers that Ht gave; it’s a sum of probabilities, so 0 < ε < 1. But note that H is a weak learner, so it must do better than random chance, so in fact 0 < ε < 21. The functions are plotted in Figure 1.2; from them, we can draw some straightforward conclusions: 5 4 3 f (ε) 2 right 1 wrong 0 0 0.2 0.4 0.6 ε Figure 1.2: The two ways the exponential can go when boosting, depending on whether or not the classifier gets sample i right ( 1−ε ε , ε in red) or wrong ( 1−ε , in blue). When our learner is incorrect, the weight of sample i increases ex- ponentially as we get more and more confident (ε → 0) in our model. When our learner is correct, the weight of sample i will decrease (relatively) proportionally with our overall confidence. To summarize these two points in different words: a learner will always weigh incorrect results more than correct results, but will focus on incorrect results more and more as the learner’s overall performance increases. Considerations: Overfitting Boosting is a robust method that tries very hard to avoid overfitting. Testing per- formance often mimics training performance pretty closely. Why is this the case? Kudrayvtsev 17 CHAPTER 1: Classification Though we’ve been discussing error at length up to this point, and how minimizing error has been our overarching goal, it’s worth discussing the idea of confidence, as well. If, for example, you had a 5-nearest neighbor in which three neighbors strongly voted one way and two neighbors voted another way, you might have low error but also low confidence, relative to a scenario where all five voted the same way. Because boosting is insecure and suffers from social anxiety, it tries really hard to be confident and this lets it avoid overfitting. Once a boosted ensemble reaches a state at which it has low error and can separate positive and negative examples well, adding more and more weak learners will actually continue to spread the gap (similar to support vector machines which we’re about to discuss with respect to margins) at the boundary. −1 0 +1 −1 0 +1 However, nobody’s perfect. Boosting still tends to overfit in an oddly-specific case Dr. Isbell highlights: when its underlying weak learner uses a complex artificial neural network (one with many nodes and layers). More generally, if the underlying learners all overfit and can’t stop overfitting (like a complex neural net tends to do), boosting can’t do anything to prevent that. Boosting also suffers under pink noise—uniform noise—and tends to overfit.2 1.3 Support Vector Machines Let’s return to the notion of a dataset being linearly-separable. From the standpoint of human cognition, finding a line that cleanly divides two colors is pretty easy. In general when we’re trying to classify something into one category or another, there’s no reason for us to consider anything but the specific details that truly separate the two categories. We hardly pay attention to the bulk of the data, focusing on what defines the boundary between them. This is the motivation behind support vector machines. I covered SVMs briefly in my notes on computer vision. For a gentler introduction to the topic, refer there. This section will have a lot more assumed knowledge so that I can avoid repeating myself. Now, which of the green dashed lines below “best” separates the two colors? 2 “White noise” is Gaussian noise. Kudrayvtsev 18 MACHINE LEARNING Supervised Learning They’re all correct, but why does the middle one “feel” best? Aesthetics? Maybe. But more likely, it’s because the middle line does the best job at separating the data without making us commit too much to it. It leaves the biggest margin of potential error if some hidden dots got revealed. A support vector machine operates on this exact notion: it tries find the boundary that will maximize the margin from the nearest data points. The optimal margin lines will always have some special points that intersect the dashed lines: These points are called the support vectors. Just like a human, a support vector machine reduces computational complexity by focusing on examples near the bound- aries rather than the entire data set. So how can we use these vectors to maximize the margin? 1.3.1 There are lines and there are lines... Let’s briefly note that a line in 2D is typically represented in the form y = mx + b. However, we want to generalize to n dimensions. The standard form of a line in 2D is defined as: ax + by + c = 0, where a, b, c ∈ Z and a > 0. From this, we can imagine a “compact” representation of the line that only uses its constants, so if we want the original equation back, we can dot this vector Kudrayvtsev 19 CHAPTER 1: Classification with x y 1 : a b c · x y 1 =0 Thus if we let s = a b c and w = x y 1 , then our line can be expressed simply as: wT s = 0. This lets us representing a line in vector form and use any number of dimensions. Our w defines the parameters of the (hyper)plane; notice that here, w ⊥ the xy-plane. If we want to stay reminiscent of y = mx + b, we can drop the last term of w and use the raw constant: wT s + c = 0. 1.3.2 Support Vectors Similar to what we did with boosting, we’ll say that when y ≥ 1, the input value x was part of the class, whereas if y ≤ −1 it wasn’t (take care to differentiate the fact that y is the label now rather than something related to the y axis). Then a line in our “label space” is of the form y = wT x + b. What’s the output, then, of the line that divides the two classes? Well it’s exactly between all of the 1s and the -1s, so it must be the line wT x + b = 0. Similarly, the two decision boundaries are exactly ±1. Note that we don’t know w or b yet, but know we want to maximize d: wT x + b = 1 wT x + b = 0 wT x + b = −1 d Well if we call the two green support vectors above x1 and x2 , what’s the distance between them? Well, wT x1 + b = 1 −(wT x2 + b = −1) wT (x1 − x2 ) = 2 2 ŵT (x1 − x2 ) = kwk Kudrayvtsev 20 MACHINE LEARNING Supervised Learning Thus we want to maximize M = kwk 2 (M for margin), while also classifying all of our data points correctly. Mathematically-speaking, maximization is harder than mini- mization (thank u calculus), so we’re actually better off optimizing for min 12 kwk2. So if we define yi ∈ {−1, 1} for every training sample as we did when Boosting, then we arrive at a standard quadratic optimization problem: 1 Minimize: kwk2 2 Subject to: yi (wT xi + b) ≥ 1 which is a well-understood, always-solveable optimization problem whose solution is just a linear combination of the support vectors: X w= α i y i xi i where the αi s are “learned weights” that are only non-zero at the support vectors. Any support vector i is defined by yi = wT xi + b, so: X yi = αi yi xTi x + b = ±1 i | {z } wT ! X We can use this to build our classification function: f (x) = sign αi yi xTi x + b i Note the highlighted box: the entirety of the classification depends only on this dot product between some “new point” x and our support vectors xi s. The dot product is a metric of similarity: here, it’s the projection of each xi onto the new x, but it doesn’t have to be... 1.3.3 Extending SVMs: The Kernel Trick We’ve been working with nice, neat 2D plots and drawing a line to separate the data. This begs the question: what if the data isn’t linearly-separable? The answer to this question is the source of power of SVMs: we’ll use it to find separation boundaries between our data points in higher dimensions than our features provide. Obviously, we can find the optimal separator between the following group of points: 0 x But what about these? Kudrayvtsev 21 CHAPTER 1: Classification 0 x No such luck this time. But what if we mapped them to a higher-dimensional space? For example, if we map these to y = x2 , a wild linear separator appears! x2 x Figure 1.3: Finding a linear separator by mapping to a higher-dimensional space. This seems promising... how can we find such a mapping (like the arbitrary x 7→ x2 above) for other feature spaces? Notice that we added a dimension simply by manipulating the representation of our features. Let’s generalize this idea. We can call our mapping function Φ; it maps the xs in our feature space to another higher-dimensional space ϕ(x), so Φ : x 7→ ϕ(x). And recall the “similarity metric” in our classification function; let’s isolate it and define K(a, b) = aT b. Then, we have ! X f (x) = sign αi yi K(xi , x) + b i The kernel trick here is simple: it states that if there exists some Φ(·) that can represent K(·) as a dot product, we can actually use K in our linear classifier. We don’t actually need to find, define, or care about Φ, it just needs to exists. And in practice, it does exist for almost any function we can think of (though I won’t offer an explanation on how). That means we can apply almost any K and it’ll work out. For example, if our 2D dataset was separable by a circular boundary, we could use K(a, b) = (aT b)2 in our classifier and it would actually find the boundary, despite it not being linearly-separable in two dimensions. Soak that in for moment... we can almost-arbitrarily define a K that represents our data in a different way and it’ll still find a boundary that just so happens to be linear in a higher-dimensional space. Kudrayvtsev 22 MACHINE LEARNING Supervised Learning Example 1.2: A Simple Polynomial Kernel Function Let’s work through a proof that a particular kernel function is a dot product in some higher-dimensional space. Remember, we don’t actually care about what that space is when it comes to applying the kernel; that’s the beauty of the kernel trick. We’re working through this to demonstrate how you would show that some kernel function does have a higher-dimensional mapping. x We have 2D vectors, so x = 1. x2 Let define the following kernel function: K(xi , xj ) = (1 + xTi xj )2 This is a simple polynomial kernel; it’s called that because we are creating a polynomial from the dot product. To prove that this is a valid kernel function, we need to show that K(xi , xj ) = ϕ(xj )T ϕ(xj ) for some ϕ. K(xi , xj ) = (1 + xTi xj )2 xj1 xj1 = 1 + xi1 xi2 1 + xi1 xi2 expand xj2 xj2 = 1 + x2i1 x2ij + 2xi1 xj1 xi2 xj2 + x2i2 x2j2 + 2xi1 xj1 + 2xi2 xj2 multiply it all out   1  x2  √ j1  √ √ √  2xj1 xj2   rewrite it as a = 1 x2i1 2xi1 xi2 x2i2 2xi1 2xi2    √x2j2  vector product     √2xj1  2xj2 At this point, we can see something magical and crucially important: each of the vectors only relies on terms from either xi or xj ! That means it’s a... wait for it... dot product! We can define ϕ as a mapping into this new 6-dimensional space: √ √ √ T ϕ(x) = 1 x21 2x1 xn2 x22 2x1 2x2 Which means now we can express K in terms of dot products in ϕ, exactly as we wanted: K(xi , xj ) = ϕ(xi )T ϕ(xj ) The choice of K(·) is much like the choice of d(·) in k-nearest neighbor: it encodes domain knowledge about the data in question that can help us classify it better. Some common kernels include polynomial kernels (like the one in the example) and Kudrayvtsev 23 CHAPTER 1: Classification the radial basis kernel which is essentially a Gaussian: K(a, b) =... = (aT b + c)p (polynomial) ! ka − bk2 = exp − (radial basis) 2σ 2 1.3.4 Summary In general, we’ve come to the conclusion that finding the linear separator of a dataset with the maximum margin is a good way to generalize and avoid overfitting. SVMs do this with a quadratic optimization problem to find the support vectors. Finally, we discussed how the kernel trick lets us find these linear separators in an arbitrary n-dimensional space by projecting the data to said space. Kudrayvtsev 24 MACHINE LEARNING Supervised Learning Algorithm 1.1: A simplified AdaBoost algorithm. Note that yi ∈ {−1, 1}, so we’re working with classification here, but it can just as easily be adapted to regression. (this algorithm comes from my notes on computer vision) Input: X = {x1 , x2 ,... , xm } and Y = {y1 , y2 ,... , ym }, a training set mapping both positive and negative training examples to their corresponding labels: xi 7→ yi. Input: H(·): a weak classifier type. Result: A boosted classifier, H ∗. ŵ = 1/m 1/m... // a uniform weight distribution t≈0 // some small threshold value close to 0 foreach training stage j ∈ [1..n] do ŵ = w/kwk hj = H(X , Y, ŵ) // train a weak learner for the current weights ∀wi ∈ ŵ where hj (xi ) 6= yi PM εj = wi i=0 1 1−ε αj = 2 ln εj j if ε > t then wi = wi · exp (−yi αj hj (xi )) ∀wi ∈ w else break end h PM i H ∗ (x) := sign j=0 αj hj (x) return H ∗ Kudrayvtsev 25 Regression We stand the risk of regression, because you refused to take risks. So life demands risks. — Sunday Adelaja hen we free ourselves from the limitation of small discrete quantities of la- W bels, we are open to approximate a much larger range of functions. Machine learning techniques that use regression can approximate real-valued and con- tinuous functions. In supervised learning, we are trying to perform inductive reasoning, in which we try to figure out the abstract bigger picture from tiny snapshots of the world in which we don’t know most of the rules (that is, approximate a function from input-output pairs). This is in stark contrast with deductive reasoning, through which individual facts are combined through strict, rigorous logic to come to bigger conclusions (think “if this, then that”). 2.1 Linear Regression You’ve likely heard of linear regres- 3 sion if you’re reading these notes. Lin- ear regression is the mathematical pro- 2 e2 cess of acquiring the “line-of-best fit” y e1 that we used to plot in grade school. 1 e0 Through the power of linear algebra, the line of best fit can be rigorously de- 0 fined by solving a linear system. 0 1 2 3 x One way to find this line is to find the sum of least squared-error between the Figure 2.1: A set of points with “no solution”: points and the chosen line; more specif- no line passes through all of them. The set of errors is plotted in red: (e0 , e1 , e2 ). ically, a visual demonstration can show 26 MACHINE LEARNING Supervised Learning us this is the same as minimizing the projection error of the points on the line. Suppose we have a set of points that don’t exactly fit a line: {(1, 1), (2, 1), (3, 2)}, plotted in Figure 2.1. We want to find the best possible line y = mx + b that mini- mizes the total error. This corresponds to solving the following system of equations, forming y = Ax:  1 = b + m · 1      1 1 1 m 1=b+m·2 =⇒ 1 = 1 2 b 2 1 3  2=b+m·3  The lack of an exact solution to this system (algebraically) means that the vector of y-values isn’t in the column space of A, or: y ∈ / C(A). The vector can’t be represented by a linear combination of column vectors in A. We can imagine the column space as a plane in xyz- space, and y existing outside of it; then, the vector that’d be within the column space is the projection y of y into the column space plane: p = proj y. This C(A) is the closest possible vector in the column space e C(A) to y, which is exactly the distance we were trying p to minimize! Thus, e=y−p Figure 2.2: The vector y rela- tive to the column space of A, and its projection p onto the column The projection isn’t super convenient to calculate space. or determine, though. Through algebraic manip- ulation, calculus, and other magic, we learn that the way to find the least squares approximation of the solution is: AT Ax∗ = AT y x∗ = (AT A)−1 AT y | {z } pseudoinverse which is exactly what the linear regression algorithm calculates. More Resources. This section basically summarizes and synthesizes this Khan Academy video, this lecture from the Computer Vision course (which goes through the full derivation), this section of Introduction to Linear Algebra, and this explana- tion from NYU. These links are provided in order of clarity. Kudrayvtsev 27 CHAPTER 2: Regression 2.2 Neural Networks Let’s dive right into the deep end and learn about how a neural network works. Neurons in the brain can be described relatively succinctly from a high level: a single neuron is connected to a bunch of other neurons. It accumulates energy and once the amount is bigger than the “activation threshold,” the neuron fires, sending energy to the neurons its connected to (potentially causing a cascade of firings). Artificial neural networks take inspiration from biology and are somewhat analogous to neuron interactions in the brain, but there’s really not much benefit to looking at the analogy further. The basic model of an “artificial neuron” is a function powered by a series of inputs xi , and weights wi , that somehow run through F and produce an output y: x1 w1 w2 activation x2 function F , and y w3 threshold θ x3 Typically, the activation function “fires” based on a firing threshold θ. 2.2.1 Perceptron The simplest, most fundamental activation function procudes a binary output (so y ∈ {0, 1}) based on the weighted sum of the inputs: ( 1 if Pn i=1 wi xi ≥ θ F (x1 , x2 ,... , xn , w1 , w2 ,... , wn ) = 0 otherwise This is called a perceptron, and it’s the foundational building block of neural net- works going back to the 1950s. Kudrayvtsev 28 MACHINE LEARNING Supervised Learning Example 2.1: Basic Perceptron Let’s quickly validate our understanding. Given the following input state: 1 x1 = 1, w1 = 2 3 x2 = 0, w2 = 5 x3 = −1.5, w3 = 1 and the firing threshold θ = 0, what’s the output? Well, pretty simply, we get 1 3 1 3 y= 1 +0 + (−1.5) · 1 = − = −1 < 0 = 0 2 5 2 2 The perceptron doesn’t fire! What kind of functions can we represent with a perceptron? Suppose we have two inputs, x1 and x2 , along with equal weights w1 = w2 = 21 , θ = 34. For what values of xi will we get an activation? Well, we know that if x2 = 0, then we’ll get an activation for anything that makes x1 w1 ≥ θ, so x1 ≥ θ/w1 ≥ 1.5. The same rationale applies for x2 , and since we know that the inequality is linear, we can just connect the dots: 2 x2 1 x1 1 2 Thus, a perceptron is a linear function (each wi xi term is linear), and so it can be used to compute the half-plane boundary between its inputs. This very example actually computes an interesting function: if the inputs are binary (that is, x1 , x2 ∈ {0, 1}), then this actually computes binary AND operation. The only possible outputs are marked accordingly; only when x1 = x2 = 1 does y = 1! We can actually model OR the same way with different weights (like w1 = w2 = 32 ). Kudrayvtsev 29 CHAPTER 2: Regression 2 2 x2 x2 1 1 x1 x1 1 2 1 2 Note that we “derived” OR by adjusting w1,2 until it worked, though we could’ve also adjusted θ. This might trigger a small “aha!” moment if the idea of induction from stuck with you: if we have some known input/output pairs (like the truth tables for the binary operators), then we can use a computer to rapidly guess-and-check the weights of a perceptron (or perhaps an entire neural network... ?) until they accurately describe the training pairs as expected. x y x⊕y Combining Perceptrons 1 1 0 To build up an intuition for how perceptrons can be com- 1 0 1 bined, let’s binary XOR. It can’t be described by a single 0 1 1 perceptron because it’s not a linear function; however, 0 0 0 it can be described by several! Table 2.1: The truth table Intuitively, XOR is like OR, except when the inputs succeed for XOR. under AND... so we might imagine that XOR ≈ OR−AND. So if x1 and x2 are both “on,” we should take away the result of the AND perceptron so that we fall under the activation threshold. However, if only one of them is “on,” we can proceed as normal. Note that the “deactivation weight” needs to be equal to the sum of the other weights in order to effectively cancel them out, as shown Figure 2.3. Learning Perceptrons Let’s delve further into the notion of “training” a perceptron network that we alluded to earlier: twiddling the wi s and θs to fit some known inputs and outputs. There are two approaches to this: the perceptron rule, which operates on the post-activated y-values, and gradient descent, which operates on the raw summation. Perceptron Rule Suppose ŷ is our perceptron’s current output, while y is its desired output. In order to “move towards” the goal y, we adjust our wi s accordingly Kudrayvtsev 30 MACHINE LEARNING Supervised Learning w1 = 1 x1 x2 AND y x1 1 1 1 1 · 1 + −2 · 1 + 1 · 1 = 0 w2 = −2 XOR AND y 1 0 0 1 · 1 + −2 · 0 + 1 · 0 = 1 θ=1 x2 0 1 0 1 · 0 + −2 · 0 + 1 · 1 = 1 0 0 0 1 · 0 + −2 · 0 + 1 · 0 = 0 w3 = 1 Figure 2.3: A neural network modeling XOR and its summations for all possi- ble bit inputs. The final column in the table is the summation expression for perceptron activation, w1 x1 + w2 FAND (x1 , x2 ) + w3 x2. based on the “error” between y and ŷ. That is, define ŷ = ( i wi xi ≥ 0) P then use ∆wi = η (y − ŷ) xi to adjust wi ← wi + ∆wi where η > 0 is a learning rate which influences the size of the ∆wi adjustment made every iteration. Notice the absense of the activation threshold, θ. To simplify the math, we can actually treat it as part of the summation: a “fake” input with a fixed weight of −1, since: Pn Pn i wi xi = θ =⇒ i w i xi − θ = 0 where wn+1 = −1 and xn+1 = θ Pn+1 =⇒ i w i xi = 0 This extra input value is now called the bias and its weight can be tweaked just like the other inputs. Notice that when our output is correct, y − ŷ = 0 so there is no effect on the weights. If ŷ is too big, then our ∆wi < 0, making that wi xi term smaller, and vice-versa if ŷ is too small. The learning rate controls our adjustments so that we take small steps in the right direction rather than overshooting, since we don’t know how close we are to fixing ŷ. Theorem 2.1 (Perceptron Rule). If a dataset is linearly-separable (that is, able to be separated by a line), then a perceptron can find it with a finite number of iterations by using the perceptron rule. Of course, it’s impossible to know whether a sufficiently-large “finite” number of a iterations has occurred, so we can’t use this fact to determine whether or not a dataset is linearly-separable, but it’s still good to know that it will terminate when it can. Kudrayvtsev 31 CHAPTER 2: Regression Gradient Descent We still need something in our toolkit for datasets that aren’t linearly-separable. This time, we’ll operate on the unthresholded summation since it gives us far more information about how close (or far) we P are from triggering an activation. We’ll use a as shorthand for the summation: a = i wi xi. Then we can define an error metric based on the difference between a and each expected output: we sum the error for each of our input/output pairs in the dataset D. That is, 1 X E(w) = (y − a)2 2 (x,y)∈D Let’s use our good old friend calculus to solve this via gradient descent. A function is at its minimum when its derivative is zero, so we’ll take the derivative with respect to a weight:   ∂E ∂ 1 X = (y − a)2  ∂wi ∂wi 2 (x,y)∈D " # X ∂ X chain rule, and only a = (y − a) · − w j xj is in terms of wi ∂wi j (x,y)∈D X when j 6= i, the derivative = (y − a)(−xi ) will be zero (x,y)∈D X rearranged to look like =− (y − a)xi the perceptron rule (x,y)∈D Notice that we essentially end up with a version of the perceptron rule where η = −1, except we now use the summation a instead of the binary output ŷ. Unlike the perceptron rule, we have no guarantees about finding a separation, but it is far more robust to non-separable data. In the limit (thank u Newton very cool), though, it will converge to a local optimum. To reiterate our learning rules, we have: ∆wi = η(y − ŷ)xi (2.1) ∆wi = η(y − a)xi (2.2) 2.2.2 Sigmoids The similarity between the learning rules in (2.1) and (2.2) begs the question, why didn’t we just use calculus on the thresholded ŷ? Kudrayvtsev 32 MACHINE LEARNING Supervised Learning y 1 a The simple answer is that the function isn’t differentiable. Wouldn’t it be nice, though if we had one that was very similar to it, but smooth at the hard corners that give us problems? Enter the sigmoid function: 1 σ(a) = (2.3) 1 + e−a By introducing this as our activation function, we can use gradient descent all over the place. Furthermore, the sigmoid’s derivative itself is beautiful (thanks to the fact that dx e = ex ): d x σ̇(a) = σ(a)(1 − σ(a)) Note that this isn’t the only function that smoothly transitions between 0 and 1; there are other activation functions out there that behave similarly. 2.2.3 Structure Now that we have the ability to put together differentiable individual neurons, let’s look at what a large neural network presents itself as. In Figure 2.4 we see a collec- tion of sigmoid units arranged in an arbitrary pattern. Just like we combined two perceptrons to represent the non-linear function XOR, we can combine a larger number of these sigmoid units to approximate an arbitrary function. x1 x2 x3 y x4 x5 input layer hidden layers output Figure 2.4: A 4-layer neural network with 5 inputs and 3 hidden layers. Because each individual unit is differentiable, the entire mapping from x 7→ y is differ- entiable! The overall error of the system enables a bidirectional flow of information: Kudrayvtsev 33 CHAPTER 2: Regression the error of y impacts the last hidden layer, which results in its own error, which im- pacts the second-to-last hidden layer, etc. This layout of computationally-beneficial organizations of the chain rule is called back-propagation: the error of the network propogates to adjust each unit’s weight individually. Optimization Methods It’s worth reiterating that we’ve departed from the guarantees of perceptrons since we’re using σ(a) instead of the binary activation function; this means that gradient descent can get stuck in local optima and not necessarily result in the best global approximation of the function in question. Gradient descent isn’t the only approach to training a neural network. Other, more advanced methods are researched heavily. Some of these include momentum, which allows gradient descent to “gain speed” if it’s descending down steep areas in the function; higher-order derivatives, which look at combinations of weight changes to try to grasp the bigger picture of how the function is changing; randomized optimization; and the idea of penalizing “complexity,” so that the network avoids overfitting with too many nodes, layers, or even too-large of weights. 2.2.4 Biases What kind of problems are neural networks appropriate for solving? Restriction Bias A neural network’s restriction bias (which, if you recall, is the representation’s ability to consider hypotheses) is basically non-existent if you use sigmoids, though certain models may require arbitrarily-complex structure. We can clearly represent Boolean functions with threshold-like units. Continuous functions with no “jumps” can actually be represented with a single hidden layer. We can think of a hidden layer as a way to stitch together “patches” of the function as they approach the output layer. Even arbitrary functions can be approximated with a neural network! They require two hidden layers, one stitching at seams and the other stitching patches. This lack of restriction does mean that there’s a significant danger of overfitting, but by carefully limiting things that add complexity (as before, this might be layers, nodes, or even the weights themselves), we can stay relatively generalized. Preference Bias On the other hand, we can’t yet answer the question of the pref- erence bias of a neural network (which, if you recall, describes which hypotheses from the restricted space are preferred). We discussed the algorithm for updating weights (gradient descent), but have yet to discuss how the weights should be initialized in the first place. Kudrayvtsev 34 MACHINE LEARNING Supervised Learning Common practice is choosing small, random values for our initial weights. The ran- domness allows for variability so that the algorithm doesn’t get stuck at the same local minima each time; the smallness allows for relative adjustments to be impactful and reduces complexity. Given this knowledge, we can say that neural networks—when all other things are equal—prefer simpler explanations to complex ones. This idea is an embodiment of Occam’s Razor: Entities should not be multiplied unnecessarily. More colloquially, it’s often expressed as the idea that the simpler explanation is likelier to be true. 2.3 Instance-Based Learning The learning algorithms presented in this section take a radically-different approach to modeling a function approximation. Instead of inducing an abstract model from the training set that compactly represents most of the data, these algorithms will actually regularly refer to the data itself to create approximations. For a high-level example, consider our fundamental “line of best fit” problem. Given some input points, linear regression will determine the line that minimizes the error with the data, and then we can use the line directly to predict new values. With instance-based learning methods, we instead would base our predictions from the points themselves. We might predict the output of a novel input x0 as being the same as the output of whatever known x is closest to it, or perhaps the average of the outputs of the three known inputs closest to it. There are some pretty clear benefits to this paradigm shift: we no longer need to actually spend time “learning” anything; the model perfectly-remembers the training data rather than remembering an abstract generalizing; and the approach is dead- simple. The downsides, of course, are that we have to store all of our training data (which might potentially require massive amounts of storage) and we are not really generalizing from it (we’re pretty obviously overfitting) at all. 2.3.1 Nearest Neighbors This idea of referring to similar known inputs for a novel input is exactly the intuition behind the k-nearest neighbor learning algorithm. While k is the number of knowns to consider, we also need a notion of “distance” to determine how close or similar an xi ∈ X is to the novel input x0. The distance is our expression of domain knowledge about the space. If we’re classify- ing restaurants into cheap, average, and expensive, our distance metric might be the difference between the average entreé price. We might even be talking about literal Kudrayvtsev 35 CHAPTER 2: Regression distances: if we had some arbitarily-colored dots (whose color was determined by the features fm and fn ) and wanted to determine the color of a novel dot (encoded in red below) with fm = 2, fn = 1, we’d use the standard Euclidean distance. 4 fn 3 d3 2 d1 d2 1 d4 fm 1 2 3 4 Notice that x4 is closest, followed by x2. If we were doing classification, we would probably choose specifically blue or green (depending on our tie-breaking scheme), but in regression, we might instead color the novel dot by the weighted sum of its k = 2 nearest neighbors: d4 y4 + d2 y2 2.8 · blue + 1.8 · green y0 = ≈ d4 + d2 2.8 + 1.8 = (0, 155, 100) in RGB ∈ [0, 255] terms which is a nice, dark blue-green color. An extremely simple (and non-performant) version of the kNN algorithm is formalized in algorithm 2.1; notice that it has O(k |X |) complexity, meaning it grows linearly both in k, the number of neighbors to consider, and in |X |, the size of the training data. Obviously we can improve on this. With a sorted list (O(n log n) time), we can apply binary search (O(log n + k) time for k values) and query far more efficiently. For features with high dimensionality (i.e. when n 0 for xi = {f1 , f2 ,... , fn }), we can also leverage the kd-tree algorithm1 —which subdivides the data into sectors to search through them more efficiently—but is still O(kn log n). This is the main downside of kNN: because we use the training data itself as part of the querying process, things can get slow and unwieldy (in both time and space) very quickly. 1 Game developers might already be somewhat familiar with the algorithm: quadtrees rely on sim- ilar principles to efficiently perform collision detection, pathfinding, and other spatially-sensitive calculations. The game world is dynamically divided into recursively-halved quadrilaterals to group closer objects together. Kudrayvtsev 36 MACHINE LEARNING Supervised Learning This is in contrast with something like linear regression, which calculates a model up- front and makes querying very cheap (constant time, in fact); in this regard, kNN is referred to as a lazy learner, whereas linear regression would be an eager learner. Algorithm 2.1: A naïve kNN learning algorithm. Both the number of neigh- bors, k, and the similarity metric, d(·), are assumed to be pre-defined. Input: A series of training samples, X := xi 7→ yi. Input: The novel input, x0. Result: The predicted value of y 0. closest := {} dists := {∞,...} foreach xi ∈ X do foreach j ∈ [1, k] do if d (xi , x0 ) < dists[j] then closest[j] = xi dists[j] = d (xi , x0 ) end end end P the distance-weighed average of the k closest neighbors. // Return di ∈dists di · xi return P di ∈dists di In the case of regression, we’ll likely be taking the weighted average like in algo- rithm 2.1; in the case of classification, we’ll likely instead have a “vote” and choose the label with plurality. We can also do more “sophisticated” tie-breaking: for regres- sion, we can consider all data points that fall within a particular shortest distance (in other words, we consider the k shortest distances rather than the k closest points); for classification, we have more options. We could choose the label that occurs the most globally in X , or randomly, or... Biases As always, it’s important to discuss what sorts of problems a kNN representation of our data would cater towards. Preference Bias Our belief of what makes a good hypothesis to explain the data heavily relies on locality—closer points (based on the domain-aware distance metric) are in fact similar—and smoothness—averaging neighbors makes sense and feature behavior smoothly transitions between values. Kudrayvtsev 37 CHAPTER 2: Regression Notice that we’ve also been treating the features of our training sample vector xi equally. The scales for the features may of course be different, but there is no notion of weight on a particular feature. This is critical: obviously, whether or not a restaurant is within your budget is far more important than whether the atmosphere inside fits your vibe (being the broke graduate students that we are). However, the kNN model has difficulty with making this differentiation. In general, we are encountering the curse of dimensionality in machine learning: as the number of features (the dimensionality of a single xi ∈ X ) grows linearly, the amount of data we need to accurately generalize the grows exponentially. If we have a lot of features and we treat them all with equal importance, we’re going to need a LOT of data to determine which ones are more relevant than others. It’s common to treat features as “hints” to the machine learning algorithm you’re using, following the rationale that, “If I give it more features, it can approximate the model better since it has more information to work with;” however, this paradoxically makes it more difficult for the model, since now the feature space has grown and “importance” is harder rather than easier to determine. Restriction Bias If you can somehow define a distance function that relates to feature points together, you can represent the data with a kNN. Kudrayvtsev 38 Computational Learning Theory People worry that computers will get too smart and take over the world, but the real problem is that they’re too stupid and they’ve already taken over the world. — Pedro Domingos omputational learning theory lets us define and understanding learning prob- C lems better. It’s a powerful tool that can let us show that specific algorithms work for specific problems (like, for example, when a kNN learner would be best for a problem), and it also lets us show when a certain problem is fundamentally “hard.” The kinds of methods used in analysing learning questions are often analogous to the methods used in algorithm analysis in a more “traditional” computer science setting (think big-O notation). There, we talk about an algorithm’s complexity in time and space; analogously, in machine learning problems, while we also care about time and space, we also care about sample complexity. Does a particular algorithm do well with small amounts of data? How does metrics like prediction accuracy grow with increased data? Generally-speaking, inductive learning is learning from examples. There are many factors in a way such a learning problem is set up that can affect the resulting induc- tions; these include: the number of samples to learn from: it should come as no surprise that the amount of data we have can affect our inductions significantly; the complexity of the hypothesis class: the more difficult a concept or idea, the harder it may be for most algorithms to model; similarly, the risk of overfitting is high when you model a complex hypothesis; the accuracy to which the target concept is approximated: obviously, whether or not a model is “good” correlates directly with how well it performs on novel 39 CHAPTER 3: Computational Learning Theory data; how samples are presented : until now, we’ve been training models on entire “batches” of data at once, but this isn’t the only option; online learning (one at a time) is another alternative; and how samples are selected : is random shuffling the best way to choose data to train on? 3.1 Learning to Learn: Interactions We’ll look at the last two items from the above list first because deep within them are some important subtleties. The ways that training examples are selected to learn from can drastically affect the overall quantity of data necessary to learn something meaningful. There are a number of ways to approach the learner-teacher relationship: 1. query-based, where the learner “asks questions” that the teacher responds to. For example, “I have x; what’s c(x)?” 2. friendly, where the teacher provides the example x and its resulting c(x), se- lecting x to be a helpful way to learn. 3. natural, where nobody chooses and there’s just some nature-provided distribu- tion of samples. 4. adversarial, where the teacher provides unhelpful (x, c(x)) pairings designed to make the learner fail (and thereby learn better when it figures out how not to fail–think “trick questions” on exams). 5.... and more! So if we have a query-based learner, coming up with and asking questions, what should they ask? An effective question one whose answer gives the most information. In a binary world of yes-or-no questions with n possible hypotheses, the question always elimi- nates some ` number of possibilities: x Figure 3.1: From the movie I, Robot, a holo- graphic version of a scientist appears posthu- ` n−` mously to help Will Smith solve the riddle of the robot uprising. A good question, then, ideally halves the space, so ` ≈ n − `; then it’ll take Kudrayvtsev 40 MACHINE LEARNING Supervised Learning log2 |H| questions overall to nail the answer and find h. Teaching If the teacher is providing friendly examples, wouldn’t a very ambitious teacher—one that wants the learner to figure out the hypothesis h ∈ H as quickly as possible— simply tell the learner to “ask,” “Is the hypothesis h?” Though effective, that’s not a very realistic way to learn: the question-space is never “the entire set of possible questions in the universe.” Teachers have a constrained space of questions they can suggest. Example 3.1: A Good Teacher Suppose we have some Boolean function that operates on some (possibly proper) subset of the bits x1 through x5. Our job is to determine the conjunction (logical-AND) of literals or negations that will fulfill the table of examples: x1 x2 x3 x4 x5 h 1 0 1 1 0 1 0 0 0 1 0 1 1 1 1 1 0 0 1 0 1 0 0 0 1 0 1 1 1 0 How should we tackle this? Well notice that in the first two rows, x1 is flipped yet h stays the same. We can infer from this that x1 is not relevant, and likewise for x3. What’s consistent across these rows? Well, both x2 = x5 = 0. Yet this is also the case in row 4, so this may be necessary but not sufficient. We also see that x4 = 1 in both top rows, so a reasonable guess overall for the Boolean function in question is: f (x1 · · · x5 ) = x2 ∧ x4 ∧ x5 This holds for all of the rows. Learning: Constrained Queries Notice that in the above example, we can come up with f from the first two rows (we can think of these as “positive” examples), then use the next three rows to verify it (these being “negative” examples). This isn’t because we’re some sort of genius, but rather that we were given very good examples to learn from: each one gave us a lot of information. This is exactly what it means to be a good teacher: it can teach us a generic f with just k + 2 samples. What if we weren’t provided good examples and had to ask questions blindly? Since Kudrayvtsev 41 CHAPTER 3: Computational Learning Theory we don’t really learn anything from h = 0, there are many possible f s that are extremely hard to guess. For a general k, it’ll take 2k questions in the worst case (consider when exactly one k-bit string gives h = 1). Learning: Mistake Bounds When the game is too hard, change the game. Instead of measuring how many samples we need, why not measure how many mistakes we make? We’ll modify the teacher-learner interaction thus: the learner is now given an input and has to guess the output. If it guesses wrong, it gets p u n i s h e d. This is an example of online learning, since the learner adjusts with each sample. Its algorithm is then: 1. To start, assume it’s possible for each variable to be both positive and negated: h0 = x1 ∧ x1 ∧... ∧ xk ∧ xk Obviously this is logically impossible, but this lets us have a meaningful baseline for Step 3 to work. 2. Given an input, compute the output based on our h0. For a while (that is, until we guess wrong), we will just output “false.” 3. If we’re wrong, set positive variables (the xi s) that were 0 to absent and negative variables (the xi s) that were 1 to be absent. 4. Go to Step 2. Basically whenever the learner guesses wrong, they will know to eliminate some vari- ables. Even though the number of examples to learn may be the same in the worst case (2k ), we will now never make more than k + 1 mistakes. A good teacher that knows that its learner has this algorithm can actually teach h with k + 1 samples, teaching it one bit every time. 3.2 Space Complexity As we’ve already established, data is king in machine learning. In this section we’ll derive a theoretical way to approximate space complexity: how much data do we need to solve a problem to a particular degree of certainty? First, we’ll need to rigorously define some terminology: a training set: S ⊆ X is made up of some samples x H is the hypothesis space the true hypothesis or concept: c ∈ H is the thing we want to learn, while the candidate hypothesis: h ∈ H is what the learner is currently considering Kudrayvtsev 42 MACHINE LEARNING Supervised Learning a consistent learner: one that produces the correct result for all of the training samples: ∀x ∈ S : c(x) = h(x) 3.2.1 Version Spaces Given the above, the version space is all of the possible hypotheses that are consis- tent with the data; in other words, the only hypotheses worth considering1 at some point in time given the training data: VS(S) = {h : c(x) = h(x) | ∀x ∈ S, h ∈ H} Example 3.2: Version Spaces Given the target concept of XOR (left) and some training data (right): x1 x2 c(x) x1 x2 c(x) 0 0 0 0 0 0 0 1 1 0 1 1 1 0 1 1 1 ? 1 1 0 and the hypotheses: H = {x1 , x1 , x2 , x2 , T, F, OR, AND, XOR, EQUIV} which of the hypotheses are in the version space? Well, which of the hypotheses would be consistent with the training samples? n o x1 , x1 , x2 , x2 , T, F, OR , AND, XOR , EQUIV 3.2.2 Error Given a candidate hypothesis h, how do we evaluate how good it is? We define the training error as the fraction of training examples misclassified by h, while the true error is the fraction of examples that would be misclassified on the sample drawn from a distribution

Machine Learning Notes - An Unofficial Guide PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue