Introduction to Machine Learning PDF

‭INTRODUCTION - A BRIEF INTRODUCTION TO MACHINE LEARNING‬ ‭1.‬ ‭Machine learning vs Artificial intelligence‬ ‭ achine‬‭learning:‬‭A‬‭machine‬‭learning‬‭algorithm‬‭is‬‭an‬‭algorithm‬‭that‬‭is‬‭able‬‭to‬‭learn‬‭from‬ M ‭data.‬ ‭“A‬‭computer‬‭program‬‭is‬‭said‬‭to‬‭learn‬‭from‬‭experience‬‭E‬‭with‬‭respect‬‭to‬‭some‬‭class‬‭of‬‭tasks‬‭T‬ ‭and‬ ‭performance‬ ‭measure‬ ‭P,‬ ‭if‬ ‭its‬ ‭performance‬ ‭at‬ ‭tasks‬‭in‬‭T,‬‭as‬‭measured‬‭by‬‭P‬‭,‬‭improves‬ ‭with experience E.”‬‭Mitchell, 1997‬ ‭Artificial‬‭intelligence:‬‭“The‬‭theory‬‭and‬‭development‬‭of‬‭computer‬‭systems‬‭able‬‭to‬‭perform‬ ‭tasks‬ ‭normally‬ ‭requiring‬‭human‬‭intelligence,‬‭such‬‭as‬‭visual‬‭perception,‬‭speech‬‭recognition,‬ ‭decision-making, and translation between languages.”‬‭Oxford Dictionary‬ ‭2.‬ ‭Task (T)‬ ‭ he‬ ‭task‬ ‭is‬ ‭the‬ ‭problem‬‭that‬‭we‬‭solve‬‭by‬‭learning‬ ‭and‬‭learning‬‭is‬‭our‬‭means‬‭of‬‭attaining‬ T ‭the ability to perform the task.‬ ‭○‬ ‭Thus, the process of learning is‬‭NOT‬‭the task.‬ ‭Machine‬‭learning‬‭enables‬‭us‬‭to‬‭tackle‬‭tasks‬‭that‬‭are‬‭too‬‭difficult‬‭to‬‭solve‬‭with‬‭fixed‬‭programs‬ ‭written and designed by human beings.‬ ‭E.g.‬ ‭We‬ ‭can‬ ‭write‬ ‭a‬ ‭program‬ ‭for‬ ‭a‬ ‭robot‬ ‭to‬ ‭walk,‬ ‭OR‬ ‭we‬ ‭can‬‭write‬‭a‬‭ML‬‭algorithm‬‭for‬‭the‬ ‭robot to learn to do it.‬ ‭Machine‬ ‭learning‬‭tasks‬‭are‬‭usually‬‭described‬‭in‬‭terms‬‭of‬‭how‬‭the‬‭system‬‭should‬‭process‬‭an‬ ‭example.‬ ‭ n‬ ‭example‬ ‭is‬ ‭a‬ ‭collection‬ ‭of‬ ‭features‬ ‭that‬ ‭have‬ ‭been‬ ‭quantitatively‬ ‭measured‬ ‭from‬ ‭some‬ A ‭object or event that we want the machine learning system to process.‬ ‭We‬ ‭typically‬ ‭represent‬ ‭an‬ ‭example‬ ‭as‬ ‭a‬‭vector‬‭of‬‭the‬‭vector‬‭is‬‭a‬‭feature‬ χ ∈ ‭‬‭ℝ‬‭n‬ ‭where‬‭each‬ ‭entry‬χ‭i‬ ‭of the vector is a feature.‬ ‭○‬ ‭E.g. the features of an image are usually values of the pixels in the image‬ ‭.1 Classification‬ 2 ‭Goal‬‭→ specify which of k categories some inputs belongs‬‭t.‬ ‭○‬ ‭To solve this task, the learning algorithm is usually asked to produce a function f:‬ ‭ℝ‭n‬ ‬ ‭→ {1, …, k}‬ ‭○‬ ‭When‬‭y=f(x)‬‭the‬‭model‬‭assigns‬‭an‬‭input‬‭described‬‭by‬‭vector‬‭x‬‭to‬‭a‬‭category‬‭identified‬ ‭by a numeric code y.‬ ‭○‬ ‭There‬ ‭are‬ ‭other‬ ‭variants‬ ‭of‬ ‭the‬ ‭classification‬ ‭task,‬ ‭for‬ ‭example,‬ ‭where‬ ‭f‬ ‭outputs‬ ‭a‬ ‭probability‬ ‭distribution‬ ‭over‬‭classes‬‭(e.g.‬‭when‬‭we‬‭use‬‭a‬‭softmax‬‭layer‬‭as‬‭the‬‭output‬ ‭of our DL model).‬ ‭Examples:‬ ‭1‬ ‭INTRODUCTION - A BRIEF INTRODUCTION TO MACHINE LEARNING‬ ‭2.1.1 Classification with missing inputs‬ ‭ lassification‬ ‭becomes‬ ‭more‬ ‭challenging‬ ‭if‬ ‭the‬ ‭computer‬ ‭program‬ ‭is‬ ‭not‬ ‭guaranteed‬ ‭that‬ C ‭every measurement in its input vector will always be provided.‬ ‭When‬ ‭some‬ ‭of‬ ‭the‬ ‭inputs‬ ‭may‬ ‭be‬ ‭missing,‬ ‭rather‬ ‭than‬ ‭providing‬ ‭a‬ ‭single‬ ‭classification‬ ‭function, the learning algorithm must learn a set of functions.‬ ‭○‬ ‭Each‬ ‭function‬ ‭corresponds‬ ‭to‬ ‭classifying‬ ‭x‬ ‭with‬ ‭a‬ ‭different‬ ‭subset‬ ‭of‬ ‭its‬ ‭inputs‬ ‭missing.‬ ‭This classification task is typical in medical diagnosis.‬ ‭2.2. Regression‬ ‭Goal‬‭→ predict a numerical value given some input‬ ‭‬ T ○ ‭ he function produced for this task type:‬‭ℝ‬‭n‬ ‭→‬‭ℝ‬ ‭○‬ ‭This‬ ‭type‬ ‭of‬ ‭task‬ ‭is‬ ‭similar‬ ‭to‬ ‭classification,‬ ‭except‬ ‭that‬ ‭the‬ ‭format‬ ‭of‬ ‭output‬ ‭is‬ ‭different‬ ‭.3 Cluster analysis‬ 2 ‭Goal‬‭→ predict a group for some given input.‬ ‭○‬ ‭The function produced for this task type:‬‭ℝ‬‭n‬ ‭→ {1, …, k}‬ ‭○‬ ‭This‬ ‭type‬ ‭of‬ ‭task‬ ‭is‬ ‭similar‬ ‭to‬‭classification,‬‭except‬‭that‬‭the‬ ‭examples are not labeled.‬ ‭2.4 Transcription‬ ‭Goal‬ ‭→‬ ‭Observe‬ ‭a‬ ‭relatively‬ ‭unstructured‬ ‭representation‬ ‭of‬ ‭some‬ ‭kind of data and transcribe the information into discrete textual form.‬ ‭2.5 Machine translation‬ ‭Goal‬ ‭→‬ ‭Convert‬ ‭a‬ ‭sequence‬ ‭of‬ ‭symbols‬ ‭in‬ ‭some‬ ‭language‬ ‭into‬ ‭a‬ ‭sequence‬ ‭of‬ ‭symbols‬ ‭in‬ ‭another language.‬ ‭○‬ ‭The‬ ‭most‬ ‭common‬ ‭application‬ ‭is‬ ‭in‬ ‭Natural‬ ‭Language‬ ‭Processing (NLP)‬ ‭.6 Structured output‬ 2 ‭Involves‬ ‭any‬ ‭task‬ ‭that‬ ‭its‬‭output‬‭is‬‭a‬‭vector‬‭or‬‭other‬‭data‬‭structure‬ ‭with important relationships between the different elements.‬ ‭2‬ ‭INTRODUCTION - A BRIEF INTRODUCTION TO MACHINE LEARNING‬ ‭.7 Anomaly detection‬ 2 ‭Goal‬ ‭→‬ ‭Detect‬ ‭or‬‭flag‬‭a‬‭set‬‭of‬‭events‬‭or‬‭objects‬‭that‬‭are‬‭being‬‭unusual‬‭or‬‭atypical‬‭from‬‭the‬ ‭global set of events or objects.‬ ‭.8 Synthesis and sampling‬ 2 ‭Goal‬‭→ Generate new examples that are similar to those‬‭in the training data‬ ‭○‬ ‭Synthesis‬ ‭and‬ ‭sampling‬ ‭via‬ ‭machine‬ ‭learning‬ ‭can‬ ‭be‬ ‭useful‬ ‭for‬ ‭media‬ ‭applications‬ ‭when‬‭generating‬‭large‬‭volumes‬‭of‬‭contect‬‭by‬‭hand‬‭would‬‭be‬‭expensive‬‭or‬‭require‬‭too‬ ‭much time.‬ ‭2.9 Imputation of missing values‬ ‭Goal‬‭→‬‭Given‬‭an‬‭example‬‭with‬‭some‬‭values‬‭missing,‬‭the‬‭computer‬‭must‬‭provide‬‭a‬‭prediction‬ ‭of the missing values.‬ ‭.10 Denoising‬ 2 ‭ oal‬‭→ Correct a corrupted example.‬ G ‭3.‬ ‭Performance measure (P)‬ ‭ o‬ ‭evaluate‬ ‭the‬ ‭abilities‬ ‭of‬ ‭a‬ ‭machine‬ ‭learning‬ T ‭algorithm,‬ ‭we‬‭must‬‭design‬‭a‬ ‭quantitative‬‭measure‬ ‭of‬ ‭its performance.‬ ‭Usually‬ ‭this‬ ‭performance‬ ‭measure‬ ‭P‬‭is‬ ‭specific‬ ‭to‬‭the‬ ‭task T being carried out by the system.‬ ‭For‬ ‭tasks‬ ‭such‬ ‭as‬ ‭classification,‬ ‭classification‬ ‭with‬ ‭missing‬ ‭inputs,‬ ‭and‬ ‭transcription,‬ ‭we‬ ‭often‬ ‭measure‬ ‭the accuracy of the model.‬ ‭○‬ ‭Accuracy can be a dangerous measurement.‬ ‭○‬ ‭Please,‬ ‭evaluate‬ ‭your‬ ‭models‬ ‭with‬ ‭more‬ ‭robust‬ ‭metrics:‬ ‭macro‬ ‭f-measure,‬ ‭AUC,‬ ‭g-mean, statistical tests…‬ ‭ e can also obtain equivalent information by measuring the‬‭error rate‬‭, the proportion of‬ W ‭examples for which the model produces an incorrect output.‬ ‭○‬ ‭We often refer to the error rate as the expected 0-1 loss.‬ ‭○‬ ‭The 0-1 loss on a particular example is 0 if it is correctly classified and 1 if it is not.‬ ‭Usually, we are interested in how well the machine learning algorithm performs on data that‬ ‭it has not seen before, since this determines how well it will work when deployed in the real‬ ‭world.‬ ‭We evaluate these performance measures using a‬‭test‬‭set‬‭of data that is‬‭separate from the‬ ‭data used for training‬‭the machine learning system.‬ ‭○‬ ‭Is very important to avoid the‬‭data bleeding‬‭. Data‬‭bleed occurs when the test set is‬ ‭also used to train the model.‬ ‭3‬ ‭INTRODUCTION - A BRIEF INTRODUCTION TO MACHINE LEARNING‬ ‭The choice of performance measure may seem straightforward and objective, but it is‬‭often‬ ‭difficult to choose‬‭a performance measure that corresponds well to the desired behavior of‬ ‭the system.‬ ‭○‬ ‭E.g. When performing a‬‭transcription task‬‭, should‬‭we measure the accuracy of the‬ ‭system at transcribing‬‭entire sequences‬‭, or should‬‭we use a more fine-grained‬ ‭performance measure that gives partial credit for getting‬‭some elements of the‬ ‭sequence‬‭correct?‬ ‭○‬ ‭When performing a‬‭regression task‬‭, should we penalize‬‭the system more if it‬ ‭frequently makes‬‭medium-sized mistakes‬‭or if it rarely‬‭makes‬‭very large mistakes‬‭?‬ ‭4.‬ ‭Experience (E)‬ ‭ achine‬‭learning‬‭algorithms‬‭can‬‭be‬‭broadly‬‭categorized‬‭as‬ ‭unsupervised‬ ‭or‬ ‭supervised‬‭by‬ M ‭what kind of experience they are allowed to have during the learning process.‬ ‭This will depend on the dataset.‬ ‭○‬ ‭A collection of examples, sometimes also called‬‭datapoints‬‭.‬ ‭4.1 Unsupervised learning‬ ‭Unsupervised‬ ‭learning‬ ‭algorithms‬ ‭experience‬ ‭a‬ ‭dataset‬ ‭containing‬ ‭many‬ ‭features,‬ ‭then‬ ‭learn useful properties of the structure of this dataset.‬ ‭○‬ ‭The examples are‬‭not labeled‬‭.‬ ‭Some‬ ‭unsupervised‬ ‭learning‬ ‭algorithms‬ ‭perform‬ ‭roles‬ ‭like‬ ‭clustering‬‭,‬ ‭which‬ ‭consists‬ ‭of‬ ‭dividing the dataset into clusters of similar examples.‬ ‭4.2 Supervised learning‬ ‭Supervised‬‭learning‬ ‭algorithms‬‭experience‬‭a‬‭dataset‬‭containing‬‭features,‬‭but‬‭each‬‭example‬ ‭is also associated with a‬‭label‬‭or‬‭target.‬ ‭○‬ ‭E.g. an eye infection dataset containing either bacteriological or fungal infections.‬ ‭Unsupervised‬ ‭learning‬ ‭involves‬ ‭observing‬ ‭several‬ ‭examples‬ ‭of‬ ‭a‬ ‭random‬ ‭vector‬ ‭x‬ ‭and‬ ‭attempting‬ ‭to‬ ‭implicitly‬ ‭or‬ ‭explicitly‬ ‭learn‬ ‭the‬ ‭probability‬ ‭distribution‬ ‭p(x),‬ ‭or‬ ‭some‬ ‭interesting properties of that distribution.‬ ‭Supervised‬ ‭learning‬ ‭involves‬ ‭observing‬ ‭several‬ ‭examples‬ ‭of‬ ‭a‬ ‭random‬ ‭vector‬ ‭x‬ ‭and‬ ‭an‬ ‭associated value or vector y, then learning to predict y from x , usually by estimating p(y|x).‬ ‭(4. Experience (E))‬ ‭Some‬ ‭machine‬ ‭learning‬ ‭algorithms‬ ‭do‬ ‭not‬ ‭just‬ ‭experience‬ ‭a‬ ‭fixed‬ ‭dataset.‬ ‭For‬ ‭example,‬ ‭reinforcement‬ ‭learning‬ ‭algorithms‬ ‭interact‬ ‭with‬ ‭an‬ ‭environment,‬ ‭so‬ ‭there‬ ‭is‬ ‭a‬ ‭feedback‬ ‭loop between the learning system and its experiences.‬ ‭Just‬ ‭as‬ ‭there‬ ‭is‬ ‭no‬ ‭formal‬ ‭definition‬ ‭of‬ ‭supervised‬ ‭and‬ ‭unsupervised‬ ‭learning,‬ ‭there‬ ‭is‬ ‭no‬ ‭rigid taxonomy of datasets or experiences.‬ ‭For example, in deep learning, self-supervised learning is also used.‬ ‭○‬ ‭The‬ ‭basic‬ ‭idea‬ ‭is‬ ‭to‬ ‭automatically‬ ‭generate‬‭some‬‭kind‬‭of‬‭supervisory‬‭signal‬‭to‬‭solve‬ ‭some‬ ‭task‬ ‭(typically,‬ ‭to‬ ‭learn‬ ‭representations‬ ‭of‬ ‭the‬‭data‬‭or‬‭to‬‭automatically‬‭label‬‭a‬ ‭dataset).‬ ‭○‬ ‭E.g. Autoencoders‬ ‭4‬ ‭INTRODUCTION - A BRIEF INTRODUCTION TO MACHINE LEARNING‬ ‭5.‬ ‭Capacity, overfitting & underfitting‬ ‭ he‬ ‭central‬ ‭challenge‬ ‭in‬‭machine‬‭learning‬‭is‬‭that‬‭our‬‭algorithm‬‭must‬‭perform‬‭well‬‭on‬‭new,‬ T ‭previously unseen inputs.‬ ‭The ability to perform well on previously unobserved inputs is called‬‭generalization‬‭.‬ ‭When training a machine learning model, we have access to a training set.‬ ‭○‬ ‭We‬ ‭can‬ ‭compute‬ ‭some‬ ‭error‬ ‭measure‬ ‭on‬ ‭the‬ ‭training‬ ‭set,‬‭called‬‭the‬ ‭training‬‭error‬ ‭and we reduce this training error.‬ ‭What‬ ‭separates‬ ‭machine‬ ‭learning‬ ‭from‬ ‭optimization‬ ‭is‬ ‭that‬ ‭we‬ ‭want‬ ‭the‬ ‭generalization‬ ‭error‬‭, also called the test error.‬ ‭Our objective is to minimize the generalization error or test error.‬ ‭ e‬‭typically‬‭estimate‬‭the‬‭generalization‬‭error‬‭of‬‭a‬‭machine‬‭learning‬‭model‬‭by‬‭measuring‬‭its‬ W ‭performance on a‬‭test set‬‭of examples that were collected‬‭separately from the training set.‬ ‭○‬ ‭Obviously,‬‭if‬‭the‬‭training‬‭and‬‭the‬‭test‬‭set‬‭are‬‭collected‬‭arbitrarily,‬‭there‬‭is‬‭little‬‭we‬‭can‬ ‭do.‬ ‭The‬‭training‬‭and‬‭test‬‭data‬‭are‬‭generated‬‭by‬‭a‬‭probability‬‭distribution‬‭over‬‭datasets‬‭called‬‭the‬ ‭data-generating‬‭process‬‭.‬‭We‬‭typically‬‭make‬‭a‬‭set‬‭of‬‭assumptions‬‭known‬‭collectively‬‭as‬‭the‬ ‭i.i.d. assumptions.‬ ‭‬ T ○ ‭ he examples in each dataset are‬‭independent‬‭from‬‭each other.‬ ‭○‬ ‭The‬ ‭training‬ ‭set‬ ‭and‬ ‭test‬ ‭set‬ ‭are‬ ‭identically‬ ‭distributed‬‭,‬ ‭drawn‬ ‭from‬ ‭the‬ ‭same‬ ‭probability distribution as each other.‬ ‭ e call that shared underlying distribution the‬‭data-generating‬‭distribution‬‭.‬ W ‭ hen‬‭we‬‭use‬‭a‬‭machine‬‭learning‬‭algorithm,‬‭we‬‭do‬‭not‬‭fix‬‭the‬‭parameters‬‭ahead‬‭of‬‭time,‬‭then‬ W ‭sample both datasets.‬ ‭‬ ‭We‬ ‭sample‬ ‭the‬ ‭training‬ ‭set,‬ ‭then‬ ‭use‬‭it‬‭to‬‭choose‬‭the‬‭parameters‬‭to‬‭reduce‬‭training‬ ‭set error, then sample the test set.‬ ‭Under‬ ‭this‬ ‭process,‬ ‭the‬‭expected‬‭test‬‭error‬‭is‬‭greater‬‭than‬‭or‬‭equal‬‭to‬‭the‬‭expected‬‭value‬‭of‬ ‭training error.‬ ‭The‬ ‭factors‬ ‭determining‬ ‭how‬ ‭well‬ ‭a‬ ‭machine‬‭learning‬‭algorithm‬‭will‬‭perform‬‭are‬‭its‬‭ability‬ ‭to:‬ ‭1.‬ ‭Make the training error small.‬ ‭2.‬ ‭Make the gap between training and test error small.‬ ‭These‬ ‭two‬ f‭ actors‬ ‭correspond‬ ‭to‬ ‭the‬ ‭two‬ ‭central‬ ‭challenges‬ ‭in‬ ‭machine‬ ‭learning:‬ ‭underfitting‬‭and‬‭overfitting‬‭.‬ ‭○‬ ‭Underfitting‬ ‭occurs‬ ‭when‬ ‭the‬ ‭model‬ ‭is‬ ‭not‬ ‭able‬ ‭to‬ ‭obtain‬ ‭a‬ ‭sufficiently‬ ‭low‬ ‭error‬ ‭value on the training set.‬ ‭‬ ‭The model is unable to learn.‬ ‭○‬ O‭ verfitting‬ ‭occurs‬ ‭when‬ ‭the‬ ‭gap‬ ‭between‬ ‭the‬ ‭training‬ ‭error‬ ‭and‬ ‭test‬ ‭error‬ ‭is‬ ‭too‬ ‭large.‬ ‭‬ ‭The model is unable to generalize.‬ ‭5‬ ‭INTRODUCTION - A BRIEF INTRODUCTION TO MACHINE LEARNING‬ ‭ e‬ ‭can‬ ‭control‬ ‭whether‬ ‭a‬ ‭model‬ ‭is‬ ‭more‬ ‭likely‬ ‭to‬ W ‭overfit or underfit by altering its‬‭capacity‬‭.‬ ‭A‬‭model’s‬‭capacity‬‭is‬‭its‬‭ability‬‭to‬‭fit‬‭a‬‭wide‬‭variety‬ ‭of functions.‬ ‭Models‬ ‭with‬ ‭low‬ ‭capacity‬ ‭may‬ ‭struggle‬ ‭to‬ ‭fit‬ ‭the‬ ‭training set.‬ ‭Models‬ ‭with‬ ‭high‬ ‭capacity‬ ‭can‬ ‭overfit‬ ‭by‬ ‭memorizing‬ ‭properties‬ ‭of‬ ‭the‬ ‭training‬ ‭set‬ ‭that‬ ‭do‬ ‭not serve them well on the test set.‬ ‭In‬ ‭deep‬‭learning‬‭for‬‭example‬‭this‬‭is‬‭controlled‬‭by‬‭the‬‭number‬‭of‬‭layers‬‭and‬‭the‬‭size‬‭of‬‭each‬ ‭layer.‬ ‭One‬ ‭way‬‭to‬‭control‬‭the‬‭capacity‬‭of‬‭a‬‭learning‬‭algorithm‬‭is‬‭by‬‭choosing‬‭its‬‭hypothesis‬‭space,‬ ‭the set of functions that the learning algorithm is allowed to select as being the solution.‬ ‭5.1 The No Free Lunch Theorem‬ ‭The no free lunch theorem for machine learning states that:‬ ‭○‬ ‭Averaged‬ ‭over‬ ‭all‬ ‭possible‬ ‭data-generating‬ ‭distributions,‬ ‭every‬ ‭classification‬ ‭algorithm has the same error rate when classifying previously unobserved points‬ ‭○‬ ‭What does this mean?‬ ‭‬ ‭No machine learning algorithm is universally any better than any other.‬ ‭ here‬ ‭is‬ ‭no‬ ‭machine‬ ‭learning‬ ‭algorithm‬ ‭that‬ ‭will‬ ‭be‬ ‭the‬ ‭best‬ ‭solution‬ ‭for‬ ‭all‬ ‭the‬ T ‭problems.‬ ‭5.2 Regularization‬ ‭ egularization‬ ‭is‬ ‭any‬ ‭modification‬ ‭we‬ ‭make‬ ‭to‬ ‭a‬ ‭learning‬ ‭algorithm‬ ‭that‬ ‭is‬ ‭intended‬ ‭to‬ R ‭reduce its generalization error but not its training error.‬ ‭One of the main concerns of machine learning, rivaled only by optimization.‬ ‭Different algorithms have different regularization options.‬ ‭○‬ ‭In deep learning a commonly used one is dropout.‬ ‭6. Hyperparameters and Validation Sets‬ ‭ ost‬ ‭machine‬ ‭learning‬ ‭algorithms‬ ‭have‬ ‭hyperparameters‬ ‭→‬ ‭settings‬ ‭that‬ ‭we‬ ‭can‬ ‭use‬ ‭to‬ M ‭control the algorithm’s behavior.‬ ‭The‬ ‭values‬ ‭for‬ ‭these‬ ‭hyperparameters‬‭are‬‭not‬‭learned‬‭automatically,‬‭they‬‭are‬‭set‬‭manually.‬ ‭We need to try different values and explore the possibilities to find the best configuration.‬ ‭Problem:‬ ‭the‬ ‭test‬ ‭examples‬ ‭must‬ ‭not‬ ‭be‬ ‭used‬ ‭in‬ ‭any‬ ‭way‬ ‭to‬ ‭make‬ ‭choices‬ ‭about‬ ‭the‬ ‭model‬‭, including its hyperparameters.‬ ‭○‬ ‭Be very strict with this.‬ ‭Validation‬ ‭set‬‭:‬ ‭constructed‬ ‭from‬ ‭the‬ ‭training‬‭data,‬‭but‬‭these‬‭examples‬‭are‬‭not‬‭seen‬‭by‬‭the‬ ‭algorithm during the training.‬ ‭Sets used:‬ ‭○‬ ‭Training set‬‭: Used to learn the parameters of the‬‭algorithm‬ ‭○‬ ‭Validation set‬‭: Used to test the generalization when‬‭adjusting the hyperparameters.‬ ‭○‬ ‭Test set‬‭: used to test the generalization of the final‬‭configuration.‬ ‭We‬‭first‬‭split‬‭the‬‭data‬‭in‬‭the‬‭training‬‭and‬‭test‬‭sets.‬‭Then‬‭split‬‭the‬‭training‬‭data‬‭in‬‭the‬‭training‬ ‭and validation sets.‬ ‭6‬ ‭INTRODUCTION - A BRIEF INTRODUCTION TO MACHINE LEARNING‬ ‭ ince‬ ‭the‬ ‭validation‬ ‭set‬ ‭is‬ ‭used‬‭to‬‭“train”‬‭the‬‭hyperparameters,‬‭the‬‭validation‬‭set‬‭error‬‭will‬ S ‭underestimate‬ ‭the‬ ‭generalization‬ ‭error,‬ ‭though‬ ‭typically‬ ‭by‬ ‭a‬ ‭smaller‬ ‭amount‬ ‭than‬ ‭the‬ ‭training error does.‬ ‭After‬‭all‬‭hyperparameter‬‭optimization‬‭is‬‭complete,‬‭the‬‭generalization‬‭error‬‭may‬‭be‬‭estimated‬ ‭using the test set.‬ ‭7. The curse of dimensionality‬ ‭ any‬ ‭machine‬ ‭learning‬ ‭problems‬ ‭become‬ ‭exceedingly‬ ‭difficult‬ ‭when‬ ‭the‬ ‭number‬ ‭of‬ M ‭dimensions in the data is high.‬ ‭‬ ‭This phenomenon is known as the‬‭curse of dimensionality‬‭.‬ ‭Of‬ ‭particular‬ ‭concern‬ ‭is‬ ‭that‬ ‭the‬ ‭number‬ ‭of‬ ‭possible‬ ‭distinct‬ ‭configurations‬ ‭of‬ ‭a‬ ‭set‬ ‭of‬ ‭variables increases exponentially as the number of variables increases.‬ ‭Problems:‬ ‭‬ ‭Data gets spread out.‬ ‭‬ ‭More data is needed.‬ ‭‬ ‭Harder to compute.‬ ‭7‬ ‭INTRODUCTION - THE MACHINE LEARNING PROCESS‬ ‭1.‬ ‭Choosing the Training Experience‬ ‭1.1 First design choice‬ ‭ he‬‭first‬‭design‬‭choice‬‭we‬‭face‬‭is‬‭to‬‭choose‬‭the‬ ‭type‬‭of‬‭training‬‭experience‬‭from‬‭which‬‭our‬ T ‭system will learn.‬ ‭○‬ ‭The‬‭type‬‭of‬‭training‬‭experience‬‭available‬‭can‬‭have‬‭a‬‭significant‬‭impact‬‭on‬‭success‬‭or‬ ‭failure of the learner.‬ ‭○‬ ‭Not all experience types are adequate for all problems.‬ ‭One‬ ‭key‬ ‭attribute‬ ‭is‬ ‭whether‬ ‭the‬ ‭training‬ ‭experience‬ ‭provides‬ ‭direct‬ ‭or‬ ‭indirect‬ ‭feedback‬ ‭regarding the choices made by the performance system.‬ I‭ n‬ ‭our‬ ‭example,‬ ‭the‬ ‭system‬ ‭might‬ ‭learn‬ ‭from‬ ‭direct‬ ‭training‬ ‭examples‬ ‭consisting‬ ‭of‬ ‭individual‬‭checkers‬‭board‬ ‭states and the correct move for each.‬ ‭ lternatively,‬ ‭it‬ ‭might‬ ‭have‬ ‭available‬ ‭only‬ ‭indirect‬ A ‭information‬ ‭consisting‬ ‭of‬ ‭the‬ ‭move‬ ‭sequences‬ ‭and‬ ‭final‬ ‭outcomes of various games played.‬ ‭○‬ ‭In‬‭this‬‭later‬‭case,‬‭information‬‭about‬‭the‬‭correctness‬ ‭of‬‭specific‬‭moves‬‭early‬‭in‬‭the‬‭game‬‭must‬‭be‬‭inferred‬‭indirectly‬‭from‬‭the‬‭fact‬‭that‬‭the‬ ‭game was eventually won or lost.‬ ‭○‬ ‭Here‬ ‭the‬ ‭learner‬ ‭faces‬ ‭an‬ ‭additional‬‭problem‬‭of‬ ‭credit‬‭assignment‬‭,‬‭or‬‭determining‬ ‭the‬‭degree‬‭to‬‭which‬‭each‬‭move‬‭in‬‭the‬‭sequence‬‭deserves‬‭credit‬‭or‬‭blame‬‭for‬‭the‬‭final‬ ‭outcome.‬ ‭○‬ ‭Credit‬ ‭assignment‬ ‭can‬ ‭be‬ ‭a‬ ‭particularly‬ ‭difficult‬ ‭problem‬ ‭because‬ ‭the‬ ‭game‬ ‭can‬ ‭be‬ ‭lost even when early moves are optimal, if these are followed later by poor moves.‬ ‭For‬‭this‬‭reason,‬‭learning‬‭from‬‭direct‬‭training‬‭feedback‬‭is‬‭typically‬‭easier‬‭than‬‭learning‬‭from‬ ‭indirect feedback.‬ ‭1.2 Second design choice‬ ‭ ‬ ‭second‬ ‭important‬ ‭attribute‬ ‭of‬ ‭the‬ ‭training‬ ‭experience‬ ‭is‬ ‭the‬ ‭degree‬ ‭to‬ ‭which‬‭the‬‭learner‬ A ‭controls the sequence of training examples.‬ ‭○‬ ‭The‬ ‭learner‬ ‭might‬ ‭rely‬ ‭on‬ ‭the‬ ‭teacher‬ ‭to‬ ‭select‬ ‭informative‬ ‭board‬ ‭states‬ ‭and‬ ‭to‬ ‭provide the correct move for each →‬‭Supervised Learning.‬ ‭○‬ ‭The‬‭learner‬‭might‬‭itself‬‭propose‬‭board‬‭states‬‭that‬‭it‬‭finds‬‭particularly‬‭confusing‬‭and‬ ‭ask the teacher for the correct move →‬‭Active Learning.‬ ‭○‬ ‭The‬ ‭learner‬ ‭may‬ ‭have‬ ‭complete‬ ‭control‬ ‭over‬ ‭both‬ ‭the‬ ‭board‬ ‭states‬ ‭and‬ ‭(indirect)‬ ‭training‬ ‭classifications,‬ ‭as‬ ‭it‬ ‭does‬ ‭when‬ ‭it‬ ‭learns‬ ‭by‬ ‭playing‬ ‭against‬ ‭itself‬ ‭with‬ ‭no‬ ‭teacher present. →‬‭Reinforcement Learning.‬ ‭Notice‬ ‭in‬ ‭this‬ ‭last‬ ‭case‬ ‭the‬ ‭learner‬ ‭may‬ ‭choose‬ ‭between‬ ‭experimenting‬ ‭with‬ ‭novel‬ ‭board‬ ‭states‬‭that‬‭it‬‭has‬‭not‬‭yet‬‭considered,‬‭or‬‭honing‬‭its‬‭skill‬‭by‬‭playing‬‭minor‬‭variations‬‭of‬‭lines‬‭of‬ ‭play‬ ‭it‬ ‭currently‬ ‭finds‬ ‭most‬ ‭promising‬ ‭→‬ ‭Exploration‬ ‭vs‬ ‭Exploitation‬ ‭in‬ ‭Reinforcement‬ ‭Learning‬‭.‬ ‭8‬ ‭INTRODUCTION - THE MACHINE LEARNING PROCESS‬ 1‭.3 Third design choice‬ ‭A‬ ‭third‬ ‭important‬ ‭attribute‬ ‭of‬ ‭the‬ ‭training‬ ‭experience‬ ‭is‬ ‭how‬ ‭well‬ ‭it‬ ‭represents‬ ‭the‬ ‭distribution of examples over which the final system performance P must be measured.‬ ‭○‬ ‭In‬‭general,‬‭learning‬‭is‬‭most‬‭reliable‬‭when‬‭the‬‭training‬‭examples‬‭follow‬‭a‬‭distribution‬ ‭similar to that of future test examples.‬ ‭Machine‬ ‭Learning‬ ‭models‬ ‭are‬ ‭learning‬ ‭a‬‭distribution‬‭from‬‭the‬‭examples,‬‭if‬‭they‬‭do‬‭not‬‭see‬ ‭examples for one class, they will not be able to identify new instances of that class.‬ ‭○‬ ‭This is not completely true however →‬‭one-shot and‬‭zero-shot learning‬‭.‬ I‭ n‬ ‭our‬ ‭example,‬ ‭the‬ ‭performance‬ ‭metric‬ ‭P‬ ‭is‬ ‭the‬ ‭percent‬ ‭of‬ ‭games‬ ‭the‬ ‭system‬ ‭wins‬ ‭in‬ ‭the‬ ‭world tournament.‬ ‭If‬ ‭its‬ ‭training‬ ‭experience‬ ‭E‬ ‭consists‬ ‭only‬ ‭of‬‭games‬‭played‬‭against‬‭itself,‬‭there‬‭is‬‭an‬‭obvious‬ ‭danger‬ ‭that‬ ‭this‬ ‭training‬ ‭experience‬ ‭might‬ ‭not‬‭be‬‭fully‬‭representative‬‭of‬‭the‬‭distribution‬‭of‬ ‭situations over which it will later be tested.‬ ‭○‬ ‭For‬‭example,‬‭the‬‭learner‬‭might‬‭never‬‭encounter‬‭certain‬‭crucial‬‭board‬‭states‬‭that‬‭are‬ ‭very likely to be played by the human checkers champion.‬ I‭ n‬ ‭practice,‬ ‭it‬ ‭is‬ ‭often‬ ‭necessary‬ ‭to‬ ‭learn‬ ‭from‬ ‭a‬‭distribution‬‭of‬‭examples‬‭that‬‭is‬‭somewhat‬ ‭different‬ ‭from‬ ‭those‬ ‭on‬ ‭which‬ ‭the‬ ‭final‬ ‭system‬ ‭will‬ ‭be‬ ‭evaluated‬ ‭(e.g.,‬ ‭the‬ ‭world‬ ‭checkers‬ ‭champion might not be interested in teaching the program!).‬ ‭○‬ ‭Training vs Validation vs Test Sets.‬ ‭ uch‬ ‭situations‬ ‭are‬ ‭problematic‬ ‭because‬ ‭mastery‬ ‭of‬ ‭one‬ ‭distribution‬ ‭of‬ ‭examples‬ ‭will‬ ‭not‬ S ‭necessarily lead to strong performance over some other distribution.‬ ‭We‬ ‭shall‬ ‭see‬ ‭that‬ ‭most‬ ‭current‬ ‭theory‬ ‭of‬ ‭machine‬‭learning‬‭rests‬‭on‬‭the‬‭crucial‬‭assumption‬ ‭that‬‭the‬‭distribution‬‭of‬‭training‬‭examples‬‭is‬‭identical‬‭to‬‭the‬‭distribution‬‭of‬‭test‬‭examples.‬‭→‬ ‭i.i.d. assumptions.‬ ‭ espite‬ ‭our‬ ‭need‬ ‭to‬ ‭make‬ ‭this‬ ‭assumption‬ ‭in‬ ‭order‬ ‭to‬ ‭obtain‬ ‭theoretical‬ ‭results,‬ ‭it‬ ‭is‬ D ‭important to keep in mind that this assumption must often be violated in practice.‬ ‭1.4 Putting all together‬ ‭ e decide that our system will train by playing games against itself.‬ W ‭This‬ ‭has‬ ‭the‬‭advantage‬‭that‬‭no‬‭external‬‭trainer‬‭need‬‭be‬‭present,‬‭and‬‭it‬‭therefore‬‭allows‬‭the‬ ‭system to generate as much training data as time permits.‬ ‭2.‬ ‭Choosing the Target Function‬ ‭ he‬ ‭next‬ ‭design‬ ‭choice‬ ‭is‬ ‭to‬ ‭determine‬ ‭exactly‬ ‭what‬‭type‬‭of‬‭knowledge‬‭will‬‭be‬‭learned‬‭and‬ T ‭how this will be used by the performance program.‬ ‭We‬‭are‬‭going‬‭to‬‭assume‬‭that‬‭we‬‭have‬‭a‬‭checkers-playing‬‭program‬‭that‬‭can‬‭generate‬‭the‬‭legal‬ ‭moves from any board state.‬ ‭The‬ ‭program‬ ‭needs‬ ‭only‬ ‭to‬ ‭learn‬ ‭how‬ ‭to‬ ‭choose‬ ‭the‬ ‭best‬ ‭move‬ ‭from‬ ‭among‬ ‭these‬ ‭legal‬ ‭moves.‬ ‭This‬ ‭learning‬ ‭task‬ ‭is‬ ‭representative‬ ‭of‬ ‭a‬ ‭large‬ ‭class‬ ‭of‬ ‭tasks‬ ‭for‬ ‭which‬‭the‬‭legal‬‭moves‬‭that‬ ‭define‬‭some‬‭large‬‭search‬‭space‬‭are‬‭known‬‭a‬‭priori,‬‭but‬‭for‬‭which‬‭the‬‭best‬‭search‬‭strategy‬‭is‬ ‭not known.‬ ‭Many‬ ‭optimization‬ ‭problems‬ ‭fall‬ ‭into‬ ‭this‬ ‭class,‬ ‭such‬ ‭as‬ ‭the‬ ‭problems‬ ‭of‬ ‭scheduling‬ ‭and‬ ‭controlling‬ ‭manufacturing‬ ‭processes‬ ‭where‬ ‭the‬ ‭available‬ ‭manufacturing‬ ‭steps‬ ‭are‬ ‭well‬ ‭understood, but the best strategy for sequencing them is not.‬ ‭9‬ ‭INTRODUCTION - THE MACHINE LEARNING PROCESS‬ ‭ iven‬ ‭this‬ ‭setting‬‭where‬‭we‬‭must‬‭learn‬‭to‬‭choose‬‭among‬‭the‬‭legal‬‭moves,‬‭the‬‭most‬‭obvious‬ G ‭choice‬ ‭for‬ ‭the‬ ‭type‬ ‭of‬ ‭information‬‭to‬‭be‬‭learned‬‭is‬‭a‬‭program,‬‭or‬‭function,‬‭that‬‭chooses‬‭the‬ ‭best move for any given board state.‬ ‭Let us it‬‭ChooseMove‬‭and use the notation ChooseMove‬‭: B → M‬ ‭○‬ ‭This function accepts as input any board from the set of legal board states B.‬ ‭○‬ ‭This function produces as output some move from the set of legal moves M.‬ ‭Throughout‬‭our‬‭discussion‬‭of‬‭machine‬‭learning‬‭we‬‭will‬‭find‬‭it‬‭useful‬‭to‬‭reduce‬‭the‬‭problem‬ ‭of‬ ‭improving‬ ‭performance‬ ‭P‬ ‭at‬ ‭task‬ ‭T‬ ‭to‬ ‭the‬ ‭problem‬ ‭of‬ ‭learning‬ ‭some‬ ‭particular‬ ‭Target‬ ‭function‬‭such as‬‭ChooseMove‬‭.‬ ‭The choice of the target function will therefore be a key design choice.‬ ‭ lthough‬ ‭ChooseMove‬ ‭is‬ ‭an‬ ‭obvious‬ ‭choice‬ ‭for‬ ‭the‬ ‭target‬ ‭function‬ ‭in‬ ‭our‬ ‭example,‬ ‭this‬ A ‭function‬ ‭will‬ ‭turn‬ ‭out‬ ‭to‬ ‭be‬ ‭very‬ ‭difficult‬ ‭to‬ ‭learn‬ ‭given‬ ‭the‬ ‭kind‬ ‭of‬ ‭indirect‬ ‭training‬ ‭experience available to our system.‬ ‭An‬‭alternative‬‭target‬‭function‬‭and‬‭one‬‭that‬‭will‬‭turn‬‭out‬‭to‬‭be‬‭easier‬‭to‬‭learn‬‭in‬‭this‬‭setting‬‭is‬ ‭an evaluation function that assigns a numerical score to any given board state.‬ ‭Let‬‭us‬‭call‬‭this‬‭target‬‭function‬‭V‬‭and‬‭again‬‭use‬‭the‬‭notation‬‭V‬‭:‬‭B‬‭→‬‭ℝ‬‭to‬‭denote‬‭that‬‭V‬‭maps‬ ‭any‬ ‭legal‬ ‭board‬ ‭state‬ ‭from‬ ‭the‬ ‭set‬ ‭B‬ ‭to‬ ‭some‬ ‭real‬ ‭value‬ ‭(we‬ ‭use‬‭ℝ‬‭to‬‭denote‬‭the‬‭set‬‭of‬‭real‬ ‭numbers).‬ ‭ e intend for this target function V to assign higher scores to better board states.‬ W ‭If‬‭the‬‭system‬‭can‬‭successfully‬‭learn‬‭such‬‭a‬‭target‬‭function‬‭V,‬‭then‬‭it‬‭can‬‭easily‬‭use‬‭it‬‭to‬‭select‬ ‭the‬‭best‬‭move‬‭from‬‭any‬‭current‬‭board‬‭position.‬‭This‬‭can‬‭be‬‭accomplished‬‭by‬‭generating‬‭the‬ ‭successor‬ ‭board‬ ‭state‬ ‭produced‬ ‭by‬ ‭every‬ ‭legal‬ ‭move,‬ ‭then‬ ‭using‬ ‭V‬ ‭to‬ ‭choose‬ ‭the‬ ‭best‬ ‭successor state and therefore the best legal move.‬ ‭Let us therefore define the target value V(b) for an arbitrary board state b in B, as follows:‬ ‭1.‬ ‭if b is a final board state that is won, then V(b) = 100.‬ ‭2.‬ ‭if b is a final board state that is lost, then V(b) = -100.‬ ‭3.‬ ‭if b is a final board state that is drawn, then V(b) = 0.‬ ‭4.‬ ‭if‬‭b‬‭is‬‭not‬‭a‬‭final‬‭state‬‭in‬‭the‬‭game,‬‭then‬‭V(b)‬‭=‬‭V(b'),‬‭where‬‭b'‬‭is‬‭the‬‭best‬‭final‬‭board‬ ‭state‬ ‭that‬‭can‬‭be‬‭achieved‬‭starting‬‭from‬‭b‬‭and‬‭playing‬‭optimally‬‭until‬‭the‬‭end‬‭of‬‭the‬ ‭game.‬ ‭ hile‬‭this‬‭recursive‬‭definition‬‭specifies‬‭a‬‭value‬‭of‬‭V(b)‬‭for‬‭every‬‭board‬‭state‬‭b,‬‭this‬‭definition‬ W ‭is not usable by our checkers player because it is not efficiently computable.‬ ‭○‬ ‭In‬ ‭case‬ ‭4,‬ ‭determining‬ ‭the‬ ‭value‬ ‭of‬ ‭V(b)‬ ‭for‬ ‭a‬ ‭particular‬ ‭board‬ ‭state‬ ‭requires‬ ‭searching ahead for the optimal line of play, all the way to the end of the game.‬ ‭Because‬‭this‬‭definition‬‭is‬‭not‬‭efficiently‬‭computable‬‭by‬‭our‬‭checkers‬‭playing‬‭program,‬‭we‬‭say‬ ‭that it is a‬‭non operational‬‭definition.‬ ‭The‬ ‭goal‬ ‭of‬ ‭learning‬ ‭in‬ ‭this‬ ‭case‬ ‭is‬ ‭to‬ ‭discover‬ ‭an‬ ‭operational‬ ‭description‬ ‭of‬ ‭V;‬ ‭that‬ ‭is,‬ ‭a‬ ‭description‬ ‭that‬ ‭can‬ ‭be‬ ‭used‬ ‭by‬ ‭the‬ ‭checkers-playing‬ ‭program‬‭to‬‭evaluate‬‭states‬‭and‬‭select‬ ‭moves within realistic time bounds.‬ ‭ hus,‬ ‭we‬ ‭have‬ ‭reduced‬ ‭the‬ ‭learning‬ ‭task‬ ‭in‬ ‭this‬ ‭case‬ ‭to‬ ‭the‬ ‭problem‬ ‭of‬ ‭discovering‬ ‭an‬ T ‭operational description of the ideal target function V.‬ ‭It may be very difficult in general to learn such an operational form of V perfectly.‬ ‭10‬ ‭INTRODUCTION - THE MACHINE LEARNING PROCESS‬ ‭ e‬ ‭often‬ ‭expect‬ ‭learning‬ ‭algorithms‬ ‭to‬ ‭acquire‬ ‭only‬ ‭some‬ ‭approximation‬ ‭to‬ ‭the‬ ‭target‬ W ‭function,‬ ‭and‬ ‭for‬ ‭this‬ ‭reason‬ ‭the‬ ‭process‬ ‭of‬ ‭learning‬ ‭the‬ ‭target‬ ‭function‬ ‭is‬ ‭often‬ ‭called‬ ‭function approximation‬‭.‬ ‭In‬ ‭the‬ ‭current‬ ‭discussion‬ ‭we‬ ‭will‬ ‭use‬‭the‬‭symbol‬‭V̂‬‭to‬‭refer‬‭to‬‭the‬‭function‬‭that‬‭is‬‭actually‬ ‭learned by our program, to distinguish it from the ideal target function V.‬ ‭3. Choosing a Representation of the Training Function‬ ‭ ow‬‭that‬‭we‬‭have‬‭specified‬‭the‬‭ideal‬‭target‬‭function‬‭V,‬‭we‬‭must‬‭choose‬‭a‬‭representation‬‭that‬ N ‭the learning program will use to describe the function that it will learn.‬ ‭As with earlier design choices, we again have many options:‬ ‭1.‬ ‭We‬‭could‬‭allow‬‭the‬‭program‬‭to‬‭represent‬‭using‬‭a‬‭large‬‭table‬‭with‬‭a‬‭V̂‬‭distinct‬‭entry‬ ‭specifying the value for each distinct board state.‬ ‭2.‬ ‭We‬ ‭could‬ ‭allow‬ ‭it‬ ‭to‬ ‭represent‬ ‭V̂‬ ‭using‬ ‭a‬ ‭collection‬ ‭of‬ ‭rules‬ ‭that‬ ‭match‬ ‭against‬ ‭features of the board state.‬ ‭3.‬ ‭We could use a quadratic polynomial function of predefined board features.‬ ‭4.‬ ‭We could use an artificial neural network.‬ ‭In general, this choice of representation involves a crucial tradeoff.‬ ‭○‬ ‭On‬‭one‬‭hand,‬‭we‬‭wish‬‭to‬‭pick‬‭a‬‭very‬‭expressive‬‭representation‬‭to‬‭allow‬‭representing‬ ‭as close an approximation as possible to the ideal target function V.‬ ‭○‬ ‭On‬ ‭the‬ ‭other‬ ‭hand,‬ ‭the‬ ‭more‬ ‭expressive‬ ‭the‬ ‭representation,‬ ‭the‬ ‭more‬ ‭training‬‭data‬ ‭the‬ ‭program‬ ‭will‬‭require‬‭in‬‭order‬‭to‬‭choose‬‭among‬‭the‬‭alternative‬‭hypotheses‬‭it‬‭can‬ ‭represent.‬ ‭○‬ ‭Remember our discussion on the‬‭capacity‬‭in the previous‬‭unit.‬ ‭ e‬‭are‬‭just‬‭going‬‭to‬‭choose‬‭one‬‭for‬‭this‬‭example,‬‭for‬‭any‬‭given‬‭board‬‭state,‬‭the‬‭function‬‭V̂‬ W ‭will be calculated as a linear combination of the following board features:‬ ‭○‬ ‭X1‬‭: the number of black pieces on the board.‬ ‭○‬ ‭X2‬‭: the number of red pieces on the board.‬ ‭○‬ ‭X3‬‭: the number of black queens on the board.‬ ‭○‬ ‭X4‬‭: the number of red queens on the board.‬ ‭○‬ ‭X5‬‭:‬‭the‬‭number‬‭of‬‭black‬‭pieces‬‭threatened‬‭by‬‭red‬‭(i.e.,‬‭which‬‭can‬‭be‬‭captured‬‭on‬‭red’s‬ ‭next turn).‬ ‭○‬ ‭X6‬‭: the number of red pieces threatened by black.‬ ‭Our learning program will represent as a linear function like:‬ ‭V̂(b) = 𝑤‬‭0‬ ‭+ 𝑤‬‭1‭𝒳 ‬ ‭1‬‬ ‭+ 𝑤‬‭2‬‭𝒳‭2 ‬ ‬ ‭+ 𝑤‬‭3‬‭𝒳‭3‬ ‬ ‭+ 𝑤‬‭4‬‭𝒳‭4‬ ‬ ‭+‬‭𝑤‭5‬ ‬‭𝒳‬‭5‬ ‭+ 𝑤‬‭6‬‭𝒳‭6‬ ‬ ‭Where‬ ‭𝑤‭0‬ ‬ ‭through‬ ‭𝑤‬‭6‬ ‭are‬ ‭numerical‬ ‭coefficients,‬ ‭or‬ ‭weights‬‭,‬ ‭to‬ ‭be‬ ‭chosen‬ ‭by‬ ‭the‬ ‭learning‬ ‭algorithm.‬ ‭Learned‬ ‭values‬ ‭for‬ ‭the‬ ‭weights‬ ‭𝑤‭1‬‬ ‭through‬‭𝑤‭6‬ ‬ ‭will‬‭determine‬‭the‬‭relative‬‭importance‬‭of‬‭the‬ ‭various board features in determining the value of the board.‬ ‭The weight 𝑤‬‭0‬ ‭will provide an additive constant to‬‭the board value.‬ ‭11‬ ‭INTRODUCTION - THE MACHINE LEARNING PROCESS‬ ‭4. Choosing a Function Approximation Algorithm‬ ‭To learn the target function V̂ we require a set of training examples‬ ‭1.‬ ‭These examples describe a specific board state b and the training value V‬‭train‬‭(b) for b.‬ ‭2.‬ ‭Each training example is an ordered pair like (b, V‬‭train‬‭(b)).‬ ‭In‬ ‭our‬ ‭checkers‬ ‭model,‬ ‭the‬ ‭following‬ ‭one‬ ‭would‬ ‭be‬ ‭an‬ ‭example‬ ‭where‬ ‭black‬ ‭has‬ ‭won‬ ‭the‬ ‭game‬‭(x‬‭2‬ ‭=‬‭0‬‭because‬‭there‬‭are‬‭no‬‭more‬‭red‬‭pieces),‬‭and‬‭in‬‭which‬‭the‬‭value‬‭of‬‭target‬‭function‬ ‭V‭t‬rain‬‭(b) = + 100.‬ ‭((x‬‭1‬ ‭= 3, x‬‭2‬ ‭= 0, x‬‭3‬ ‭= 0, x‬‭4‬ ‭= 0, x‬‭5‬ ‭= 0, x‬‭6‬ ‭= 0),‬‭+100)‬ ‭4.1 Estimating Training Values‬ I‭ n‬‭our‬‭problem,‬‭we‬‭are‬‭going‬‭to‬‭use‬‭an‬‭approach‬‭that‬‭has‬‭shown‬‭to‬‭work‬‭quite‬‭well‬‭in‬‭similar‬ ‭problems.‬ ‭○‬ ‭In‬ ‭most‬ ‭cases,‬ ‭we‬ ‭do‬ ‭not‬ ‭need‬ ‭to‬ ‭reinvent‬ ‭the‬ ‭wheel,‬ ‭studying‬ ‭the‬ ‭literature‬ ‭for‬ ‭similar problems is usually how you would approach the design.‬ ‭The‬ ‭approach‬‭is‬‭to‬‭assign‬‭the‬‭training‬‭value‬‭of‬‭V‬‭train‬‭(b)‬‭or‬‭any‬‭intermediate‬‭board‬‭state‬‭b‬‭to‬ ‭be‬ ‭V̂‬ ‭(Successor(b))‬ ‭where‬ ‭V̂‬ ‭is‬ ‭the‬ ‭learner’s‬ ‭current‬ ‭approximation‬ ‭to‬ ‭V‬ ‭and‬ ‭where‬ ‭(Successor(b))‬ ‭denotes‬ ‭the‬ ‭next‬ ‭board‬ ‭state‬ ‭following‬ ‭b‬ ‭for‬‭which‬‭it‬‭is‬‭again‬‭the‬‭program’s‬ ‭turn to move.‬ ‭○‬ ‭I.e. the board state following the program’s move and the opponent’s response‬ ‭This rule for estimating training values can be summarized as:‬ ‭V‬‭train‬‭(b) ← V̂(Successor(b))‬ I‭ t‬ ‭probably‬ ‭seems‬ ‭a‬ ‭bit‬ ‭strange‬ ‭that‬ ‭we‬ ‭are‬ ‭using‬‭the‬‭current‬‭version‬‭of‬‭V̂‬‭to‬‭estimate‬‭the‬ ‭training values that will be used to further refine the same function‬ ‭○‬ ‭We are learning the proper values of the weights.‬ ‭○‬ ‭With‬‭each‬‭example,‬‭we‬‭are‬‭looking‬‭for‬‭the‬‭values‬‭of‬‭the‬‭weights‬‭that‬‭will‬‭result‬‭in‬‭our‬ ‭model winning the match‬ ‭Notice‬‭that‬‭we‬‭are‬‭using‬‭estimates‬‭of‬‭the‬‭values‬‭of‬‭the‬‭Successor(b)‬‭to‬‭estimate‬‭the‬‭values‬‭of‬ ‭board‬‭state‬‭b.‬‭Intuitively,‬‭we‬‭can‬‭see‬‭this‬‭will‬‭make‬‭sense‬‭if‬‭V̂‬‭tends‬‭to‬‭be‬‭more‬‭accurate‬‭for‬ ‭board states closer to the game's end.‬ ‭○‬ ‭Under‬ ‭certain,‬ ‭the‬ ‭approach‬ ‭of‬ ‭iteratively‬ ‭estimating‬ ‭training‬ ‭values‬ ‭based‬ ‭on‬ ‭estimates‬ ‭of‬ ‭successor‬ ‭state‬ ‭values‬ ‭can‬ ‭be‬ ‭proven‬ ‭to‬ ‭converge‬ ‭toward‬ ‭perfect‬ ‭estimates of V‬‭train‬‭.‬ ‭○‬ ‭See‬‭chapter‬‭13‬‭of‬‭[Mitchel,‬‭1997]‬‭for‬‭a‬‭more‬‭detailed‬‭discussion‬‭on‬‭how‬‭this‬‭is‬‭done‬‭in‬ ‭Reinforcement Learning.‬ ‭4.2 Adjusting the Weights‬ ‭ he‬ ‭only‬ ‭thing‬ ‭that‬ ‭we‬‭need‬‭now‬‭to‬‭finish‬‭our‬‭design‬‭is‬‭to‬‭select‬‭the‬‭algorithm‬‭that‬‭we‬‭will‬ T ‭use to choose the weights w‬‭i‬ ‭to best fit the set of‬‭training examples {(b, V‬‭train‬‭(b))}‬ ‭As a first step we must define what we mean by the best fit to the training data.‬ ‭○‬ ‭One‬ ‭common‬ ‭approach‬ ‭is‬ ‭to‬ ‭define‬ ‭the‬ ‭best‬ ‭hypothesis,‬ ‭or‬ ‭set‬ ‭of‬ ‭weights,‬ ‭as‬ ‭that‬ ‭which‬ ‭minimizes‬ ‭the‬ ‭square‬ ‭error‬ ‭E‬ ‭between‬ ‭the‬ ‭training‬ ‭values‬ ‭and‬ ‭the‬ ‭values‬ ‭predicted by the hypothesis V̂.‬ ‭12‬ ‭INTRODUCTION - THE MACHINE LEARNING PROCESS‬ ‭ e‬ ‭seek‬ ‭the‬ ‭weights,‬ ‭or‬ ‭equivalently‬ ‭the‬ ‭V̂,‬ ‭that‬ ‭minimize‬ ‭E‬ ‭for‬ ‭the‬ ‭observed‬ ‭training‬ W ‭examples.‬ ‭○‬ ‭Chapter‬ ‭6‬ ‭in‬ ‭Mitchell,‬ ‭1997‬ ‭discusses‬ ‭settings‬ ‭in‬ ‭which‬ ‭minimizing‬ ‭the‬ ‭sum‬ ‭of‬ ‭squared‬ ‭errors‬ ‭is‬ ‭equivalent‬ ‭to‬ ‭finding‬ ‭the‬ ‭most‬ ‭probable‬ ‭hypothesis‬ ‭given‬ ‭the‬ ‭observed training data.‬ ‭Several‬‭algorithms‬‭are‬‭known‬‭for‬‭finding‬‭weights‬‭of‬‭linear‬‭function‬‭that‬‭minimize‬‭E‬‭defined‬ ‭in this way.‬ ‭○‬ ‭In‬‭our‬‭case,‬‭we‬‭require‬‭an‬‭algorithm‬‭that‬‭will‬‭incrementally‬‭refine‬‭the‬‭weights‬‭as‬‭new‬ ‭training‬ ‭examples‬ ‭become‬ ‭available‬ ‭and‬ ‭that‬ ‭will‬ ‭be‬ ‭robust‬ ‭to‬ ‭errors‬ ‭in‬ ‭these‬ ‭estimated training values.‬ ‭One such algorithm is called the‬‭least mean squares,‬‭or LMS training rule‬‭.‬ ‭○‬ ‭For‬ ‭each‬ ‭observed‬ ‭training‬ ‭example,‬ ‭it‬ ‭adjusts‬ ‭the‬ ‭weights‬ ‭a‬ ‭small‬ ‭amount‬ ‭in‬ ‭the‬ ‭direction that reduces the error on this training example.‬ ‭○‬ ‭As‬ ‭discussed‬ ‭in‬ ‭Chapter‬ ‭4‬ ‭in‬ ‭Mitchell,‬ ‭1997,‬ ‭this‬ ‭algorithm‬ ‭can‬ ‭be‬ ‭viewed‬ ‭as‬ ‭performing‬ ‭a‬ ‭stochastic‬ ‭gradient-descent‬ ‭search‬ ‭through‬ ‭the‬ ‭space‬ ‭of‬ ‭possible‬ ‭hypotheses (weight values) to minimize the squared error E.‬ ‭LMS weight update rule:‬ ‭○‬ ‭For each training example (b, V‬‭train‬‭(b)).‬ ‭○‬ ‭Use the current weights to calculate V̂(b).‬ ‭○‬ ‭For each weight 𝑤‬‭i‬ ‭update it as:‬ ‭○‬ η ‭ ‬ ‭(the‬ ‭Greek‬ ‭letter‬ ‭eta)‬ ‭is‬ ‭a‬ ‭small‬ ‭constant‬ ‭(e.g.,‬ ‭0.1)‬‭that‬‭moderates‬‭the‬‭size‬‭of‬‭the‬ ‭weight update → Why do we need this?‬ ‭○‬ ‭X‬ ‭is‬ ‭the‬ ‭feature‬ ‭value.‬ ‭Notice‬ ‭that‬ ‭if‬ ‭the‬ ‭value‬ ‭of‬ ‭some‬ ‭feature‬ ‭xi‬ ‭is‬ ‭zero,‬ ‭then‬ ‭its‬ ‭weight‬ ‭is‬ ‭not‬ ‭altered‬ ‭regardless‬ ‭of‬ ‭the‬ ‭error,‬ ‭so‬ ‭that‬ ‭the‬ ‭only‬ ‭weights‬ ‭updated‬ ‭are‬ ‭those whose features actually occur on the training example board.‬ ‭ o‬‭get‬‭an‬‭intuitive‬‭understanding‬‭of‬‭why‬‭this‬‭weight‬‭update‬‭rule‬‭works,‬‭notice‬‭that‬‭when‬‭the‬ T ‭error (V‬‭train‬‭(b)) - V̂(b))is zero, no weights are‬‭changed.‬ ‭When‬‭(V‬‭train‬‭(b))‬‭-‬‭V̂(b))‬‭is‬‭positive‬‭(i.e.,‬‭when‬‭V̂(b)‬‭is‬‭too‬‭low),‬‭then‬‭each‬‭weight‬‭is‬‭increased‬ ‭in‬ ‭proportion‬ ‭to‬ ‭the‬ ‭values‬ ‭of‬ ‭its‬‭corresponding‬‭features.‬‭This‬‭will‬‭raise‬‭the‬‭value‬‭of‬‭V̂(b),‬ ‭reducing the error.‬ ‭5. The final design‬ ‭ ow we are going to put it all together.‬ N ‭The‬ ‭final‬ ‭design‬ ‭of‬ ‭our‬ ‭checkers‬ ‭learning‬ ‭system‬ ‭can‬ ‭be‬ ‭naturally‬ ‭described‬ ‭by‬ ‭four‬ ‭distinct‬ ‭program‬ ‭modules‬ ‭that‬ ‭represent the central components in many learning systems.‬ ‭5.1 Performance system‬ ‭ he‬ ‭Performance‬ ‭System‬ ‭is‬ ‭the‬ ‭module‬ ‭that‬ ‭must‬ ‭solve‬ ‭the‬ T ‭given performance task, in this case playing checkers, by using the learned target function(s).‬ ‭It‬ ‭takes‬ ‭an‬ ‭instance‬ ‭of‬ ‭a‬ ‭new‬ ‭problem‬ ‭(new‬ ‭game)‬ ‭as‬ ‭input‬ ‭and‬ ‭produces‬ ‭a‬ ‭trace‬ ‭of‬ ‭its‬ ‭solution (game history) as output.‬ ‭13‬ ‭INTRODUCTION - THE MACHINE LEARNING PROCESS‬ I‭ n‬‭our‬‭example,‬‭the‬‭strategy‬‭used‬‭by‬‭the‬‭Performance‬‭System‬‭to‬‭select‬‭its‬‭next‬‭move‬‭at‬‭each‬ ‭step‬ ‭is‬ ‭determined‬ ‭by‬ ‭the‬ ‭learned‬ ‭V̂‬ ‭evaluation‬ ‭function.‬ ‭Therefore,‬ ‭we‬ ‭expect‬ ‭its‬ ‭performance to improve as this evaluation function becomes increasingly accurate.‬ ‭5.2 Critic‬ ‭The‬ ‭Critic‬ ‭takes‬ ‭as‬ ‭input‬ ‭the‬ ‭history‬ ‭or‬ ‭trace‬ ‭of‬ ‭the‬ ‭game‬ ‭and‬ ‭produces‬ ‭as‬ ‭output‬‭a‬‭set‬‭of‬ ‭training examples of the target function.‬ ‭○‬ ‭Each‬‭training‬‭example‬‭in‬‭this‬‭case‬‭corresponds‬‭to‬‭some‬‭game‬‭state‬‭in‬‭the‬‭trace,‬‭along‬ ‭with an estimated Vtrain of the target function value for this example.‬ ‭In our example, the Critic corresponds to the training rule given by the equation‬ ‭V‭t‬rain‬‭(b) ← V̂(Successor(b))‬ ‭5.3 Generalizer‬ ‭The‬ ‭Generalizer‬ ‭takes‬ ‭as‬ ‭input‬ ‭the‬ ‭training‬ ‭examples‬ ‭and‬ ‭produces‬ ‭an‬ ‭output‬ ‭hypothesis‬ ‭that is its estimate of the target function.‬ ‭It‬ ‭generalizes‬ ‭from‬ ‭the‬ ‭specific‬ ‭training‬ ‭examples,‬ ‭hypothesizing‬ ‭a‬ ‭general‬ ‭function‬ ‭that‬ ‭covers these examples and other cases beyond the training examples.‬ ‭In‬ ‭our‬ ‭example,‬ ‭the‬ ‭Generalizer‬ ‭corresponds‬ ‭to‬ ‭the‬ ‭LMS‬ ‭algorithm‬ ‭and‬ ‭the‬ ‭output‬ ‭hypothesis is the function V̂ described by the learned weights 𝑤‬‭i‭.‬ ‬ ‭5.4 Experiment Generator‬ ‭ he‬‭Experiment‬‭Generator‬‭takes‬‭as‬‭input‬‭the‬‭current‬‭hypothesis‬‭(currently‬‭learned‬‭function)‬ T ‭and outputs a new problem (i.e., initial board state) for the Performance System to explore‬ ‭Its‬ ‭role‬ ‭is‬ ‭to‬ ‭pick‬ ‭new‬ ‭practice‬ ‭problems‬‭that‬‭will‬‭maximize‬‭the‬‭learning‬‭rate‬‭of‬‭the‬‭overall‬ ‭system.‬ ‭In‬‭our‬‭example,‬‭the‬‭Experiment‬‭Generator‬‭follows‬‭a‬‭very‬‭simple‬‭strategy:‬‭It‬‭always‬‭proposes‬ ‭the same initial game board to begin a new game.‬ ‭○‬ ‭More‬ ‭sophisticated‬ ‭strategies‬ ‭could‬ ‭involve‬ ‭creating‬ ‭board‬ ‭positions‬ ‭designed‬ ‭to‬ ‭explore particular regions of the state space.‬ ‭(5. The final design)‬ ‭ ith‬ ‭our‬ ‭choices,‬ ‭we‬‭have‬‭ended‬‭with‬‭a‬‭design‬‭somewhat‬‭similar‬‭to‬‭the‬‭Actor-Critic‬‭model‬ W ‭employed in Reinforcement Learning.‬ ‭There‬‭are‬‭multiple‬‭ways‬‭to‬‭solve‬‭the‬‭same‬‭problem,‬‭we‬‭will‬‭learn‬‭how‬‭to‬‭use‬‭different‬‭metrics‬ ‭to evaluate which of the possible solutions is the best one.‬ ‭Remember,‬ ‭there‬ ‭is‬ ‭no‬ ‭one‬ ‭single‬ ‭algorithm‬ ‭or‬ ‭model‬ ‭that‬ ‭is‬ ‭the‬ ‭best‬ ‭solution‬ ‭for‬ ‭all‬ ‭problems (‬‭No free lunch theorem‬‭).‬ ‭6. Designing a learning system: Steps‬ ‭1.‬ ‭Data preprocessing‬ ‭5.‬ C ‭ hoosing‬ ‭a‬ ‭Function‬ ‭.‬ C 2 ‭ hoosing the Training Experience‬ ‭Approximation Algorithm‬ ‭3.‬ ‭Choosing the Training Function‬ ‭6.‬ ‭The Final Design‬ ‭4.‬ ‭Choosing‬ a ‭ ‬ ‭Representation‬ ‭of‬ ‭the‬ ‭7.‬ ‭Model deployment (MLOps)‬ ‭Training Function‬ ‭14‬

Introduction to Machine Learning PDF

Document Details

Tags

Related

Summary

Full Transcript