Introduction to Machine Learning PDF

Summary

This document introduces the foundational concepts of machine learning and its different tasks, including classification, regression, cluster analysis, and machine translation. It covers machine learning algorithms and how they solve various problems in the field of AI. The document is not a past paper.

Full Transcript

‭INTRODUCTION - A BRIEF INTRODUCTION TO MACHINE LEARNING‬ ‭1.‬ ‭Machine learning vs Artificial intelligence‬ ‭ achine‬‭learning:‬‭A‬‭machine‬‭learning‬‭algorithm‬‭is‬‭an‬‭algorithm‬‭that‬‭is‬‭able‬‭to‬‭learn‬‭from‬ M ‭data.‬ ‭“A‬‭computer‬‭program‬‭is‬‭said‬‭to‬‭learn‬‭from‬‭experience‬‭E‬‭wi...

‭INTRODUCTION - A BRIEF INTRODUCTION TO MACHINE LEARNING‬ ‭1.‬ ‭Machine learning vs Artificial intelligence‬ ‭ achine‬‭learning:‬‭A‬‭machine‬‭learning‬‭algorithm‬‭is‬‭an‬‭algorithm‬‭that‬‭is‬‭able‬‭to‬‭learn‬‭from‬ M ‭data.‬ ‭“A‬‭computer‬‭program‬‭is‬‭said‬‭to‬‭learn‬‭from‬‭experience‬‭E‬‭with‬‭respect‬‭to‬‭some‬‭class‬‭of‬‭tasks‬‭T‬ ‭and‬ ‭performance‬ ‭measure‬ ‭P,‬ ‭if‬ ‭its‬ ‭performance‬ ‭at‬ ‭tasks‬‭in‬‭T,‬‭as‬‭measured‬‭by‬‭P‬‭,‬‭improves‬ ‭with experience E.”‬‭Mitchell, 1997‬ ‭Artificial‬‭intelligence:‬‭“The‬‭theory‬‭and‬‭development‬‭of‬‭computer‬‭systems‬‭able‬‭to‬‭perform‬ ‭tasks‬ ‭normally‬ ‭requiring‬‭human‬‭intelligence,‬‭such‬‭as‬‭visual‬‭perception,‬‭speech‬‭recognition,‬ ‭decision-making, and translation between languages.”‬‭Oxford Dictionary‬ ‭2.‬ ‭Task (T)‬ ‭ he‬ ‭task‬ ‭is‬ ‭the‬ ‭problem‬‭that‬‭we‬‭solve‬‭by‬‭learning‬ ‭and‬‭learning‬‭is‬‭our‬‭means‬‭of‬‭attaining‬ T ‭the ability to perform the task.‬ ‭○‬ ‭Thus, the process of learning is‬‭NOT‬‭the task.‬ ‭Machine‬‭learning‬‭enables‬‭us‬‭to‬‭tackle‬‭tasks‬‭that‬‭are‬‭too‬‭difficult‬‭to‬‭solve‬‭with‬‭fixed‬‭programs‬ ‭written and designed by human beings.‬ ‭E.g.‬ ‭We‬ ‭can‬ ‭write‬ ‭a‬ ‭program‬ ‭for‬ ‭a‬ ‭robot‬ ‭to‬ ‭walk,‬ ‭OR‬ ‭we‬ ‭can‬‭write‬‭a‬‭ML‬‭algorithm‬‭for‬‭the‬ ‭robot to learn to do it.‬ ‭Machine‬ ‭learning‬‭tasks‬‭are‬‭usually‬‭described‬‭in‬‭terms‬‭of‬‭how‬‭the‬‭system‬‭should‬‭process‬‭an‬ ‭example.‬ ‭ n‬ ‭example‬ ‭is‬ ‭a‬ ‭collection‬ ‭of‬ ‭features‬ ‭that‬ ‭have‬ ‭been‬ ‭quantitatively‬ ‭measured‬ ‭from‬ ‭some‬ A ‭object or event that we want the machine learning system to process.‬ ‭We‬ ‭typically‬ ‭represent‬ ‭an‬ ‭example‬ ‭as‬ ‭a‬‭vector‬‭of‬‭the‬‭vector‬‭is‬‭a‬‭feature‬ χ ∈ ‭‬‭ℝ‬‭n‬ ‭where‬‭each‬ ‭entry‬χ‭i‬ ‭of the vector is a feature.‬ ‭○‬ ‭E.g. the features of an image are usually values of the pixels in the image‬ ‭.1 Classification‬ 2 ‭Goal‬‭→ specify which of k categories some inputs belongs‬‭t.‬ ‭○‬ ‭To solve this task, the learning algorithm is usually asked to produce a function f:‬ ‭ℝ‭n‬ ‬ ‭→ {1, …, k}‬ ‭○‬ ‭When‬‭y=f(x)‬‭the‬‭model‬‭assigns‬‭an‬‭input‬‭described‬‭by‬‭vector‬‭x‬‭to‬‭a‬‭category‬‭identified‬ ‭by a numeric code y.‬ ‭○‬ ‭There‬ ‭are‬ ‭other‬ ‭variants‬ ‭of‬ ‭the‬ ‭classification‬ ‭task,‬ ‭for‬ ‭example,‬ ‭where‬ ‭f‬ ‭outputs‬ ‭a‬ ‭probability‬ ‭distribution‬ ‭over‬‭classes‬‭(e.g.‬‭when‬‭we‬‭use‬‭a‬‭softmax‬‭layer‬‭as‬‭the‬‭output‬ ‭of our DL model).‬ ‭Examples:‬ ‭1‬ ‭INTRODUCTION - A BRIEF INTRODUCTION TO MACHINE LEARNING‬ ‭2.1.1 Classification with missing inputs‬ ‭ lassification‬ ‭becomes‬ ‭more‬ ‭challenging‬ ‭if‬ ‭the‬ ‭computer‬ ‭program‬ ‭is‬ ‭not‬ ‭guaranteed‬ ‭that‬ C ‭every measurement in its input vector will always be provided.‬ ‭When‬ ‭some‬ ‭of‬ ‭the‬ ‭inputs‬ ‭may‬ ‭be‬ ‭missing,‬ ‭rather‬ ‭than‬ ‭providing‬ ‭a‬ ‭single‬ ‭classification‬ ‭function, the learning algorithm must learn a set of functions.‬ ‭○‬ ‭Each‬ ‭function‬ ‭corresponds‬ ‭to‬ ‭classifying‬ ‭x‬ ‭with‬ ‭a‬ ‭different‬ ‭subset‬ ‭of‬ ‭its‬ ‭inputs‬ ‭missing.‬ ‭This classification task is typical in medical diagnosis.‬ ‭2.2. Regression‬ ‭Goal‬‭→ predict a numerical value given some input‬ ‭‬ T ○ ‭ he function produced for this task type:‬‭ℝ‬‭n‬ ‭→‬‭ℝ‬ ‭○‬ ‭This‬ ‭type‬ ‭of‬ ‭task‬ ‭is‬ ‭similar‬ ‭to‬ ‭classification,‬ ‭except‬ ‭that‬ ‭the‬ ‭format‬ ‭of‬ ‭output‬ ‭is‬ ‭different‬ ‭.3 Cluster analysis‬ 2 ‭Goal‬‭→ predict a group for some given input.‬ ‭○‬ ‭The function produced for this task type:‬‭ℝ‬‭n‬ ‭→ {1, …, k}‬ ‭○‬ ‭This‬ ‭type‬ ‭of‬ ‭task‬ ‭is‬ ‭similar‬ ‭to‬‭classification,‬‭except‬‭that‬‭the‬ ‭examples are not labeled.‬ ‭2.4 Transcription‬ ‭Goal‬ ‭→‬ ‭Observe‬ ‭a‬ ‭relatively‬ ‭unstructured‬ ‭representation‬ ‭of‬ ‭some‬ ‭kind of data and transcribe the information into discrete textual form.‬ ‭2.5 Machine translation‬ ‭Goal‬ ‭→‬ ‭Convert‬ ‭a‬ ‭sequence‬ ‭of‬ ‭symbols‬ ‭in‬ ‭some‬ ‭language‬ ‭into‬ ‭a‬ ‭sequence‬ ‭of‬ ‭symbols‬ ‭in‬ ‭another language.‬ ‭○‬ ‭The‬ ‭most‬ ‭common‬ ‭application‬ ‭is‬ ‭in‬ ‭Natural‬ ‭Language‬ ‭Processing (NLP)‬ ‭.6 Structured output‬ 2 ‭Involves‬ ‭any‬ ‭task‬ ‭that‬ ‭its‬‭output‬‭is‬‭a‬‭vector‬‭or‬‭other‬‭data‬‭structure‬ ‭with important relationships between the different elements.‬ ‭2‬ ‭INTRODUCTION - A BRIEF INTRODUCTION TO MACHINE LEARNING‬ ‭.7 Anomaly detection‬ 2 ‭Goal‬ ‭→‬ ‭Detect‬ ‭or‬‭flag‬‭a‬‭set‬‭of‬‭events‬‭or‬‭objects‬‭that‬‭are‬‭being‬‭unusual‬‭or‬‭atypical‬‭from‬‭the‬ ‭global set of events or objects.‬ ‭.8 Synthesis and sampling‬ 2 ‭Goal‬‭→ Generate new examples that are similar to those‬‭in the training data‬ ‭○‬ ‭Synthesis‬ ‭and‬ ‭sampling‬ ‭via‬ ‭machine‬ ‭learning‬ ‭can‬ ‭be‬ ‭useful‬ ‭for‬ ‭media‬ ‭applications‬ ‭when‬‭generating‬‭large‬‭volumes‬‭of‬‭contect‬‭by‬‭hand‬‭would‬‭be‬‭expensive‬‭or‬‭require‬‭too‬ ‭much time.‬ ‭2.9 Imputation of missing values‬ ‭Goal‬‭→‬‭Given‬‭an‬‭example‬‭with‬‭some‬‭values‬‭missing,‬‭the‬‭computer‬‭must‬‭provide‬‭a‬‭prediction‬ ‭of the missing values.‬ ‭.10 Denoising‬ 2 ‭ oal‬‭→ Correct a corrupted example.‬ G ‭3.‬ ‭Performance measure (P)‬ ‭ o‬ ‭evaluate‬ ‭the‬ ‭abilities‬ ‭of‬ ‭a‬ ‭machine‬ ‭learning‬ T ‭algorithm,‬ ‭we‬‭must‬‭design‬‭a‬ ‭quantitative‬‭measure‬ ‭of‬ ‭its performance.‬ ‭Usually‬ ‭this‬ ‭performance‬ ‭measure‬ ‭P‬‭is‬ ‭specific‬ ‭to‬‭the‬ ‭task T being carried out by the system.‬ ‭For‬ ‭tasks‬ ‭such‬ ‭as‬ ‭classification,‬ ‭classification‬ ‭with‬ ‭missing‬ ‭inputs,‬ ‭and‬ ‭transcription,‬ ‭we‬ ‭often‬ ‭measure‬ ‭the accuracy of the model.‬ ‭○‬ ‭Accuracy can be a dangerous measurement.‬ ‭○‬ ‭Please,‬ ‭evaluate‬ ‭your‬ ‭models‬ ‭with‬ ‭more‬ ‭robust‬ ‭metrics:‬ ‭macro‬ ‭f-measure,‬ ‭AUC,‬ ‭g-mean, statistical tests…‬ ‭ e can also obtain equivalent information by measuring the‬‭error rate‬‭, the proportion of‬ W ‭examples for which the model produces an incorrect output.‬ ‭○‬ ‭We often refer to the error rate as the expected 0-1 loss.‬ ‭○‬ ‭The 0-1 loss on a particular example is 0 if it is correctly classified and 1 if it is not.‬ ‭Usually, we are interested in how well the machine learning algorithm performs on data that‬ ‭it has not seen before, since this determines how well it will work when deployed in the real‬ ‭world.‬ ‭We evaluate these performance measures using a‬‭test‬‭set‬‭of data that is‬‭separate from the‬ ‭data used for training‬‭the machine learning system.‬ ‭○‬ ‭Is very important to avoid the‬‭data bleeding‬‭. Data‬‭bleed occurs when the test set is‬ ‭also used to train the model.‬ ‭3‬ ‭INTRODUCTION - A BRIEF INTRODUCTION TO MACHINE LEARNING‬ ‭The choice of performance measure may seem straightforward and objective, but it is‬‭often‬ ‭difficult to choose‬‭a performance measure that corresponds well to the desired behavior of‬ ‭the system.‬ ‭○‬ ‭E.g. When performing a‬‭transcription task‬‭, should‬‭we measure the accuracy of the‬ ‭system at transcribing‬‭entire sequences‬‭, or should‬‭we use a more fine-grained‬ ‭performance measure that gives partial credit for getting‬‭some elements of the‬ ‭sequence‬‭correct?‬ ‭○‬ ‭When performing a‬‭regression task‬‭, should we penalize‬‭the system more if it‬ ‭frequently makes‬‭medium-sized mistakes‬‭or if it rarely‬‭makes‬‭very large mistakes‬‭?‬ ‭4.‬ ‭Experience (E)‬ ‭ achine‬‭learning‬‭algorithms‬‭can‬‭be‬‭broadly‬‭categorized‬‭as‬ ‭unsupervised‬ ‭or‬ ‭supervised‬‭by‬ M ‭what kind of experience they are allowed to have during the learning process.‬ ‭This will depend on the dataset.‬ ‭○‬ ‭A collection of examples, sometimes also called‬‭datapoints‬‭.‬ ‭4.1 Unsupervised learning‬ ‭Unsupervised‬ ‭learning‬ ‭algorithms‬ ‭experience‬ ‭a‬ ‭dataset‬ ‭containing‬ ‭many‬ ‭features,‬ ‭then‬ ‭learn useful properties of the structure of this dataset.‬ ‭○‬ ‭The examples are‬‭not labeled‬‭.‬ ‭Some‬ ‭unsupervised‬ ‭learning‬ ‭algorithms‬ ‭perform‬ ‭roles‬ ‭like‬ ‭clustering‬‭,‬ ‭which‬ ‭consists‬ ‭of‬ ‭dividing the dataset into clusters of similar examples.‬ ‭4.2 Supervised learning‬ ‭Supervised‬‭learning‬ ‭algorithms‬‭experience‬‭a‬‭dataset‬‭containing‬‭features,‬‭but‬‭each‬‭example‬ ‭is also associated with a‬‭label‬‭or‬‭target.‬ ‭○‬ ‭E.g. an eye infection dataset containing either bacteriological or fungal infections.‬ ‭Unsupervised‬ ‭learning‬ ‭involves‬ ‭observing‬ ‭several‬ ‭examples‬ ‭of‬ ‭a‬ ‭random‬ ‭vector‬ ‭x‬ ‭and‬ ‭attempting‬ ‭to‬ ‭implicitly‬ ‭or‬ ‭explicitly‬ ‭learn‬ ‭the‬ ‭probability‬ ‭distribution‬ ‭p(x),‬ ‭or‬ ‭some‬ ‭interesting properties of that distribution.‬ ‭Supervised‬ ‭learning‬ ‭involves‬ ‭observing‬ ‭several‬ ‭examples‬ ‭of‬ ‭a‬ ‭random‬ ‭vector‬ ‭x‬ ‭and‬ ‭an‬ ‭associated value or vector y, then learning to predict y from x , usually by estimating p(y|x).‬ ‭(4. Experience (E))‬ ‭Some‬ ‭machine‬ ‭learning‬ ‭algorithms‬ ‭do‬ ‭not‬ ‭just‬ ‭experience‬ ‭a‬ ‭fixed‬ ‭dataset.‬ ‭For‬ ‭example,‬ ‭reinforcement‬ ‭learning‬ ‭algorithms‬ ‭interact‬ ‭with‬ ‭an‬ ‭environment,‬ ‭so‬ ‭there‬ ‭is‬ ‭a‬ ‭feedback‬ ‭loop between the learning system and its experiences.‬ ‭Just‬ ‭as‬ ‭there‬ ‭is‬ ‭no‬ ‭formal‬ ‭definition‬ ‭of‬ ‭supervised‬ ‭and‬ ‭unsupervised‬ ‭learning,‬ ‭there‬ ‭is‬ ‭no‬ ‭rigid taxonomy of datasets or experiences.‬ ‭For example, in deep learning, self-supervised learning is also used.‬ ‭○‬ ‭The‬ ‭basic‬ ‭idea‬ ‭is‬ ‭to‬ ‭automatically‬ ‭generate‬‭some‬‭kind‬‭of‬‭supervisory‬‭signal‬‭to‬‭solve‬ ‭some‬ ‭task‬ ‭(typically,‬ ‭to‬ ‭learn‬ ‭representations‬ ‭of‬ ‭the‬‭data‬‭or‬‭to‬‭automatically‬‭label‬‭a‬ ‭dataset).‬ ‭○‬ ‭E.g. Autoencoders‬ ‭4‬ ‭INTRODUCTION - A BRIEF INTRODUCTION TO MACHINE LEARNING‬ ‭5.‬ ‭Capacity, overfitting & underfitting‬ ‭ he‬ ‭central‬ ‭challenge‬ ‭in‬‭machine‬‭learning‬‭is‬‭that‬‭our‬‭algorithm‬‭must‬‭perform‬‭well‬‭on‬‭new,‬ T ‭previously unseen inputs.‬ ‭The ability to perform well on previously unobserved inputs is called‬‭generalization‬‭.‬ ‭When training a machine learning model, we have access to a training set.‬ ‭○‬ ‭We‬ ‭can‬ ‭compute‬ ‭some‬ ‭error‬ ‭measure‬ ‭on‬ ‭the‬ ‭training‬ ‭set,‬‭called‬‭the‬ ‭training‬‭error‬ ‭and we reduce this training error.‬ ‭What‬ ‭separates‬ ‭machine‬ ‭learning‬ ‭from‬ ‭optimization‬ ‭is‬ ‭that‬ ‭we‬ ‭want‬ ‭the‬ ‭generalization‬ ‭error‬‭, also called the test error.‬ ‭Our objective is to minimize the generalization error or test error.‬ ‭ e‬‭typically‬‭estimate‬‭the‬‭generalization‬‭error‬‭of‬‭a‬‭machine‬‭learning‬‭model‬‭by‬‭measuring‬‭its‬ W ‭performance on a‬‭test set‬‭of examples that were collected‬‭separately from the training set.‬ ‭○‬ ‭Obviously,‬‭if‬‭the‬‭training‬‭and‬‭the‬‭test‬‭set‬‭are‬‭collected‬‭arbitrarily,‬‭there‬‭is‬‭little‬‭we‬‭can‬ ‭do.‬ ‭The‬‭training‬‭and‬‭test‬‭data‬‭are‬‭generated‬‭by‬‭a‬‭probability‬‭distribution‬‭over‬‭datasets‬‭called‬‭the‬ ‭data-generating‬‭process‬‭.‬‭We‬‭typically‬‭make‬‭a‬‭set‬‭of‬‭assumptions‬‭known‬‭collectively‬‭as‬‭the‬ ‭i.i.d. assumptions.‬ ‭‬ T ○ ‭ he examples in each dataset are‬‭independent‬‭from‬‭each other.‬ ‭○‬ ‭The‬ ‭training‬ ‭set‬ ‭and‬ ‭test‬ ‭set‬ ‭are‬ ‭identically‬ ‭distributed‬‭,‬ ‭drawn‬ ‭from‬ ‭the‬ ‭same‬ ‭probability distribution as each other.‬ ‭ e call that shared underlying distribution the‬‭data-generating‬‭distribution‬‭.‬ W ‭ hen‬‭we‬‭use‬‭a‬‭machine‬‭learning‬‭algorithm,‬‭we‬‭do‬‭not‬‭fix‬‭the‬‭parameters‬‭ahead‬‭of‬‭time,‬‭then‬ W ‭sample both datasets.‬ ‭‬ ‭We‬ ‭sample‬ ‭the‬ ‭training‬ ‭set,‬ ‭then‬ ‭use‬‭it‬‭to‬‭choose‬‭the‬‭parameters‬‭to‬‭reduce‬‭training‬ ‭set error, then sample the test set.‬ ‭Under‬ ‭this‬ ‭process,‬ ‭the‬‭expected‬‭test‬‭error‬‭is‬‭greater‬‭than‬‭or‬‭equal‬‭to‬‭the‬‭expected‬‭value‬‭of‬ ‭training error.‬ ‭The‬ ‭factors‬ ‭determining‬ ‭how‬ ‭well‬ ‭a‬ ‭machine‬‭learning‬‭algorithm‬‭will‬‭perform‬‭are‬‭its‬‭ability‬ ‭to:‬ ‭1.‬ ‭Make the training error small.‬ ‭2.‬ ‭Make the gap between training and test error small.‬ ‭These‬ ‭two‬ f‭ actors‬ ‭correspond‬ ‭to‬ ‭the‬ ‭two‬ ‭central‬ ‭challenges‬ ‭in‬ ‭machine‬ ‭learning:‬ ‭underfitting‬‭and‬‭overfitting‬‭.‬ ‭○‬ ‭Underfitting‬ ‭occurs‬ ‭when‬ ‭the‬ ‭model‬ ‭is‬ ‭not‬ ‭able‬ ‭to‬ ‭obtain‬ ‭a‬ ‭sufficiently‬ ‭low‬ ‭error‬ ‭value on the training set.‬ ‭‬ ‭The model is unable to learn.‬ ‭○‬ O‭ verfitting‬ ‭occurs‬ ‭when‬ ‭the‬ ‭gap‬ ‭between‬ ‭the‬ ‭training‬ ‭error‬ ‭and‬ ‭test‬ ‭error‬ ‭is‬ ‭too‬ ‭large.‬ ‭‬ ‭The model is unable to generalize.‬ ‭5‬ ‭INTRODUCTION - A BRIEF INTRODUCTION TO MACHINE LEARNING‬ ‭ e‬ ‭can‬ ‭control‬ ‭whether‬ ‭a‬ ‭model‬ ‭is‬ ‭more‬ ‭likely‬ ‭to‬ W ‭overfit or underfit by altering its‬‭capacity‬‭.‬ ‭A‬‭model’s‬‭capacity‬‭is‬‭its‬‭ability‬‭to‬‭fit‬‭a‬‭wide‬‭variety‬ ‭of functions.‬ ‭Models‬ ‭with‬ ‭low‬ ‭capacity‬ ‭may‬ ‭struggle‬ ‭to‬ ‭fit‬ ‭the‬ ‭training set.‬ ‭Models‬ ‭with‬ ‭high‬ ‭capacity‬ ‭can‬ ‭overfit‬ ‭by‬ ‭memorizing‬ ‭properties‬ ‭of‬ ‭the‬ ‭training‬ ‭set‬ ‭that‬ ‭do‬ ‭not serve them well on the test set.‬ ‭In‬ ‭deep‬‭learning‬‭for‬‭example‬‭this‬‭is‬‭controlled‬‭by‬‭the‬‭number‬‭of‬‭layers‬‭and‬‭the‬‭size‬‭of‬‭each‬ ‭layer.‬ ‭One‬ ‭way‬‭to‬‭control‬‭the‬‭capacity‬‭of‬‭a‬‭learning‬‭algorithm‬‭is‬‭by‬‭choosing‬‭its‬‭hypothesis‬‭space,‬ ‭the set of functions that the learning algorithm is allowed to select as being the solution.‬ ‭5.1 The No Free Lunch Theorem‬ ‭The no free lunch theorem for machine learning states that:‬ ‭○‬ ‭Averaged‬ ‭over‬ ‭all‬ ‭possible‬ ‭data-generating‬ ‭distributions,‬ ‭every‬ ‭classification‬ ‭algorithm has the same error rate when classifying previously unobserved points‬ ‭○‬ ‭What does this mean?‬ ‭‬ ‭No machine learning algorithm is universally any better than any other.‬ ‭ here‬ ‭is‬ ‭no‬ ‭machine‬ ‭learning‬ ‭algorithm‬ ‭that‬ ‭will‬ ‭be‬ ‭the‬ ‭best‬ ‭solution‬ ‭for‬ ‭all‬ ‭the‬ T ‭problems.‬ ‭5.2 Regularization‬ ‭ egularization‬ ‭is‬ ‭any‬ ‭modification‬ ‭we‬ ‭make‬ ‭to‬ ‭a‬ ‭learning‬ ‭algorithm‬ ‭that‬ ‭is‬ ‭intended‬ ‭to‬ R ‭reduce its generalization error but not its training error.‬ ‭One of the main concerns of machine learning, rivaled only by optimization.‬ ‭Different algorithms have different regularization options.‬ ‭○‬ ‭In deep learning a commonly used one is dropout.‬ ‭6. Hyperparameters and Validation Sets‬ ‭ ost‬ ‭machine‬ ‭learning‬ ‭algorithms‬ ‭have‬ ‭hyperparameters‬ ‭→‬ ‭settings‬ ‭that‬ ‭we‬ ‭can‬ ‭use‬ ‭to‬ M ‭control the algorithm’s behavior.‬ ‭The‬ ‭values‬ ‭for‬ ‭these‬ ‭hyperparameters‬‭are‬‭not‬‭learned‬‭automatically,‬‭they‬‭are‬‭set‬‭manually.‬ ‭We need to try different values and explore the possibilities to find the best configuration.‬ ‭Problem:‬ ‭the‬ ‭test‬ ‭examples‬ ‭must‬ ‭not‬ ‭be‬ ‭used‬ ‭in‬ ‭any‬ ‭way‬ ‭to‬ ‭make‬ ‭choices‬ ‭about‬ ‭the‬ ‭model‬‭, including its hyperparameters.‬ ‭○‬ ‭Be very strict with this.‬ ‭Validation‬ ‭set‬‭:‬ ‭constructed‬ ‭from‬ ‭the‬ ‭training‬‭data,‬‭but‬‭these‬‭examples‬‭are‬‭not‬‭seen‬‭by‬‭the‬ ‭algorithm during the training.‬ ‭Sets used:‬ ‭○‬ ‭Training set‬‭: Used to learn the parameters of the‬‭algorithm‬ ‭○‬ ‭Validation set‬‭: Used to test the generalization when‬‭adjusting the hyperparameters.‬ ‭○‬ ‭Test set‬‭: used to test the generalization of the final‬‭configuration.‬ ‭We‬‭first‬‭split‬‭the‬‭data‬‭in‬‭the‬‭training‬‭and‬‭test‬‭sets.‬‭Then‬‭split‬‭the‬‭training‬‭data‬‭in‬‭the‬‭training‬ ‭and validation sets.‬ ‭6‬ ‭INTRODUCTION - A BRIEF INTRODUCTION TO MACHINE LEARNING‬ ‭ ince‬ ‭the‬ ‭validation‬ ‭set‬ ‭is‬ ‭used‬‭to‬‭“train”‬‭the‬‭hyperparameters,‬‭the‬‭validation‬‭set‬‭error‬‭will‬ S ‭underestimate‬ ‭the‬ ‭generalization‬ ‭error,‬ ‭though‬ ‭typically‬ ‭by‬ ‭a‬ ‭smaller‬ ‭amount‬ ‭than‬ ‭the‬ ‭training error does.‬ ‭After‬‭all‬‭hyperparameter‬‭optimization‬‭is‬‭complete,‬‭the‬‭generalization‬‭error‬‭may‬‭be‬‭estimated‬ ‭using the test set.‬ ‭7. The curse of dimensionality‬ ‭ any‬ ‭machine‬ ‭learning‬ ‭problems‬ ‭become‬ ‭exceedingly‬ ‭difficult‬ ‭when‬ ‭the‬ ‭number‬ ‭of‬ M ‭dimensions in the data is high.‬ ‭‬ ‭This phenomenon is known as the‬‭curse of dimensionality‬‭.‬ ‭Of‬ ‭particular‬ ‭concern‬ ‭is‬ ‭that‬ ‭the‬ ‭number‬ ‭of‬ ‭possible‬ ‭distinct‬ ‭configurations‬ ‭of‬ ‭a‬ ‭set‬ ‭of‬ ‭variables increases exponentially as the number of variables increases.‬ ‭Problems:‬ ‭‬ ‭Data gets spread out.‬ ‭‬ ‭More data is needed.‬ ‭‬ ‭Harder to compute.‬ ‭7‬ ‭INTRODUCTION - THE MACHINE LEARNING PROCESS‬ ‭1.‬ ‭Choosing the Training Experience‬ ‭1.1 First design choice‬ ‭ he‬‭first‬‭design‬‭choice‬‭we‬‭face‬‭is‬‭to‬‭choose‬‭the‬ ‭type‬‭of‬‭training‬‭experience‬‭from‬‭which‬‭our‬ T ‭system will learn.‬ ‭○‬ ‭The‬‭type‬‭of‬‭training‬‭experience‬‭available‬‭can‬‭have‬‭a‬‭significant‬‭impact‬‭on‬‭success‬‭or‬ ‭failure of the learner.‬ ‭○‬ ‭Not all experience types are adequate for all problems.‬ ‭One‬ ‭key‬ ‭attribute‬ ‭is‬ ‭whether‬ ‭the‬ ‭training‬ ‭experience‬ ‭provides‬ ‭direct‬ ‭or‬ ‭indirect‬ ‭feedback‬ ‭regarding the choices made by the performance system.‬ I‭ n‬ ‭our‬ ‭example,‬ ‭the‬ ‭system‬ ‭might‬ ‭learn‬ ‭from‬ ‭direct‬ ‭training‬ ‭examples‬ ‭consisting‬ ‭of‬ ‭individual‬‭checkers‬‭board‬ ‭states and the correct move for each.‬ ‭ lternatively,‬ ‭it‬ ‭might‬ ‭have‬ ‭available‬ ‭only‬ ‭indirect‬ A ‭information‬ ‭consisting‬ ‭of‬ ‭the‬ ‭move‬ ‭sequences‬ ‭and‬ ‭final‬ ‭outcomes of various games played.‬ ‭○‬ ‭In‬‭this‬‭later‬‭case,‬‭information‬‭about‬‭the‬‭correctness‬ ‭of‬‭specific‬‭moves‬‭early‬‭in‬‭the‬‭game‬‭must‬‭be‬‭inferred‬‭indirectly‬‭from‬‭the‬‭fact‬‭that‬‭the‬ ‭game was eventually won or lost.‬ ‭○‬ ‭Here‬ ‭the‬ ‭learner‬ ‭faces‬ ‭an‬ ‭additional‬‭problem‬‭of‬ ‭credit‬‭assignment‬‭,‬‭or‬‭determining‬ ‭the‬‭degree‬‭to‬‭which‬‭each‬‭move‬‭in‬‭the‬‭sequence‬‭deserves‬‭credit‬‭or‬‭blame‬‭for‬‭the‬‭final‬ ‭outcome.‬ ‭○‬ ‭Credit‬ ‭assignment‬ ‭can‬ ‭be‬ ‭a‬ ‭particularly‬ ‭difficult‬ ‭problem‬ ‭because‬ ‭the‬ ‭game‬ ‭can‬ ‭be‬ ‭lost even when early moves are optimal, if these are followed later by poor moves.‬ ‭For‬‭this‬‭reason,‬‭learning‬‭from‬‭direct‬‭training‬‭feedback‬‭is‬‭typically‬‭easier‬‭than‬‭learning‬‭from‬ ‭indirect feedback.‬ ‭1.2 Second design choice‬ ‭ ‬ ‭second‬ ‭important‬ ‭attribute‬ ‭of‬ ‭the‬ ‭training‬ ‭experience‬ ‭is‬ ‭the‬ ‭degree‬ ‭to‬ ‭which‬‭the‬‭learner‬ A ‭controls the sequence of training examples.‬ ‭○‬ ‭The‬ ‭learner‬ ‭might‬ ‭rely‬ ‭on‬ ‭the‬ ‭teacher‬ ‭to‬ ‭select‬ ‭informative‬ ‭board‬ ‭states‬ ‭and‬ ‭to‬ ‭provide the correct move for each →‬‭Supervised Learning.‬ ‭○‬ ‭The‬‭learner‬‭might‬‭itself‬‭propose‬‭board‬‭states‬‭that‬‭it‬‭finds‬‭particularly‬‭confusing‬‭and‬ ‭ask the teacher for the correct move →‬‭Active Learning.‬ ‭○‬ ‭The‬ ‭learner‬ ‭may‬ ‭have‬ ‭complete‬ ‭control‬ ‭over‬ ‭both‬ ‭the‬ ‭board‬ ‭states‬ ‭and‬ ‭(indirect)‬ ‭training‬ ‭classifications,‬ ‭as‬ ‭it‬ ‭does‬ ‭when‬ ‭it‬ ‭learns‬ ‭by‬ ‭playing‬ ‭against‬ ‭itself‬ ‭with‬ ‭no‬ ‭teacher present. →‬‭Reinforcement Learning.‬ ‭Notice‬ ‭in‬ ‭this‬ ‭last‬ ‭case‬ ‭the‬ ‭learner‬ ‭may‬ ‭choose‬ ‭between‬ ‭experimenting‬ ‭with‬ ‭novel‬ ‭board‬ ‭states‬‭that‬‭it‬‭has‬‭not‬‭yet‬‭considered,‬‭or‬‭honing‬‭its‬‭skill‬‭by‬‭playing‬‭minor‬‭variations‬‭of‬‭lines‬‭of‬ ‭play‬ ‭it‬ ‭currently‬ ‭finds‬ ‭most‬ ‭promising‬ ‭→‬ ‭Exploration‬ ‭vs‬ ‭Exploitation‬ ‭in‬ ‭Reinforcement‬ ‭Learning‬‭.‬ ‭8‬ ‭INTRODUCTION - THE MACHINE LEARNING PROCESS‬ 1‭.3 Third design choice‬ ‭A‬ ‭third‬ ‭important‬ ‭attribute‬ ‭of‬ ‭the‬ ‭training‬ ‭experience‬ ‭is‬ ‭how‬ ‭well‬ ‭it‬ ‭represents‬ ‭the‬ ‭distribution of examples over which the final system performance P must be measured.‬ ‭○‬ ‭In‬‭general,‬‭learning‬‭is‬‭most‬‭reliable‬‭when‬‭the‬‭training‬‭examples‬‭follow‬‭a‬‭distribution‬ ‭similar to that of future test examples.‬ ‭Machine‬ ‭Learning‬ ‭models‬ ‭are‬ ‭learning‬ ‭a‬‭distribution‬‭from‬‭the‬‭examples,‬‭if‬‭they‬‭do‬‭not‬‭see‬ ‭examples for one class, they will not be able to identify new instances of that class.‬ ‭○‬ ‭This is not completely true however →‬‭one-shot and‬‭zero-shot learning‬‭.‬ I‭ n‬ ‭our‬ ‭example,‬ ‭the‬ ‭performance‬ ‭metric‬ ‭P‬ ‭is‬ ‭the‬ ‭percent‬ ‭of‬ ‭games‬ ‭the‬ ‭system‬ ‭wins‬ ‭in‬ ‭the‬ ‭world tournament.‬ ‭If‬ ‭its‬ ‭training‬ ‭experience‬ ‭E‬ ‭consists‬ ‭only‬ ‭of‬‭games‬‭played‬‭against‬‭itself,‬‭there‬‭is‬‭an‬‭obvious‬ ‭danger‬ ‭that‬ ‭this‬ ‭training‬ ‭experience‬ ‭might‬ ‭not‬‭be‬‭fully‬‭representative‬‭of‬‭the‬‭distribution‬‭of‬ ‭situations over which it will later be tested.‬ ‭○‬ ‭For‬‭example,‬‭the‬‭learner‬‭might‬‭never‬‭encounter‬‭certain‬‭crucial‬‭board‬‭states‬‭that‬‭are‬ ‭very likely to be played by the human checkers champion.‬ I‭ n‬ ‭practice,‬ ‭it‬ ‭is‬ ‭often‬ ‭necessary‬ ‭to‬ ‭learn‬ ‭from‬ ‭a‬‭distribution‬‭of‬‭examples‬‭that‬‭is‬‭somewhat‬ ‭different‬ ‭from‬ ‭those‬ ‭on‬ ‭which‬ ‭the‬ ‭final‬ ‭system‬ ‭will‬ ‭be‬ ‭evaluated‬ ‭(e.g.,‬ ‭the‬ ‭world‬ ‭checkers‬ ‭champion might not be interested in teaching the program!).‬ ‭○‬ ‭Training vs Validation vs Test Sets.‬ ‭ uch‬ ‭situations‬ ‭are‬ ‭problematic‬ ‭because‬ ‭mastery‬ ‭of‬ ‭one‬ ‭distribution‬ ‭of‬ ‭examples‬ ‭will‬ ‭not‬ S ‭necessarily lead to strong performance over some other distribution.‬ ‭We‬ ‭shall‬ ‭see‬ ‭that‬ ‭most‬ ‭current‬ ‭theory‬ ‭of‬ ‭machine‬‭learning‬‭rests‬‭on‬‭the‬‭crucial‬‭assumption‬ ‭that‬‭the‬‭distribution‬‭of‬‭training‬‭examples‬‭is‬‭identical‬‭to‬‭the‬‭distribution‬‭of‬‭test‬‭examples.‬‭→‬ ‭i.i.d. assumptions.‬ ‭ espite‬ ‭our‬ ‭need‬ ‭to‬ ‭make‬ ‭this‬ ‭assumption‬ ‭in‬ ‭order‬ ‭to‬ ‭obtain‬ ‭theoretical‬ ‭results,‬ ‭it‬ ‭is‬ D ‭important to keep in mind that this assumption must often be violated in practice.‬ ‭1.4 Putting all together‬ ‭ e decide that our system will train by playing games against itself.‬ W ‭This‬ ‭has‬ ‭the‬‭advantage‬‭that‬‭no‬‭external‬‭trainer‬‭need‬‭be‬‭present,‬‭and‬‭it‬‭therefore‬‭allows‬‭the‬ ‭system to generate as much training data as time permits.‬ ‭2.‬ ‭Choosing the Target Function‬ ‭ he‬ ‭next‬ ‭design‬ ‭choice‬ ‭is‬ ‭to‬ ‭determine‬ ‭exactly‬ ‭what‬‭type‬‭of‬‭knowledge‬‭will‬‭be‬‭learned‬‭and‬ T ‭how this will be used by the performance program.‬ ‭We‬‭are‬‭going‬‭to‬‭assume‬‭that‬‭we‬‭have‬‭a‬‭checkers-playing‬‭program‬‭that‬‭can‬‭generate‬‭the‬‭legal‬ ‭moves from any board state.‬ ‭The‬ ‭program‬ ‭needs‬ ‭only‬ ‭to‬ ‭learn‬ ‭how‬ ‭to‬ ‭choose‬ ‭the‬ ‭best‬ ‭move‬ ‭from‬ ‭among‬ ‭these‬ ‭legal‬ ‭moves.‬ ‭This‬ ‭learning‬ ‭task‬ ‭is‬ ‭representative‬ ‭of‬ ‭a‬ ‭large‬ ‭class‬ ‭of‬ ‭tasks‬ ‭for‬ ‭which‬‭the‬‭legal‬‭moves‬‭that‬ ‭define‬‭some‬‭large‬‭search‬‭space‬‭are‬‭known‬‭a‬‭priori,‬‭but‬‭for‬‭which‬‭the‬‭best‬‭search‬‭strategy‬‭is‬ ‭not known.‬ ‭Many‬ ‭optimization‬ ‭problems‬ ‭fall‬ ‭into‬ ‭this‬ ‭class,‬ ‭such‬ ‭as‬ ‭the‬ ‭problems‬ ‭of‬ ‭scheduling‬ ‭and‬ ‭controlling‬ ‭manufacturing‬ ‭processes‬ ‭where‬ ‭the‬ ‭available‬ ‭manufacturing‬ ‭steps‬ ‭are‬ ‭well‬ ‭understood, but the best strategy for sequencing them is not.‬ ‭9‬ ‭INTRODUCTION - THE MACHINE LEARNING PROCESS‬ ‭ iven‬ ‭this‬ ‭setting‬‭where‬‭we‬‭must‬‭learn‬‭to‬‭choose‬‭among‬‭the‬‭legal‬‭moves,‬‭the‬‭most‬‭obvious‬ G ‭choice‬ ‭for‬ ‭the‬ ‭type‬ ‭of‬ ‭information‬‭to‬‭be‬‭learned‬‭is‬‭a‬‭program,‬‭or‬‭function,‬‭that‬‭chooses‬‭the‬ ‭best move for any given board state.‬ ‭Let us it‬‭ChooseMove‬‭and use the notation ChooseMove‬‭: B → M‬ ‭○‬ ‭This function accepts as input any board from the set of legal board states B.‬ ‭○‬ ‭This function produces as output some move from the set of legal moves M.‬ ‭Throughout‬‭our‬‭discussion‬‭of‬‭machine‬‭learning‬‭we‬‭will‬‭find‬‭it‬‭useful‬‭to‬‭reduce‬‭the‬‭problem‬ ‭of‬ ‭improving‬ ‭performance‬ ‭P‬ ‭at‬ ‭task‬ ‭T‬ ‭to‬ ‭the‬ ‭problem‬ ‭of‬ ‭learning‬ ‭some‬ ‭particular‬ ‭Target‬ ‭function‬‭such as‬‭ChooseMove‬‭.‬ ‭The choice of the target function will therefore be a key design choice.‬ ‭ lthough‬ ‭ChooseMove‬ ‭is‬ ‭an‬ ‭obvious‬ ‭choice‬ ‭for‬ ‭the‬ ‭target‬ ‭function‬ ‭in‬ ‭our‬ ‭example,‬ ‭this‬ A ‭function‬ ‭will‬ ‭turn‬ ‭out‬ ‭to‬ ‭be‬ ‭very‬ ‭difficult‬ ‭to‬ ‭learn‬ ‭given‬ ‭the‬ ‭kind‬ ‭of‬ ‭indirect‬ ‭training‬ ‭experience available to our system.‬ ‭An‬‭alternative‬‭target‬‭function‬‭and‬‭one‬‭that‬‭will‬‭turn‬‭out‬‭to‬‭be‬‭easier‬‭to‬‭learn‬‭in‬‭this‬‭setting‬‭is‬ ‭an evaluation function that assigns a numerical score to any given board state.‬ ‭Let‬‭us‬‭call‬‭this‬‭target‬‭function‬‭V‬‭and‬‭again‬‭use‬‭the‬‭notation‬‭V‬‭:‬‭B‬‭→‬‭ℝ‬‭to‬‭denote‬‭that‬‭V‬‭maps‬ ‭any‬ ‭legal‬ ‭board‬ ‭state‬ ‭from‬ ‭the‬ ‭set‬ ‭B‬ ‭to‬ ‭some‬ ‭real‬ ‭value‬ ‭(we‬ ‭use‬‭ℝ‬‭to‬‭denote‬‭the‬‭set‬‭of‬‭real‬ ‭numbers).‬ ‭ e intend for this target function V to assign higher scores to better board states.‬ W ‭If‬‭the‬‭system‬‭can‬‭successfully‬‭learn‬‭such‬‭a‬‭target‬‭function‬‭V,‬‭then‬‭it‬‭can‬‭easily‬‭use‬‭it‬‭to‬‭select‬ ‭the‬‭best‬‭move‬‭from‬‭any‬‭current‬‭board‬‭position.‬‭This‬‭can‬‭be‬‭accomplished‬‭by‬‭generating‬‭the‬ ‭successor‬ ‭board‬ ‭state‬ ‭produced‬ ‭by‬ ‭every‬ ‭legal‬ ‭move,‬ ‭then‬ ‭using‬ ‭V‬ ‭to‬ ‭choose‬ ‭the‬ ‭best‬ ‭successor state and therefore the best legal move.‬ ‭Let us therefore define the target value V(b) for an arbitrary board state b in B, as follows:‬ ‭1.‬ ‭if b is a final board state that is won, then V(b) = 100.‬ ‭2.‬ ‭if b is a final board state that is lost, then V(b) = -100.‬ ‭3.‬ ‭if b is a final board state that is drawn, then V(b) = 0.‬ ‭4.‬ ‭if‬‭b‬‭is‬‭not‬‭a‬‭final‬‭state‬‭in‬‭the‬‭game,‬‭then‬‭V(b)‬‭=‬‭V(b'),‬‭where‬‭b'‬‭is‬‭the‬‭best‬‭final‬‭board‬ ‭state‬ ‭that‬‭can‬‭be‬‭achieved‬‭starting‬‭from‬‭b‬‭and‬‭playing‬‭optimally‬‭until‬‭the‬‭end‬‭of‬‭the‬ ‭game.‬ ‭ hile‬‭this‬‭recursive‬‭definition‬‭specifies‬‭a‬‭value‬‭of‬‭V(b)‬‭for‬‭every‬‭board‬‭state‬‭b,‬‭this‬‭definition‬ W ‭is not usable by our checkers player because it is not efficiently computable.‬ ‭○‬ ‭In‬ ‭case‬ ‭4,‬ ‭determining‬ ‭the‬ ‭value‬ ‭of‬ ‭V(b)‬ ‭for‬ ‭a‬ ‭particular‬ ‭board‬ ‭state‬ ‭requires‬ ‭searching ahead for the optimal line of play, all the way to the end of the game.‬ ‭Because‬‭this‬‭definition‬‭is‬‭not‬‭efficiently‬‭computable‬‭by‬‭our‬‭checkers‬‭playing‬‭program,‬‭we‬‭say‬ ‭that it is a‬‭non operational‬‭definition.‬ ‭The‬ ‭goal‬ ‭of‬ ‭learning‬ ‭in‬ ‭this‬ ‭case‬ ‭is‬ ‭to‬ ‭discover‬ ‭an‬ ‭operational‬ ‭description‬ ‭of‬ ‭V;‬ ‭that‬ ‭is,‬ ‭a‬ ‭description‬ ‭that‬ ‭can‬ ‭be‬ ‭used‬ ‭by‬ ‭the‬ ‭checkers-playing‬ ‭program‬‭to‬‭evaluate‬‭states‬‭and‬‭select‬ ‭moves within realistic time bounds.‬ ‭ hus,‬ ‭we‬ ‭have‬ ‭reduced‬ ‭the‬ ‭learning‬ ‭task‬ ‭in‬ ‭this‬ ‭case‬ ‭to‬ ‭the‬ ‭problem‬ ‭of‬ ‭discovering‬ ‭an‬ T ‭operational description of the ideal target function V.‬ ‭It may be very difficult in general to learn such an operational form of V perfectly.‬ ‭10‬ ‭INTRODUCTION - THE MACHINE LEARNING PROCESS‬ ‭ e‬ ‭often‬ ‭expect‬ ‭learning‬ ‭algorithms‬ ‭to‬ ‭acquire‬ ‭only‬ ‭some‬ ‭approximation‬ ‭to‬ ‭the‬ ‭target‬ W ‭function,‬ ‭and‬ ‭for‬ ‭this‬ ‭reason‬ ‭the‬ ‭process‬ ‭of‬ ‭learning‬ ‭the‬ ‭target‬ ‭function‬ ‭is‬ ‭often‬ ‭called‬ ‭function approximation‬‭.‬ ‭In‬ ‭the‬ ‭current‬ ‭discussion‬ ‭we‬ ‭will‬ ‭use‬‭the‬‭symbol‬‭V̂‬‭to‬‭refer‬‭to‬‭the‬‭function‬‭that‬‭is‬‭actually‬ ‭learned by our program, to distinguish it from the ideal target function V.‬ ‭3. Choosing a Representation of the Training Function‬ ‭ ow‬‭that‬‭we‬‭have‬‭specified‬‭the‬‭ideal‬‭target‬‭function‬‭V,‬‭we‬‭must‬‭choose‬‭a‬‭representation‬‭that‬ N ‭the learning program will use to describe the function that it will learn.‬ ‭As with earlier design choices, we again have many options:‬ ‭1.‬ ‭We‬‭could‬‭allow‬‭the‬‭program‬‭to‬‭represent‬‭using‬‭a‬‭large‬‭table‬‭with‬‭a‬‭V̂‬‭distinct‬‭entry‬ ‭specifying the value for each distinct board state.‬ ‭2.‬ ‭We‬ ‭could‬ ‭allow‬ ‭it‬ ‭to‬ ‭represent‬ ‭V̂‬ ‭using‬ ‭a‬ ‭collection‬ ‭of‬ ‭rules‬ ‭that‬ ‭match‬ ‭against‬ ‭features of the board state.‬ ‭3.‬ ‭We could use a quadratic polynomial function of predefined board features.‬ ‭4.‬ ‭We could use an artificial neural network.‬ ‭In general, this choice of representation involves a crucial tradeoff.‬ ‭○‬ ‭On‬‭one‬‭hand,‬‭we‬‭wish‬‭to‬‭pick‬‭a‬‭very‬‭expressive‬‭representation‬‭to‬‭allow‬‭representing‬ ‭as close an approximation as possible to the ideal target function V.‬ ‭○‬ ‭On‬ ‭the‬ ‭other‬ ‭hand,‬ ‭the‬ ‭more‬ ‭expressive‬ ‭the‬ ‭representation,‬ ‭the‬ ‭more‬ ‭training‬‭data‬ ‭the‬ ‭program‬ ‭will‬‭require‬‭in‬‭order‬‭to‬‭choose‬‭among‬‭the‬‭alternative‬‭hypotheses‬‭it‬‭can‬ ‭represent.‬ ‭○‬ ‭Remember our discussion on the‬‭capacity‬‭in the previous‬‭unit.‬ ‭ e‬‭are‬‭just‬‭going‬‭to‬‭choose‬‭one‬‭for‬‭this‬‭example,‬‭for‬‭any‬‭given‬‭board‬‭state,‬‭the‬‭function‬‭V̂‬ W ‭will be calculated as a linear combination of the following board features:‬ ‭○‬ ‭X1‬‭: the number of black pieces on the board.‬ ‭○‬ ‭X2‬‭: the number of red pieces on the board.‬ ‭○‬ ‭X3‬‭: the number of black queens on the board.‬ ‭○‬ ‭X4‬‭: the number of red queens on the board.‬ ‭○‬ ‭X5‬‭:‬‭the‬‭number‬‭of‬‭black‬‭pieces‬‭threatened‬‭by‬‭red‬‭(i.e.,‬‭which‬‭can‬‭be‬‭captured‬‭on‬‭red’s‬ ‭next turn).‬ ‭○‬ ‭X6‬‭: the number of red pieces threatened by black.‬ ‭Our learning program will represent as a linear function like:‬ ‭V̂(b) = 𝑤‬‭0‬ ‭+ 𝑤‬‭1‭𝒳 ‬ ‭1‬‬ ‭+ 𝑤‬‭2‬‭𝒳‭2 ‬ ‬ ‭+ 𝑤‬‭3‬‭𝒳‭3‬ ‬ ‭+ 𝑤‬‭4‬‭𝒳‭4‬ ‬ ‭+‬‭𝑤‭5‬ ‬‭𝒳‬‭5‬ ‭+ 𝑤‬‭6‬‭𝒳‭6‬ ‬ ‭Where‬ ‭𝑤‭0‬ ‬ ‭through‬ ‭𝑤‬‭6‬ ‭are‬ ‭numerical‬ ‭coefficients,‬ ‭or‬ ‭weights‬‭,‬ ‭to‬ ‭be‬ ‭chosen‬ ‭by‬ ‭the‬ ‭learning‬ ‭algorithm.‬ ‭Learned‬ ‭values‬ ‭for‬ ‭the‬ ‭weights‬ ‭𝑤‭1‬‬ ‭through‬‭𝑤‭6‬ ‬ ‭will‬‭determine‬‭the‬‭relative‬‭importance‬‭of‬‭the‬ ‭various board features in determining the value of the board.‬ ‭The weight 𝑤‬‭0‬ ‭will provide an additive constant to‬‭the board value.‬ ‭11‬ ‭INTRODUCTION - THE MACHINE LEARNING PROCESS‬ ‭4. Choosing a Function Approximation Algorithm‬ ‭To learn the target function V̂ we require a set of training examples‬ ‭1.‬ ‭These examples describe a specific board state b and the training value V‬‭train‬‭(b) for b.‬ ‭2.‬ ‭Each training example is an ordered pair like (b, V‬‭train‬‭(b)).‬ ‭In‬ ‭our‬ ‭checkers‬ ‭model,‬ ‭the‬ ‭following‬ ‭one‬ ‭would‬ ‭be‬ ‭an‬ ‭example‬ ‭where‬ ‭black‬ ‭has‬ ‭won‬ ‭the‬ ‭game‬‭(x‬‭2‬ ‭=‬‭0‬‭because‬‭there‬‭are‬‭no‬‭more‬‭red‬‭pieces),‬‭and‬‭in‬‭which‬‭the‬‭value‬‭of‬‭target‬‭function‬ ‭V‭t‬rain‬‭(b) = + 100.‬ ‭((x‬‭1‬ ‭= 3, x‬‭2‬ ‭= 0, x‬‭3‬ ‭= 0, x‬‭4‬ ‭= 0, x‬‭5‬ ‭= 0, x‬‭6‬ ‭= 0),‬‭+100)‬ ‭4.1 Estimating Training Values‬ I‭ n‬‭our‬‭problem,‬‭we‬‭are‬‭going‬‭to‬‭use‬‭an‬‭approach‬‭that‬‭has‬‭shown‬‭to‬‭work‬‭quite‬‭well‬‭in‬‭similar‬ ‭problems.‬ ‭○‬ ‭In‬ ‭most‬ ‭cases,‬ ‭we‬ ‭do‬ ‭not‬ ‭need‬ ‭to‬ ‭reinvent‬ ‭the‬ ‭wheel,‬ ‭studying‬ ‭the‬ ‭literature‬ ‭for‬ ‭similar problems is usually how you would approach the design.‬ ‭The‬ ‭approach‬‭is‬‭to‬‭assign‬‭the‬‭training‬‭value‬‭of‬‭V‬‭train‬‭(b)‬‭or‬‭any‬‭intermediate‬‭board‬‭state‬‭b‬‭to‬ ‭be‬ ‭V̂‬ ‭(Successor(b))‬ ‭where‬ ‭V̂‬ ‭is‬ ‭the‬ ‭learner’s‬ ‭current‬ ‭approximation‬ ‭to‬ ‭V‬ ‭and‬ ‭where‬ ‭(Successor(b))‬ ‭denotes‬ ‭the‬ ‭next‬ ‭board‬ ‭state‬ ‭following‬ ‭b‬ ‭for‬‭which‬‭it‬‭is‬‭again‬‭the‬‭program’s‬ ‭turn to move.‬ ‭○‬ ‭I.e. the board state following the program’s move and the opponent’s response‬ ‭This rule for estimating training values can be summarized as:‬ ‭V‬‭train‬‭(b) ← V̂(Successor(b))‬ I‭ t‬ ‭probably‬ ‭seems‬ ‭a‬ ‭bit‬ ‭strange‬ ‭that‬ ‭we‬ ‭are‬ ‭using‬‭the‬‭current‬‭version‬‭of‬‭V̂‬‭to‬‭estimate‬‭the‬ ‭training values that will be used to further refine the same function‬ ‭○‬ ‭We are learning the proper values of the weights.‬ ‭○‬ ‭With‬‭each‬‭example,‬‭we‬‭are‬‭looking‬‭for‬‭the‬‭values‬‭of‬‭the‬‭weights‬‭that‬‭will‬‭result‬‭in‬‭our‬ ‭model winning the match‬ ‭Notice‬‭that‬‭we‬‭are‬‭using‬‭estimates‬‭of‬‭the‬‭values‬‭of‬‭the‬‭Successor(b)‬‭to‬‭estimate‬‭the‬‭values‬‭of‬ ‭board‬‭state‬‭b.‬‭Intuitively,‬‭we‬‭can‬‭see‬‭this‬‭will‬‭make‬‭sense‬‭if‬‭V̂‬‭tends‬‭to‬‭be‬‭more‬‭accurate‬‭for‬ ‭board states closer to the game's end.‬ ‭○‬ ‭Under‬ ‭certain,‬ ‭the‬ ‭approach‬ ‭of‬ ‭iteratively‬ ‭estimating‬ ‭training‬ ‭values‬ ‭based‬ ‭on‬ ‭estimates‬ ‭of‬ ‭successor‬ ‭state‬ ‭values‬ ‭can‬ ‭be‬ ‭proven‬ ‭to‬ ‭converge‬ ‭toward‬ ‭perfect‬ ‭estimates of V‬‭train‬‭.‬ ‭○‬ ‭See‬‭chapter‬‭13‬‭of‬‭[Mitchel,‬‭1997]‬‭for‬‭a‬‭more‬‭detailed‬‭discussion‬‭on‬‭how‬‭this‬‭is‬‭done‬‭in‬ ‭Reinforcement Learning.‬ ‭4.2 Adjusting the Weights‬ ‭ he‬ ‭only‬ ‭thing‬ ‭that‬ ‭we‬‭need‬‭now‬‭to‬‭finish‬‭our‬‭design‬‭is‬‭to‬‭select‬‭the‬‭algorithm‬‭that‬‭we‬‭will‬ T ‭use to choose the weights w‬‭i‬ ‭to best fit the set of‬‭training examples {(b, V‬‭train‬‭(b))}‬ ‭As a first step we must define what we mean by the best fit to the training data.‬ ‭○‬ ‭One‬ ‭common‬ ‭approach‬ ‭is‬ ‭to‬ ‭define‬ ‭the‬ ‭best‬ ‭hypothesis,‬ ‭or‬ ‭set‬ ‭of‬ ‭weights,‬ ‭as‬ ‭that‬ ‭which‬ ‭minimizes‬ ‭the‬ ‭square‬ ‭error‬ ‭E‬ ‭between‬ ‭the‬ ‭training‬ ‭values‬ ‭and‬ ‭the‬ ‭values‬ ‭predicted by the hypothesis V̂.‬ ‭12‬ ‭INTRODUCTION - THE MACHINE LEARNING PROCESS‬ ‭ e‬ ‭seek‬ ‭the‬ ‭weights,‬ ‭or‬ ‭equivalently‬ ‭the‬ ‭V̂,‬ ‭that‬ ‭minimize‬ ‭E‬ ‭for‬ ‭the‬ ‭observed‬ ‭training‬ W ‭examples.‬ ‭○‬ ‭Chapter‬ ‭6‬ ‭in‬ ‭Mitchell,‬ ‭1997‬ ‭discusses‬ ‭settings‬ ‭in‬ ‭which‬ ‭minimizing‬ ‭the‬ ‭sum‬ ‭of‬ ‭squared‬ ‭errors‬ ‭is‬ ‭equivalent‬ ‭to‬ ‭finding‬ ‭the‬ ‭most‬ ‭probable‬ ‭hypothesis‬ ‭given‬ ‭the‬ ‭observed training data.‬ ‭Several‬‭algorithms‬‭are‬‭known‬‭for‬‭finding‬‭weights‬‭of‬‭linear‬‭function‬‭that‬‭minimize‬‭E‬‭defined‬ ‭in this way.‬ ‭○‬ ‭In‬‭our‬‭case,‬‭we‬‭require‬‭an‬‭algorithm‬‭that‬‭will‬‭incrementally‬‭refine‬‭the‬‭weights‬‭as‬‭new‬ ‭training‬ ‭examples‬ ‭become‬ ‭available‬ ‭and‬ ‭that‬ ‭will‬ ‭be‬ ‭robust‬ ‭to‬ ‭errors‬ ‭in‬ ‭these‬ ‭estimated training values.‬ ‭One such algorithm is called the‬‭least mean squares,‬‭or LMS training rule‬‭.‬ ‭○‬ ‭For‬ ‭each‬ ‭observed‬ ‭training‬ ‭example,‬ ‭it‬ ‭adjusts‬ ‭the‬ ‭weights‬ ‭a‬ ‭small‬ ‭amount‬ ‭in‬ ‭the‬ ‭direction that reduces the error on this training example.‬ ‭○‬ ‭As‬ ‭discussed‬ ‭in‬ ‭Chapter‬ ‭4‬ ‭in‬ ‭Mitchell,‬ ‭1997,‬ ‭this‬ ‭algorithm‬ ‭can‬ ‭be‬ ‭viewed‬ ‭as‬ ‭performing‬ ‭a‬ ‭stochastic‬ ‭gradient-descent‬ ‭search‬ ‭through‬ ‭the‬ ‭space‬ ‭of‬ ‭possible‬ ‭hypotheses (weight values) to minimize the squared error E.‬ ‭LMS weight update rule:‬ ‭○‬ ‭For each training example (b, V‬‭train‬‭(b)).‬ ‭○‬ ‭Use the current weights to calculate V̂(b).‬ ‭○‬ ‭For each weight 𝑤‬‭i‬ ‭update it as:‬ ‭○‬ η ‭ ‬ ‭(the‬ ‭Greek‬ ‭letter‬ ‭eta)‬ ‭is‬ ‭a‬ ‭small‬ ‭constant‬ ‭(e.g.,‬ ‭0.1)‬‭that‬‭moderates‬‭the‬‭size‬‭of‬‭the‬ ‭weight update → Why do we need this?‬ ‭○‬ ‭X‬ ‭is‬ ‭the‬ ‭feature‬ ‭value.‬ ‭Notice‬ ‭that‬ ‭if‬ ‭the‬ ‭value‬ ‭of‬ ‭some‬ ‭feature‬ ‭xi‬ ‭is‬ ‭zero,‬ ‭then‬ ‭its‬ ‭weight‬ ‭is‬ ‭not‬ ‭altered‬ ‭regardless‬ ‭of‬ ‭the‬ ‭error,‬ ‭so‬ ‭that‬ ‭the‬ ‭only‬ ‭weights‬ ‭updated‬ ‭are‬ ‭those whose features actually occur on the training example board.‬ ‭ o‬‭get‬‭an‬‭intuitive‬‭understanding‬‭of‬‭why‬‭this‬‭weight‬‭update‬‭rule‬‭works,‬‭notice‬‭that‬‭when‬‭the‬ T ‭error (V‬‭train‬‭(b)) - V̂(b))is zero, no weights are‬‭changed.‬ ‭When‬‭(V‬‭train‬‭(b))‬‭-‬‭V̂(b))‬‭is‬‭positive‬‭(i.e.,‬‭when‬‭V̂(b)‬‭is‬‭too‬‭low),‬‭then‬‭each‬‭weight‬‭is‬‭increased‬ ‭in‬ ‭proportion‬ ‭to‬ ‭the‬ ‭values‬ ‭of‬ ‭its‬‭corresponding‬‭features.‬‭This‬‭will‬‭raise‬‭the‬‭value‬‭of‬‭V̂(b),‬ ‭reducing the error.‬ ‭5. The final design‬ ‭ ow we are going to put it all together.‬ N ‭The‬ ‭final‬ ‭design‬ ‭of‬ ‭our‬ ‭checkers‬ ‭learning‬ ‭system‬ ‭can‬ ‭be‬ ‭naturally‬ ‭described‬ ‭by‬ ‭four‬ ‭distinct‬ ‭program‬ ‭modules‬ ‭that‬ ‭represent the central components in many learning systems.‬ ‭5.1 Performance system‬ ‭ he‬ ‭Performance‬ ‭System‬ ‭is‬ ‭the‬ ‭module‬ ‭that‬ ‭must‬ ‭solve‬ ‭the‬ T ‭given performance task, in this case playing checkers, by using the learned target function(s).‬ ‭It‬ ‭takes‬ ‭an‬ ‭instance‬ ‭of‬ ‭a‬ ‭new‬ ‭problem‬ ‭(new‬ ‭game)‬ ‭as‬ ‭input‬ ‭and‬ ‭produces‬ ‭a‬ ‭trace‬ ‭of‬ ‭its‬ ‭solution (game history) as output.‬ ‭13‬ ‭INTRODUCTION - THE MACHINE LEARNING PROCESS‬ I‭ n‬‭our‬‭example,‬‭the‬‭strategy‬‭used‬‭by‬‭the‬‭Performance‬‭System‬‭to‬‭select‬‭its‬‭next‬‭move‬‭at‬‭each‬ ‭step‬ ‭is‬ ‭determined‬ ‭by‬ ‭the‬ ‭learned‬ ‭V̂‬ ‭evaluation‬ ‭function.‬ ‭Therefore,‬ ‭we‬ ‭expect‬ ‭its‬ ‭performance to improve as this evaluation function becomes increasingly accurate.‬ ‭5.2 Critic‬ ‭The‬ ‭Critic‬ ‭takes‬ ‭as‬ ‭input‬ ‭the‬ ‭history‬ ‭or‬ ‭trace‬ ‭of‬ ‭the‬ ‭game‬ ‭and‬ ‭produces‬ ‭as‬ ‭output‬‭a‬‭set‬‭of‬ ‭training examples of the target function.‬ ‭○‬ ‭Each‬‭training‬‭example‬‭in‬‭this‬‭case‬‭corresponds‬‭to‬‭some‬‭game‬‭state‬‭in‬‭the‬‭trace,‬‭along‬ ‭with an estimated Vtrain of the target function value for this example.‬ ‭In our example, the Critic corresponds to the training rule given by the equation‬ ‭V‭t‬rain‬‭(b) ← V̂(Successor(b))‬ ‭5.3 Generalizer‬ ‭The‬ ‭Generalizer‬ ‭takes‬ ‭as‬ ‭input‬ ‭the‬ ‭training‬ ‭examples‬ ‭and‬ ‭produces‬ ‭an‬ ‭output‬ ‭hypothesis‬ ‭that is its estimate of the target function.‬ ‭It‬ ‭generalizes‬ ‭from‬ ‭the‬ ‭specific‬ ‭training‬ ‭examples,‬ ‭hypothesizing‬ ‭a‬ ‭general‬ ‭function‬ ‭that‬ ‭covers these examples and other cases beyond the training examples.‬ ‭In‬ ‭our‬ ‭example,‬ ‭the‬ ‭Generalizer‬ ‭corresponds‬ ‭to‬ ‭the‬ ‭LMS‬ ‭algorithm‬ ‭and‬ ‭the‬ ‭output‬ ‭hypothesis is the function V̂ described by the learned weights 𝑤‬‭i‭.‬ ‬ ‭5.4 Experiment Generator‬ ‭ he‬‭Experiment‬‭Generator‬‭takes‬‭as‬‭input‬‭the‬‭current‬‭hypothesis‬‭(currently‬‭learned‬‭function)‬ T ‭and outputs a new problem (i.e., initial board state) for the Performance System to explore‬ ‭Its‬ ‭role‬ ‭is‬ ‭to‬ ‭pick‬ ‭new‬ ‭practice‬ ‭problems‬‭that‬‭will‬‭maximize‬‭the‬‭learning‬‭rate‬‭of‬‭the‬‭overall‬ ‭system.‬ ‭In‬‭our‬‭example,‬‭the‬‭Experiment‬‭Generator‬‭follows‬‭a‬‭very‬‭simple‬‭strategy:‬‭It‬‭always‬‭proposes‬ ‭the same initial game board to begin a new game.‬ ‭○‬ ‭More‬ ‭sophisticated‬ ‭strategies‬ ‭could‬ ‭involve‬ ‭creating‬ ‭board‬ ‭positions‬ ‭designed‬ ‭to‬ ‭explore particular regions of the state space.‬ ‭(5. The final design)‬ ‭ ith‬ ‭our‬ ‭choices,‬ ‭we‬‭have‬‭ended‬‭with‬‭a‬‭design‬‭somewhat‬‭similar‬‭to‬‭the‬‭Actor-Critic‬‭model‬ W ‭employed in Reinforcement Learning.‬ ‭There‬‭are‬‭multiple‬‭ways‬‭to‬‭solve‬‭the‬‭same‬‭problem,‬‭we‬‭will‬‭learn‬‭how‬‭to‬‭use‬‭different‬‭metrics‬ ‭to evaluate which of the possible solutions is the best one.‬ ‭Remember,‬ ‭there‬ ‭is‬ ‭no‬ ‭one‬ ‭single‬ ‭algorithm‬ ‭or‬ ‭model‬ ‭that‬ ‭is‬ ‭the‬ ‭best‬ ‭solution‬ ‭for‬ ‭all‬ ‭problems (‬‭No free lunch theorem‬‭).‬ ‭6. Designing a learning system: Steps‬ ‭1.‬ ‭Data preprocessing‬ ‭5.‬ C ‭ hoosing‬ ‭a‬ ‭Function‬ ‭.‬ C 2 ‭ hoosing the Training Experience‬ ‭Approximation Algorithm‬ ‭3.‬ ‭Choosing the Training Function‬ ‭6.‬ ‭The Final Design‬ ‭4.‬ ‭Choosing‬ a ‭ ‬ ‭Representation‬ ‭of‬ ‭the‬ ‭7.‬ ‭Model deployment (MLOps)‬ ‭Training Function‬ ‭14‬

Use Quizgecko on...
Browser
Browser