Chapter 5: Machine Learning Basics PDF

Chapter 5 Mac Machine hine Learning Basics Deep learning is a sp speciﬁc eciﬁc kind of mac machine hine learning. To understand deep learning well, one must ha ha...

Chapter 5 Mac Machine hine Learning Basics Deep learning is a sp speciﬁc eciﬁc kind of mac machine hine learning. To understand deep learning well, one must ha hav ve a solid understanding of the basic principles of mac machine hine learning. This chapter pro provides vides a brief course in the most important general principles that are applied throughout the rest of the book. Novice readers or those who wan wantt a wider persp erspective ective are encouraged to consider macmachine hine learning textb textboooks with a more comprehensiv comprehensivee co cov verage of the fundamen fundamentals, tals, such as Murphy (2012) or Bishop (2006). If you are already familiar with machine learning basics, feel free to skip ahead to section 5.11. That section cov covers ers some persp erspectiv ectiv ectives es on traditional machine learning techniques that ha hav ve strongly inﬂuenced the developmen developmentt of deep learning algorithms. Chapter 5 We begin with a deﬁnition of what a learning algorithm is and present an example: Deep learningthe linear is a spregression eciﬁc kind algorithm. of machine W e then pro learning. proceed To ceed to describ understand describe deepe ho howw the learning cwhallenge ell, one mofust ﬁtting have the training a solid data diﬀers understanding from of the the principles basic challenge ofofmac ﬁnding hine patterns learning. that generalize to new data. Most machine learning algorithms This chapter provides a brief course in the most important general principles hav have e settings that called hyp hyperp erp erpar ar arameters ameters ameters, , whic which h m ust b e determined outside the are applied throughout the rest of the book. Novice readers or those who want a learning algorithm itself; wider pwe erspdiscuss ective hohow arewencouraged to set thesetousing consider additional machinedata. Mac Machine learning hine olearning textb oks with isa essen essentially more tially a form eofcovapplied comprehensiv erage ofstatistics the fundamen withtals, increased emphasis(2012 such as Murphy on )the use of or Bishop Machine Learning Basics (computers on 2006). If you proving to statistically are alreadyestimate conﬁdence familiarcomplicated interv intervals als with machine around these functions learning functions; and we a decreased basics, feel free therefore ahead to section 5.11. That section covers some perspectives on traditional machine emphasis to skip present the tlearning wo cen central tral approaches techniques thattohastatistics: ve stronglyfrequentist inﬂuencedestimators and Bay the developmen Bayesian t ofesian deepinference. learning Most mac machine algorithms. hine learning algorithms can b e divided into the categories of sup supervised ervised learning and unsup unsupervised ervised learning; we describ describee these categories and give some examples of simple learning algorithms from each category category.. Most deep learning W e b egin with a deﬁnition of what algorithms are based on an optimization algorithm a learning algorithm called sto iscand stoc hastic present an gradient example: the linear regression algorithm. We then proceed to describe how the challenge of ﬁtting the training data diﬀers 96 from the challenge of ﬁnding patterns hyperparameters that generalize to new data. Most machine learning algorithms have settings called , which must be determined outside the learning algorithm itself; we discuss how to set these using additional data. Machine learning is essentially a form of applied statistics with increased emphasis on the use of computers to statistically estimate complicated functions and a decreased emphasis on proving conﬁdence intervals around these functions; we therefore present the two central approaches to statistics: frequentist estimators and Bayesian inference. CHAPTER 5. MA MACHINE CHINE LEARNING BASICS descen descent. t. We describ describee how to combine various algorithm comp components, onents, such as an optimization algorithm, a cost function, a mo model, del, and a dataset, to build a mac machine hine learning algorithm. Finally Finally,, in section 5.11, we describ describee some of the factors that hav havee limited the ability of traditional machine learning to generalize. These challenges hav havee motiv motivated ated the developmen developmentt of deep learning algorithms that overcome these obstacles. CHAPTER 5. MACHINE LEARNING BASICS 5.1 Learning Algorithms A mac machine hine learning algorithm is an algorithm that is able to learn from data. But what do we mean by learning? Mitchell (1997) provides a succinct deﬁnition: “A computer program is said to learn from exp experience erience E with resprespect ect to some descen class oft.tasks We Tdescrib and peerformance how to combine measure various algorithm P , if its compat performance onents, tasks insuch T , as an optimization measured algorithm, by P , impro improv ves witha cost exp function, experience erience E.”a One model,canand a dataset, imagine a wideto build variet arietyy ofa mac exp hine learning experiences eriences E, tasks algorithm. Finally, in measures T , and performance section 5.11 P ,,and we we describ e some do not attemptof the in factors this book to hav that e limited formally deﬁnethewhat ability ma of mayybtraditional e used for machine learning each of these en to generalize. entities. tities. Instead, These in the cfollo hallenges following have motiv wing sections, weated the developmen provide intuitiv t of deep learning intuitivee descriptions algorithms and examples of that the overcome diﬀeren diﬀerent theseofobstacles. t kinds tasks, performance measures, and exp experiences eriences that can be used to construct machine learning algorithms. 5.1.1 The Task, T 5.1 Learning hine learning Algorithms algorithm A Macmac Machine hine learning enables us toistackle an algorithm tasks that that areistoable too E to learn o diﬃcult from data. to solve with But what ﬁxed programs do we mean by learning? Mitchell (1997 ) provides a succinct deﬁnition: T written and designed by hPuman beings. From a scientiﬁcTand “A computer philosophical P program poin is said ointt of view, to learn machine from is learning exp erience teresting bwith ecauseresp ect to some E in interesting developing our class of tasks understanding and p erformance measure , if its p erformance at tasks in , as E of it en entails T tails developing our understanding P of the principles that by , improves with experience.” One can imagine a wide variety of measuredintelligence. underlie experiences , tasks , and performance measures , and we do not attempt in thisInbothis ok torelatively formallyformal deﬁne deﬁnition what may of bethe word used for “task,” each of the thesepro process encess of learning tities. Instead, itself is not the task. in the following sections, Learning is our means of attaining the ability we provide intuitive descriptions and examples of theto p erform the task. For example, T if we pwerformance an antt a rob robot otmeasures, to be ableand to walk, then walking is the task. diﬀeren t kinds of tasks, experiences that can be used W toeconstruct could program machinethe learning rob robot ot to algorithms. learn to walk, or we could attempt to directly write a program that sp speciﬁes eciﬁes how to walk man manually ually ually.. Mac Machine 5.1.1 hine Thelearning Task, tasks are usually describ described ed in terms of how the mac machine hine learning system should pro process cess an example. An example is a collection of features that hav haveelearning Machine been quantitativ quantitatively enables usely to measured from that tackle tasks someare ob object ject too or even eventt to diﬃcult that we w solve an antt with the ﬁxedmachine programslearning writtensystem to pro process. and designedcess.byW heuman typically represent beings. From an example as a scientiﬁc anda n v ector x ∈ R pwhere philosophical oint of each view,entry xi oflearning machine the vector is inis anotherbfeature. teresting For example, ecause developing our the features of an image are usually the v alues of the pixels in the understanding of it entails developing our understanding of the principles that image. underlie intelligence. 97 In this relatively formal deﬁnition of the word “task,” the process of learning itself is not the task. Learning is our means of attaining the ability to perform the task. For example, if we want a robot to be able to walk, then walking is the task. We could program the robot to learn to walk, or we could attempt to directly write a program that speciﬁes how to walk manually. Rn example i features Machine learning tasks are usually described in terms of how the machine learning system should process an. An example is a collection of CHAPTER 5. MA MACHINE CHINE LEARNING BASICS Man Many y kinds of tasks can be solved with mac machine hine learning. Some of the most common machine learning tasks include the follo following: wing: Classiﬁcation: In this typ ypee of task, the computer program is ask asked ed to sp specify ecify whic which h of k categories some input belongs to. To solve this task, the learning algorithm is usually asked to pro duce a function f : Rn → {1,... , k}. When produce CHAPTER 5. MACHINE LEARNING BASICS y = f (x), the mo modeldel assigns an input describ described ed by vector x to a category iden identiﬁed tiﬁed by numeric co code de y. There are other varian ariantsts of the classiﬁcation task, for example, where f outputs a probabilit probability y distribution ov overer classes. An example of a classiﬁcation task is ob object ject recognition, where the input is an image (usually describ described ed as a set of pixel brightness values), and the output is a numeric co codede identifying the ob object ject in the image. For example, the Willow Garage PR2 rob robotot is able to act as a waiter that can recognize Man y diﬀeren kinds of tasks can diﬀerentt kinds of drinks and deliv b e solved with deliver machine er them to plearning. eople Some of the Rnon command (Gomost Goo od- common fello fellowmachine w et al. learning al.,, 2010). Mo tasks Modern include dern ob object the follo wing: ject recognition is best accomplished with Classiﬁcation deep learning k (Krizhevsky et al., 2012; Ioﬀe and Szegedy, 2015). Ob Object ject : In this typ e of task, recognition is the same basic technology that enables the computer program is ask ed to sp ecify f : computers 1,...to, krecognize whic faces h of categories some input b elongs to. T o solve this task, thetaglearning y = f((T xaigman ) al.,, 2014), which can be used to automatically et al. x people algorithm is usually asked to in photo collections and for ycomputers to in pro duce a function teract more naturally withWhen interact. their users. , the mo del assigns an input describ ed b y vector to a category f identiﬁed by numeric code. There are other variants of the classiﬁcation task, Classiﬁcation for example, withwhere missing inputs outputs : Classiﬁcation a probabilit becomesov y distribution more chal- er classes. lenging An example if the of computer program task a classiﬁcation is notisguaran guaranteed ob jectteed that everywhere recognition, measurement the input in its is aninput image vector will alw (usually alwa ays bed describ e pro provided. as vided. a set ofTopixelsolvebrightness the classiﬁcation values),task, and thethe learning output isalgorithm a numeric only code hasidentifying to deﬁne athe single function ob ject in themapping image. from a vector For example, input to a categorical the Willow Garage PR2 output. robot isWhen able to some act of as the inputs a waiter →that{may canberecognize }missing, rather et al. diﬀerenthant kinds pro providing viding of drinks a single classiﬁcation and deliv er them function, to peoplethe on learning command algorithm (Good- m ust learn a set of functions. et al. Each function corresp corresponds onds to classifying with fellow , 2010). Modern ob ject recognition is best accomplishedx with adeep diﬀerent learningsubset of its inputs missing. (Krizhevsky , 2012 This kindand ; Ioﬀe of situation Szegedy,arises2015). frequen frequently Ob jecttly in medical diagnosis, et al. recognition is the samebecause man many basic technology y kinds of enables that medicalcomputers tests are exp expensive ensive or to recognize in inv vasive. faces (TaigmanOne way to eﬃciently , 2014 ), whichdeﬁne can bsuch e used a to large set of functions automatically is to tag people learn in photoa probabilit probability collections y distribution and for computersov overer all totheinrelev relevant teract antmore variables, naturallythenwith solvetheir the classiﬁcation users. Classiﬁcation taskwithby marginalizing missing inputs out the missing variables. With n input n variables, we can now obtain all 2 diﬀeren diﬀerentt classiﬁcation functions needed for each possible set of missing inputs,: Classiﬁcation but the computer becomesprogram moreneedschal- to learnifonly lenging the computer programdescribing a single function is not guaran theteed single jointthat every measurement probabilit probability y distribution. in its input See Go Goo vector will odfellow et al.alw ays b)e for (2013b provided. an example To solveof athedeepclassiﬁcation probabilistic task, mo modelthe del learning algorithm only applied to such a task in this wa has to deﬁne way a function y. Many of the other tasks describmapping from described a vector ed in this input to a section can also categorical output. When some of the inputs may b e missing, set be generalized to work with missing inputs; classiﬁcation x rathermissing with than pro vidingis ajust inputs single oneclassiﬁcation example of what function, machinethe learning learning algorithm can do. must learn a of functions. Each function corresponds to classifying with 98 a diﬀerent subset of its inputs missing. n This kind of situation arises frequently in medical diagnosis, because many kinds of medical tests are expensive or invasive. One way to eﬃciently deﬁne such a large set of functions is to n learn a probability distribution over all the relevant variables, then solve the 2 classiﬁcation task by marginalizing out the missing variables. With input variables, we can now obtain all diﬀerent classiﬁcation functions needed for each possible set of missing inputs, but the computer program needs et al. CHAPTER 5. MA MACHINE CHINE LEARNING BASICS Regression: In this type of task, the computer program is ask asked ed to predict a numerical value given some input. To solve this task, the learning algorithm is asked to output a function f : Rn → R. This type of task is similar to asked classiﬁcation, except that the format of output is diﬀerent. An example of a regression task is the prediction of the exp expected ected claim amount that an insured person will mak makee (used to set insurance premiums), or the prediction CHAPTER 5. MACHINE LEARNING BASICS of future prices of securities. These kinds of predictions are also used for algorithmic trading. Transcription: In this type of task, the mac machine hine learning system is asked to observ observee a relativ relativelyely unstructuredn representation of some kind of data R R and transcrib transcribee the information into discrete textual form. For example, in Regression optical character recognition, the computer program is sho shownwn a photograph con containing taining an image : In of text this type and is of task, theasked to return computer this is program textaskin ed the form ofa to predict f : anumerical sequencevalue of characters given some (e.g., in ASCI input. ASCII I or Unico To solve Unicode this de format). task, the learning Go Google ogle Street algorithm View is askusesed todeep outputlearning to pro a function process cess address.num numb berstype This in this wa way of task y is (Go Goo odfellow similar to et al. al.,, 2014d ). Another example is sp speec eec eech h recognition, classiﬁcation, except that the format of output is diﬀerent. An example of where the computer program a regression is pro provided vided task is an theaudio waveform prediction and exp of the emits a sequence ected claim amountof characters that an or w ord IDpco insured codes des describing erson will make (used the wto ords setthat were sp insurance spoken oken in theoraudio premiums), recording. the prediction Deep of future learning pricesis of a crucial securities.comp component onent kinds These of mo modern dern sp speech eech recognition of predictions are also used systemsfor used algorithmicat Transcriptionma majorjor companies, trading. including Microsoft, IBM and Go Google ogle ( Hin Hinton ton al.,, 2012b). et al. : In this type of task, → the machine learning system is asked to Mac Machine a relatively : unstructured hinee translation observ In a mac machinehinerepresentation translation task, the input of some kind of already data consists and transcribof a sequence of symbols in e the information into some language, discrete andform. textual the computer For example,program in m ust con optical conv vert thisrecognition, character into a sequence of symbols the computer in another program is sholanguage. wn a photograph This is commonly containing appliedan image to of natural text and languages, is askedsuch as translating to return this textfrom in theEnglish form toof Fa renc rench. h. Deep sequence learning has of characters recently (e.g., in ASCI begun I or to hav haveedeanformat). Unico imp importan ortan ortant Got ogle impact on Street et this al. Viewkind usesofdeep tasklearning (Sutskev Sutskever toerpro al.,, 2014 et cess al. ; Bahdanau address numbers et in al. al.,, 2015 this way).(Goodfellow , 2014d). Another example is speech recognition, where the computer Structured output: Structured output tasks inv olve any task where the involve program is provided an audio waveform and emits a sequence of characters or output is a vector (or other data structure con containing taining multiple values) with word ID codes describing the words that were spoken in the audio recording. imp importan ortan ortantt relationships betw between een the diﬀeren diﬀerentt elements. This is a broad Deep learning is a crucial component of modern speech recognition systems category and subsumes the transcription and translation tasks describ et al. described ed used at ma jor companies, including Microsoft, IBM and Google (Hinton ab abo ov e, as well as man many y other tasks. One example is parsing—mapping a Mac ,hine 2012b ). translation natural language sen sentence tence intointo a tree that describes describes its grammatical structure by tagging nodes nodes of the : Intrees a macas b eingtranslation hine verbs, nouns, task,adv adverbs, theerbs, inputandalready so on. See Collob Collobert consists of aert (2011) of sequence forsymbols an example in some of language, deep learning and the applied computer to aprogram parsing task. must con Another vert this example is pixel-wise into a sequence segmen segmentation of symbols tation in anotherof images, language. where Thistheis computer program assigns every pixel in commonly applied to natural languages, such as translatingspeciﬁc an image to a sp category eciﬁcEnglish from category.to. et al. et al. French. Deep learning has recently begun to have an important impact on this kind of task Structured output(Sutskever ,992014; Bahdanau , 2015). : Structured output tasks involve any task where the output is a vector (or other data structure containing multiple values) with important relationships between the diﬀerent elements. This is a broad category and subsumes the transcription and translation tasks described above, as well as many other tasks. One example is parsing—mapping a natural language sentence into a tree that describes its grammatical structure CHAPTER 5. MA MACHINE CHINE LEARNING BASICS For example, deep learning can be used to annotate the lo locations cations of roads in aerial photographs (Mnih and Hinton, 2010). The output form need not mirror the structure of the input as closely as in these annotation-st annotation-styleyle tasks. For example, in image captioning, the computer program observes an image and outputs a natural language sentence describing the image (Kiros et al., 2014a,b; Mao et al., 2015; Viny Vinyals als et al. al.,, 2015b; Donahue et al.al.,, 2014; CHAPTER 5. MACHINE LEARNING BASICS Karpath Karpathy y and Li, 2015; Fang et al. al.,, 2015; Xu et al.al.,, 2015). These tasks are called structur structureed output tasks because the program must output several values that are all tigh tightly tly in interrelated. terrelated. For example, the words pro produced duced by an image captioning program must form a valid sentence. Anomaly detection: In this type of task, the computer program sifts through a set of ev even en ents ts or ob objects jects and ﬂags some of them as being unusual or Foratypical. example,An example deep learning of ancananomaly be used detection to annotate tasktheis lo credit cations card fraud of roads detection. By mo modeling in aerial photographs deling(Mnih your purc purchasing andhasing Hinton habits, , 2010a). credit card company The output form need can detect misuse not mirror theofstructure your cards. of theIf ainput thiefassteals closely youras credit in thesecard or credit card annotation-st yle information, tasks. For example, in image captioning, the computer program observes any the thief thief’s ’s purchases will often come from a diﬀeren diﬀerent t probabilit probability et al. et al. et al. et al. distribution image and outputsov overer purchase a natural typ ypes es than sentence language your own.describing The credit thecard image company (Kiros et al. et al. can prev prevenen entt fraud , 2014a,b; Mao by placing a hold , 2015; Vinyalson an account as so soon on , 2015b; Donahue as that card has; , 2014 structured output tasks b een used Karpath y for andanLiunc uncharacteristic haracteristic , 2015 ; Fang purc purchase. , hase. 2015; See Xu Chandola , 2015 et ). (2009) tasks al. These for a surv survey ey of anomaly detection metho are called methods. ds. because the program must output several values that are all tightly interrelated. For example, the words produced by Syn thesis and sampling : In this typ Synthesis ypee of task, the machine learning al- an image captioning Anomaly detection program must form a valid sentence. gorithm is asked to generate new examples that are similar to those in the training data. Syn Synthesis thesis : Inand sampling this type of via task, mac machine hinecomputer the learning program can be useful sifts for mediaa applications through set of events when generating or ob jects and ﬂags large somevolumes of them of asconten content being t by hand unusual w orould be exp atypical.expensive, Anensive, exampleboring, of an or anomaly require to too o muc much detection h time. task is For example, credit video card fraud games can By detection. automatically modeling your generate textures purchasing for large habits, ob objects a credit jects cardor landscap landscapes, company es, can rather detect than misuse requiring of youran artistIftoa man cards. manually thiefually stealslab label el each your pixelcard credit (Luo al., 2013 oretcredit card). In some cases, information, thewthief e wanant ’s tpurchases the sampling or synthesis will often come from pro procedure acedure diﬀeren tot probabilit generate ay sp speciﬁc eciﬁc kind ov distribution oferoutput purchase given the typ es input. than your For oexample, wn. The in a sp creditspeech eech card synsynthesis thesis company et al. task, can prevwe en pro provide t vide fraudaby written placing sen sentence atence hold on andanask the program account as soonto asemit that an cardaudio has w aveform been used containing a spspoken for an uncharacteristic oken version purchase.of thatSee sentence. Chandola This is a kind (2009 ) for ofa structured surv Syney output of anomaly thesis task, but metho detection and sampling with the ds. added qualiﬁcation that there is no single correct output for each input, and we explicitly desire a large amount of variation in the output, in: order In thisfortyp thee of task,tothe output seem machine learningand more natural al- gorithm is asked to generate new examples that are similar to those in the realistic. training data. Synthesis and sampling via machine learning can be useful for Imputation of missing media applications whenvalues : In this large generating type of task, the volumes of machine content by learning hand n algorithm is given a new example x ∈ , but would be expensive, boring, or require too much time. For example,xivideo R with some entries of x et al. games can automatically generate textures for large ob jects or landscapes, rather than requiring an artist to man 100 ually lab el each pixel (Luo , 2013). In some cases, we want the sampling or synthesis procedure to generate a speciﬁc kind of output given the input. For example, in a speech synthesis task, we provide a written sentence and ask the program to emit an audio waveform containing a spoken version of that sentence. This is a kind of structured output task, but with the added qualiﬁcation that there is no single correct output for each input, and R wen explicitly desire a large amount i of variation in the output, in order for the output to seem more natural and CHAPTER 5. MA MACHINE CHINE LEARNING BASICS missing. The algorithm must pro provide vide a prediction of the values of the missing en entries. tries. Denoising: In this typ ypee of task, the mac machine hine learning algorithm is giv given en as n input a corrupte orruptedd example x˜ ∈ R obtained by an unkno unknown wn corruption pro process cess n from a cle an example x ∈ R. The learner must predict the clean example clean x from its CHINE CHAPTER 5. MA corrupted versionBASICS LEARNING ˜, or more generally predict the conditional x̃ x probabilit probabilityy distribution p(x | x˜ ). x̃ Densit Density y estimation or probabilit probability y mass function estimation: In the densit density y estimation problem, the mac machine hine learning algorithm is asked to learn a n function pmodel : R → R, where pmodel (x ) can be interpreted as a probability densit density y function (if x is contin continuous) uous) or a probability mass function (if x is Rn discrete) on the space that the examples were dra drawnwn from. To do such a task missing. The algorithm must R nprovide a prediction of the values of the missing well (w (wee will sp specify ecify exactly what that means when we discuss performance en tries. Denoising measures ), the algorithm needs to learn the structure of the data it has seen. P It must ckno orrupte know d example :wInwhere this texamples x˜ cluster tigh ype of task, tightly the mac tly and hine wherealgorithm learning they are unlik unlikely is gively en toas oinput cle an example ccur.aMost of the tasks describ x described edobtained ab aboove require by an the unknolearning algorithm wn corruption protocess at x least implicitly capture the x ˜ structure of the probabilit distribution. Density from a. The learner probability must predict y the clean example estimation enables us top ( x ˜ x ) explicitly capture from itsmodelcorruptedRn vR ersion , or more generally predict theIn model that distribution. principle, conditional w e can y probabilit Densit theny perform computations distribution estimation probabilit. y onmass that distribution function estimationto solve the other tasks as well. For example, if we hav havee performed densit density y estimation to obtain a probabilit probability or : In the p y distribution : p (x ),pwe can (x )use that distribution to solv solvee the densit missing y estimation problem,task. value imputation x the mac If hine a valuelearning xi is algorithm missing, andis asked to learn all the other x a function v alues, denoted x −i , are giv , where given,∈ then we know en, canthebe distribution interpreted as ovaerprobability over it is giv givenen densit b y p (xyi function | x−i ). In(ifpractice, is ∈ contin uous)estimation density or a probability do doeses mass not alwa function alwaysys enable (ifus is to discrete) solv on therelated solvee all these space that tasks, thebecause examples were dra in many wn the cases from. To do such required op a task operations erations P wellp((w on x)e are willcomputationally specify exactly |what that means when we discuss performance intractable. measures ), the algorithm needs to learn the structure of the data it has seen. OfItcourse, must kno many w where examples other tasks and cluster types oftigh tly are tasks andpwhere ossible.they Thearetyp unlik ypeses ofely to tasks we listoccur. here are Most in of the tasks intended tended only describ to provideed abexamples ove require of the what learning mac machine algorithm hine learningtocan at do, not least implicitly to deﬁne capture a rigid → taxonom taxonomythe structure y of tasks.of the probability distribution. Density i estimation enables us to explicitly capture that distribution. In principle, i we can then perform computations on that distribution to solve the other 5.1.2 The i Performance Measure, p (x ) P tasks as well.i For example, if we have performed density estimation to obtain x To ev aaluate probabilit evaluate y distribution the abilities of a mac machine, welearning hine can use algorithm, that distribution we must to solv e the design a x quan missing quantitativ titativ titative value e measure imputation task. If a v alue is missing, and all the other p (x x ) of its performance. Usually this performance measure P is sp speciﬁc v alues, denoted eciﬁc to the task T being , are given,out carried thenbywthee know the distribution over it is given system. by. In practice, density estimation does not always enable us to For tasksp(x) such as classiﬁcation, − classiﬁcation with missing inputs, and tran- solve all these related tasks, because in many cases the required operations scription, we often − measure the accuracy of the model. Accuracy is just the on are computationally intractable. 101 Of course, many other tasks and types ofPtasks are possible. The types of tasks we list here are intended only to provide examples of what machine learning can do, not to deﬁne a rigid taxonomy of tasks. | 5.1.2 The Performance Measure, P To evaluate the abilities of a machine learning algorithm, we must design a CHAPTER 5. MA MACHINE CHINE LEARNING BASICS prop proportion ortion of examples for whic which h the mo model del pro produces duces the correct output. We can also obtain equiv equivalent alent information by measuring the error rate , the propproportion ortion of examples for which the mo model del pro produces duces an incorrect output. We often refer to the error rate as the exp expected ected 0-1 loss. The 0-1 loss on a particular example is 0 if it is correctly classiﬁed and 1 if it is not. For tasks such as density estimation, it do doeses not make sense to measure accuracy accuracy,, error rate, or any other kind of 0-1 CHAPTER 5. MACHINE LEARNING BASICS loss. Instead, we must use a diﬀerent performance metric that gives the mo model del a contin continuous-v uous-v uous-valued alued score for each example. The most common approach is to rep report ort the av average erage log-probability the momodel del assigns to some examples. Usually we are interested in how well the mac machine hine learning algorithm performs on data that it has not seen before, since this determines how well it will work when deplo deploy yed in the real world. We therefore ev evaluate aluate these performance measures using a test set of data that is separate from the data used for training error rate the machine proportion of examples for which the model produces the correct output. We can learning system. also obtain equivalent information by measuring the , the proportion of examples for which the model produces an incorrect output. Weand The choice of p erformance measure ma may y seem straightforw straightforwardard oftenob objectiv jectiv jective, refer e, to but it is often the error diﬃcult rate as the exptoected cho hoose ose 0-1a loss. performance The 0-1 measure loss on athat corresp corresponds particular onds wellisto0 example the if it desired behavior is correctly of theand classiﬁed system. 1 if it is not. For tasks such as density estimation, it doInessome cases, sense not make this istobecause measure it is diﬃcult, error accuracy to decide rate,what should or any otherbekind measured. of 0-1 F or example, loss. Instead,when we m pust erforming a transcription use a diﬀerent task, should performance metricwe that measure givesthethe accuracy model of the system a contin uous-vatalued transcribing score foren entire tire sequences, each example. or Theshould mostwecommon use a more ﬁne-grained approach is to p erformance rep measure ort the average that giv gives log-probabilityes partial the mocredit for getting del assigns to somesome elements of the examples. sequence Usually correct? When performing we are interested in how well a regression the machine task, should learning we penalize algorithm the performs system on data more that itifhas it frequen frequently not seently makessince before, medium-sized mistak mistakes this determines howes well or ifititwill rarely workmakes when v test ery set large mistakes? These kinds of design c hoices dep depend end on the application. deployed in the real world. We therefore evaluate these performance measures using a In otherofcases, data we thatkno know iswseparate what quantit quantity from ythewedata would ideally used like to measure, for training but the machine measuring it is impractical. For example, this arises frequen learning system. frequently tly in the con context text of densit density They estimation. Many of the choice of performance best probabilistic measure mo models dels represent may seem straightforw ard andprobability ob jective, distributions but it is often diﬃcult to choose a performance measure that corresponds well to only implicitly implicitly.. Computing the actual probability value assigned to athe sp speciﬁc eciﬁc p oint in space in desired behavior of the system.man many y suc such h mo models dels is in intractable. tractable. In these cases, one must design an alternative criterion that still corresp corresponds onds to the design ob objectives, jectives, In some cases, this is because it is diﬃcult to decide what should be measured. or design a go goo od approximation to the desired criterion. For example, when performing a transcription task, should we measure the accuracy of the system at transcribing entire sequences, or should we use a more ﬁne-grained 5.1.3 Themeasure performance Exp Experience, erience, that givEes partial credit for getting some elements of the sequence correct? When performing a regression task, should we penalize the Mac Machine hinemore system learning algorithms if it frequen can bemedium-sized tly makes broadly categorized mistakesas or unsup unsupervised ervised if it rarely or makes sup supervised veryervised by whatThese large mistakes? kind kinds of exp experience oferience design they aredep choices allo allowed wed end ontothehav havee during the application. learning pro process. cess. In other cases, we know what quantity we would ideally like to measure, but Most of itthe measuring is learning algorithms impractical. in this bthis For example, ook can be frequen arises understo understoo odinasthe tly being conallo allow wed text of density estimation. Many of the best probabilistic102 models represent probability distributions only implicitly. Computing the actual probability value assigned to a speciﬁc point in space in many such models is intractable. In these cases, one E must design an alternative criterion that still corresponds to the design ob jectives, or design a good approximation to the desired criterion. 5.1.3 The Experience, unsupervised CHAPTER 5. MA MACHINE CHINE LEARNING BASICS to exp experience erience an en entire tire dataset. A dataset is a collection of many examples, as deﬁned in section 5.1.1. Sometimes we call examples data points. One of the oldest datasets studied by statisticians and machine learning re- searc searchers hers is the Iris dataset (Fisher, 1936). It is a collection of measurements of diﬀeren diﬀerentt parts of 150 iris plants. Eac Each h individual plant corresp corresponds onds to one example. 5. CHAPTER TheMA features within eachBASICS CHINE LEARNING example are the measurements of each part of the plant: the sepal length, sepal width, petal length and petal width. The dataset also records which sp species ecies each plant belonged to. Three diﬀeren diﬀerentt sp species ecies are represented in the dataset. Unsup Unsupervisedervised learning algorithms exp experience erience a dataset con containing taining many features, then learn useful prop properties erties of the structure of this dataset. In the con context text of deep learning, we usually wan dataset wantt to learn the entire probabilit probability y distribution that generated a dataset, whether explicitly explicitly, , as in isdensity data p oints estimation, or implicitly implicitly, , for to experience an entire. A dataset a collection of many examples, as tasks like synthesis or denoising. deﬁned in section 5.1.1. Sometimes we call examplesSome other unsup unsupervised ervised learning. algorithms perform other roles, lik likee clustering, which consists of dividing the dataset in into to One of the oldest datasets studied by statisticians and machine learning re- clusters of similar examples. searchers is the Iris dataset (Fisher, 1936). It is a collection of measurements Sup Supervised of diﬀeren ervised t partslearning of 150 iris algorithms

Chapter 5: Machine Learning Basics PDF

Document Details

Tags

Related

Summary

Full Transcript