Introduction to Machine Learning PDF
Document Details
Uploaded by Deleted User
Tags
Summary
This document introduces the foundational concepts of machine learning and its different tasks, including classification, regression, cluster analysis, and machine translation. It covers machine learning algorithms and how they solve various problems in the field of AI. The document is not a past paper.
Full Transcript
INTRODUCTION - A BRIEF INTRODUCTION TO MACHINE LEARNING 1. Machine learning vs Artificial intelligence achinelearning:Amachinelearningalgorithmisanalgorithmthatisabletolearnfrom M data. “AcomputerprogramissaidtolearnfromexperienceEwi...
INTRODUCTION - A BRIEF INTRODUCTION TO MACHINE LEARNING 1. Machine learning vs Artificial intelligence achinelearning:Amachinelearningalgorithmisanalgorithmthatisabletolearnfrom M data. “AcomputerprogramissaidtolearnfromexperienceEwithrespecttosomeclassoftasksT and performance measure P, if its performance at tasksinT,asmeasuredbyP,improves with experience E.”Mitchell, 1997 Artificialintelligence:“Thetheoryanddevelopmentofcomputersystemsabletoperform tasks normally requiringhumanintelligence,suchasvisualperception,speechrecognition, decision-making, and translation between languages.”Oxford Dictionary 2. Task (T) he task is the problemthatwesolvebylearning andlearningisourmeansofattaining T the ability to perform the task. ○ Thus, the process of learning isNOTthe task. Machinelearningenablesustotackletasksthataretoodifficulttosolvewithfixedprograms written and designed by human beings. E.g. We can write a program for a robot to walk, OR we canwriteaMLalgorithmforthe robot to learn to do it. Machine learningtasksareusuallydescribedintermsofhowthesystemshouldprocessan example. n example is a collection of features that have been quantitatively measured from some A object or event that we want the machine learning system to process. We typically represent an example as avectorofthevectorisafeature χ ∈ ℝn whereeach entryχi of the vector is a feature. ○ E.g. the features of an image are usually values of the pixels in the image .1 Classification 2 Goal→ specify which of k categories some inputs belongst. ○ To solve this task, the learning algorithm is usually asked to produce a function f: ℝn → {1, …, k} ○ Wheny=f(x)themodelassignsaninputdescribedbyvectorxtoacategoryidentified by a numeric code y. ○ There are other variants of the classification task, for example, where f outputs a probability distribution overclasses(e.g.whenweuseasoftmaxlayerastheoutput of our DL model). Examples: 1 INTRODUCTION - A BRIEF INTRODUCTION TO MACHINE LEARNING 2.1.1 Classification with missing inputs lassification becomes more challenging if the computer program is not guaranteed that C every measurement in its input vector will always be provided. When some of the inputs may be missing, rather than providing a single classification function, the learning algorithm must learn a set of functions. ○ Each function corresponds to classifying x with a different subset of its inputs missing. This classification task is typical in medical diagnosis. 2.2. Regression Goal→ predict a numerical value given some input T ○ he function produced for this task type:ℝn →ℝ ○ This type of task is similar to classification, except that the format of output is different .3 Cluster analysis 2 Goal→ predict a group for some given input. ○ The function produced for this task type:ℝn → {1, …, k} ○ This type of task is similar toclassification,exceptthatthe examples are not labeled. 2.4 Transcription Goal → Observe a relatively unstructured representation of some kind of data and transcribe the information into discrete textual form. 2.5 Machine translation Goal → Convert a sequence of symbols in some language into a sequence of symbols in another language. ○ The most common application is in Natural Language Processing (NLP) .6 Structured output 2 Involves any task that itsoutputisavectororotherdatastructure with important relationships between the different elements. 2 INTRODUCTION - A BRIEF INTRODUCTION TO MACHINE LEARNING .7 Anomaly detection 2 Goal → Detect orflagasetofeventsorobjectsthatarebeingunusualoratypicalfromthe global set of events or objects. .8 Synthesis and sampling 2 Goal→ Generate new examples that are similar to thosein the training data ○ Synthesis and sampling via machine learning can be useful for media applications whengeneratinglargevolumesofcontectbyhandwouldbeexpensiveorrequiretoo much time. 2.9 Imputation of missing values Goal→Givenanexamplewithsomevaluesmissing,thecomputermustprovideaprediction of the missing values. .10 Denoising 2 oal→ Correct a corrupted example. G 3. Performance measure (P) o evaluate the abilities of a machine learning T algorithm, wemustdesigna quantitativemeasure of its performance. Usually this performance measure Pis specific tothe task T being carried out by the system. For tasks such as classification, classification with missing inputs, and transcription, we often measure the accuracy of the model. ○ Accuracy can be a dangerous measurement. ○ Please, evaluate your models with more robust metrics: macro f-measure, AUC, g-mean, statistical tests… e can also obtain equivalent information by measuring theerror rate, the proportion of W examples for which the model produces an incorrect output. ○ We often refer to the error rate as the expected 0-1 loss. ○ The 0-1 loss on a particular example is 0 if it is correctly classified and 1 if it is not. Usually, we are interested in how well the machine learning algorithm performs on data that it has not seen before, since this determines how well it will work when deployed in the real world. We evaluate these performance measures using atestsetof data that isseparate from the data used for trainingthe machine learning system. ○ Is very important to avoid thedata bleeding. Datableed occurs when the test set is also used to train the model. 3 INTRODUCTION - A BRIEF INTRODUCTION TO MACHINE LEARNING The choice of performance measure may seem straightforward and objective, but it isoften difficult to choosea performance measure that corresponds well to the desired behavior of the system. ○ E.g. When performing atranscription task, shouldwe measure the accuracy of the system at transcribingentire sequences, or shouldwe use a more fine-grained performance measure that gives partial credit for gettingsome elements of the sequencecorrect? ○ When performing aregression task, should we penalizethe system more if it frequently makesmedium-sized mistakesor if it rarelymakesvery large mistakes? 4. Experience (E) achinelearningalgorithmscanbebroadlycategorizedas unsupervised or supervisedby M what kind of experience they are allowed to have during the learning process. This will depend on the dataset. ○ A collection of examples, sometimes also calleddatapoints. 4.1 Unsupervised learning Unsupervised learning algorithms experience a dataset containing many features, then learn useful properties of the structure of this dataset. ○ The examples arenot labeled. Some unsupervised learning algorithms perform roles like clustering, which consists of dividing the dataset into clusters of similar examples. 4.2 Supervised learning Supervisedlearning algorithmsexperienceadatasetcontainingfeatures,buteachexample is also associated with alabelortarget. ○ E.g. an eye infection dataset containing either bacteriological or fungal infections. Unsupervised learning involves observing several examples of a random vector x and attempting to implicitly or explicitly learn the probability distribution p(x), or some interesting properties of that distribution. Supervised learning involves observing several examples of a random vector x and an associated value or vector y, then learning to predict y from x , usually by estimating p(y|x). (4. Experience (E)) Some machine learning algorithms do not just experience a fixed dataset. For example, reinforcement learning algorithms interact with an environment, so there is a feedback loop between the learning system and its experiences. Just as there is no formal definition of supervised and unsupervised learning, there is no rigid taxonomy of datasets or experiences. For example, in deep learning, self-supervised learning is also used. ○ The basic idea is to automatically generatesomekindofsupervisorysignaltosolve some task (typically, to learn representations of thedataortoautomaticallylabela dataset). ○ E.g. Autoencoders 4 INTRODUCTION - A BRIEF INTRODUCTION TO MACHINE LEARNING 5. Capacity, overfitting & underfitting he central challenge inmachinelearningisthatouralgorithmmustperformwellonnew, T previously unseen inputs. The ability to perform well on previously unobserved inputs is calledgeneralization. When training a machine learning model, we have access to a training set. ○ We can compute some error measure on the training set,calledthe trainingerror and we reduce this training error. What separates machine learning from optimization is that we want the generalization error, also called the test error. Our objective is to minimize the generalization error or test error. etypicallyestimatethegeneralizationerrorofamachinelearningmodelbymeasuringits W performance on atest setof examples that were collectedseparately from the training set. ○ Obviously,ifthetrainingandthetestsetarecollectedarbitrarily,thereislittlewecan do. Thetrainingandtestdataaregeneratedbyaprobabilitydistributionoverdatasetscalledthe data-generatingprocess.Wetypicallymakeasetofassumptionsknowncollectivelyasthe i.i.d. assumptions. T ○ he examples in each dataset areindependentfromeach other. ○ The training set and test set are identically distributed, drawn from the same probability distribution as each other. e call that shared underlying distribution thedata-generatingdistribution. W henweuseamachinelearningalgorithm,wedonotfixtheparametersaheadoftime,then W sample both datasets. We sample the training set, then useittochoosetheparameterstoreducetraining set error, then sample the test set. Under this process, theexpectedtesterrorisgreaterthanorequaltotheexpectedvalueof training error. The factors determining how well a machinelearningalgorithmwillperformareitsability to: 1. Make the training error small. 2. Make the gap between training and test error small. These two f actors correspond to the two central challenges in machine learning: underfittingandoverfitting. ○ Underfitting occurs when the model is not able to obtain a sufficiently low error value on the training set. The model is unable to learn. ○ O verfitting occurs when the gap between the training error and test error is too large. The model is unable to generalize. 5 INTRODUCTION - A BRIEF INTRODUCTION TO MACHINE LEARNING e can control whether a model is more likely to W overfit or underfit by altering itscapacity. Amodel’scapacityisitsabilitytofitawidevariety of functions. Models with low capacity may struggle to fit the training set. Models with high capacity can overfit by memorizing properties of the training set that do not serve them well on the test set. In deeplearningforexamplethisiscontrolledbythenumberoflayersandthesizeofeach layer. One waytocontrolthecapacityofalearningalgorithmisbychoosingitshypothesisspace, the set of functions that the learning algorithm is allowed to select as being the solution. 5.1 The No Free Lunch Theorem The no free lunch theorem for machine learning states that: ○ Averaged over all possible data-generating distributions, every classification algorithm has the same error rate when classifying previously unobserved points ○ What does this mean? No machine learning algorithm is universally any better than any other. here is no machine learning algorithm that will be the best solution for all the T problems. 5.2 Regularization egularization is any modification we make to a learning algorithm that is intended to R reduce its generalization error but not its training error. One of the main concerns of machine learning, rivaled only by optimization. Different algorithms have different regularization options. ○ In deep learning a commonly used one is dropout. 6. Hyperparameters and Validation Sets ost machine learning algorithms have hyperparameters → settings that we can use to M control the algorithm’s behavior. The values for these hyperparametersarenotlearnedautomatically,theyaresetmanually. We need to try different values and explore the possibilities to find the best configuration. Problem: the test examples must not be used in any way to make choices about the model, including its hyperparameters. ○ Be very strict with this. Validation set: constructed from the trainingdata,buttheseexamplesarenotseenbythe algorithm during the training. Sets used: ○ Training set: Used to learn the parameters of thealgorithm ○ Validation set: Used to test the generalization whenadjusting the hyperparameters. ○ Test set: used to test the generalization of the finalconfiguration. Wefirstsplitthedatainthetrainingandtestsets.Thensplitthetrainingdatainthetraining and validation sets. 6 INTRODUCTION - A BRIEF INTRODUCTION TO MACHINE LEARNING ince the validation set is usedto“train”thehyperparameters,thevalidationseterrorwill S underestimate the generalization error, though typically by a smaller amount than the training error does. Afterallhyperparameteroptimizationiscomplete,thegeneralizationerrormaybeestimated using the test set. 7. The curse of dimensionality any machine learning problems become exceedingly difficult when the number of M dimensions in the data is high. This phenomenon is known as thecurse of dimensionality. Of particular concern is that the number of possible distinct configurations of a set of variables increases exponentially as the number of variables increases. Problems: Data gets spread out. More data is needed. Harder to compute. 7 INTRODUCTION - THE MACHINE LEARNING PROCESS 1. Choosing the Training Experience 1.1 First design choice hefirstdesignchoicewefaceistochoosethe typeoftrainingexperiencefromwhichour T system will learn. ○ Thetypeoftrainingexperienceavailablecanhaveasignificantimpactonsuccessor failure of the learner. ○ Not all experience types are adequate for all problems. One key attribute is whether the training experience provides direct or indirect feedback regarding the choices made by the performance system. I n our example, the system might learn from direct training examples consisting of individualcheckersboard states and the correct move for each. lternatively, it might have available only indirect A information consisting of the move sequences and final outcomes of various games played. ○ Inthislatercase,informationaboutthecorrectness ofspecificmovesearlyinthegamemustbeinferredindirectlyfromthefactthatthe game was eventually won or lost. ○ Here the learner faces an additionalproblemof creditassignment,ordetermining thedegreetowhicheachmoveinthesequencedeservescreditorblameforthefinal outcome. ○ Credit assignment can be a particularly difficult problem because the game can be lost even when early moves are optimal, if these are followed later by poor moves. Forthisreason,learningfromdirecttrainingfeedbackistypicallyeasierthanlearningfrom indirect feedback. 1.2 Second design choice second important attribute of the training experience is the degree to whichthelearner A controls the sequence of training examples. ○ The learner might rely on the teacher to select informative board states and to provide the correct move for each →Supervised Learning. ○ Thelearnermightitselfproposeboardstatesthatitfindsparticularlyconfusingand ask the teacher for the correct move →Active Learning. ○ The learner may have complete control over both the board states and (indirect) training classifications, as it does when it learns by playing against itself with no teacher present. →Reinforcement Learning. Notice in this last case the learner may choose between experimenting with novel board statesthatithasnotyetconsidered,orhoningitsskillbyplayingminorvariationsoflinesof play it currently finds most promising → Exploration vs Exploitation in Reinforcement Learning. 8 INTRODUCTION - THE MACHINE LEARNING PROCESS 1.3 Third design choice A third important attribute of the training experience is how well it represents the distribution of examples over which the final system performance P must be measured. ○ Ingeneral,learningismostreliablewhenthetrainingexamplesfollowadistribution similar to that of future test examples. Machine Learning models are learning adistributionfromtheexamples,iftheydonotsee examples for one class, they will not be able to identify new instances of that class. ○ This is not completely true however →one-shot andzero-shot learning. I n our example, the performance metric P is the percent of games the system wins in the world tournament. If its training experience E consists only ofgamesplayedagainstitself,thereisanobvious danger that this training experience might notbefullyrepresentativeofthedistributionof situations over which it will later be tested. ○ Forexample,thelearnermightneverencountercertaincrucialboardstatesthatare very likely to be played by the human checkers champion. I n practice, it is often necessary to learn from adistributionofexamplesthatissomewhat different from those on which the final system will be evaluated (e.g., the world checkers champion might not be interested in teaching the program!). ○ Training vs Validation vs Test Sets. uch situations are problematic because mastery of one distribution of examples will not S necessarily lead to strong performance over some other distribution. We shall see that most current theory of machinelearningrestsonthecrucialassumption thatthedistributionoftrainingexamplesisidenticaltothedistributionoftestexamples.→ i.i.d. assumptions. espite our need to make this assumption in order to obtain theoretical results, it is D important to keep in mind that this assumption must often be violated in practice. 1.4 Putting all together e decide that our system will train by playing games against itself. W This has theadvantagethatnoexternaltrainerneedbepresent,anditthereforeallowsthe system to generate as much training data as time permits. 2. Choosing the Target Function he next design choice is to determine exactly whattypeofknowledgewillbelearnedand T how this will be used by the performance program. Wearegoingtoassumethatwehaveacheckers-playingprogramthatcangeneratethelegal moves from any board state. The program needs only to learn how to choose the best move from among these legal moves. This learning task is representative of a large class of tasks for whichthelegalmovesthat definesomelargesearchspaceareknownapriori,butforwhichthebestsearchstrategyis not known. Many optimization problems fall into this class, such as the problems of scheduling and controlling manufacturing processes where the available manufacturing steps are well understood, but the best strategy for sequencing them is not. 9 INTRODUCTION - THE MACHINE LEARNING PROCESS iven this settingwherewemustlearntochooseamongthelegalmoves,themostobvious G choice for the type of informationtobelearnedisaprogram,orfunction,thatchoosesthe best move for any given board state. Let us itChooseMoveand use the notation ChooseMove: B → M ○ This function accepts as input any board from the set of legal board states B. ○ This function produces as output some move from the set of legal moves M. Throughoutourdiscussionofmachinelearningwewillfinditusefultoreducetheproblem of improving performance P at task T to the problem of learning some particular Target functionsuch asChooseMove. The choice of the target function will therefore be a key design choice. lthough ChooseMove is an obvious choice for the target function in our example, this A function will turn out to be very difficult to learn given the kind of indirect training experience available to our system. Analternativetargetfunctionandonethatwillturnouttobeeasiertolearninthissettingis an evaluation function that assigns a numerical score to any given board state. LetuscallthistargetfunctionVandagainusethenotationV:B→ℝtodenotethatVmaps any legal board state from the set B to some real value (we useℝtodenotethesetofreal numbers). e intend for this target function V to assign higher scores to better board states. W IfthesystemcansuccessfullylearnsuchatargetfunctionV,thenitcaneasilyuseittoselect thebestmovefromanycurrentboardposition.Thiscanbeaccomplishedbygeneratingthe successor board state produced by every legal move, then using V to choose the best successor state and therefore the best legal move. Let us therefore define the target value V(b) for an arbitrary board state b in B, as follows: 1. if b is a final board state that is won, then V(b) = 100. 2. if b is a final board state that is lost, then V(b) = -100. 3. if b is a final board state that is drawn, then V(b) = 0. 4. ifbisnotafinalstateinthegame,thenV(b)=V(b'),whereb'isthebestfinalboard state thatcanbeachievedstartingfrombandplayingoptimallyuntiltheendofthe game. hilethisrecursivedefinitionspecifiesavalueofV(b)foreveryboardstateb,thisdefinition W is not usable by our checkers player because it is not efficiently computable. ○ In case 4, determining the value of V(b) for a particular board state requires searching ahead for the optimal line of play, all the way to the end of the game. Becausethisdefinitionisnotefficientlycomputablebyourcheckersplayingprogram,wesay that it is anon operationaldefinition. The goal of learning in this case is to discover an operational description of V; that is, a description that can be used by the checkers-playing programtoevaluatestatesandselect moves within realistic time bounds. hus, we have reduced the learning task in this case to the problem of discovering an T operational description of the ideal target function V. It may be very difficult in general to learn such an operational form of V perfectly. 10 INTRODUCTION - THE MACHINE LEARNING PROCESS e often expect learning algorithms to acquire only some approximation to the target W function, and for this reason the process of learning the target function is often called function approximation. In the current discussion we will usethesymbolV̂torefertothefunctionthatisactually learned by our program, to distinguish it from the ideal target function V. 3. Choosing a Representation of the Training Function owthatwehavespecifiedtheidealtargetfunctionV,wemustchoosearepresentationthat N the learning program will use to describe the function that it will learn. As with earlier design choices, we again have many options: 1. WecouldallowtheprogramtorepresentusingalargetablewithaV̂distinctentry specifying the value for each distinct board state. 2. We could allow it to represent V̂ using a collection of rules that match against features of the board state. 3. We could use a quadratic polynomial function of predefined board features. 4. We could use an artificial neural network. In general, this choice of representation involves a crucial tradeoff. ○ Ononehand,wewishtopickaveryexpressiverepresentationtoallowrepresenting as close an approximation as possible to the ideal target function V. ○ On the other hand, the more expressive the representation, the more trainingdata the program willrequireinordertochooseamongthealternativehypothesesitcan represent. ○ Remember our discussion on thecapacityin the previousunit. earejustgoingtochooseoneforthisexample,foranygivenboardstate,thefunctionV̂ W will be calculated as a linear combination of the following board features: ○ X1: the number of black pieces on the board. ○ X2: the number of red pieces on the board. ○ X3: the number of black queens on the board. ○ X4: the number of red queens on the board. ○ X5:thenumberofblackpiecesthreatenedbyred(i.e.,whichcanbecapturedonred’s next turn). ○ X6: the number of red pieces threatened by black. Our learning program will represent as a linear function like: V̂(b) = 𝑤0 + 𝑤1𝒳 1 + 𝑤2𝒳2 + 𝑤3𝒳3 + 𝑤4𝒳4 +𝑤5 𝒳5 + 𝑤6𝒳6 Where 𝑤0 through 𝑤6 are numerical coefficients, or weights, to be chosen by the learning algorithm. Learned values for the weights 𝑤1 through𝑤6 willdeterminetherelativeimportanceofthe various board features in determining the value of the board. The weight 𝑤0 will provide an additive constant tothe board value. 11 INTRODUCTION - THE MACHINE LEARNING PROCESS 4. Choosing a Function Approximation Algorithm To learn the target function V̂ we require a set of training examples 1. These examples describe a specific board state b and the training value Vtrain(b) for b. 2. Each training example is an ordered pair like (b, Vtrain(b)). In our checkers model, the following one would be an example where black has won the game(x2 =0becausetherearenomoreredpieces),andinwhichthevalueoftargetfunction Vtrain(b) = + 100. ((x1 = 3, x2 = 0, x3 = 0, x4 = 0, x5 = 0, x6 = 0),+100) 4.1 Estimating Training Values I nourproblem,wearegoingtouseanapproachthathasshowntoworkquitewellinsimilar problems. ○ In most cases, we do not need to reinvent the wheel, studying the literature for similar problems is usually how you would approach the design. The approachistoassignthetrainingvalueofVtrain(b)oranyintermediateboardstatebto be V̂ (Successor(b)) where V̂ is the learner’s current approximation to V and where (Successor(b)) denotes the next board state following b forwhichitisagaintheprogram’s turn to move. ○ I.e. the board state following the program’s move and the opponent’s response This rule for estimating training values can be summarized as: Vtrain(b) ← V̂(Successor(b)) I t probably seems a bit strange that we are usingthecurrentversionofV̂toestimatethe training values that will be used to further refine the same function ○ We are learning the proper values of the weights. ○ Witheachexample,wearelookingforthevaluesoftheweightsthatwillresultinour model winning the match NoticethatweareusingestimatesofthevaluesoftheSuccessor(b)toestimatethevaluesof boardstateb.Intuitively,wecanseethiswillmakesenseifV̂tendstobemoreaccuratefor board states closer to the game's end. ○ Under certain, the approach of iteratively estimating training values based on estimates of successor state values can be proven to converge toward perfect estimates of Vtrain. ○ Seechapter13of[Mitchel,1997]foramoredetaileddiscussiononhowthisisdonein Reinforcement Learning. 4.2 Adjusting the Weights he only thing that weneednowtofinishourdesignistoselectthealgorithmthatwewill T use to choose the weights wi to best fit the set oftraining examples {(b, Vtrain(b))} As a first step we must define what we mean by the best fit to the training data. ○ One common approach is to define the best hypothesis, or set of weights, as that which minimizes the square error E between the training values and the values predicted by the hypothesis V̂. 12 INTRODUCTION - THE MACHINE LEARNING PROCESS e seek the weights, or equivalently the V̂, that minimize E for the observed training W examples. ○ Chapter 6 in Mitchell, 1997 discusses settings in which minimizing the sum of squared errors is equivalent to finding the most probable hypothesis given the observed training data. SeveralalgorithmsareknownforfindingweightsoflinearfunctionthatminimizeEdefined in this way. ○ Inourcase,werequireanalgorithmthatwillincrementallyrefinetheweightsasnew training examples become available and that will be robust to errors in these estimated training values. One such algorithm is called theleast mean squares,or LMS training rule. ○ For each observed training example, it adjusts the weights a small amount in the direction that reduces the error on this training example. ○ As discussed in Chapter 4 in Mitchell, 1997, this algorithm can be viewed as performing a stochastic gradient-descent search through the space of possible hypotheses (weight values) to minimize the squared error E. LMS weight update rule: ○ For each training example (b, Vtrain(b)). ○ Use the current weights to calculate V̂(b). ○ For each weight 𝑤i update it as: ○ η (the Greek letter eta) is a small constant (e.g., 0.1)thatmoderatesthesizeofthe weight update → Why do we need this? ○ X is the feature value. Notice that if the value of some feature xi is zero, then its weight is not altered regardless of the error, so that the only weights updated are those whose features actually occur on the training example board. ogetanintuitiveunderstandingofwhythisweightupdateruleworks,noticethatwhenthe T error (Vtrain(b)) - V̂(b))is zero, no weights arechanged. When(Vtrain(b))-V̂(b))ispositive(i.e.,whenV̂(b)istoolow),theneachweightisincreased in proportion to the values of itscorrespondingfeatures.ThiswillraisethevalueofV̂(b), reducing the error. 5. The final design ow we are going to put it all together. N The final design of our checkers learning system can be naturally described by four distinct program modules that represent the central components in many learning systems. 5.1 Performance system he Performance System is the module that must solve the T given performance task, in this case playing checkers, by using the learned target function(s). It takes an instance of a new problem (new game) as input and produces a trace of its solution (game history) as output. 13 INTRODUCTION - THE MACHINE LEARNING PROCESS I nourexample,thestrategyusedbythePerformanceSystemtoselectitsnextmoveateach step is determined by the learned V̂ evaluation function. Therefore, we expect its performance to improve as this evaluation function becomes increasingly accurate. 5.2 Critic The Critic takes as input the history or trace of the game and produces as outputasetof training examples of the target function. ○ Eachtrainingexampleinthiscasecorrespondstosomegamestateinthetrace,along with an estimated Vtrain of the target function value for this example. In our example, the Critic corresponds to the training rule given by the equation Vtrain(b) ← V̂(Successor(b)) 5.3 Generalizer The Generalizer takes as input the training examples and produces an output hypothesis that is its estimate of the target function. It generalizes from the specific training examples, hypothesizing a general function that covers these examples and other cases beyond the training examples. In our example, the Generalizer corresponds to the LMS algorithm and the output hypothesis is the function V̂ described by the learned weights 𝑤i. 5.4 Experiment Generator heExperimentGeneratortakesasinputthecurrenthypothesis(currentlylearnedfunction) T and outputs a new problem (i.e., initial board state) for the Performance System to explore Its role is to pick new practice problemsthatwillmaximizethelearningrateoftheoverall system. Inourexample,theExperimentGeneratorfollowsaverysimplestrategy:Italwaysproposes the same initial game board to begin a new game. ○ More sophisticated strategies could involve creating board positions designed to explore particular regions of the state space. (5. The final design) ith our choices, wehaveendedwithadesignsomewhatsimilartotheActor-Criticmodel W employed in Reinforcement Learning. Therearemultiplewaystosolvethesameproblem,wewilllearnhowtousedifferentmetrics to evaluate which of the possible solutions is the best one. Remember, there is no one single algorithm or model that is the best solution for all problems (No free lunch theorem). 6. Designing a learning system: Steps 1. Data preprocessing 5. C hoosing a Function . C 2 hoosing the Training Experience Approximation Algorithm 3. Choosing the Training Function 6. The Final Design 4. Choosing a Representation of the 7. Model deployment (MLOps) Training Function 14