Machine Learning Learning Model PDF

Machine Learning Learning Model Fabio Vandin October 7th , 2024 1 A Formal Model (Statistical Learning) We have a learner (us, or the machine) has access to: 1 Domain set X : set of all possible objects to make predictions about domain point x ∈ X = instance, usually represented by a vector of features X is the instance space 2 Label set Y: set of possible labels. often two labels, e.g {−1, +1} or {0, 1} 3 Training data S = ((x1 , y1 ),... , (xm , ym )): finite sequence of labeled domain points, i.e. pairs in X × Y this is the learner’s input S: training example or training set 2 4 Learner’s output h: prediction rule h : X → Y also called predictor, hypothesis, or classifier A(S): prediction rule produced by learning algorithm A when training set S is given to it sometimes fˆ used instead of h 5 Data-generation model: instances are generated by some probability distribution and labeled according to a function D: probability distribution over X (NOT KNOWN TO THE LEARNER!) labeling function f : X → Y (NOT KNOWN TO THE LEARNER!) label yi of instance xi : yi = f (xi ), for all i = 1,... , m each point in training set S: first sample xi according to D, then label it as yi = f (xi ) 6 Measures of success: error of a classifier = probability it does not predict the correct label on a random data point generate by distribution D 3 Loss Given domain subset A ⊂ X , D(A) = probability of observing a point x ∈ A. Let A be defined by a function π : X → {0, 1}: A = {x ∈ X : π(x) = 1} In this case we have Px∼D [π(x)] = D(A) Error of prediction rule h : X → Y is def def LD,f (h) = Px∼D [h(x) ̸= f (x)] = D({x : h(x) ̸= f (x)}) Notes: LD,f (h) has many different names: generalization error, true error, risk, loss,... often f is obvious, so omitted: LD (h) 4 Empirical Risk Minimization Learner outputs hS : X → Y. Goal: find hS which minimizes the generalization error LD,f (h) LD,f (h) is unknown! What about considering the error on the training data, that is, reporting in output hS that minimizes the error on training data? def |{i:h(xi )̸=yi ,1≤i≤m}| Training error: LS (h) = m Note: the training error is also called empirical error or empirical risk Empirical Risk Minimization (ERM): produce in output h minimizing LS (h) 5 What can go wrong with ERM? Consider our simplified movie ratings prediction problem. Assume data is given by: Assume D and f are such that: instance x is taken uniformly at random in the square (D) label is 1 if x inside the inner square, 0 otherwise (f ) area inner square = 1, area larger square = 2 Consider classifier given by yi if ∃i ∈ {1,... , m} : xi = x hS (x) = 0 otherwise 6 Is it a good predictor? 7 Hypothesis Class and ERM Apply ERM over a restricted set of hypotheses H = hypothesis class each h ∈ H is a function h : X → Y ERMH learner: ERMH ∈ arg min LS (h) h∈H Which hypothesis classes H do not lead to overfitting? 8 Finite Hypothesis Classes Assume H is a finite class: |H| < ∞ Let hS be the output of ERMH (S), i.e. hS ∈ arg min LS (h) h∈H Assumptions Realizability: there exists h∗ ∈ H such that LD,f (h∗ ) = 0 i.i.d.: examples in the training set are independently and identically distributed (i.i.d) according to D, that is S ∼ Dm Observation: realizability assumption implies that LS (h∗ ) = 0 Can we learn (i.e., find using ERM) h∗ ? 9 (Simplified) PAC learning Probably Approximately Correct (PAC) learning Since the training data comes from D: we can only be approximately correct we can only be probably correct Parameters: accuracy parameter ε: we are satisfied with a good hS : LD,f (hS ) ≤ ε confidence parameter δ: want hS to be a good hypothesis with probability ≥ 1 − δ 10 Theorem Let H be a finite hypothesis class. Let δ ∈ (0, 1), ε ∈ (0, 1) , and m ∈ N such that log(|H|/δ) m≥. ε Then for any f and any D for which the realizability assumption holds, with probability ≥ 1 − δ we have that for every ERM hypothesis hS it holds that LD,f (hS ) ≤ ε. Note: log = natural logarithm 11 Proof (see book as well, Corollary 2.3) 12 13 PAC Learning Definition (PAC learnability) A hypothesis class H is PAC learnable if there exist a function mH : (0, 1)2 → N and a learning algorithm such that for every δ, ε ∈ (0, 1), for every distribution D over X , and for every labeling function f :X → {0, 1}, if the realizability assumption holds with respect to H,D,f , then when running the learning algorithm on m ≥ mH (ε, δ) i.i.d. examples generate by D and labeled by f , the algorithm returns a hypothesis h such that, with probability ≥ 1 − δ (over the choice of examples): LD,f (h) ≤ ε. mH : (0, 1)2 → N: sample complexity of learning H. mH is the minimal integer that satisfies the requirements. Corollary Every finite hypothesis class l is PACm learnable with sample log(|H|/δ) complexity mH (ε, δ) ≤ ε. 14 A More General Learning Model: Remove Realizability Assumption (Agnostic PAC Learning) Realizability Assumption: there exists h∗ ∈ H such that LD,f (h∗ ) = 0 Informally: the label is fully determined by the instance x ⇒ Too strong in many applications! Relaxation: D is a probability distribution over X × Y ⇒ D is the joint distribution over domain points and labels. For example, two components of D: Dx : (marginal) distribution over domain points D((x, y )|x): conditional distribution over labels for each domain point Given x, label y is obtained according to a conditional probability P[y |x]. 15 The Empirical and True Error With D that is a probability distribution over X × Y the true error (or risk) is: def LD (h) = P(x,y )∼D [h(x) ̸= y ] As before D is not known to the learner; the learner only knows the training data S Empirical risk: as before, that is def |{i, 0 ≤ i ≤ m : h(xi ) ̸= yi }| LS (h) = m Note: LS (h) = probability that for a pair (xi , yi ) taken uniformly at random from S the event “h(xi ) ̸= yi ” holds. 16 An Optimal Predictor Learner’s goal: find h : X → Y minimizing LD (h) Is there a best predictor? 17 Agnostic PAC Learnability Consider only predictors from a hypothesis class H. We are going to be ok with not finding the best predictor, but not being too far off. Definition A hypothesis class H is agnostic PAC learnable if there exist a function mH : (0, 1)2 → N and a learning algorithm such that for every δ, ε ∈ (0, 1), for every distribution D over X × Y, when running the learning algorithm on m ≥ mH (ε, δ) i.i.d. examples generated by D the algorithm returns a hypothesis h such that, with probability ≥ 1 − δ (over the choice of the m training examples): LD (h) ≤ min ′ LD (h′ ) + ε. h ∈H Note: this is a generalization of the previous learning model. 18 19 A More General Learning Model: Beyond Binary Classification Binary classification: Y = {0, 1} Other learning problems: multiclass classification: classification with > 2 labels regression: Y = R Multiclass classification: same as before! 20 Regression Domain set: X is usually Rp for some p. Target set: Y is R Training data: (as before) S = ((x1 , y1 ),... , (xm , ym )) Learner’s output: (as before) h : X → Y Loss: the previous one does not make much sense... 21 (Generalized) Loss Functions Definition Given any hypotheses set H and some domain Z , a loss function is any function ℓ : H × Z → R+ Risk function = expected loss of a hypothesis h ∈ H with respect to D over Z : def LD (h) = Ez∼D [ℓ(h, z)] Empirical risk = expected loss over a given sample S = (z1 ,... , zm ) ∈ Z m : m def 1 X LS (h) = ℓ(h, zi ) m i=1 22 Some Common Loss Functions 0-1 loss: Z = X × Y def 0 if h(x) = y ℓ0−1 (h, (x, y )) = 1 if h(x) ̸= y Commonly used in binary or multiclass classification. Squared loss: Z = X × Y def ℓsq (h, (x, y )) = (h(x) − y )2 Commonly used in regression. Note: in general, the loss function may depend on the application! But computational considerations play a role... 23 How to Choose the Loss Function? 24 25 Agnostic PAC Learnability for General Loss Functions Definition A hypothesis class H is agnostic PAC learnable with respect to a set Z and a loss function ℓ : H × Z → R+ if there exist a function mH : (0, 1)2 → N and a learning algorithm such that for every δ, ε ∈ (0, 1), for every distribution D over Z , when running the learning algorithm on m ≥ mH (ε, δ) i.i.d. examples generated by D the algorithm returns a hypothesis h such that, with probability ≥ 1 − δ (over the choice of the m training examples): LD (h) ≤ min ′ LD (h′ ) + ε h ∈H where LD (h) = Ez∼D [ℓ(h, z)] 26 Leslie Valiant, Turing award 2010 For transformative contributions to the theory of computation, including the theory of probably approximately correct (PAC) learning, the complexity of enumeration and of algebraic computation, and the theory of parallel and distributed computing. 27 Bibliography Up to now: [UML] Chapter 2 and Chapter 3 28

Machine Learning Learning Model PDF

Document Details

Tags

Related

Summary

Full Transcript