Document Details

ConsiderateChrysoberyl

Uploaded by ConsiderateChrysoberyl

California Institute of Technology

Yaser S. Abu-Mostafa

Tags

VC dimension machine learning learning from data statistics

Summary

These lecture notes cover the VC dimension, a crucial concept in machine learning. They provide definitions, examples, and applications of the VC dimension. The notes also explore the growth function and VC inequality.

Full Transcript

Review of Le ture 6 The VC Inequality mH(N ) is polynomial Hoeffding Inequality Union Bound VC Bound if H has a break point k k space of...

Review of Le ture 6 The VC Inequality mH(N ) is polynomial Hoeffding Inequality Union Bound VC Bound if H has a break point k k space of data sets 1 2 3 4 5 6... 1 1 2 2 2 2 2.. D 2 1 3 4 4 4 4.. 3 1 4 7 top 8 8 8.. (a) (b) (c) N 4 1 5 11........ 5 1 6 :. 6 1 7 :. P [ |E e− 2 ǫ 2 N : : : :. in (g) − E (g)| > ǫ ] ≤ 2 out M bottom ↓ ↓ ↓ X k−1 N  ↓ ↓ ↓ mH(N ) ≤ i 1 P [ |E − 8 ǫ2 N i=0 | {z } in (g) − E (g)| > ǫ ] ≤ 4 mH(2N ) out e maximum power is N k−1 Learning From Data Yaser S. Abu-Mostafa California Institute of Te hnology Le ture 7: The VC Dimension Sponsored by Calte h's Provost O e, E&AS Division, and IST Tuesday, April 24, 2012 Outline The denition VC dimension of per eptrons Interpreting the VC dimension Generalization bounds AM L Creator: Yaser Abu-Mostafa - LFD Le ture 7 2/24 Denition of VC dimension The VC dimension of a hypothesis set H, denoted by dv (H), is the largest value of N for whi h m H (N ) = 2N the most points H an shatter N ≤ dv (H) =⇒ H an shatter N points k > dv (H) =⇒ k is a break point for H AM L Creator: Yaser Abu-Mostafa - LFD Le ture 7 3/24 The growth fun tion In terms of a break point k: k−1   X N mH(N ) ≤ i=0 i In terms of the VC dimension dv : dv   X N mH(N ) ≤ i |i=0 {z } maximum power is N dv AM L Creator: Yaser Abu-Mostafa - LFD Le ture 7 4/24 Examples H is positive rays: dv = 1 H is 2D per eptrons: dv = 3 H is onvex sets : up dv = ∞ bottom AM L Creator: Yaser Abu-Mostafa - LFD Le ture 7 5/24 VC dimension and learning is nite will generalize up dv (H) =⇒ g∈H UNKNOWN TARGET FUNCTION PROBABILITY f: X Y DISTRIBUTION X Independent of the learning algorithm P on TRAINING EXAMPLES ( x1 , y1 ),... , ( xN , yN ) Independent of the input distribution LEARNING ALGORITHM A FINAL HYPOTHESIS g~ ~f Independent of the target fun tion HYPOTHESIS SET H down AM L Creator: Yaser Abu-Mostafa - LFD Le ture 7 6/24 VC dimension of per eptrons For d = 2, dv = 3 up In general, dv = d + 1 We will prove two dire tions: dv ≤ d + 1 dv ≥ d + 1 down AM L Creator: Yaser Abu-Mostafa - LFD Le ture 7 7/24 Here is one dire tion A set of N = d + 1 points in R shattered by the per eptron: d x      T 1 0 0... 0 x  1  T   1 1 0... 0  x   2     T    X =  = 1 0 1 0 . 3  ....... 0  x     T d+1 1 0... 0 1 X is invertible AM L Creator: Yaser Abu-Mostafa - LFD Le ture 7 8/24 Can we shatter this data set?     y1 ±1 For any an we nd a ve tor w satisfying      y2   ±1..  y = = ,     yd+1 ±1 sign(Xw) = y Easy! Just make sign(Xw)= y whi h means w = X−1y AM L Creator: Yaser Abu-Mostafa - LFD Le ture 7 9/24 We an shatter these d+1 points This implies what? [a℄ dv =d+1 [b℄ dv ≥d+1 X [ ℄ dv ≤d+1 [d℄ No on lusion AM L Creator: Yaser Abu-Mostafa - LFD Le ture 7 10/24 Now, to show that dv ≤ d + 1 We need to show that: [a℄ There are d + 1 points we annot shatter [b℄ There are d + 2 points we annot shatter [℄ We annot shatter any set of d + 1 points [d℄ We annot shatter any set of d + 2 points X AM L Creator: Yaser Abu-Mostafa - LFD Le ture 7 11/24 Take any d+2 points For any d + 2 points, x1, · · · , xd+1, xd+2 More points than dimensions =⇒ we must have X xj = ai xi i6=j where not all the a 's are zeros i AM L Creator: Yaser Abu-Mostafa - LFD Le ture 7 12/24 So? X xj = ai xi i6=j Consider the following di hotomy: x 's with non-zero a get y = sign(a ) i i i i and x gets y = −1 j j No per eptron an implement su h di hotomy! AM L Creator: Yaser Abu-Mostafa - LFD Le ture 7 13/24 Why? X X xj = ai xi =⇒ wTxj = ai wTxi i6=j i6=j If y = sign(w x ) = sign(a ), then a w x > 0 i T i i i T i This for es wx = X a wx > 0 T j i T i i6=j Therefore, yj = sign(wTxj ) = +1 AM L Creator: Yaser Abu-Mostafa - LFD Le ture 7 14/24 Putting it together We proved dv ≤ d + 1 and dv ≥ d + 1 dv = d + 1 What is d + 1 in the per eptron? It is the number of parameters w , w , · · · , w 0 1 d AM L Creator: Yaser Abu-Mostafa - LFD Le ture 7 15/24 Outline The denition VC dimension of per eptrons Interpreting the VC dimension Generalization bounds AM L Creator: Yaser Abu-Mostafa - LFD Le ture 7 16/24 1. Degrees of freedom Parameters reate degrees of freedom # of parameters: analog degrees of freedom dv : equivalent `binary' degrees of freedom AM L Creator: Yaser Abu-Mostafa - LFD Le ture 7 17/24 -0.06 0.02 -0.04 0.04 The usual suspe ts -0.02 0.06( ): 0 Positive rays dv = 1 0.08 0.02 0.1 h(x) = −1 a h(x) = +1 0.04 x1 x2 x3... xN Positive intervals 0.06 (dv =2 ): 0.08 0.1 h(x) = −1 h(x) = +1 h(x) = −1 x1 x2 x3... xN AM L Creator: Yaser Abu-Mostafa - LFD Le ture 7 18/24 Not just parameters Parameters may not ontribute degrees of freedom: down x y down dv measures the ee tive number of parameters AM L Creator: Yaser Abu-Mostafa - LFD Le ture 7 19/24 2. Number of data points needed Two small quantities in the VC inequality: P [|E (g) − E (g)| > ǫ] ≤ 4m in out | H (2N {z)e − 81 ǫ2N } δ If we want ertain ǫ and δ, how does N depend on dv ? Let us look at N de−N AM L Creator: Yaser Abu-Mostafa - LFD Le ture 7 20/24 N de−N Fix N e = small value d −N 10 10 N 30e−N How does N hange with d? 10 5 0 10 Rule of thumb: −5 10 N ≥ 10 dv 20 40 60 80 100 120 140 160 180 200 AM L Creator: Yaser Abu-Mostafa - LFD Le ture 7 21/24 Outline The denition VC dimension of per eptrons Interpreting the VC dimension Generalization bounds AM L Creator: Yaser Abu-Mostafa - LFD Le ture 7 22/24 Rearranging things Start from the VC inequality: P [|E − E | > ǫ] ≤ |4mH(2N out in {z)e − 81 ǫ2N } δ Get ǫ in terms of δ: r − 18 ǫ2N 8 4mH(2N ) δ = 4mH(2N )e =⇒ ǫ = ln | N {z δ } Ω With probability ≥ 1 − δ, |E out − E | ≤ Ω(N, H, δ) in AM L Creator: Yaser Abu-Mostafa - LFD Le ture 7 23/24 Generalization bound With probability ≥ 1 − δ, |E out − E | ≤ Ω(N, H, δ) in =⇒ With probability ≥ 1 − δ, E out ≤ E in + Ω AM L Creator: Yaser Abu-Mostafa - LFD Le ture 7 24/24

Use Quizgecko on...
Browser
Browser