Machine Learning CSI 4145 Lecture Notes PDF
Document Details

Uploaded by Owenhalvie
2025
Tom Cesari
Tags
Summary
These are lecture notes for a machine learning course (CSI 4145) given by Tom Cesari, last updated April 3, 2025. Topics include math essentials, unsupervised learning (density estimation, clustering), and supervised learning (classification, regression). The notes also contain questions to prepare for the final exam.
Full Transcript
Machine Learning (CSI 4145) Tom Cesari Last update: April 3, 2025 Contents I Preliminaries 3 1 M...
Machine Learning (CSI 4145) Tom Cesari Last update: April 3, 2025 Contents I Preliminaries 3 1 Math Essentials 3 1.1 Elementary Math................................................... 3 1.2 Calculus........................................................ 3 1.3 Linear Algebra.................................................... 4 1.4 (Discrete) Probability Theory............................................ 4 1.5 Information Theory................................................. 10 2 Before We Begin... 12 II Unsupervised Learning 13 3 Density estimation 14 3.1 Learning Algorithm: Maximum Likelihood Estimation (MLE).......................... 15 4 Clustering 17 4.1 Learning Algorithm: k-means++.......................................... 18 III Supervised Learning 19 5 Classification 20 5.1 Learning Algorithm: Nearest Neighbor (NN)................................... 21 5.2 Learning Algorithm: Empirical Risk Minimization (ERM)............................ 24 5.2.1 ERM for Finite Hypothesis Classes..................................... 25 5.2.2 ERM for Linear Predictors (Perceptron).................................. 27 5.3 Learning Algorithm: Adaptive Boosting (AdaBoost)............................... 30 6 Regression 31 6.1 Learning Algorithm: Empirical Risk Minimization (ERM)............................ 32 6.1.1 ERM for Linear Predictors (Ordinary Least Squares (OLS))....................... 32 7 A Broader Perspective on Statistical Risk Guarantees 34 7.1 Classification: VC-Dimension............................................ 34 7.2 Regression: Pseudo-Dimension........................................... 34 IV Online Learning 35 8 Online Learning with Full Feedback 35 8.1 Learning Algorithm: Online Gradient Descent (OGD).............................. 36 V Appendix 38 A Questions (Full List) 38 B Questions to Prepare for the Final 42 B.1 Short Questions (No proofs)............................................. 42 B.2 Long Questions (With Proofs)............................................ 43 2 Part I Preliminaries 1 Math Essentials In this introductory section, we collect all mathematical definitions, notation, and results that are needed to understand the main content of this course. 1.1 Elementary Math Definition 1 (Cartesian product). Given two sets A and B, the Cartesian product A × B is the set of all pairs (a, b) with a ∈ A and b ∈ B. Notation 2 (Arbitrary union/intersection). Given a set of indices Λ and T a family of sets (Aλ )λ∈Λ indexed by the elements λ of Λ, we denote their union by λ∈Λ Aλ and their intersection by λ∈Λ Aλ. S Notation 3 (A∗ ). Given a set A, we denote by A∗ := n∈N An the set of all finite sequences of elements in A. S Notation 4 (Function). Given two sets A and B, the notation f : A → B denotes a function f defined on A and taking values in B. Definition 5 (Indicator Function). Let Ω be a set and, for all ω ∈ Ω, let P (ω) be a proposition. Then, the indicator function I{P } : Ω → {0, 1} is defined, for all ω ∈ Ω, by ( 1, if P (ω) is true, I P (ω) := 0, if P (ω) is false. Notation 6 (Preimage). Let A, B be two sets, f : A → B a function, and B ⊆ B. We denote the preimage of B, i.e., f −1 (B) := a ∈ A : f (a) ∈ B , simply by {f ∈ B}. Definition 7 (Power set). Given a set Ω, its power set, denoted by 2Ω , is the set of all subsets of Ω. Definition 8 (Intervals). If a, b ∈ R, with a < b, then the interval (a, b) (resp., [a, b), (a, b], or [a, b]) is the set of all points x ∈ R such that a < x < b (resp, a ≤ x < b, a < x ≤ b, or a ≤ x ≤ b). 1.2 Calculus Definition 9 (Gradient). Let f : Rd → R be a differentiable function. The gradient of f at a point x ∈ Rd is the vector ∂f ∂f ∇f (x) := (x),... , (x). ∂x1 ∂xd Fact 10 (Taylor’s formula). If f : Rd → R is a twice differentiable function, then, for all w, u ∈ Rd , 1 f (u) = f (w) + ∇f (w)⊤ (u − w) + (u − w)⊤ ∇2 f (ξ)(u − w) 2 where ∇2 f (ξ) is the Hessian matrix of f evaluated at a point ξ on the segment joining u and w. Definition 11 (Convex function). If C ⊆ Rd is a convex set1 set, we say that a function f : C → R is convex if, for any two points x, y ∈ C and any λ ∈ [0, 1], it holds that f (1 − λ)x + λx ≤ (1 − λ)f (x) + λf (y). Fact 12 (Hessian of a convex function). If f : Rd → R is a twice differentiable convex function, then, the Hessian ∇2 f is positive semidefinite, i.e., for all z, ξ ∈ Rd , z ⊤ ∇2 f (ξ)z ≥ 0. Corollary 13 (Twice differentiable convex functions). If f : Rd → R is a twice differentiable convex function, then f (w) − f (u) ≤ ∇f (w)⊤ (w − u). Proof. It’s an immediate consequence of Facts 10 and 12. 1 I.e., if for all two points x, y ∈ C, the segment joining u and w is included in C. 3 1.3 Linear Algebra Definition 14 (Euclidean norm, unit vector). We qPdenote by ∥·∥ : R → [0, +∞) the d-dimensional Euclidean norm, d d defined, for all x := (x1 ,... , xd ) ∈ Rd , by ∥x∥ := i=1 xi. A vector w ∈ R is called a unit vector if ∥w∥ = 1. 2 d Definition 15 (Dot product). Given two vectors x, u ∈ Rd , we denote their dot (a.k.a. scalar, a.k.a. inner) product by Pd x⊤ u := i=1 xi ui = ∥x∥ ∥u∥ cos(θ), where θ is the angle between x and u. Definition 16 (Hyperplane). Let w ∈ Rd be a unit vector2 and c ∈ R. The set {x ∈ Rd : w⊤ x = c} is called a hyperplane with parameters (w, c). If c = 0, the set {x ∈ Rd : w⊤ x = 0} is called a homogeneous hyperplane with normal (or weight) vector w. Since a homogeneous hyperplane is entirely determined by its normal vector w, with a slight abuse of notation, we will sometimes say that w is the (homogeneous) hyperplane. Remark 17 (Hyperplanes and their coefficients). A hyperplane with coefficients (w, c) is orthogonal to the line {λw}λ∈R and intersects it a distance c from the origin. For example, in dimension d = 2, if w = √12 , √12 and c = 3, then √ x1 x2 {x ∈ Rd : w⊤ x = c} = (x1 , x2 ) ∈ R2 : √ + √ = 3 = (x1 , x2 ) ∈ R2 : x2 = −x1 + 3 2 2 2 which is indeed orthogonal to the line √2 , √2 λ∈R and intersects it at the point √32 , √32 , which has a distance 3 from λ λ the origin. Remark 18 (Homogeneous hyperplanes are enough). Without loss of generality, we can (and will!) assume that all the hyperplanes we consider are homogeneous. Indeed, given any unit vector w := (w1 ,... , wd ) ∈ Rd and any non- zero c ∈ R, the non-homogeneous hyperplane {x ∈ Rd : w⊤ x = c} is equivalent to the homogeneous hyperplane {x′ ∈ Rd+1 : v ⊤ x′ = 0} with v := √1+c 1 2 · (w1 ,... , wd , −c) in the following sense: for all x := (x1 ,... , xd ) ∈ Rd , ′ := letting x (x1 ,... , xd , 1), it holds that p d+1 X d X 1 + c2 · v ⊤ x′ = vj x′j = vj x′j + vd+1 x′d+1 = w⊤ x − c; j=1 j=1 |{z} | {z } =:wj xj =:−c·1 thus w⊤ x = c ⇐⇒ v ⊤ x′ = 0 In machine learning, this simply amounts to adding an extra feature with a default value of 1. 1.4 (Discrete) Probability Theory Note 19. In this section, we give an elementary introduction to discrete probability theory. Up to simple reductions and identifications, all definitions and results given here can be traced back to the general field of measure-theoretic probability theory. For a comprehensive introduction to the field, see Chapters 1-10 of the classic Probability with Martingales textbook by David Williams. Notation 20 (Infinite sums). Let Ω be an arbitrary set, S a finite subset of Ω, R a subset of R∪{+∞} (or R ⊆ R∪{−∞}) and f : Ω → R a function such that f (ω) ̸= 0 if and only if ω ∈ S. Then, for any E ⊆ Ω, we introduce the handy infinite sum notation: X X f (ω), if S ∩ E ̸= ∅, f (ω) := ω∈S∩E 0, if S ∩ E = ∅. ω∈E Exercise 21. Let Ω1 := N and let f1 : Ω1 → R be defined, for all ω ∈ Ω1 , by f1 (ω) := 0.1 · I{ω = 3} + 0.4 · I{ω = √ 120}. Moreover, let Ω2 := R and let f2 : Ω2 → RPbe defined, for all ω ∈ Ω2 , by f2 (ω) := 0.3 · I{ω = 23} + 0.2 · I{ω = π} + 0.3 · I{ω = 29} + 0.25 · I{ω = 50}. For all i ∈ {1, 2}, compute ω∈Ωi fi (ω). Answers: in white here → 0.7 and 0.85. Definition 22 (Discrete density on a set). Let Ω be a set. We say that a function p : Ω → [0, 1] is a discrete density on Ω if P there exists a finite subset Sp of Ω, called the support of p, such that, for all ω ∈ Ω, p(ω) ̸= 0 if and only if ω ∈ Sp , and ω∈Ω p(ω) = 1. 2 Hyperplanes are sometimes defined for any w ∈ Rd \ {0}. Although formally correct, we would gain nothing from such a (seemingly) more general definition other than clutter in statements and proofs. 4 Definition 23 (Discrete probability measure/space—See Figure 1). Given a set Ω, which we call the outcome (or sample) space, we say that P : 2Ω → [0, 1] is a discrete probability measure on Ω if there exists a discrete density p : Ω → [0, 1] on Ω such that, for all subsets E ⊆ Ω (subsets E of Ω are called events), it holds that X P[E] = p(ω), ω∈E and say that the support Sp of this discrete density p is the support SP of P. The pair (Ω, P) is called a discrete probability space. Figure 1: An illustration of a discrete probability measure P. The outcome space Ω is the set of all (gray and red) dots in the rectangle. Of all the outcomes ω ∈ Ω, only six have non-zero probability of happening: the six red dots, which constitute the support SP of P (the probabilities P {ω} of the outcomes ω ∈ SP are written next to the corresponding red dots). All other outcomes (the gray dots) have probability zero. To compute the probability P[E] of the set E ⊆ Ω in the picture (the blue blob), we simply sum the probabilities of all the (three) outcomes in E that have non-zero probability, obtaining P[E] = 0.2 + 0.1 + 0.1 = 0.4. Similarly, to compute the probability P[Ω] of the entire outcome space, we sum the probabilities of all the (six) outcomes in Ω with non-zero probability of happening, obtaining P[Ω] = 1. Exercise 24. Let Ω := R. Which of the following is a discrete probability measure on Ω? P1 : Ω → [0, 1] is defined, for all ω ∈ Ω, by P1 (ω) := I{ω = 0}. P2 : 2Ω → [−1, 1] satisfies, for all ω ∈ Ω, P2 {ω} = −I{ω = 0}. P3 : 2Ω → [0, 1] satisfies, for all ω ∈ Ω, P3 {ω} = 1. P4 : 2Ω → [0, 1] satisfies, for all ω ∈ Ω, P4 {ω} = 0, and P4 [Ω] = 1. P5 : 2Ω → [0, 1] satisfies, for all Ω, P5 {ω} = I ω ∈ {0, 1} , and for all E ⊆ Ω, P5 [E] = ω∈E P5 {ω}. P ω∈ P6 : 2Ω → [0, 1] satisfies, for all ω ∈ Ω, P6 {ω} = I ω = 1 , and for all E ⊆ Ω, P6 [E] = ω∈E P6 {ω}. Answer: in white P here → Only P6. Remark 25 (Discrete probability measures and discrete densities). A discrete probability measure determines uniquely its discrete density, and vice versa. Indeed, fix a setΩ. Given a probability measure P : 2Ω → [0, 1] on Ω, the function pP : Ω → [0, 1] defined, for all ω ∈ Ω, by pP (ω) := P P{ω} is a discrete density on Ω and it is the only discrete density p : Ω → [0, 1] such that, for any event E ⊆ Ω, P[E] = ω∈E p(ω).PVice versa, given a discrete density p : Ω → [0, 1] on Ω the function Pp : 2Ω → [0, 1] defined, for all E ⊆ Ω, by Pp [E] := ω∈Ep(ω) is a probability measure on Ω, and it is the only probability measure P : 2Ω → [0, 1] such that, for all ω ∈ Ω, P {ω} = p(ω). For this reason, we can interchangeably define discrete probability measures or their discrete densities depending on what it is more convenient. Ω\A A B A B A B Monotonicity Prob. of a disjoint union Union bound Prob. of the complement Figure 2: Illustration of the four properties 1-4 in Proposition 26 Proposition 26 (Elementary properties of probability measures—See Figure 2). Let (Ω, P) be a discrete probability space. Then: 5 1. (Monotonicity) If A, B ⊆ Ω and A ⊆ B, then P[A] ≤ P[B]. 2. (Probability of a Disjoint Union) If A, B ⊆ Ω and A ∩ B = ∅, then P[A ∪ B] = P[A] + P[B]. 3. (Union Bound) If A, B ⊆ Ω then P[A ∪ B] ≤ P[A] + P[B]. 4. (Probability of the Complement) If A ⊆ Ω, then P[Ω \ A] = 1 − P[A]. Proof. We prove the four points separately. 1. If an event A is included in an event B, then X X X X P[A] = P {ω} + P {ω ′ } = P {ω} = P[B] P {ω} ≤ ω∈A ω∈A ω ′ ∈B\A ω∈B 2. If A and B are two disjoint events, then X X X ′ P[A ∪ B] = P {ω} = P {ω} + P {ω } = P[A] + P[B] ω∈A∪B ω∈A ω ′ ∈B 3. If A and B are two events, then X X X X X ′ P[A ∪ B] = P {ω} = P {ω} + P {ω ′ } ≤ P {ω} + P {ω } = P[A] + P[B] ω∈A∪B ω∈A ω ′ ∈B\A ω∈A ω ′ ∈B 4. If A is an event, since A and Ω \ A are disjoint, then 1 = P[Ω] = P A ∪ (Ω \ A) = P[A] + P[Ω \ A] and the claimed property follows by rearranging the left and right-hand side. Definition 27 (Discrete random variable/distribution). Let (Ω, P) be a discrete probability space and Z a set. We say that: A function Z : Ω → Z is a Z-valued discrete random variable.3 The i measure on Z D : 2 → [0, 1] defined for all events E ⊆ Z, by D(E) := P[Z ∈ E] := h discrete probability 4 Z P ω ∈ Ω : Z(ω) ∈ E is the (discrete) distribution of Z. The function pZ : Z → [0, 1] defined for all z ∈ Z, by pZ (z) := P[Z = z] is the density of Z. The pair (Ω, P) is the probability space that carries Z. Remark 28 (Important). Given a set Z and a discrete probability measure D on Z, it is common to read sentences like: “Let Z be a Z-valued discrete random variable with distribution D”, with no explicit mention of the probability space (Ω, P) that carries Z and no explicit formula that defines Z. This is acceptable because, given a set Z and a discrete probability measure D on Z, there is a canonical way to define them: One can always take Ω := Z as the outcome space, P := D as the discrete probability measure on Ω, and Z as the identity function Z : Ω → Z, ω 7→ Z(ω) := ω. Notation 29 (R̄). We denote the set R ∪ {−∞, +∞}, called the extended real line, by R. Definition 30 (Non-negative and L1 random variables). Let (Ω, P) be a discrete probability space, R a subset of R, and V : Ω → R a discrete random variable. We say that V is a non-negative discrete random variable, if R ⊆ [0, +∞]. V is an L1 discrete random variable if P[V = ±∞] = 0. 3 If Z ⊆ R ∪ {−∞, +∞}, we will often omit “Z-valued” and simply say that Z is a discrete random variable. 4 As an exercise, prove that D is indeed a discrete probability measure on Z. 6 Definition 31 (Expectation). Let (Ω, P) be a discrete probability space and V a non-negative or L1 discrete random variable. We define the expected value (or expectation) of V as X E[V ] := vP[V = v], v∈R with the convention that ±∞ · 0 = 0. Notation 32 (1/V and ln(V )). Let (Ω, P) be a discrete probability space and V a non-negative discrete random variable. Then, we denote by 1/V (or V1 ) the non-negative discrete random variable defined, for all ω ∈ Ω, by +∞, if V (ω) = 0, 1 V (ω) , if V (ω) ∈ (0, +∞), 0, if V (ω) = +∞, and we denote by ln V (or ln(V )) the discrete random variable defined, for all ω ∈ Ω, by −∞, if V (ω) = 0, ln V (ω) , if V (ω) ∈ (0, +∞), +∞, if V (ω) = +∞. Theorem 33 (Law Of The Unconscious Statistician—LOTUS). Let (Ω, P) be a discrete probability space, V a set, V : Ω → V a V-valued discrete random variable, R a subset of R, and f : V → R such that f (V ) is a non-negative or L1 discrete random variable. Then X E f (V ) = f (v)P[V = v]. v∈V Proof. By definition of expectation of the random variable f (V ), we get X X X X X E f (V ) = yP f (V ) = y = P[V = v] = yP[V = v] y y∈R y∈R v∈V:f (v)=y y∈R v∈V:f (v)=y X X X = f (v)P[V = v] = f (v)P[V = v]. y∈R v∈V:f (v)=y v∈V Definition 34 (Variance). Let (Ω, P) be a discrete probability space, and V an L1 discrete random variable. We define the variance of V as h 2 i 2 var(V ) := E V − E[V ] = E[V 2 ] − E[V ]. Definition 35 (Independence). Let (Ω, P) be a discrete probability space, Λ a (possibly infinite) set of indices, Z a set, and (Zλ )λ∈Λ a family of Z-valued discrete random variables indexed by the elements λ of Λ. We say that (Zλ )λ∈Λ is an independent family if, for any finite subset I ⊆ Λ and any family (Ei )i∈I of subsets of Z, it holds that " # \ Y P {Zi ∈ Ei } = P[Zi ∈ Ei ] i∈I i∈I Theorem 36 (Expectation of a product of independent random variables). Let (Ω, P) be a discrete probability space, k ∈ N, and (Z1 ,... , Zk ) an independent family of L1 discrete random variables. Then " k # k Y Y E Zi = E[Zi ]. i=1 i=1 Proof. Define the “product” function f : Rk → R, for all (z1 ,... , zk ) ∈ Rk , by f (z1 ,... , zk ) := z1 · · · · · zk. Noting that 7 Qk i=1 Zi = f (Z1 ,... , Zk ) is an L1 discrete random variable, by the law of the unconscious statistician, " k # Y X Zi = E f (Z1 ,... , Zk ) = f (z1 ,... , zk )P (Z1 ,... , Zk ) = (z1 ,... , zk ) E i=1 (z1 ,...,zk )∈Rk " k # k X X \ (I) X X Y = ··· z1 ·... · zk P {Zi = zi } = ··· z1 ·... · zk P[Zi = zi ] z1 ∈R zk ∈R i=1 z1 ∈R zk ∈R i=1 X X X k Y = z1 P[Z1 = z1 ] · z2 P[Z2 = z2 ] ·... · zk P[Zk = zk ] = E[Zk ], z1 ∈R z2 ∈R zk ∈R i=1 where (I) followed from the independence of Z1 ,... , Zk. Definition 37 (Conditional Probability). Let (Ω, P) be a discrete probability space and C ⊆ Ω an event such that P[C] > 0. We say that the function P[· | C] : 2Ω → [0, 1] defined for all events E ⊆ Ω by P[E ∩ C] P[E | C] := (We say that P[E | C] is the conditional probability P of E given C) P[C] is the conditional probability P given C. Proposition 38 (P[· | C] is a discrete probability measure). Let (Ω, P) be a discrete probability space and C ⊆ Ω an event such that P[C] > 0. Then P[· | C] is a discrete probability measure. Proof. Let S P be the support of P. Then the finite set SP[·|C] = SP ∩ C is the support of P[· | C], i.e., SP[·|C] is a finite set such that P {ω} | C = P {ω} ∩ C /P[C] ̸= 0 if and only if ω ∈ SP[·|C]. This proves the first property. For the second one, note that, for any event E ⊆ Ω, P ω∈E∩C P {ω} P[E ∩ C] X P {ω} ∩ C X P {ω} ∩ C X P[E | C] = = = = = P {ω} | C. P[C] P[C] P[C] P[C] ω∈E∩C ω∈E ω∈E Finally, P[Ω | C] = P[Ω ∩ C]/P[C] = P[C]/P[C] = 1. Definition 39 (Conditional Expectation). Let (Ω, P) be a discrete probability space, V a non-negative or L1 discrete random variable, W a set, and W : Ω → W a W-valued discrete random variable. The conditional expectation of V given W is the discrete random variable E[V | W ] : Ω → R defined, for all ω ∈ Ω, by E[V | W ](ω) := E V | W = W (ω) , where, for all w ∈ W, (P v∈R vP[V = v | W = w], if P[W = w] > 0, E[V | W = w] := 0, if P[W = w] = 0. (In words, for all w ∈ W such that P[W = w] > 0, E[V | W = w] is simply the expectation computed with respect to the conditional probability P[· | W = w].) Theorem 40 (Elementary properties of conditional expectations). Let (Ω, P) be a discrete probability space, V, V ′ two L1 discrete random variables, W, W ′ two sets, and W : Ω → W, W ′ : Ω → W ′ two (respectively, W and W ′ -valued) discrete random variables. The following properties hold: 1. (Linearity) If α, β ∈ R, then E[αV + βV ′ | W ] = αE[V | W ] + βE[V ′ | W ]. 2. (Monotonicity) If V ≥ 0, then E[V | W ] ≥ 0. 3. (Constancy on level sets) If f : R × W → R and w ∈ W, then E [f (V, W ) | W = w] = E [f (V, w) | W = w]. 4. (Taking out what is know) If f : W → R, then E V f (W ) | W = f (W )E[V | W ]. 5. (Tower property) E[V | W ] is an L1 discrete random variable and E E[V | W ] = E[V ]. 8 Proof. First, note that for any discrete random variable X, E[X] = X(ω)P {ω} ; indeed: P ω∈Ω X X X X X E[X] = xP[X = x] = P {ω} = I X(ω) = x P {ω} x x x∈R x∈R ω∈{X=x} x∈R ω∈Ω XX XX = xI X(ω) = x P {ω} = I X(ω) = x X(ω)P {ω} x∈R ω∈Ω ω∈Ω x∈R X X X = I X(ω) = x = X(ω)P {ω} X(ω)P {ω}. ω∈Ω x∈R ω∈Ω Having established this equivalent way of expressing expectations, we prove each point separately. 1. If α, β ∈ R, then, for any ω ′ ∈ Ω such that P W = W (ω ′ ) > 0, letting w′ := W (ω ′ ), X E[αV + βV ′ | W = w′ ] = αV (ω) + βV ′ (ω) P {ω} | W = w′ ω∈Ω X X =α V (ω)P {ω} | W = w′ + β V ′ (ω)P {ω} | W = w′ ω∈Ω ω∈Ω = αE[V | W = w ] + βE[V | W = w ] ′ ′ ′ and, for any ω ′′ ∈ Ω such that P W = W (ω ′′ ) = 0, letting w′′ := W (ω ′′ ), E[αV + βV ′ | W = w′′ ] = 0 = 0 + 0 = αE[V | W = w′ ] + βE[V ′ | W = w′′ ]; therefore, for any ω ∈ Ω, letting w := W (ω), it holds that E[αV +βV ′ | W = w] = αE[V | W = w]+βE[V ′ | W = w], which in turn implies that E[αV + βV ′ | W ] = αE[V | W ] + βE[V ′ | W ]. 2. If V ≥ 0, then, for any w ∈ W such that P[W = w] > 0, X E[V | W = w] = V (ω)P {ω} | W = w ≥ 0, ω∈Ω which in turn implies that E[V | W ] ≥ 0. 3. If f : R × W → R, then, for all w ∈ W such that P[W = w] > 0, X E f (V, W ) | W = w = f V (ω), W (ω) P {ω} | W = w ω∈Ω | {z } =0 whenever ω ∈{W / =w} X = f V (ω), W (ω) I{W (ω) = w}P {ω} | W = w ω∈Ω X = f V (ω), w I{W (ω) = w}P {ω} | W = w ω∈Ω X = f V (ω), w P {ω} | W = w = E [f (V, w) | W = w]. ω∈Ω 4. If f : W → R, we have to prove that for all ω ∈ Ω, E V f (W ) | W = W (ω) = f W (ω) E V | W = W (ω) , which is trivially true if P W = W (ω) = 0, and, if P W = W (ω) > 0, follows by the “constant on level sets” property (C) of conditional expectations and the linearity of expectations (L); indeed: (C) h i (L) E V f (W ) | W = W (ω) = E V f W (ω) | W = W (ω) = f W (ω) E[V | W = W (ω)]. 5. Since V is an L1 discrete random variable, then, for all w ∈ W, E[V | W = w] ̸= ±∞; hence, E[V | W ] is also an L1 discrete random variable. By the law of unconscious statistician applied to the function f : W → R defined for 9 all w ∈ W by f (w) := E[V | W = w], we get X X E E[V | W ] = E f (W ) = f (w)P[W = w] = E[V | W = w]P[W = w] w∈W w∈W X X X = E[V | W = w]P[W = w] = vP[V = v | W = w]P[W = w] w∈W, P[W =w]>0 w∈W, P[W =w]>0 v∈R X X (∗) X X = vP {V = v} ∩ {W = w} = vP {V = v} ∩ {W = w} w∈W, P[W =w]>0 v∈R w∈W v∈R X X X = P {V = v} ∩ {W = w} = vP[V = v] = E[V ] v v∈R w∈W v∈R where (∗) follows from the monotonicity of probabilities, noting that P {V = v} ∩ {W = w} = 0 whenever P[W = w] = 0. Definition 41 (I.I.D.). Let (Ω, P) be a discrete probability space, Λ a (possibly infinite) set of indices, V a set, and (Vλ )λ∈Λ a family of V-valued discrete random variables indexed by the elements λ of Λ. We say that (Vλ )λ∈Λ is an i.i.d. (independent and with identical distribution) family if it is an independent family and all random variables have the same distribution. Fact 42 (Hoeffding’s Inequality). Let (Ω, P) be a discrete probability space, and V, V1 ,... , Vn be an i.i.d. sequence of L1 discrete random variables taking values in the interval [0, 1]. Then, for all ε > 0, " n # " n r # 1X 1X ln(2/δ) ∀ε > 0, P Vt − E[V ] ≥ ε ≤ 2 exp(−2nε ), or, equivalently, ∀δ > 0, P 2 Vt − E[V ] ≤ ≥ 1 − δ. n t=1 n t=1 2n Exercise 43. Using the content of this section, verify that the following claims are true. Consider the random experiment of rolling two fair dice and adding together the results. Formally, let Ω := {1,... , 6}2 be the space containing the 36 possible outcomes that can be obtained by rolling two dice, let P be the unique discrete probability measure on Ω such that, for all ω ∈ Ω, P {ω} = 36 1 , and let V : Ω → R be the random variable defined for all (i, j) ∈ Ω by V (i, j) := i+j. Thedistribution D of V is the unique discrete probability measure on R such that, for all v ∈ {2,... , 12}, D {v} = min v−1 13−v. 36 , 36 Moreover, E[V ] = 7. Now let W : Ω → R be the random variable defined for all (i, j) ∈ Ω by W (i, j) := i and let Z : Ω → R be the random variable defined for all (i, j) ∈ Ω by Z(i, j) := j. Then W and Z are independent and the conditional expectation E[V | W ] : Ω → [0, 1] satisfies, for all (i, j) ∈ Ω, E[V | W = W (i, j)] = 3.5 + i. 1.5 Information Theory Definition 44 (Entropy). Let (Ω, P) be a discrete probability space, X a discrete random variable with density p : R → [0, 1] and let Sp be the support of p. We define the entropy of X as h i X H(X) := E − ln p(X) = − p(x) ln p(x). x∈Sp Remark 45 (Entropy). The entropy quantifies the inherent level of uncertainty of a random variable, with constant (no uncertainty) and uniform (maximum uncertainty) random variables being the ones with lowest and highest entropy. Definition 46 (KL Divergence, a.k.a. Relative Entropy). Let p and q be two discrete densities on R and let Sp the support of p. We define the KL divergence (or relative entropy) between p and q as X p(x) KL(p∥q) := p(x) ln , q(x) x∈Sp with the convention that if there exists x ∈ Sp such that q(x) = 0, then KL(p∥q) := ∞. Remark 47 (KL Divergence). The KL divergence measures how different a density p is from another density q. Theorem 48 (Gibb’s Inequality). For any two discrete densities p and q on R, KL(p∥q) ≥ 0 and KL(p∥q) = 0 if and only if p = q. 10 Proof. Fix any two discrete densities p and q on R, and let Sp be the support of p. Since KL(p∥q) = ∞ if and only if there exists x ∈ Sp such that q(x) = 0, then KL(p∥q) = ∞ implies that p ̸= q. Hence, assume without loss of generality that KL(p∥q) < ∞, i.e., that for all x ∈ Sp , q(x) > 0. Since, for all x > 0, − ln(x) ≥ 1 − x, with equality if and only if x = 1, then, X p(x) X q(x) X q(x) X KL(p∥q) = p(x) ln = p(x) − ln ≥ p(x) 1 − =1− q(x) ≥ 0. q(x) p(x) p(x) x∈Sp x∈Sp x∈Sp x∈Sp This shows that it is always the case that KL(p∥q) ≥ 0. Now, if p = q, we immediately have that KL(p∥q) = 0. Vice versa, if KL(p∥q) = 0, then the two inequalities in the display above must hold with equality. The first inequality being an equality implies that q(x) = p(x) for all x ∈ Sp , which, together with the assumption that q is a density, implies that q(x) = 0 for all x ∈ R \ Sp , i.e., that q = p. 11 2 Before We Begin... Note 49 (Discrete Distributions). We will only consider discrete distributions in this course. This is merely to simplify the presentation. All the results we will see here can be greatly generalized leveraging more sophisticated mathematical machinery. Note 50 (Disclaimer). Do not overfit to this course. Machine learning is still in its infancy and growing faster than any other scientific field. Definitions and notations are constantly evolving and might vary from cohort to cohort. Notation 51 (Probability Spaces). We will always use the pair (Ω, P) (without explicitly mentioning it every time) to denote the probability space that carries all the discrete random variables we will encounter. Definition 52 (High-probability and in-expectation bounds). Let X be a discrete random variable. If there exists c > 0 such that E[X] ≤ c, we call the inequality E[X] ≤ c an in-expectation bound (equivalently, we say that X is upper bounded by c in expectation, or that c upper bounds X in expectation). If for all δ > 0 there exists a cδ > 0 such that P[X ≤ cδ ] ≥ 1 − δ, we will call the inequality X ≤ cδ a high-probability bound (equivalently, we say with a slight abuse of notation that X is upper bounded by cδ with high probability, or that cδ upper bounds X with high probability).5 Remark 53 (High-probability vs in-expectation bounds). In-expectation bounds describe the average behavior of random variables. High-probability bounds describe how the random variables behave most of the time. Therefore, in-expectation results are meaningful for random experiments that can be repeat many times (e.g., monetary investments made by a firm with a large budget, where it might still OK to lose money in most of the investments as long as a large profit is made when the investments are successful). High-probability bounds are important in scenarios where it’s crucial to limit the number of mistakes (e.g., choosing a therapy to cure a disease). Both types of bounds have their place in the machine learning landscape. In this course, we will showcase both. Note 54 (Machine Learning—Informal). A machine learns to perform a task if, given enough experience, its performance reaches a sufficient level. Remark 55 (Checklist). These notes are organized hierarchically as follows. 1. Types of machine learning problems (we will see unsupervised, supervised, and online learning). 2. For each type, we will see at least two machine learning problems. (Unsupervised: density estimation and clustering. Supervised: classification and regression. Online: under full and partial feedback.) 3. For each machine learning problem, we will discuss: (a) Motivating real-life examples. (b) A learning algorithm. (c) The computational complexity of the algorithm. (d) The analysis of the algorithm. 5 With a common abuse of notation, we will often write high probability bounds as follows: “For all δ > 0 there exists a c > 0 such that, δ with probability at least 1 − δ, it holds that X ≤ cδ ”. 12 Part II Unsupervised Learning Note 56 (Unsupervised Learning—Informal). The goal of unsupervised learning is to reliably infer a hidden structure from observed data. Problem 57 (Unsupervised Learning). An unsupervised learning problem is parameterized by: 1. A set X , called the input space, whose elements are called data points. 2. A set H, called the hypothesis class, whose elements are called hypotheses. 3. A function ℓ : H × X → [0, +∞], called the loss (function). Given an input space X , a hypothesis class H, and a loss function ℓ, the goal of unsupervised learning is to design (unsupervised) learning algorithms, i.e., functions A : X∗ → H and control their (unsupervised) statistical risk defined, for any n ∈ N and any i.i.d. sequence of discrete random variables X1 ,... , Xn , X taking values in X , by h i E ℓ A (X1 ,... , Xn ), X | X1 ,... , Xn , where n is called the sample size, X1 ,... , Xn are called samples, and the sequence (X1 ,... , Xn ) is called the training set. Note 58 (Intuition). Loss functions measure how effective hypotheses are at describing the structure of data points. The statistical risk measures how effectively, an algorithm can learn hidden structures of new “average” data points after observing a sequences of sample data points. Note 59 (Deterministic vs randomized algorithms). While for most of our problems it is sufficient to consider deterministic algorithms, for some learning problems, it is helpful to consider randomized algorithms, that we define below. Definition 60 (Randomized algorithm). With the same notation as Problem 57 (Unsupervised Learning), a randomized (unsupervised) learning algorithm is a function A : X ∗ × [0, 1] → H and its (unsupervised) statistical risk is defined, for any n ∈ N, any i.i.d. sequence of discrete random variables X1 ,... , Xn , X taking values in X , and any [0, 1]-valued random variable U independent of X1 ,... , Xn , X, by h i E ℓ A (X1 ,... , Xn , U ), X | X1 ,... , Xn , where U is called a random seed. Remark 61 (Random seeds). Random seeds U are typically chosen as uniform random variables on [0, 1]. Since these are continuous random variables, and we only reviewed discrete random variables in Section 1, we will not rely too heavily on their properties, but only recall here two important facts. 1. There exists a way to transform any uniform random variable U on [0, 1] into any number k ∈ N of independent uniform random variables U1 ,... , Uk on [0, 1] such that, if U was independent of a set S of random variables, U1 ,... , Uk are still independent of the random variables is S. 2. For any set Z, and any discrete distribution D, there exists a way to transform any uniform random variable U on [0, 1] into a Z-valued random variable Z with distribution D such that, if U was independent of a set S of random variables, Z is still independent of the random variables is S. 13 3 Density estimation Note 62 (Density Estimation—Informal). The goal of density estimation is to satisfactorily infer a hidden probability density function from observed data. Note 63 (Some applications). Anomaly detection in network security (in cybersecurity, detecting unusual patterns in network traffic is crucial for identifying potential threats or attacks; density estimation is often used to model normal net- work behavior), customer behavior modeling (running an online store, and tracking whether customers make a purchase), medical diagnosis (estimate the probability density of certain biological markers or test results in a healthy population to identify abnormalities), speech recognition (learning to recognize the probability distribution of different sound patterns to correctly interpret speech), stock market modeling (in finance, estimating the probability density of stock price movements helps investors understand market volatility or predict price jumps), weather forecasting (estimate the probability density of different weather conditions, like temperature, rainfall, etc.), website user behavior (tracking specific user interactions, such as whether users click on a specific advertisement). Note 64 (Bernoulli Density Estimation). For the sake of simplicity, we will focus on the case where the densities are supported on only two values (Bernoulli density estimation). Problem 65 (Bernoulli Density Estimation). A Bernoulli density estimation problem is an unsupervised learning problem where: 1. The input space is X := {0, 1}. 2. The hypothesis class is H := {pb }b∈[0,1] , where, for all b ∈ [0, 1], pb : X → [0, 1] is defined, for all x ∈ X , by ( b, if x = 1, pb (x) := 1 − b, if x = 0, i.e., pb is the density of a Bernoulli distribution with parameter b. 3. The loss ℓ : H × X → [0, +∞] is defined, for all (p, x) ∈ H × X , by ( − ln p(x) , if p(x) > 0, ℓ(p, x) = +∞, if p(x) = 0. Remark 66 (Intuition behind minimizing this loss). The loss ℓ(p, x) of a pair (p, x) ∈ H × X should be high whenever the hypothesis p does a poor job of representing the structure of the data point x. In the Bernoulli density estimation problem described above, the loss − ln p(x) is very high whenever p(x) is close to 0. This makes sense, because if our hypothesis that “the probability p(x) of observing x is close to zero” were correct, than it would be highly unlikely to observe x. Therefore, if we do observe x, it means that our hypothesis that p(x) ≈ 0 was probably way off, and we should pay a high loss for our bad hypothesis. Remark 67 (Intuition behind the next theorem). In addition to making intuitive sense, the lossabove makes formal sense, because, as the next theorem shows, the discrete density p : X → [0, 1] that minimizes the risk E ℓ(p, X) = E − ln p(X) is precisely the density of X. Therefore, finding algorithms with small risk translates exactly into our initial goal of inferring the density that generates the data points. Moreover, the theorem establish a lower limit to what is achievable: No algorithm can have zero risk in general; even the distribution that generates the data has a high expected loss if the amount of uncertainty of the distribution that generates the data (i.e., with its entropy) is large. Theorem 68 (Optimal density estimation). Consider the Bernoulli density estimation problem (Problem 65), fix b ∈ [0, 1], and let X be a Bernoulli random variable with parameter b. Then, recalling that the entropy of X is denoted by H(X), we have h i h i argmin E − ln p(X) = {pb } and E − ln pb (X) = H(X). p∈H Proof. Consider first the case b = 0. Recalling the conventions − ln(0) := +∞ and ±∞ · 0 := 0 we introduced to compute expectations, the LOTUS yields h i E − ln p0 (X) = − ln p0 (0) p0 (0) − ln p0 (1) p0 (1) = −0 · 1 + ∞ · 0 = 0 14 and, for all p ∈ H \ {p0 }, h i E − ln p(X) = − ln p(0) p0 (0) − ln p(1) p0 (1) = − ln p(0) · 1 − ln p(1) · 0 > 0. This proves the theorem when b = 0. Next, if b = 1, analogous computations show that h i h i E − ln p1 (X) = 0 and ∀p ∈ H \ {p1 }, E − ln p(X) > 0, h i This proves the theorem when b = 1. Finally, in the general case of b ∈ (0, 1), we have that E − ln p0 (X) = +∞, h i E − ln p1 (X) = +∞, and, for all densities p ∈ H \ {p0 , p1 }, pb (X) h i h i h i h i E − ln p(X) = E − ln pb (X) + E ln pb (X) − ln p(X) = E − ln pb (X) + E ln , | {z } | p(X) {z } =H(X) =KL(pb ∥p) and noting that the first expectation on the right-hand side is the entropy H(X) of X, which is constant in p, and the second one is the KL-divergence between pb and p, which is minimized (and vanishes) if and only if p = pb by Gibb’s inequality (Theorem 48), we conclude that the unique discrete density p that minimize the left-hand side is precisely p = pb. 3.1 Learning Algorithm: Maximum Likelihood Estimation (MLE) Algorithm 1 Maximum Likelihood Estimation (MLE) 1: input: (x1 ,... , xn ) ∈ X ∗ 2: compute estimated parameter b̂(x1 ,... , xn ) ∈ argmaxb∈[0,1] pb (x1 ) ·... · pb (xn ) 3: output: pb̂(x ,...,x ) ∈ H 1 n Remark 69 (Intuition behind Maximum Likelihood Estimation). If X1 ,... , Xn is an i.i.d. sequence of X -valued discrete random variables with common discrete density p, then, for any sequence of realizations x1 ,... , xn ∈ X , the probability of observing these n realizations is P[X1 = x1 ,... , Xn = xn ] = P[X1 = x1 ] ·... · P[Xn = xn ] = p(x1 ) ·... · p(xn ), where the right-hand side is called the likelihood of x1 ,... , xn —since it quantifies exactly how likely it is to observe x1 ,... , xn by drawing n samples according to p. The maximum likelihood estimation algorithm (Algorithm 1), then, simply selects, among all possible densities, one that maximizes the likelihood that the observed is drawn i.i.d. according to that density. Intuitively, this should be a good guess of the hidden density. Lemma 70 (MLE returns the mean). Consider the Bernoulli density estimation problem (Problem 65) and let X1 ,... , Xn be an i.i.d. sequence of Bernoulli random variables. Then, the parameter b̂(X1 ,... , Xn ) of the output density pb̂(X1 ,...,Xn ) of the Maximum Likelihood Estimation algorithm (Algorithm 1) run with training samples (X1 ,... , Xn ) is the empirical average of the samples; i.e.: n 1X b̂(X1 ,... , Xn ) = Xt. n t=1 Proof. With the convention that 00 := 1, let L : [0, 1] → [0, 1] be the so-called likelihood function defined, for all b ∈ [0, 1], by Yn Yn L(b) := pb (Xt ) = bXt (1 − b)1−Xt. t=1 t=1 The MLE algorithm picks a maximizer of L to use as a parameter of its predicted Bernoulli density. PTherefore, to complete n the proof, we simply have to prove that the likelihood function L is always mimimized at b = n1 t=1 Xt. 1. Consider first the special case where X1 = · · · = P Xn = 0. In this case, for all b ∈ [0, 1], L(b) = (1 − b)n. Hence, the n likelihood function L is maximized at b = 0 = n1 t=1 Xt. 15 2. Secondly, consider the special case where X1 = ·P · · = Xn = 1. In this case, for all b ∈ [0, 1], L(b) = bn. Hence, the n likelihood function L is maximized at b = 1 = n t=1 Xt. 1 3. Now, consider the general case where there exist two distinct i, j ∈ {1,... , n} such that Xi = 0 and Xj = 1. In this case the likelihood function L satisfies L(0) = 0 = L(1) and, for all b ∈ (0, 1), L(b) > 0. Therefore, L is maximized at a point b ∈ (0, 1). Since the natural logarithm is an increasing function, maximizing L on (0, 1) is equivalent to maximizing its logarithm, i.e., the function l : (0, 1) → R defined, for all b ∈ (0, 1), by n ! n n ! n ! Y X X X l(b) := ln L(b) = ln b (1 − b) Xt 1−Xt = Xt ln(b)+(1−Xt ) ln(1−b) = Xt ln(b)+ n − Xt ln(1−b). t=1 t=1 t=1 t=1 Then, to minimize the function l, we simply study the sign of its derivative, obtaining, for all b ∈ (0, 1) Pn Pn n Xt n− Xt 1X l (b) = ′ t=1 − t=1 ≥0 ⇐⇒ b≤ Xt. b 1−b n t=1 Pn This shows that l (and, therefore, L) is maximized at b = 1 n t=1 Xt , completing the proof. Remark 71 (Complexity of MLE). Since the parameter of the Bernoulli density returned by MLE is simply the empirical average of the samples, the computational complexity of MLE is O(n), where n is the sample size. Theorem 72 (Analysis of MLE). Consider the Bernoulli density estimation problem (Problem 65) and let X1 ,... , Xn be an i.i.d. sequence of Bernoulli random variables with (unknown) parameter b ∈ (0, 1). Then, for any δ ∈ (0, 1), if n ≥ 2 ln 2δ max{b−2 , (1 − b)−2 }, the unsupervised statistical risk of MLE (Algorithm 1), satisfies, with probability at least 1 − δ, s 8 2 h i E − ln pb̂(X1 ,...,Xn ) (X) | X1 ,... , Xn ≤ H(X) + ln , n δ where we recall that H(X) is the entropy of X. Proof. Fix any any δ ∈ (0, 1) and n ≥ 2 ln 2δ max{b−2 , (1 − b)−2 }. To lighten the notation, let X := (X1 ,... , Xn ). By Pn q Lemma 70, we have that b̂(X) = n1 t=1 Xt. Hence, Hoeffding’s inequality (Fact 42) applied with ε = 2n 1 ln 2δ yields that " s # 1 2 P b̂(X) − b ≤ ln ≥ 1 − δ. 2n δ Using this fact, to prove the result, we only need to show that ( s ) s 1 2 8 2 h i ln , E − ln pb̂(X) (X) | X (ω) ≤ H(X) + ln ∀ω ∈ b̂(X) − b ≤. 2n δ n δ n q o To do so, fix an arbitrary outcome ω ∈ b̂(X) − b ≤ 1 2n ln 2 δ , let x := X(ω). Then, by our assumption on n, s 1 2 1 b̂(x) − b ≤ ln ≤ min{b, 1 − b}. (1) 2n δ 2 16 Therefore, the properties of conditional expectations (see Proposition 40) and the independence of X and X imply h i h i E − ln pb̂(X) (X) | X (ω) = E − ln pb̂(X) (X) | X = x (definition of conditional expectation) h i = E − ln pb̂(x) (X) | X = x (constancy on level sets) h i = E − ln pb̂(x) (X) (P[X = x] > 0 + independence) h i h i = E − ln pb (X) + E ln pb (X) − ln pb̂(x) (X) (linearity) " !# pb (X) = H(X) + E ln (definition of entropy) pb̂(x) (X) ! ! b 1−b = H(X) + b ln + (1 − b) ln (LOTUS) b̂(x) 1 − b̂(x) ! ! b − b̂(x) b̂(x) − b = H(X) + b ln 1 + + (1 − b) ln 1 + b̂(x) 1 − b̂(x) b − b̂(x) b̂(x) − b ≤ H(X) + b + (1 − b) (∀x > −1, ln(1 + x) ≤ x) b + b̂(x) − b 1 − b + b − b̂(x) b − b̂(x) b̂(x) − b ≤ H(X) + b + (1 − b) b − b̂(x) − b 1 − b − b − b̂(x) b − b̂(x) b̂(x) − b ≤ H(X) + b + (1 − b) (Equation (1)) b− b 2 1−b− 1−b 2 s 8 2 ≤ H(X) + ln. (Equation (1) again) n δ Remark 73 (MLE is asymptotically optimal). The previous result, together with Theorem 68, shows that, if b ∈ (0, 1), then, the more the number of samples n grows, the more the performance (i.e., the risk) of the hypothesis returned by MLE becomes closer to that of the density that generates the data. In other words, when n approaches infinity, MLE learns to behave optimally. What if b ∈ {0, 1}? 4 Clustering Note 74 (Clustering—Informal). The goal of clustering is to satisfactorily partition the input space into a finite number of subsets based of some notion of closeness of the observed data. Note 75 (Some applications). Customer segmentation in e-commerce (grouping customers based on purchasing behavior, demographics, or interests to tailor marketing strategies) or baking (identifying customer segments for personalized finan- cial products or credit risk assessment), image segmentation for medical imaging (separating different regions in medical scans; e.g., tumors vs. healthy tissue) or computer vision (dividing images into segments for object detection or recog- nition tasks), social network analysis (community detection or influencer identification), recommender systems (movie, content, or product recommendations), genomics and bioinformatics (clustering genes with similar expression patterns for biological research, or grouping patients based on genetic markers or clinical symptoms). Problem 76 (k-means clustering). A (k-means) clustering problem is an unsupervised learning problem where: 1. The input space X is a bounded subset of Rd , where d ∈ N is the known problem dimension. 2. The hypothesis class is the set H containing all k-dimensional vectors (c1 ,... , ck ) ∈ X k with distinct components (called centers), where k is the known number of clusters. 3. The loss ℓ : H × X → [0, +∞] is defined, for all h := (c1 ,... , ck ) ∈ H and x ∈ X , by ℓ(h, x) := 2 min ∥x − ci ∥. i∈{1,...,k} 17 Remark 77 (Intuition behind minimizing this loss). A good hypothesis should have a center of each cluster, placed at the “center” of the cluster, thus guaranteeing that each element of each cluster would have center close to them. 4.1 Learning Algorithm: k-means++ Algorithm 2 k-means++ 1: input: (x1 ,... , xn ) ∈ X ∗ , random seed u ∈ [0, 1] 2: initialization: 3: let n′ := {x1 ,... , xn } and x′1 ,... , x′n′ be the n′ distinct elements of {x1 ,... , xn } 4: if n′ ≤ k then (0) (0) (0) (0) (0) (0) 5: let c1 := x′1 ,... , cn′ := x′n′ and cn′ +1 ,... , ck be arbitrary points in X such that c1 ,... , ck are all distinct 6: else 7: let d(x) denote the shortest distance from a data point x ∈ X to the closest center we have already chosen (0) 8: choose an initial center c1 uniformly at random from x′1 ,... , x′n′ 9: for i = 2,... , k do (0) (0) (0) 10: let Ci := {x′1 ,... , x′n′ } \ {c1 ,... , ci−1 } (0) (0) let pi : Ci → [0, 1] be the discrete density defined, for all x ∈ Ci , by pi (x) := d(x)2 / x′ ∈Ci d(x′ )2 P 11: (0) 12: draw ci at random from Ci according to pi. (0) := (0) (0) 13: Let K {c1 ,... , ck } be the set of cluster centers after the initialization phase 14: main loop: 15: for iteration t = 1, 2,... do 16: for each cluster i = 1,... , k do (t) (t−1) 17: let Ci be the set of points in {x1 ,... , xn′ } that are closer to ci than to any other center c ∈ K(t−1) (t) (t) (t) let ci be the center of mass of all points in Ci , i.e., ci = (t) 1 P 18: (t) x |Ci | x∈C i (t) (t) 19: let K :=(t) be the updated set of cluster centers {c1 ,... , ck } 20: if K(t) = K (t−1) (i.e., if no changes in the clusters’ centers occurred) then break (t) (t) 21: output: c1 ,... , ck ∈H Remark 78 (Intuition behind k-means++). The k-means++ algorithm is built around two ideas. First, initialize your “candidate centers” at random favoring configurations of centers where the centers are far away from each other. This way, there is a reasonable chance you will end up with a center per cluster immediately after initialization. Then, refine the position of each candidate center by computing the center of mass of the cluster consisting of all points in the training set closest to it, and replacing it with this new candidate center. This way, if there is only one candidate center per cluster, all these candidate centers will converge to a central position within the cluster. If, instead, there is a cluster with, say, two candidate centers and a second cluster with none, then, one of the two candidate centers (the one closest to the cluster with no candidate centers) will be pulled towards the second cluster and eventually to a central position within the cluster. Fact 79 (Complexity of k-means++). The worst-case computational complexity of k-means++ is exponential in n. Note 80 (Complexity of k-means++). In practice, k-means++ terminates quickly. It is not fully known why. Fact 81 (Analysis of k-means++). Consider the k-means clustering problem (Problem 76), and let r be the diameter of X. Then, there exist two numerical constants c1 , c2 > 0 such that, for any n ∈ N and δ ∈ (0, 1), the unsupervised statistical risk of k-means++ (Algorithm 2, denoted by A below), satisfies, with probability at least 1 − δ, r h i k log(n/δ) E ℓ A (X1 ,... , Xn , U ), X | X1 ,... , Xn ≤ (c1 log k) inf E ℓ(h, X) + c2 r 2 . h∈H n Note 82 (Performance of k-means++). k-means++ is asymptotically optimal up to logarithmic terms. 18 Part III Supervised Learning Note 83 (Supervised Learning—Informal). The goal of supervised learning is to reliably predict the hidden label of data points after observing a finite amount of examples of labeled data. Problem 84 (Supervised Learning). A supervised learning problem is parameterized by: 1. A set X , called the input space, whose elements are called data points. 2. A set Y, called the label set, whose elements are called labels. 3. A set H of functions h : X → Y, called the hypothesis class, whose elements are called hypotheses.6 4. A function ℓ : H × (X × Y) → [0, +∞], called the loss (function). Given an input space X , a label set Y, a hypothesis class H, and a loss function ℓ, the goal of supervised learning is to design (supervised) learning algorithms, i.e., functions A : (X × Y)∗ → H and control their (supervised) statistical risk defined, for any n ∈ N and any i.i.d. sequence of discrete random variables (X 1 , Y1 ),... , (X n , Yn ), (X, Y ) taking values in X × Y, by E ℓ A (X 1 , Y1 ),... , (X n , Yn ) , (X, Y ) | (X 1 , Y1 ),... , (X n , Yn ) , where the pairs (X 1 , Y1 ),... , (X n , Yn ) are called examples and the sequence Sn := (X 1 , Y1 ),... , (X n , Yn ) is called training set. Note 85 (Intuition). Loss functions measure how effective hypotheses are at predicting the labels of data points. The statistical risk measures how effectively an algorithm can learn to label never-before seen “average” data points after observing a sequences of examples of labeled data points. Definition 86 (Bayes optimal predictor). Consider a supervised learning problem (Problem 84) and let F be the set of all functions f : X → Y. We say that f ∗ : X → Y is a Bayes optimal predictor if, for all x ∈ X , it holds that h i f ∗ ∈ argmin E ℓ f, (X, Y ) | X = x f ∈F h i and we call its average loss E ℓ f ∗ , (X, Y ) the Bayes risk. Remark 87 (Existence of Bayes h optimal predictors).i Note that f might not exist. For example, for some x ∈ X , it ∗ might happen that argminf ∈F E ℓ f, (X, Y ) | X = x = ∅. Proposition 88 (Optimality of f ∗ ). Consider a supervised learning problem (Problem 84) and let f ∗ be a Bayes optimal predictor. Then, for any other predictor h : X → Y, it holds that h i h i E ℓ f ∗ , (X, Y ) ≤ E ℓ h, (X, Y ). Proof. Fix an arbitrary h : X → Y. By definition of Bayes optimal predictor, we get, for all x ∈ X , h i h i E ℓ f ∗ , (X, Y ) | X = x ≤ E ℓ h, (X, Y ) | X = x. Hence, by definition of conditional expectation, h i h i E ℓ f ∗ , (X, Y ) | X ≤ E ℓ h, (X, Y ) | X. Putting this inequality together with the monotonicity of the expectation (M), and recalling the tower property (T), we obtain h h i (T) i (M) h i (T) h i E ℓ f , (X, Y ) = E E ℓ f , (X, Y ) | X ∗ ∗ ≤ E E ℓ h, (X, Y ) | X = E ℓ h, (X, Y ). 6 Functions h : X → Y are also called predictors. 19 Remark 89 (If f ∗ is so good, why don’t we just use it?). Note that to compute f ∗ , we would need to know the distribution of (X, Y ), which we do not know. This is where optimization and machine learning differ. We want to be able to output predictors as good as the best one without having access to all the information needed to solve the corresponding optimization problem but, instead, only relying on examples (X 1 , Y1 ),... , (X n , Yn ). 5 Classification Note 90 (Classification—Informal). The goal of classification is to satisfactorily predict the correct label (out of finitely many possible labels) of data points after observing a finite amount of examples of labeled data. Note 91 (Some applications). Spam detection, sentiment analysis, image recognition (object classification), medical diagnosis, credit scoring (loan approval), handwritten digit recognition, language identification. Note 92 (Binary classification). For the sake of simplicity, we will focus on the case where the label set contains only two labels (binary classification). Problem 93 (Binary classification). A binary classification problem is a supervised learning problem where: 1. The input space X is a subset of Rd. 2. The label set Y is the set {−1, +1}. 3. The hypothesis class H is a set of functions h : X → {−1, +1} (whose elements are also called classifiers). 4. The loss ℓ : H × X × {−1, +1} → [0, +∞] is the so-called zero-one loss, defined, for any h ∈ H and (x, y) ∈ X × {−1, +1}, by ℓ h, (x, y) := I h(x) ̸= y. Proposition 94 (Bayes optimal predictor/risk in classification). Consider a binary classification problem (Problem 93) and let f ∗ be a Bayes optimal predictor. Denoting by η : X → [0, 1] the function defined, for all x ∈ X , by η(x) := P[Y = +1 | X = x], it holds that: 1. For any x ∈ X , ( +1, if η(x) ≥ 21 , f ∗ (x) = −1, if η(x) < 12. h i 2. The Bayes risk is E min η(X), 1 − η(X) ≤ 12. Proof. Fix an arbitrary x ∈ X. By the “constancy on level sets” property of conditional expectations (C), the definitions of Bayes optimal predictor (B) and zero-one loss (Z), and the fact that functions f : X → Y take only the two values −1 and +1 (V), we have h i (C) h i (B) h i E I f ∗ (x) ̸= Y | X = x = E ℓ f ∗ , (X, Y ) | X = x = min E ℓ f, (X, Y ) | X = x f ∈F (C) h i (Z) h i = min E ℓ f, (x, Y ) | X = x = min E I f (x) ̸= Y | X = x f ∈F f ∈F (V) h i = min E I y ̸= Y | X = x. y∈{−1,+1} h i h i This chain of equalities shows that f ∗ (x) satisfies E I f ∗ (x) ̸= Y | X = x = miny∈{−1,+1} E I y ̸= Y | X = x , i.e., that h i f ∗ (x) ∈ argmin E I y ̸= Y | X = x. y∈{−1,+1} 20 h i To to conclude the proof, then, it suffices to show that argminy∈{−1,+1} E I y ̸= Y | X = x is a singleton containing only the number +1, if η(X) ≥ 21 , and −1, if η(X) < 12. Indeed, we have that h i argmin E I y ̸= Y | X = x = argmin E I{Y = +1}I{y = −1} + I{Y = −1}I{y = +1} | X = x y∈{−1,+1} y∈{−1,+1} = argmin I{y = −1}E I{Y = +1} | X = x + I{y = +1}E I{Y = −1} | X = x (by linearity) y∈{−1,+1} = argmin I{y = −1}P[Y = +1 | X = x] + I{y = +1}P[Y = −1 | X = x] y∈{−1,+1} = argmin I{y = −1}η(x) + I{y = +1} 1 − η(x) (by definition of η) y∈{−1,+1} ( ! (( ) η(x) if ŷ = −1 +1 if η(x) ≥ 1 = argmin = 2. y∈{−1,+1} 1 − η(x) if ŷ = +1 −1 if η(x) < 1 2 This proves the first point. For the second point, proceeding in a similar fashion, we get h i h i E ℓ f ∗ , (X, Y ) = E I f ∗ (X) ̸= Y (definition of zero-one loss) h i = E E I{Y = +1}I f ∗ (X) = −1 + I{Y = −1}I f ∗ (X) = +1 | X (tower property) h i h i = E E I{Y = +1}I f ∗ (X) = −1 | X + E E I{Y = −1}I f ∗ (X) = +1 | X (linearity) h i h i = E I f ∗ (X) = −1 E I{Y = +1} | X + E I f ∗ (X) = +1 E I{Y = −1} | X (“taking out what is known”) = E I f (X) = −1 P[Y = +1 | X] + E I f (X) = +1 P[Y = −1 | X] ∗ ∗ = E I f (X) = −1 η(X) + E I f (X) = +1 1 − η(X) (definition of η) ∗ ∗ h i = E I η(X) < 1/2 η(X) + I η(X) ≥ 1/2 1 − η(X) (linearity & point 1) (◦) h i = E I η(X) < 1/2 min η(X), 1 − η(X) + I η(X) ≥ 1/2 min η(X), 1 − η(X) h i = E min η(X), 1 − η(X) , (I {η(X) < 1/2} + I {η(X) ≥ 1/2} = 1) where (◦) follows from the fact that min η(X), 1 − η(X) = η(X) if η(X) < 1/2 and min η(X), 1 − η(X) = 1 − η(X) if η(X) ≥ 1/2. The proof is concluded by observing that for all a ∈ [0, 1], min a, 1 − a ≤ 1/2. 5.1 Learning Algorithm: Nearest Neighbor (NN) Algorithm 3 Nearest Neighbor (NN) input: sn := (x1 , y1 ),... , (xn , yn ) ∈ (X × Y)∗ 1: 2: for all x ∈ X , pick π(sn , x) ∈ argmint∈{1,...,n} ∥x − xt ∥ 3: output: hNN,sn : X → Y defined, for all x ∈ X , by hNN,sn (x) := yπ(sn ,x) Remark 95 (Intuition behind NN). The NN algorithm is built around the idea that the closer two data points are, the more likely it is that their labels coincide. Remark 96 (Complexity of NN). One can prove that the worst-case computational complexity of NN is of order nd , i.e., exponential in the in the dimension d. We will not present a proof of this claim, but only mention that this is due to the fact that, to build a Nearest-Neighbor classifier, one has to partition X in so-called Voronoi cells (see Figure 3), where each training data point xt is contained in a cell and the border between cells is the set of points in X that have equal 21 distance from the closest data points. To alleviate this issue, one can simply store the training set sn and, when asked to evaluate hNN,sn at a new data point x, find the label yt of the closest point xt to x in O(nd) time —where O(d) is the 2 time needed to compute each distance ∥x − xt ∥2 and n is the number of times that this operation needs to be repeated. Figure 3: Voronoi diagram for a training set in [−1, 1]2. Theorem 97 (Analysis of NN). Consider the binary classification problem (Problem 93), and let X := [−1, 1]d and H be the set of all functions h : X → Y. Then, there exists a vanishing sequence of positive numbers ε1 , ε2 ,... such that, for any n ∈ N, the supervised statistical risk of NN (Algorithm 3), satisfies h i h i E ℓ hNN,Sn , (X, Y ) ≤ 2E ℓ f ∗ , (X, Y ) + εn. Proof. Define the auxiliary function η : X → [0, 1], for all x ∈ X , by η(x) := P[Y = +1 | X = x]. In words, η(x) returns the probability that the test label is 1, given that the corresponding test data point is x. Let also S ⊆ X be the support of the distribution of X. We prove this result in three steps (Claims 98, 99, and 101). We begin by proving that the restriction of η to S is Lipschitz. Claim 98. There exists a constant c > 0 such that, for all x, x′ ∈ S, η(x) − η(x′ ) ≤ c ∥x − x′ ∥. To prove the claim, note that since X is discrete, then S is finite, therefore minx,x′ ∈S,x̸=x′ ∥x − x′ ∥ > 0. Hence, defining c := 1/ minx,x′ ∈S,x̸=x′ ∥x − x′ ∥, we immediately obtain that, for all x, x′ ∈ S such that x ̸= x′ (if x = x′ the claim is trivially true): ≤c z }| { 1 η(x) − η(x ) = η(x) − η(x ) · ′ ′ · ∥x − x′ ∥ ≤ c ∥x − x′ ∥. | {z } ∥x − x′ ∥ ≤1 This proves Claim 98. Next, we prove that the supervised statistical risk of NN can be upper bounded (in expectation) by twice the Bayes risk plus a constant times the expected distance between the test data point X and one of the training data points X t closest to X. h i h i h i Claim 99. E ℓ hNN,Sn , (X, Y ) ≤ 2E ℓ f ∗ , (X, Y ) + cE X π(Sn ,X) − X. To prove the claim, we will leverage the following fact. h i h i Fact 100. E ℓ hNN,Sn , (X, Y ) = E η X π(Sn ,X) 1 − η(X) + 1 − η X π(Sn ,X) η(X). By the Lipschitzness of η, for all x, x′ ∈ S, we immediately get the two inequalities η(x′ ) ≤ η(x) + c ∥x − x′ ∥ 1 − η(x′ ) ≤ 1 − η(x) + c ∥x − x′ ∥ which in turn give η(x′ ) 1 − η(x) + 1 − η(x′ ) η(x) ≤