Lecture1_annotated-merged.pdf
Document Details
Uploaded by Deleted User
Full Transcript
INTRODUCTION TO MACHINE LEARNING COMPSCI 4ML3 LECTURE 1 HASSAN ASHTIANI CURVE-FITTING PREDICT HEIGHT OF A PERSON GIVEN HER/HIS AGE? COLLECT A SET OF “DATA POINTS” REPRESENT DATA POINT 𝑖 BY (𝑥 𝑖 , 𝑦 𝑖 ) E.G., IF THE INDIVIDUAL 𝑖 IS 25 YEARS OLD AND IS 175CM TALL THEN W...
INTRODUCTION TO MACHINE LEARNING COMPSCI 4ML3 LECTURE 1 HASSAN ASHTIANI CURVE-FITTING PREDICT HEIGHT OF A PERSON GIVEN HER/HIS AGE? COLLECT A SET OF “DATA POINTS” REPRESENT DATA POINT 𝑖 BY (𝑥 𝑖 , 𝑦 𝑖 ) E.G., IF THE INDIVIDUAL 𝑖 IS 25 YEARS OLD AND IS 175CM TALL THEN WE CAN WRITE (𝑥 𝑖 , 𝑦 𝑖 ) = (25, 175) WE HAVE COLLECTED 𝑛 DATA POINTS, S = {(𝑥 𝑖 , 𝑦 𝑖 )}𝑛𝑖=1 GIVEN A NEW 𝑥, “PREDICT” ITS 𝑦? CURVE-FITTING Height Age LINEAR CURVE-FITTING Height Age LINEAR CURVE-FITTING 𝑥 IS CALLED AN INPUT, 𝑦 IS CALLED AN OUTPUT/RESPONSE Height y x Age NON-LINEAR CURVE-FITTING WHICH ONE IS BETTER? Height Height Age Age LINEAR VS NON-LINEAR? MULTIDIMENSIONAL CURVE-FITTING 𝑥 AND/OR 𝑦 COULD BE MULTI-DIMENSIONAL FOR EXAMPLE, PREDICT THE HEIGHT BASED ON THE AGE AND WEIGHT E.G., IN THE RIGHT PICTURE, 𝑥 ∈ ℝ2 , 𝑦 ∈ ℝ CURVE-FITTING IS EVERYWHERE FITTING A CURVE ENABLES INTERPOLATION AND EXTRAPOLATION THIS IS A TYPE OF SUPERVISED LEARNING/PREDICTION PREDICTION, BECAUSE WE PREDICT 𝑦 GIVEN 𝑥 SUPERVISED, BECAUSE {(𝑥 𝑖 , 𝑦 𝑖 )}𝑛𝑖=1 IS GIVEN PREDICTION IS EVERYWHERE FACE RECOGNITION 𝑥: IMAGE 𝑥∈ 𝑦: IDENTITY 𝑦∈ PREDICTION IS EVERYWHERE BIOMEDICAL IMAGING 𝑥: MRI IMAGE 𝑥∈ 𝑦: CANCEROUS? 𝑦∈ PREDICTION IS EVERYWHERE SPAM DETECTION 𝑥 ∈? 𝑦 ∈? PREDICTION IS EVERYWHERE OIL/STOCK PRICE PREDICTION E.G., GIVEN PRICE FOR 𝑡 = 1,2, … , 1000 PREDICT PRICE FOR 𝑡 = 1001 𝑥 ∈? 𝑦 ∈? PREDICTION IS EVERYWHERE TRANSLATING FRENCH TEXT TO ENGLISH TEXT INPUT? OUTPUT? SPEECH RECOGNITION (INPUT? OUTPUT?) TEXT TO SPEECH (INPUT? OUTPUT?) NETFLIX RECOMMENDATIONS (INPUT? OUTPUT?) IS EVERYTHING IN AI JUST PREDICTION?! NOT EXACTLY (E.G., PREDICTION VS CONTROL) PREDICTION IS EVERYWHERE PREDICTION IS EVERYWHERE AND … PREDICTION METHODS CAN BE QUITE DIFFERENT FOR EACH APPLICATION LINEAR REGRESSION PREDICTION WHERE 𝑥 ∈ ℝ𝑑 Height 𝑦 ∈ ℝ (COULD BE ℝ𝑘 ) THE CASE 𝑥, 𝑦 ∈ ℝ IS CALLED SIMPLE LINEAR REGRESSION BEST WAY OF FITTING A LINE? Age LINEAR REGRESSION WHICH LINE IS BETTER? MAYBE THE ONE THAT Height “FITS THE DATA” BETTER? Age LINEAR REGRESSION 𝑦𝑖 𝑦𝑖 𝑦𝑖 IS THE PREDICTION OF THE MODEL LET 𝑑 𝑖 = |𝑦 𝑖 − 𝑦𝑖 | 𝑖 2 BEST LINE MINIMIZES σ𝑛𝑖=1 𝑑 ? OTHER OPTIONS? LINEAR REGRESSION ORDINARY LEAST SQUARES (OLS) METHOD 2 𝑛 𝑖 2 σ𝑛𝑖=1(𝑦 𝑖 σ MINIMIZE 𝑖=1(𝑑 ) = 𝑖 −𝑦 ) HOW ABOUT? MINIMIZE σ𝑛𝑖=1 |𝑦 𝑖 − 𝑦𝑖 | 𝑦𝑖 𝑖 𝑦 𝑦 𝑖 +𝑎 𝑖 +𝑎 𝑦 MINIMIZE σ𝑖=1( 𝑖 𝑛 + ) OR MINIMIZE σ𝑖=1( 𝑖 𝑛 + ) 𝑦 𝑦𝑖 𝑦 +𝑎 𝑦 𝑖 +𝑎 𝑦 𝑖 +0.0001 MINIMIZE σ𝑖=1 LOG 𝑖 𝑛 𝑦 +0.0001 … 1-D ORDINARY LEAST SQUARES 𝑥, 𝑦 ∈ ℝ FIND 𝑎, 𝑏 ∈ ℝ SUCH THAT 𝑦ො = 𝑎𝑥 + 𝑏 ≈ 𝑦 WE ARE GIVEN {(𝑥 𝑖 , 𝑦 𝑖 )}𝑛𝑖=1 FIND/LEARN 𝑎, 𝑏 FROM THE DATA 𝑛 𝑖 𝑖 2 MIN 𝑎𝑥 +𝑏− 𝑦 𝑎,𝑏 𝑖=1 𝑖 2 MINIMIZE 𝑓 𝑎, 𝑏 = 𝑛 σ𝑖=1 𝑖 𝑎𝑥 + 𝑏 − 𝑦 1-D ORDINARY LEAST SQUARES OPTIMAL 𝑎 AND 𝑏: 𝑥𝑦−𝑥∙ҧ 𝑦ത 𝐶𝑂𝑉(𝑥,𝑦) 𝑎= = ,𝑏 = 𝑦ത − 𝑎𝑥ҧ 𝑥 2 − 𝑥ҧ 2 𝑉𝑎𝑟(𝑥) 1 1 1 1 𝑖 2 𝑥ҧ = σ𝑥 , 𝑦ത = σ𝑦 , 𝑥𝑦 = σ𝑥 𝑦 , 𝑖 𝑖 𝑖 𝑖 𝑥2 = σ 𝑥 N N N N INTRODUCTION TO MACHINE LEARNING COMPSCI 4ML3 LECTURE 2 HASSAN ASHTIANI LINEAR CURVE-FITTING (REVIEW) Height y x Age HTTPS://BRADLEYBOEHMKE.GITHUB.IO/HOML/REGULARIZED-REGRESSION.HTML ORDINARY LEAST SQUARES (1 DIMENSION) 𝑖 𝑖 𝑛 𝑖 𝑖 𝑥 ,𝑦 𝑖=1 , 𝑥 ∈ ℝ, 𝑦 ∈ ℝ 𝑛 𝑖 𝑖 2 MIN 𝑎𝑥 + 𝑏 − 𝑦 𝑎,𝑏 𝑖=1 𝑥𝑦 − 𝑥ҧ ∙ 𝑦ത 𝐶𝑂𝑉(𝑥, 𝑦) 𝑎= = , 𝑏 = 𝑦ത − 𝑎𝑥ҧ 𝑥 2 − 𝑥ҧ 2 𝑉𝑎𝑟(𝑥) ORDINARY LEAST SQUARES (D DIMENSIONS) ASSUME 𝑥 ∈ ℝ𝑑 , 𝑦 ∈ ℝ INSTEAD OF A LINE, WE NEED TO FIT A HYPERPLANE! HYPERPLANE EQUATION: 𝑑 𝑦ො = 𝑤0 + 𝑗=1 𝑤𝑗 𝑥𝑗 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 … + 𝑤𝑑 𝑥𝑑 σ 𝑤0 - THE 𝑦-INTERCEPT (THE BIAS) HTTP://WWW.CS.CORNELL.EDU/COURSES/CS4758/2013SP/MATERIALS/CS4758-LIN-REGRESSION.PDF http://www.cs.cornell.edu/courses/cs4758/2013sp/materials/cs4758-lin-regression.pdf EXAMPLE ESTIMATE THE PRICE OF OIL BASED ON TWO PROPERTIES: (1) PRICE OF GOLD AND (2) WORLD GDP 𝑥 ∈? 𝑛 INPUT DATA: 𝑥 ,𝑦𝑖 𝑖 𝑖=1 𝑦ො = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 FIND 𝑤0 , 𝑤1 , 𝑤2 THAT GIVE THE BEST ESTIMATE ORDINARY LEAST SQUARES (D-DIMENSIONS) SIMPLIFICATION: HOMOGENEOUS HYPERPLANES 𝑤0 = 0 𝑦ො = 𝑤1 𝑥1 + 𝑤2 𝑥2 … + 𝑤𝑑 𝑥𝑑 𝑦ො = 𝑤, 𝑥 = 𝑤 𝑇 𝑥 = 𝑥 𝑇 𝑤, 𝑤 = (𝑤1 , … , 𝑤𝑑 ) FIND/LEARN 𝑤𝑗 ’S FROM THE DATA 𝑛 MIN ( 𝑦ො 𝑖 − 𝑦 𝑖 )2 𝑤1 ,…,𝑤𝑑 ∈ℝ 𝑖=1 OPTIMIZE DIRECTLY? 𝑛 MIN ( 𝑦ො 𝑖 − 𝑦 𝑖 )2 = 𝑤1 ,…,𝑤𝑑 ∈ℝ 𝑖=1 MATRIX FORM OLS (ORDINARY LEAST SQUARES) 𝑥11 ⋯ 𝑥𝑑1 𝑦1 𝑤1 X𝑛×𝑑 = ⋮ ⋱ ⋮ , 𝑌𝑛×1 = … , 𝑊𝑑×1 = … 𝑥1𝑛 ⋯ 𝑥𝑑𝑛 𝑦𝑛 𝑤𝑑 PREDICTION IN VECTOR FORM FIND/LEARN W𝑑×1 FROM THE DATA (SOON) GIVEN 𝑥 AND W = W𝑑×1 , WHAT SHOULD 𝑦ො BE? 𝑦ො = FINDING W 2 𝑖 2 OBJECTIVE: σ𝑛𝑖=1 𝑦𝑖 − 𝑦 𝑖 = σ𝑛𝑖=1 𝑖 𝑤, 𝑥 − 𝑦 DEFINE Δ1 𝑥11 ⋯ 𝑥𝑑1 𝑤1 𝑦1 1 − 𝑦1 𝑦 Δ= … = ⋮ ⋱ ⋮ … − … = … Δ𝑛 𝑥1𝑛 ⋯ 𝑥𝑑𝑛 𝑤𝑑 𝑦𝑛 𝑦𝑛 − 𝑦 𝑛 FINDING W Δ1 𝑥11 ⋯ 𝑥𝑑1 𝑤1 𝑦1 Δ= … = ⋮ ⋱ ⋮ … − … Δ𝑛 𝑥1𝑛 ⋯ 𝑥𝑑𝑛 𝑤𝑑 𝑦𝑛 OBJECTIVE FUNCTION: σ𝑛𝑖=1(Δ𝑖 )2 𝑛 2 2 MIN 𝑑×1 σ𝑖=1 Δ𝑖 = MIN ∆, ∆ = MIN Δ 2 = 𝑊∈ℝ 𝑊∈ℝ𝑑×1 𝑊∈ℝ𝑑×1 𝟐 𝒎𝒊𝒏 𝒅×𝟏 ‖𝑿𝑾 − 𝒀‖𝟐 𝑾∈ℝ OLS SOLUTION 𝑊 𝐿𝑆 = 𝑇 −1 𝑇 𝑋 𝑋 𝑋 𝑌 VERIFY DIMENSIONS 𝐶𝑂𝑉(𝑥,𝑦) COMPARE TO 𝑎 = FOR 𝑑=1 𝑉𝑎𝑟(𝑥) WHAT IF 𝑋 𝑇 𝑋 IS NOT INVERTIBLE? INTRODUCTION TO MACHINE LEARNING COMPSCI 4ML3 LECTURE 3 HASSAN ASHTIANI ORDINARY LEAST SQUARES (D-DIMENSIONS) ASSUME 𝑥 ∈ ℝ𝑑 , 𝑦 ∈ ℝ INSTEAD OF A LINE, WE NEED TO FIT A HYPERPLANE! WHY ARE THE LINES VERTICAL? ANY DIFFERENT IF WE MINIMIZE THE DISTANCE TO THE HYPERPLANE? HTTP://WWW.CS.CORNELL.EDU/COURSES/CS4758/2013SP/MATERIALS/CS4758-LIN-REGRESSION.PDF http://www.cs.cornell.edu/courses/cs4758/2013sp/materials/cs4758-lin-regression.pdf MATRIX FORM OLS Δ1 𝑥11 ⋯ 𝑥𝑑1 𝑤1 𝑦1 Δ= … = ⋮ ⋱ ⋮ … − … Δ𝑛 𝑥1𝑛 ⋯ 𝑥𝑑𝑛 𝑤𝑑 𝑦𝑛 𝑛 2 2 MIN Δ𝑖 = MIN < Δ, Δ > = MIN Δ 2 𝑊∈ℝ𝑑×1 𝑊∈ℝ𝑑×1 𝑊∈ℝ𝑑×1 𝑖=1 𝟐 𝒎𝒊𝒏 𝒅×𝟏 ‖𝑿𝑾 − 𝒀‖𝟐 𝑾∈ℝ TAKING THE “DERIVATIVE” REAL-VALUED FUNCTION OF A VECTOR GRADIENT: VECTOR-VALUED FUNCTION OF A VECTOR JACOBIAN: MATRIX/VECTOR CALCULUS 𝑢, 𝑣 ∈ 𝑅𝑛 𝑔 𝑢 = 𝑢𝑇 𝑣 ∇𝑢(𝑔)= MATRIX/VECTOR CALCULUS 𝐴 ∈ 𝑅𝑚×𝑛 , 𝑢 ∈ 𝑅𝑛 𝑔 𝑢 = 𝐴𝑢 ∇𝑢(𝑔)= MATRIX/VECTOR CALCULUS 𝐴 ∈ 𝑅𝑚×𝑛 , 𝑢 ∈ 𝑅𝑛 𝑔 𝑢 = 𝑢𝑇 𝐴 𝑢 ∇𝑢(𝑔)= SOLVING OLS f(W) = 𝑚𝑖𝑛 𝑑×1 ‖𝑋𝑊 − 𝑌‖2 2. WHAT IS ∇𝑓? 𝑊∈ℝ SOLVING OLS 𝑊 𝐿𝑆 = 𝑇 −1 𝑇 𝑋 𝑋 𝑋 𝑌 DEGENERATE CASE WHEN 𝑋 𝑇 𝑋 IS NOT INVERTIBLE? BIAS/INTERCEPT TERM WE ARE MISSING THE BIAS TERM (W0 ) 𝑛 𝑖 𝑖 2 MIN 𝑤1 𝑥1𝑖 +⋯+ 𝑤𝑑 𝑥𝑑 + 𝑤0 − 𝑦 𝑤0 ,𝑤1 ,…,𝑤𝑑 ∈ℝ 𝑖=1 MATRIX FORM WITH THE BIAS TERM? 𝑤0 MIN ‖XW + 𝑤0 − Y‖2 𝑑×1 … 2 𝑊∈ℝ ,𝑤0 ∈ℝ 𝑤0 EXAMPLE BIAS/INTERCEPT TERM ADD A NEW AUXILIARY DIMENSION TO THE DATA 𝑤1 𝑥11 ⋯ 𝑥𝑑1 1 … X′𝑛×(𝑑+1) = ⋮ ⋱ ⋮ 1 , W′ (𝑑+1)×1 = 𝑤𝑑 𝑥1𝑛 ⋯ 𝑥𝑑𝑛 1 𝑤0 SOLVE OLS: MIN ‖X′W′ − Y‖2 2 W′ ∈ℝ (D+1)×1 𝑤0 WILL BE THE BIAS TERM! SOME EXAMPLES OLS NOTEBOOK INTRODUCTION TO MACHINE LEARNING COMPSCI 4ML3 LECTURE 4 HASSAN ASHTIANI MATRIX FORM OLS Δ1 𝑥11 ⋯ 𝑥𝑑1 𝑤1 𝑦1 Δ= … = ⋮ ⋱ ⋮ … − … Δ𝑛 𝑥1𝑛 ⋯ 𝑥𝑑𝑛 𝑤𝑑 𝑦𝑛 𝑛 2 2 MIN 𝑑×1 Δ𝑖 = MIN Δ 2 = 𝑊∈ℝ 𝑊∈ℝ𝑑×1 𝑖=1 𝟐 𝒎𝒊𝒏 𝒅×𝟏 ‖𝑿𝑾 − 𝒀‖𝟐 𝑾∈ℝ 𝐿𝑆 𝑇 −1 𝑇 𝑊 = 𝑋 𝑋 𝑋 𝑌 BIAS/INTERCEPT TERM WE ARE MISSING THE BIAS TERM (𝑤0 ) 𝑛 MIN (𝑤1 𝑥1𝑖 + ⋯ + 𝑤𝑑 𝑥𝑑𝑖 + 𝑤0 − 𝑦 𝑖 )2 𝑤0 ,𝑤1 ,…,𝑤𝑑 ∈ℝ 𝑖=1 𝑤0 MIN ‖XW + 𝑤0 − Y‖2 𝑑×1 … 2 𝑤0 ∈ℝ,𝑊∈ℝ 𝑤0 BIAS/INTERCEPT TERM ADD A NEW AUXILIARY DIMENSION TO THE DATA 𝑤1 𝑥11 ⋯ 𝑥𝑑1 1 … X𝑛×(𝑑+1) = ⋮ ⋱ ⋮ 1 , W (𝑑+1)×1 = 𝑤𝑑 𝑥1𝑛 ⋯ 𝑥𝑑𝑛 1 𝑤0 SOLVE OLS: MIN (𝑑+1)×1 ‖XW − Y‖2 2 𝑊∈ℝ 𝑤0 WILL BE THE BIAS TERM! “NON-LINEAR” DATA? FOR EXAMPLE, WHAT IS THE BEST DEGREE 2 POLYNOMIAL? Height Height Age Age HOW CAN WE REUSE THE “LEAST-SQUARES MACHINERY”? IDEA: DATA TRANSFORMATION WE INCREASED THE FLEXIBILITY OF OUR PREDICTOR BY A FORM OF DATA TRANSFORMATION/AUGMENTATION 𝑥11 ⋯ 𝑥𝑑1 1 X′𝑛×(𝑑+1) = ⋮ ⋱ ⋮ 1 𝑥1𝑛 ⋯ 𝑥𝑑𝑛 1 CAN WE USE THE SAME IDEA TO MAKE OUR PREDICTOR EVEN MORE FLEXIBLE (NON-LINEAR)? EXAMPLE LEAST-SQUARES FOR POLYNOMIALS IDEA: 𝑎𝑥 2 + 𝑏𝑥 + 𝑐 IS STILL LINEAR WITH RESPECT TO THE PARAMETERS! (W.R.T. 𝑎, 𝑏 AND 𝑐) 𝑥 1 𝑥1 𝑥1 2 1 INSTEAD OF X𝑛×1 = … USE 𝑋′𝑛×3 = … … … 𝑥𝑛 𝑥𝑛 (𝑥 𝑛 )2 1 TREAT 𝑋𝑛×3 AS IF IT WAS YOUR ORIGINAL INPUT DATA WE CAN EXTEND THIS TO HIGHER DEGREE POLYNOMIALS SIMILARLY, E.G., 𝑎𝑥 3 + 𝑏𝑥 2 + 𝑐𝑥 + 𝑑 NOTEBOOK EXAMPLE MULTIVARIATE POLYNOMIALS HOW ABOUT WHEN 𝑥 IS MULTIVARIATE ITSELF? 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤3 𝑥1 𝑥2 + 𝑤4 𝑥1 2 + 𝑤5 𝑥2 2 + 𝑤6 INSTEAD OF (𝑥1 , 𝑥2 ) USE 𝑥1 𝑥2 𝑥1 𝑥2 𝑥1 2 𝑥2 2 1 TREAT THE NEW 𝑋 AS (A HIGHER-DIMENSIONAL) INPUT INPUT DIMENSION: 𝑑 DEGREE OF POLYNOMIAL: 𝑀 NUMBER OF TERMS (MONOMIALS) OF DEGREE AT MOST M ≈ 𝑀+𝑑 𝑀+𝑑 = 𝑑 𝑀 OVERFITTING OVERFITTING DIVIDE THE DATA RANDOMLY TO “TRAIN” AND “TEST” SETS ROOT-MEAN-SQUARE ERROR FOR EACH SET: 2 ‖𝑌−Y‖2 σ𝑛 ෞ𝑖 𝑖=1 𝑦𝑖 −𝑦 2 = 𝑛 𝑛 MORE DATA, LESS OVER-FITTING THE TRADE-OFF A POWERFUL/FLEXIBLE CURVE-FITTING METHOD SMALL TRAINING ERROR REQUIRES MORE TRAINING DATA TO GENERALIZE OTHERWISE LARGE TEST ERROR A LESS FLEXIBLE CURVE-FITTING METHOD LARGER TRAINING ERROR REQUIRES LESS TRAINING DATA SMALLER DIFFERENCE BETWEEN TRAINING AND TEST ERROR THE SO-CALLED “BIAS-VARIANCE” TRADE-OFF THE CASE OF MULTIVARIATE POLYNOMIALS ASSUME 𝑀 ≫ 𝑑 𝑀 𝑑 NUMBER OF TERMS (MONOMIALS): ≈ ( ) 𝑑 𝑀 𝑑 #TRAINING SAMPLES ≈ #PARAMETERS ≈ ( ) 𝑑 #TRAINING SAMPLES SHOULD INCREASE EXPONENTIALLY WITH 𝑑 SUSCEPTIBLE TO OVER-FITTING… AN EXAMPLE OF CURSE OF DIMENSIONALITY! WE CAN SAY SAMPLE COMPLEXITY OF LEARNING MULTIVARIATE POLYNOMIALS IS EXPONENTIAL IN 𝑑 ORTHOGONAL TO COMPUTATIONAL COMPLEXITY INTRODUCTION TO MACHINE LEARNING COMPSCI 4ML3 LECTURE 5 HASSAN ASHTIANI THE TRADE-OFF A POWERFUL/FLEXIBLE CURVE-FITTING METHOD SMALL TRAINING ERROR REQUIRES MORE TRAINING DATA TO GENERALIZE OTHERWISE LARGE TEST ERROR A LESS FLEXIBLE CURVE-FITTING METHOD LARGER TRAINING ERROR REQUIRES LESS TRAINING DATA SMALLER DIFFERENCE BETWEEN TRAINING AND TEST ERROR THE SO-CALLED “BIAS-VARIANCE” TRADE-OFF THE CASE OF MULTIVARIATE POLYNOMIALS ASSUME 𝑀 ≫ 𝑑 𝑀 𝑑 NUMBER OF TERMS (MONOMIALS): ≈ ( ) 𝑑 𝑀 𝑑 #TRAINING SAMPLES ≈ #PARAMETERS ≈ ( ) 𝑑 #TRAINING SAMPLES SHOULD INCREASE EXPONENTIALLY WITH 𝑑 SUSCEPTIBLE TO OVER-FITTING… AN EXAMPLE OF CURSE OF DIMENSIONALITY! WE CAN SAY SAMPLE COMPLEXITY OF LEARNING MULTIVARIATE POLYNOMIALS IS EXPONENTIAL IN 𝑑 ORTHOGONAL TO COMPUTATIONAL COMPLEXITY MODEL SELECTION: HOW TO AVOID OVERFITTING? SELECTING M (THE COMPLEXITY OF THE MODEL) BASED ON 𝑑 (DIMENSION) AND 𝑛 (NUMBER OF SAMPLES) MORE PRACTICALLY, TRY SEVERAL OPTIONS FOR M USE A HOLDOUT (EVALUATION) SAMPLE NEVER USE TEST DATA TO TUNE PARAMETERS! AVOID OVERFITTING WITH REGULARIZED LEAST SQUARES 𝟐 𝟐 𝒎𝒊𝒏𝒅 ‖𝑿𝑾 − 𝒀‖𝟐 + 𝝀‖𝑾‖𝟐 𝑾∈𝓡 ENCOURAGE A SOLUTION WITH A SMALLER NORM 𝑊 𝑅𝐿𝑆 = 𝑋 𝑇 𝑋 + 𝜆𝐼 −1 𝑋𝑇 𝑌 EXERCISE: PROVE THAT THIS IS THE OPTIMAL SOLUTION DOES THE INVERSE ALWAYS EXIST? YES! (EXERCISE: PROVE) HOW TO CHOOSE 𝝀? POLYNOMIAL CURVE-FITTING REVISITED MAP THE INPUTS 𝑥 𝑖 TO A HIGHER DIMENSIONAL SPACE A KIND OF “PRE-PROCESSING” THE DATA DO LINEAR REGRESSION ON THE HIGH-DIMENSIONAL SPACE EQUIVALENT TO PERFORMING NON-LINEAR REGRESSION IN THE ORIGINAL SPACE MAP 𝜙 𝑥 : R𝑑1 ⟼ R𝑑2 WHERE 𝑑2 ≫ 𝑑1 𝑥 𝜙1 𝑥 𝑥2 𝜙 𝑥 = … IS NONLINEAR, E.G., 𝑥 ∈ 𝑅 AND 𝜙 𝑥 = 𝑥3 𝜙𝑑2 𝑥 … 𝑥 𝑑2 WHAT IF 𝑑2 IS MUCH LARGER THAN THE NUMBER OF SAMPLES? CURVE-FITTING WITH BASIS FUNCTIONS FEATURE MAP: 𝜙 𝑥 : R𝑑1 ⟼ R𝑑2 𝑑2 ≫ 𝑑1 𝑇 Φ𝑛×𝑑2 = 𝜙 𝑥 1 … 𝜙 𝑥𝑛 TRAINING 𝑾∗ = 𝒎𝒊𝒏 ‖𝚽𝑾 − 𝒀‖𝟐𝟐 + 𝝀‖𝑾‖𝟐𝟐 𝑾 𝑾∗ = 𝚽 𝑇 𝚽 + 𝜆𝐼 −1 𝚽 𝑇 𝑌 PREDICTION 𝑻 𝒚 ෝ =< 𝑾∗ , 𝝓 𝒙 >= ∗ 𝑾 𝝓(𝒙) OTHER CHOICES OF 𝜙 𝑥 PICK A FIXED (NONLINEAR) Φ 𝑥 ENCODES YOUR PRIOR KNOWLEDGE ABOUT THE DATA FEATURE ENGINEERING! POLYNOMIAL BASIS FUNCTIONS GAUSSIAN BASIS FUNCTIONS: 2 𝑥−𝜇𝑖 − 2 𝜙𝑖 𝑥 = 𝑒 2𝜎 2 DFT (FFT), WAVELET FOR TIME SERIES IS IT POSSIBLE TO LEARN THE MAPPING 𝜙𝑖 𝑥 ITSELF? LATER, E.G., NEURAL NETWORKS COMPUTATIONAL COMPLEXITY OF NAÏVE RLS TRAINING: CALCULATE W RLS = 𝜙 𝑇 𝜙 + 𝜆𝐼 −1 𝜙 𝑇 𝑌 BOTTLENECK: MATRIX INVERSION HOW MANY OPERATIONS? PREDICTION: 𝑦ො =< 𝜙 𝑥 , 𝑤 𝑅𝐿𝑆 > HOW MANY OPERATIONS? REGULARIZATION ALLOWS US TO GO INTO HIGH-DIMENSIONAL SPACE WITHOUT OVERFITTING, BUT IT DOES NOT SOLVE THE COMPUTATIONAL PROBLEM COMPUTATIONAL COMPLEXITY MATRIX MULTIPLICATION (N-BY-N MATRICES) NATIVE METHOD: O(𝑁 3 ) STRASSEN’S ALGORITHM: O(𝑁 2.8074 ) current best COPPERSMITH–WINOGRAD-LIKE ALGORITHMS [CURRENT BEST O(𝑁 2.3728639 )] MATRIX INVERSION GAUSSIAN ELIMINATION: O(𝑁 3 ) POSSIBLE TO REDUCE IT TO MULTIPLICATION THE COMPUTATIONAL PROBLEM CAN WE SOLVE THE REGULARIZED LEAST SQUARES IN R𝑑2 WITHOUT EXPLICITLY MAPPING THE DATA INTO R𝑑2 ? 𝟐 𝟐 𝑾∗ = 𝒎𝒊𝒏 𝒅 ‖𝚽𝑾 − 𝒀‖𝟐 + 𝝀‖𝑾‖𝟐 𝑾∈𝑹 𝟐 SOMETHING LIKE MULTIPLICATION USING FFT IF SO, WE COULD EVEN MAP THE DATA TO AN INFINITE DIMENSIONAL SPACE!! FFT AND MULTIPLICATION INTRODUCTION TO MACHINE LEARNING COMPSCI 4ML3 LECTURE 6 HASSAN ASHTIANI COMPUTATIONAL COMPLEXITY OF NAÏVE RLS TRAINING: CALCULATE W RLS = 𝜙 𝑇 𝜙 + 𝜆𝐼 −1 𝜙 𝑇 𝑌 BOTTLENECK: MATRIX INVERSION HOW MANY OPERATIONS? PREDICTION: 𝑦ො =< 𝜙 𝑥 , 𝑤 𝑅𝐿𝑆 > HOW MANY OPERATIONS? REGULARIZATION ALLOWS US TO GO INTO HIGH-DIMENSIONAL SPACE WITHOUT OVERFITTING, BUT IT DOES NOT SOLVE THE COMPUTATIONAL PROBLEM COMPUTATIONAL COMPLEXITY MATRIX MULTIPLICATION (N-BY-N MATRICES) NATIVE METHOD: O(𝑁 3 ) STRASSEN’S ALGORITHM: O(𝑁 2.8074 ) current best COPPERSMITH–WINOGRAD-LIKE ALGORITHMS [CURRENT BEST O(𝑁 2.3728639 )] MATRIX INVERSION GAUSSIAN ELIMINATION: O(𝑁 3 ) POSSIBLE TO REDUCE IT TO MULTIPLICATION THE COMPUTATIONAL PROBLEM CAN WE SOLVE THE REGULARIZED LEAST SQUARES IN R𝑑2 WITHOUT EXPLICITLY MAPPING THE DATA INTO R𝑑2 ? 𝟐 𝟐 𝑾∗ = 𝒎𝒊𝒏 𝒅 ‖𝚽𝑾 − 𝒀‖𝟐 + 𝝀‖𝑾‖𝟐 𝑾∈𝑹 𝟐 SOMETHING LIKE MULTIPLICATION USING FFT IF SO, WE COULD EVEN MAP THE DATA TO AN INFINITE DIMENSIONAL SPACE!! FFT AND MULTIPLICATION THE COMPUTATIONAL PROBLEM CAN WE SOLVE THE REGULARIZED LEAST SQUARES IN R𝑑2 WITHOUT EXPLICITLY MAPPING THE DATA INTO R𝑑2 ? 𝒎𝒊𝒏 ‖𝚽𝑾 − 𝒀‖𝟐𝟐 + 𝝀‖𝑾‖𝟐𝟐 𝑾 SOMETHING LIKE MULTIPLICATION USING FFT IF SO, WE COULD EVEN MAP THE DATA TO AN INFINITE DIMENSIONAL SPACE!! THE KERNEL TRICK COMPUTE THE HIGH-DIMENSIONAL INNER PRODUCT EFFICIENTLY 𝐾 𝑥 𝑖 , 𝑥 𝑗 =< 𝜙 𝑥 𝑖 , 𝜙 𝑥 𝑗 > USE THIS AS A BUILDING-BLOCK FOR PERFORMING OTHER OPERATIONS REWRITE THE LEAST SQUARES SOLUTION SO THAT IT ONLY USES THE INNER PRODUCT OF THE FEATURE MAPS?! Φ𝑇 Φ + 𝜆𝐼 −1 Φ𝑇 𝑌 THE KERNEL FUNCTION KERNEL FUNCTION FOR A MAPPING 𝜙: EXAMPLE: 𝜙 𝑥 𝑇 = ൣ1, 2𝑥1 , 2𝑥2 , … 2𝑥𝑑 , 𝑥1 2 , 𝑥1 𝑥2 𝑥1 𝑥3 , … , 𝑥1 𝑥𝑑 , 𝑥2 𝑥1 , … … , 𝑥𝑑 𝑥𝑑 ൧ SO POLYNOMIAL BASIS FUNCTIONS WITH 𝑀 = 2 COMPUTING 𝐾 𝑢, 𝑣 =< 𝜙 𝑢 , 𝜙 𝑣 > COMPLEXITY OF NAÏVE CALCULATION? BETTER APPROACH? 𝑇 2 2 𝑖 𝑗 𝑖 𝑗 𝑖 𝑗 𝑘 𝑥 ,𝑥 = 1+ 𝑥 𝑥 = 1 +< 𝑥 , 𝑥 > NUMBER OF OPERATIONS? DEGREE M POLYNOMIALS FOR HIGHER DEGREE POLYNOMIALS, WE CAN USE 𝑇 𝑀 𝑘 𝑥𝑖, 𝑥 𝑗 = 1 + 𝑥 𝑖 𝑥 𝑗 HOW MANY OPERATIONS? THE KERNEL TRICK COMPUTE THE HIGH-DIMENSIONAL INNER PRODUCT EFFICIENTLY 𝐾 𝑥 𝑖 , 𝑥 𝑗 =< 𝜙 𝑥 𝑖 , 𝜙 𝑥 𝑗 > USE THIS AS A BUILDING-BLOCK FOR PERFORMING OTHER OPERATIONS REWRITE THE LEAST SQUARES SOLUTION SO THAT IT ONLY USES THE INNER PRODUCT OF THE FEATURE MAPS?! Φ𝑇 Φ + 𝜆𝐼 −1 Φ𝑇 𝑌 ROADMAP ASSUME 𝑑2 IS VERY LARGE, EVEN 𝑑2 >> 𝑛 INSTEAD OF FINDING W, TRY TO INTRODUCE NEW PARAMETER 𝑎 WHOSE SIZE IS 𝑛 RATHER THAN 𝑑2 NOW WE HAVE 𝑛 PARAMETERS FIND OPTIMAL 𝑎 AS A FUNCTION OF 𝐾 INTRODUCTION TO MACHINE LEARNING COMPSCI 4ML3 LECTURE 7 HASSAN ASHTIANI CALCULATING OLS WITH FEATURE MAPS FEATURE MAP: 𝜙 𝑥 : R𝑑1 ⟼ R𝑑2 TO CALCULATE: W ∗ = Φ𝑇 Φ + 𝜆𝐼 −1 Φ𝑇 𝑌 NEED TO INVERT A 𝑑2 × 𝑑2 MATRIX 𝑑2 CAN BE VERY LARGE, AND EVEN INFINITE! KERNEL TRICK: COMPUTE THE HIGH-DIMENSIONAL INNER PRODUCT EFFICIENTLY 𝐾 𝑥 𝑖 , 𝑥 𝑗 =< 𝜙 𝑥 𝑖 , 𝜙 𝑥 𝑗 > USE THIS AS A BUILDING-BLOCK FOR PERFORMING OTHER OPERATIONS REWRITE THE LEAST SQUARES SOLUTION SO THAT IT ONLY USES INNER PRODUCTS IN THE FEATURE MAP! ROADMAP OPTIMAL OLS 𝑊 ∗ = Φ𝑇 Φ + 𝜆𝐼 −1 Φ𝑇 𝑌 ASSUME 𝑑2 IS VERY LARGE, EVEN 𝑑2 >> 𝑛 INSTEAD OF FINDING W, TRY TO INTRODUCE NEW PARAMETER 𝑎 WHOSE SIZE IS 𝑛 RATHER THAN 𝑑2 NOW WE HAVE 𝑛 PARAMETERS FIND OPTIMAL 𝑎 AS A FUNCTION OF 𝐾 KERNELIZED LEAST SQUARES 𝑾∗ = 𝒎𝒊𝒏 ‖𝚽𝑾 − 𝒀‖𝟐𝟐 + 𝝀‖𝑾‖𝟐𝟐 𝑾 STEP 1: SHOW THERE EXISTS 𝑎 ∈ 𝑅𝑛 , SUCH THAT 𝑊 ∗ = Φ𝑇 𝑎 IN OTHER WORDS, 𝑊 ∗ = ∑𝑎𝑖 𝜙(𝑥 𝑖 ) NUMBER OF PARAMETERS? 𝑛 INSTEAD OF 𝑑2 … PROOF? KERNEL FUNCTION NOTATIONS 𝑘 𝑥 𝑖 , 𝑥 𝑗 =< 𝜙 𝑥 𝑖 , 𝜙 𝑥 𝑗 > KERNEL OR GRAM MATRIX OF A DATA SET: 𝐾𝑛×𝑛 = 𝑘(𝑥 𝑖 , 𝑥 𝑗 ) = ΦΦ𝑇 𝑘 𝑥 = Φ𝜙 𝑥 = 𝑘 𝑥, 𝑥 1 𝑘 𝑥, 𝑥 2 …. 𝑘 𝑥, 𝑥 𝑛 𝑇 PREDICTION, GIVEN 𝑎 𝑊 ∗ = Φ𝑇 𝑎 PREDICTION ON TRAINING POINTS 𝑌 = Φ𝑊 ∗ =? PREDICTION FOR NEW TEST POINT 𝑥: 𝑦ො 𝑥 =< 𝜙 𝑥 , 𝑊 ∗ >=? FINDING 𝒂 USING DUAL FORM 𝑾∗ = 𝒎𝒊𝒏 ‖𝚽𝑾 − 𝒀‖𝟐𝟐 + 𝝀‖𝑾‖𝟐𝟐 𝑾 STEP 2: USE 𝑊 ∗ = Φ𝑇 𝑎 TO REFORMULATE THE PROBLEM IN TERMS OF FINDING 𝑎 (DUAL FORM) 𝒎𝒊𝒏𝒏 ‖𝚽𝚽 𝐓 𝒂 − 𝒀‖𝟐𝟐 + 𝝀‖𝚽 𝐓 𝐚‖𝟐𝟐 OR… 𝒂∈𝑹 𝟐 𝑻 𝒎𝒊𝒏 ‖𝑲𝒂 − 𝒀‖𝟐 + 𝝀𝒂 𝑲𝒂 𝒂 𝒂∗ = 𝑲 + 𝝀𝑰 −𝟏 𝒀 (PROOF?) FASTER WHEN 𝑑2 ≫ 𝑛 CHOICE OF KERNEL KERNEL ENCODES SIMILARITY OF POINTS 𝑥 𝑖 AND 𝑥𝑗 POLYNOMIAL: 𝑘 𝑥, 𝑧 = 1 + 𝑥 𝑇 𝑧 𝑀 1 −( 2 ) 𝑥−𝑧 22 −𝛼 𝑥−𝑧 22 GAUSSIAN: 𝑘 𝑥, 𝑧 = 𝑒 2𝜎 = 𝑒 𝜙(𝑥) IS INFINITE DIMENSIONAL HOW TO CHOOSE A KERNEL? IT SHOULD BE VALID (THERE MUST EXIST A 𝜙) DOMAIN KNOWLEDGE KERNEL FUNCTION CAPTURES “SIMILARITY” BETWEEN POINTS GAUSSIAN KERNEL: INTUITION JUPYTER NOTEBOOK COMPUTATIONAL COMPLEXITY MATRIX MULTIPLICATION (N-BY-N MATRICES) NATIVE METHOD: O(𝑁 3 ) STRASSEN’S ALGORITHM: O(𝑁 2.8074 ) CURRENTLY BEST KNOWN METHOD: COPPERSMITH–WINOGRAD ALGORITHM O(𝑁 2.3755 ) MATRIX INVERSION GAUSSIAN ELIMINATION: O(𝑁 3 ) POSSIBLE TO REDUCE IT TO MULTIPLICATION (SO O(𝑁 2.3755 )) COMPUTATIONAL COMPLEXITY TRAINING COMPLEXITY (𝑛 TRAINING POINTS) REGULARIZED LEAST SQUARES 𝑊 = 𝑋 𝑇 𝑋 + 𝝀𝑰 −1 𝑋 𝑇 𝑌 KERNEL LEAST SQUARES 𝒂 = 𝑲 + 𝝀𝑰 −𝟏 𝒀 TEST COMPLEXITY (FOR A SINGLE TEST POINT) REGULARIZED LEAST SQUARES 𝑥𝑇𝑊 KERNEL LEAST SQUARES 𝑘 𝑥 𝑇𝑎 INTRODUCTION TO MACHINE LEARNING COMPSCI 4ML3 LECTURE 8 HASSAN ASHTIANI HIGH-DIMENSIONAL DATA VISUALIZATION WE HAVE 𝑛 DATA POINTS EACH 𝑑 DIMENSIONAL HOW CAN WE VISUALIZE 𝑋𝑛×𝑑 ? FOR NOW ASSUME THERE IS NO 𝑌 VALUE IF 𝑑 = 1 OR 𝑑 = 2 (OR MAYBE 𝑑 = 3)? HIGH-DIMENSIONAL DATA SAY YOU HAVE A DATA SET OF IMAGES HOW TO VISUALIZE? MAP THE DATA SET TO A LOW DIMENSIONAL SPACE OPPOSITE OF WHAT WE DID FOR NON-LINEAR CURVE-FITTING! HOW TO MAP THE DATA? FINDING A GOOD MAPPING SIMPLE CASE: THE ORIGINAL SPACE IS 2D THE MAPPED SPACE IS 1D THE MAPPING IS LINEAR EXAMPLE IN CONTRAST TO LS? (EXAMPLE) PROBLEM FORMULATION MAP 𝑥 ∈ 𝑅𝑑 TO 𝑧 ∈ 𝑅𝑞 WITH 𝑞 = 𝑦 𝑖 OR EQUIVALENTLY, FOR EVERY 𝑖 ∈ [𝑛] WE HAVE (𝑊 ∗ 𝑇 𝑥 𝑖 )𝑦 𝑖 > 0 IN OTHER WORDS, THE CLASSIFICATION ERROR ON 𝑍 IS 0 CAN WE FIND 𝑊 ∗ EFFICIENTLY FOR LINEARLY SEPARABLE DATA? LINEAR PROGRAMMING STANDARD LP PROBLEM: 𝐦𝐚𝐱𝒅 < 𝒖, 𝒘 > 𝒘∈ℝ 𝒔. 𝒕. 𝑨𝒘 ≥ 𝒗 LP PROBLEMS CAN BE SOLVED EFFICIENTLY! LP FOR CLASSIFICATION DATA IS LINEARLY SEPARABLE SO ∗𝑇 𝑖 ∃𝑊 S.T. ∀𝑖 ∈ [𝑛], (𝑊 𝑥 )𝑦 𝑖 > 0 ∗ SO, ∃𝑊 ∗ , 𝛾 > 0 S.T. ∀𝑖 ∈ [𝑛], 𝑊 ∗ 𝑇 𝑥 𝑖 𝑦 𝑖 ≥ 𝛾 SO, 𝑇 𝑖 ∃𝑊 ∗ , S.T. ∀𝑖 ∈ [𝑛], ∗ 𝑊 𝑥 𝑦𝑖 ≥ 1 LP FOR LINEAR CLASSIFICATION DEFINE 𝐴 = 𝑥𝑗𝑖 𝑦 𝑖 𝑛×𝑑 THEN FINDING THE OPTIMAL 𝑊 IS EQUIVALENT TO 𝐦𝐚𝐱𝒅 < 𝟎, 𝒘 > 𝒘∈ℝ 𝒔. 𝒕. 𝑨𝒘 ≥ 𝟏 WE CAN USE OFF-THE-SHELF LP SOLVERS. WHAT IF THE BEST 𝑊 DOES NOT GO THROUGH THE ORIGIN? (IT HAS A BIAS OR INTERCEPT)? APPROACH 2: PERCEPTRON PROPOSED IN 50’S BY ROSENBLATT PREDECESSOR OF NEURAL NETWORKS MULTI-LAYER PERCEPTRON! ROSENBLATT'S PERCEPTRON IN EACH UPDATE, 𝑊 BECOMES “MORE CORRECT” ON 𝑥𝑖 HTTPS://PHIRESKY.GITHUB.IO/KOGSYS-DEMOS/NEURAL-NETWORK-DEMO/?PRESET=ROSENBLATT+PERCEPTRON THE GREEDY UPDATE IN EACH UPDATE, 𝑊 BECOMES “MORE CORRECT” ON 𝑥𝑖: WHAT ABOUT OTHER 𝑥 𝑗 ’S? INTRODUCTION TO MACHINE LEARNING COMPSCI 4ML3 LECTURE 15 HASSAN ASHTIANI LINEARLY SEPARABLE DATA 𝑛 A BINARY CLASSIFICATION DATA SET 𝑖 𝑍 = (𝑥 , 𝑦 ) 𝑖 𝑖=1 IS LINEARLY SEPARABLE IF THERE EXISTS 𝑊 ∗ SUCH THAT FOR EVERY 𝑖 ∈ [𝑛] WE HAVE SGN < 𝑥 𝑖 , 𝑊 ∗ > = 𝑦 𝑖 OR EQUIVALENTLY, FOR EVERY 𝑖 ∈ [𝑛] WE HAVE (𝑊 ∗ 𝑇 𝑥 𝑖 )𝑦 𝑖 > 0 IN OTHER WORDS, THE CLASSIFICATION ERROR ON 𝑍 IS 0 CAN WE FIND 𝑊 ∗ EFFICIENTLY FOR LINEARLY SEPARABLE DATA? LP FOR LINEAR CLASSIFICATION DEFINE 𝐴 = 𝑖 𝑖 𝑥𝑗 𝑦 𝑛×𝑑 THEN FINDING THE OPTIMAL 𝑊 IS EQUIVALENT TO 𝐦𝐚𝐱𝒅 < 𝟎, 𝒘 > 𝒘∈ℝ 𝒔. 𝒕. 𝑨𝒘 ≥ 𝟏 WE CAN USE OFF-THE-SHELF LP SOLVERS! APPROACH 2: PERCEPTRON PROPOSED IN 50’S BY ROSENBLATT PREDECESSOR OF NEURAL NETWORKS MULTI-LAYER PERCEPTRON! ROSENBLATT'S PERCEPTRON IN EACH UPDATE, 𝑊 BECOMES “MORE CORRECT” ON 𝑥𝑖 HTTPS://PHIRESKY.GITHUB.IO/KOGSYS-DEMOS/NEURAL-NETWORK-DEMO/?PRESET=ROSENBLATT+PERCEPTRON THE GREEDY UPDATE IN EACH UPDATE, 𝑊 BECOMES “MORE CORRECT” ON 𝑥𝑖: WHAT ABOUT OTHER 𝑥 𝑗 ’S? NOVIKOFF,1962 CONVERGENCE OF PERCEPTRON #STEPS DOES NOT EXPLICITLY DEPEND ON 𝑑 YOU CAN FIND MORE DETAILS ABOUT THIS LECTURE IN UNDERSTANDING MACHINE LEARNING, CHAPTER 9 HTTPS://WWW.CS.HUJI.AC.IL/~SHAIS/UNDERSTANDINGMACHINELEA RNING/UNDERSTANDING-MACHINE-LEARNING-THEORY- ALGORITHMS.PDF IN 1969, MARVIN MINSKY AND SEYMOUR PAPERT ARGUED THAT IT IS IMPOSSIBLE TO LEARN XOR FUNCTION USING MULTILAYER PERCEPTRON… ONLY GOOD FOR LINEARLY SEPARABLE DATA STACKING PERCEPTRONS? 70’S: AI (CONNECTIONISM) WINTER SUPPORT VECTOR MACHINES AMONG PERFECT LINEAR SEPARATORS, WHICH ONE SHOULD WE CHOOSE? SUPPORT VECTOR MACHINES PICK THE LINEAR SEPARATOR THAT MAXIMIZES THE “MARGIN” MORE ROBUST TO “PERTURBATION” LESS PRONE TO OVERFITTING WORKS WELL FOR HIGH-DIMENSIONAL DATA (?) MORE ON THAT LATER! DISTANCE OF A POINT TO A HYPERPLANE THE EUCLIDEAN DISTANCE BETWEEN A POINT 𝑥 AND THE HYPERPLANE PARAMETRIZED BY 𝑊 IS (WHY?) |𝑊 𝑇 𝑥 + 𝑏| ||𝑊||2 THE DECISION BOUNDARY OF A LINEAR CLASSIFIER IS DETERMINED BY THE DIRECTION OF 𝑊 (NOT 𝑊 2 ) ASSUME 𝑊 2 =1, THEN THE DISTANCE IS 𝑇 |𝑊 𝑥 + 𝑏| MAXIMUM MARGIN HYPERPLANE LET THE HYPERPLANE BE PARAMETRIZED BY 𝑊 ASSUME 𝑊 2 =1 𝑊 HAS A 𝛾 MARGIN IF 𝑊 𝑇 𝑥 + 𝑏 > 𝛾 FOR EVERY BLUE 𝑥, AND 𝑊 𝑇 𝑥 + 𝑏 < −𝛾 FOR EVERY RED 𝑥 THE MARGIN 𝑖 𝑖 𝑛 𝑍= 𝑥 ,𝑦 𝑖=1 , 𝑦 ∈ {−1, +1}, |𝑊 |2 = 1 MAXIMIZING THE MARGIN THE VERSION WITH “BIAS” WE COULD HAVE ALSO ADDED A DUMMY “1” FEATURE TO ALL POINTS SO AS TO ACCOUNT FOR THE BIAS/INTERCEPT SENSITIVITY TO OUTLIERS