Lecture 3 ML Foundations PDF

AI6103 Machine Learning Foundamentals Lecture Outline Information Theory Basics of Machine Learning Linear Regression Ridge Regression Logistic Regression AI6103 2 Information Theory The scientific study of the quantification, storage, and communication of digital information. How can we communicate through a noisy channel? How can we encode information into binary form efficiently? We will only scrape the surface of this vast and rich subject AI6103 3 Entropy The degree of uncertainty or “chaos / surprise Example: Fair die with probabilities {1/6, 1/6, / information” in a random variable 1/6, 1/6, 1/6, 1/6} Expectation of negative log probability 1 1 𝐻𝐻 = − log × 6 = 0.78 6 6 𝐻𝐻 𝑃𝑃(𝑋𝑋 ) = 𝐸𝐸 − log 𝑃𝑃(𝑋𝑋) Biased die with probabilities {1/12, 1/12, = −∫ 𝑃𝑃 𝑥𝑥 log𝑃𝑃(𝑥𝑥) 𝑑𝑑𝑑𝑑 1/12, 1/12, 1/3, 1/3} For discrete variables 1 1 1 1 𝐻𝐻 = − log ×4− log × 2 = 0.67 12 12 3 3 𝐻𝐻 𝑃𝑃(𝑋𝑋 ) = − 𝑃𝑃(𝑋𝑋 = 𝑥𝑥𝑖𝑖 ) log 𝑃𝑃(𝑋𝑋 = 𝑥𝑥𝑖𝑖 ) 𝑖𝑖 More uncertainty when the distribution is closer to uniform. AI6103 4 Entropy: Relation to Event Encoding If we observe 26 letters with equal probability, we can use 1 log 2 26 = − log 2 bits to encode each character. 26 No fractional bits, so log 2 26 = 5 A = 00000, B = 00001, C=00010, D=00011, etc. To encode three letters, we need 15 bits. AI6103 5 Entropy: Relation to Event Encoding However, if one letter is more common than others, we can design the encoding such that we use fewer bits for more frequent letters. A = 001, B=0001, C=10100, D=10011, etc. BAD=000100110011 (12 bits) Using fewer bits in expectation because less frequent letters have longer encoding. AI6103 6 Entropy: Relation to Event Encoding − log 2 𝑃𝑃(𝐴𝐴) is the “information content” of event A. It is the number of bits we need to tell people that this event happened. Its expectation is the entropy. 𝐻𝐻 𝑃𝑃 = 𝐸𝐸 − log 𝑋𝑋 = − 𝑃𝑃(𝑋𝑋 = 𝑥𝑥𝑖𝑖 ) log 𝑃𝑃(𝑋𝑋 = 𝑥𝑥𝑖𝑖 ) 𝑖𝑖 AI6103 7 Cross-Entropy For two probability distributions 𝑃𝑃 and 𝑄𝑄, the cross-entropy is 𝐻𝐻 𝑄𝑄, 𝑃𝑃 = 𝐸𝐸𝑄𝑄 − log 𝑃𝑃(𝑋𝑋) = − 𝑄𝑄(𝑋𝑋 = 𝑥𝑥𝑖𝑖 ) log 𝑃𝑃(𝑋𝑋 = 𝑥𝑥𝑖𝑖 ) 𝑖𝑖 Interpretation: We designed an encoding scheme for the probability distribution 𝑃𝑃. However, the actual distribution is 𝑄𝑄. What is the number of bits we need to encode the information? AI6103 8 Kullback–Leibler divergence (relative entropy) A measure for the differences between distributions 𝑃𝑃(𝑋𝑋) 𝑃𝑃 𝑋𝑋 = 𝑥𝑥𝑖𝑖 𝐾𝐾𝐾𝐾 𝑃𝑃||𝑄𝑄 = 𝐸𝐸𝑃𝑃 log = 𝑃𝑃(𝑋𝑋 = 𝑥𝑥𝑖𝑖 ) log 𝑄𝑄(𝑋𝑋) 𝑄𝑄 𝑋𝑋 = 𝑥𝑥𝑖𝑖 𝑖𝑖 Continuous distributions with probability density functions 𝑃𝑃 and 𝑄𝑄 𝑃𝑃 𝑥𝑥 𝐾𝐾𝐾𝐾 𝑃𝑃||𝑄𝑄 = 𝑃𝑃(𝑥𝑥)log 𝑑𝑑𝑑𝑑 𝑄𝑄 𝑥𝑥 AI6103 9 KL divergence and cross entropy 𝑃𝑃 𝑃𝑃 𝑋𝑋 = 𝑥𝑥𝑖𝑖 𝐾𝐾𝐾𝐾 𝑃𝑃||𝑄𝑄 = 𝐸𝐸𝑃𝑃 log = 𝑃𝑃(𝑋𝑋 = 𝑥𝑥𝑖𝑖 ) log 𝑄𝑄 𝑄𝑄 𝑋𝑋 = 𝑥𝑥𝑖𝑖 𝑖𝑖 𝐻𝐻 𝑄𝑄, 𝑃𝑃 = − 𝑄𝑄(𝑋𝑋 = 𝑥𝑥𝑖𝑖 ) log 𝑃𝑃(𝑋𝑋 = 𝑥𝑥𝑖𝑖 ) Cross entropy here 𝑖𝑖 = − ∑𝑖𝑖 𝑄𝑄(𝑋𝑋 = 𝑥𝑥𝑖𝑖 ) log 𝑃𝑃(𝑋𝑋 = 𝑥𝑥𝑖𝑖 ) + ∑𝑖𝑖 𝑄𝑄 𝑋𝑋 = 𝑥𝑥𝑖𝑖 log 𝑄𝑄 𝑋𝑋 = 𝑥𝑥𝑖𝑖 − ∑𝑖𝑖 𝑄𝑄 𝑋𝑋 = 𝑥𝑥𝑖𝑖 log 𝑄𝑄 𝑋𝑋 = 𝑥𝑥𝑖𝑖 𝑄𝑄 𝑋𝑋 = 𝑥𝑥𝑖𝑖 = 𝑄𝑄 𝑋𝑋 = 𝑥𝑥𝑖𝑖 log + 𝐻𝐻(𝑄𝑄(𝑋𝑋)) 𝑃𝑃 𝑋𝑋 = 𝑥𝑥𝑖𝑖 𝑖𝑖 = 𝐻𝐻 𝑄𝑄 + 𝐾𝐾𝐾𝐾(𝑄𝑄 || 𝑃𝑃) AI6103 10 KL divergence is asymmetric Forward KL: 𝑃𝑃 𝑋𝑋 = 𝑥𝑥𝑖𝑖 P is large when Q is small -> large divergence 𝐾𝐾𝐾𝐾 𝑃𝑃 || 𝑄𝑄 = 𝑃𝑃(𝑋𝑋 = 𝑥𝑥𝑖𝑖 ) log 𝑄𝑄 𝑋𝑋 = 𝑥𝑥𝑖𝑖 𝑖𝑖 Q is large when P is small -> small divergence. AI6103 11 Jensen–Shannon divergence A symmetric measure for the differences between distributions 1 𝑀𝑀 = 𝑃𝑃 + 𝑄𝑄 2 1 1 𝐽𝐽𝐽𝐽 𝑄𝑄 || 𝑃𝑃 = 𝐾𝐾𝐾𝐾 𝑃𝑃 || 𝑀𝑀 + 𝐾𝐾𝐾𝐾 𝑄𝑄 || 𝑀𝑀 2 2 AI6103 12 Machine Learning Basics AI6103 13 Intelligence? Human intelligence has an innate aspect and an environmental aspect Chimpanzees, dolphins, or parrots can demonstrate some levels of intelligence, but they can’t reach the human level of intelligence even if training starts very early. As humans, we observe and interact with the world for a long time and learn about knowledge accumulated over thousands of years AI6103 14 Machine Learning: Analogies The “innate” aspect is to specify a machine learning model, which defines the parameters that can be learned and the parameters that are determined before learning (“hyperparameters”). The “experiential” aspect is to learn the model parameters from data, a.k.a. training. AI6103 15 Machine Learning Basics There is a function y = 𝑓𝑓(𝑥𝑥) that we want to approximate 𝑥𝑥 is the input to the machine learning model. 𝑦𝑦 is what the machine learning model tries to predict The exact function is unknown, but we have access to historic data 𝒙𝒙 1 , 𝑦𝑦 1 , 𝒙𝒙 2 , 𝑦𝑦 2 , … , 𝒙𝒙 𝑛𝑛 , 𝑦𝑦 𝑛𝑛 Our goal is to find out this function from data AI6103 16 Machine Learning Basics There is a function y = 𝑓𝑓(𝑥𝑥) that we want to approximate Example: Automobile insurance risk 𝑥𝑥 = characteristics of the car, such as make, model, year, safety features, engine type, length, weight, height, fuel efficiency, etc. 𝑦𝑦 = probability of accident in a year, or average cost of repairs Example: Heart disease diagnosis 𝑥𝑥 = characteristics of the patient, such as age, sex, chest pain location, cholesterol level, blood sugar, etc. 𝑦𝑦 = medical diagnosis made by a human doctor AI6103 17 Machine Learning Basics There is a function y = 𝑓𝑓(𝑥𝑥) that we want to approximate Example: Image classification 𝑥𝑥 = image pixels 𝑦𝑦 = predefined classes, such as dog, cat, truck, airplane, apple, orange, etc. Example: Tweet emotion recognition 𝑥𝑥 = text of the tweet 𝑦𝑦 = human label of the reflected emotion: fear, anger, joy, sad, contempt, disgust, and surprise (Ekman’s basic emotions). AI6103 18 Classification vs. Regression Classification: the output 𝑦𝑦 is discrete and represents distinct categories Examples Image categories: dog, cat, truck, airplane, apple, orange Emotion categories: fear, anger, joy, sad, contempt, disgust, and surprise AI6103 19 MNIST Classification AI6103 20 Classification for Skin Problems AI6103 21 Classification vs. Regression Regression: the output 𝑦𝑦 represents a continuously varying quantity. Typically, a real number Examples Stock price in a week Heart disease: no disease (0), very mild (1), mild (2), severe (3), immediate danger (4) Key difference: measurement of error AI6103 22 Ranking Recommender systems provide a list of recommended items. We care about not only the first item, but the first N items. AI6103 23 Linear Regression AI6103 27 Linear Regression We have a p-dimensional feature vector 𝒙𝒙 = (𝑥𝑥1 , 𝑥𝑥2 , … , 𝑥𝑥𝑝𝑝 ) and a scalar output 𝑦𝑦 Our model is linear 𝑦𝑦 = 𝛽𝛽1 𝑥𝑥1 + 𝛽𝛽2 𝑥𝑥2 + ⋯ + 𝛽𝛽𝑝𝑝 𝑥𝑥𝑝𝑝 + 𝛽𝛽0 In vector form 𝑦𝑦 = 𝜷𝜷⊤ 𝒙𝒙 + 𝛽𝛽0 AI6103 28 Linear Regression We have 𝑛𝑛 number of data points 𝒙𝒙 1 , 𝑦𝑦 1 , 𝒙𝒙 2 , 𝑦𝑦 2 , … , 𝒙𝒙 𝑛𝑛 , 𝑦𝑦 𝑛𝑛 Assumption: Every data point follows the same model 𝑦𝑦 (𝑖𝑖) = 𝜷𝜷⊤ 𝒙𝒙(𝑖𝑖) + 𝛽𝛽0 Central question of machine learning How do we find the parameters 𝜷𝜷 and 𝛽𝛽0 ? AI6103 29 One Small Tweak … Adding one dimension to 𝒙𝒙, ⊤ 𝒙𝒙 = 𝑥𝑥1 , 𝑥𝑥2 , … , 𝑥𝑥𝑝𝑝 , 1 Adding one dimension to 𝜷𝜷, ⊤ 𝜷𝜷 = 𝛽𝛽1 , 𝛽𝛽2 , … , 𝛽𝛽𝑝𝑝 , 𝛽𝛽0 Our model becomes 𝑦𝑦 (𝑖𝑖) = 𝜷𝜷⊤ 𝒙𝒙(𝑖𝑖) AI6103 30 Geometric Intuition Find the straight line that is closest to observed data points AI6103 31 Geometric Intuition in 2D Find the 2D plane that is closest to observed data points AI6103 32 Higher dimensions? Can’t visualize them because we live in a 3D world. Geometric intuition: Find the hyperplane that is closest to observed data points Key point: The function we fit is linear. A unit change in 𝑥𝑥𝑖𝑖 always causes a change in 𝑦𝑦 of the magnitude 𝛽𝛽𝑖𝑖 , no matter the value of 𝒙𝒙 or 𝑦𝑦. AI6103 33 The Loss Function We must define a measure of error. How wrong is our model? Mean Square Error: the average squared distance between 𝑦𝑦 (𝑖𝑖) and 𝑦𝑦 (𝑖𝑖) 𝑛𝑛 1 𝑖𝑖 𝑖𝑖 2 1 2 MSE = 𝑦𝑦 − 𝑦𝑦 = 𝒚𝒚 − 𝒚𝒚 𝑛𝑛 𝑛𝑛 𝑖𝑖=1 AI6103 34 Linear Regression One central tenet of supervised learning Find the model parameters that lead to smallest error on all possible data. We only observe limited amount of data, so it is usually taken as minimizing error on training data while hoping to achieve low error on test data (data unseen during training) When training data are too few or when they are not very representative, we need to use regularization AI6103 35 A Closed-form Solution In matrix form, the loss function is 1 MSE = 𝑋𝑋𝜷𝜷 − 𝒚𝒚 ⊤ 𝑋𝑋𝜷𝜷 − 𝒚𝒚 𝑛𝑛 To find the minimum, we find the derivative against 𝜷𝜷 𝜕𝜕MSE 2 ⊤ = 𝑋𝑋 𝑋𝑋𝜷𝜷 − 𝑋𝑋 ⊤ 𝒚𝒚 𝜕𝜕𝜷𝜷 𝑛𝑛 Necessary condition for minimum: the derivative is zero Setting the above to zero and simplify, we get ∗ ⊤ −1 ⊤ 𝜷𝜷 = 𝑋𝑋 𝑋𝑋 𝑋𝑋 𝒚𝒚 AI6103 36 Wait a minute … 𝜷𝜷 = 𝑋𝑋 ⊤ 𝑋𝑋 −1 𝑋𝑋 ⊤ 𝒚𝒚 Is this a local minimum? What about the second-order condition? AI6103 37 Second-order Conditions Consider the 1D function For multidimensional functions, 𝑦𝑦 = 𝑎𝑎𝑥𝑥 2 + 𝑏𝑏𝑏𝑏 + 𝑐𝑐 we consider the Hessian matrix. When 𝑎𝑎 > 0, it has a minimum, This is a symmetric matrix. but no maximum 𝜕𝜕 2 L 𝜕𝜕 2 L 𝜕𝜕 2 L When 𝑎𝑎 < 0, it has a maximum, 𝜕𝜕𝛽𝛽12 𝜕𝜕𝛽𝛽1 𝜕𝜕𝛽𝛽2 … 𝜕𝜕𝛽𝛽1 𝜕𝜕𝛽𝛽𝑝𝑝 but no minimum 𝜕𝜕 2 L 𝜕𝜕 2 L … 𝜕𝜕 2 L 𝐻𝐻 = 𝜕𝜕𝛽𝛽2 𝜕𝜕𝛽𝛽1 𝜕𝜕𝛽𝛽22 𝜕𝜕𝛽𝛽2 𝜕𝜕𝛽𝛽𝑝𝑝 When 𝑎𝑎 = 0 and 𝑏𝑏 ≠ 0, it has ⋮ ⋮ … … neither a minimum nor a 2 𝜕𝜕 L 2 𝜕𝜕 L … 𝜕𝜕 2 L 𝜕𝜕𝛽𝛽𝑝𝑝2 maximum 𝜕𝜕𝛽𝛽𝑝𝑝 𝜕𝜕𝛽𝛽1 𝜕𝜕𝛽𝛽𝑝𝑝 𝜕𝜕𝛽𝛽2 AI6103 38 Linear Algebra: Positive Definiteness A matrix 𝐴𝐴 is positive definite (PD) iff for all vector 𝒙𝒙 ≠ 0, 𝒙𝒙⊤ 𝐴𝐴𝒙𝒙 > 0 Recall 𝒙𝒙⊤ 𝐴𝐴𝒙𝒙 = ∑𝑖𝑖 ∑𝑗𝑗 𝐴𝐴𝑖𝑖𝑖𝑖 𝑥𝑥𝑖𝑖 𝑥𝑥𝑗𝑗 2 −1 0 𝒙𝒙⊤ −1 3 −1 𝒙𝒙 = 𝑥𝑥12 + 𝑥𝑥22 + 𝑥𝑥32 + 𝑥𝑥1 − 𝑥𝑥2 2 + 𝑥𝑥3 − 𝑥𝑥2 2 > 0, ∀𝒙𝒙 ≠ 0 0 −1 2 AI6103 39 Linear Algebra: Positive Definiteness A matrix 𝐴𝐴 is positive definite (PD) iff for all vector 𝒙𝒙 ≠ 0, 𝒙𝒙⊤ 𝐴𝐴𝒙𝒙 > 0 A matrix 𝐴𝐴 is positive semi-definite (PSD) iff for all vector 𝒙𝒙 ≠ 0, 𝒙𝒙⊤ 𝐴𝐴𝒙𝒙 ≥ 0 A matrix 𝐴𝐴 is negative definite (ND) iff for all vector 𝒙𝒙 ≠ 0, 𝒙𝒙⊤ 𝐴𝐴𝒙𝒙 < 0 A matrix 𝐴𝐴 is negative semi-definite (NSD) iff for all vector 𝒙𝒙 ≠ 0, 𝒙𝒙⊤ 𝐴𝐴𝒙𝒙 ≤ 0 AI6103 40 Second-order Conditions 1 ⊤ L = 𝑋𝑋𝜷𝜷 − 𝒚𝒚 𝑋𝑋𝜷𝜷 − 𝒚𝒚 𝑛𝑛 𝜕𝜕 2 𝐿𝐿 2 ⊤ H= 2 = 𝑋𝑋 𝑋𝑋 𝜕𝜕𝜷𝜷 𝑛𝑛 Hessian is PSD Hessian is NSD Hessian is neither PSD nor NSD AI6103 41 Conditions for Minimum 𝜕𝜕2 MSE 2 The second-order derivative is = 𝑋𝑋 ⊤ 𝑋𝑋 𝜕𝜕𝜷𝜷2 𝑛𝑛 It is positive semi-definite because for an arbitrary vector 𝒛𝒛 ≠ 0 𝐳𝐳 ⊤ 𝑋𝑋 ⊤ 𝑋𝑋𝒛𝒛 = (𝑋𝑋𝒛𝒛)⊤ 𝑋𝑋𝒛𝒛 Letting 𝒂𝒂 = 𝑋𝑋𝒛𝒛, 𝒂𝒂⊤ 𝒂𝒂 is always greater than or equal to zero So this is truly a local minimum. Since the loss function is convex (which we will not show), the local minimum is also the global minimum. AI6103 42 Solution to Linear Regression 𝜷𝜷 = 𝑋𝑋 𝑋𝑋 ⊤ −1 ⊤ 𝑋𝑋 𝒚𝒚 is the model parameter that minimizes the loss 𝑛𝑛 1 𝑖𝑖 𝑖𝑖 2 MSE = 𝑦𝑦 − 𝑦𝑦 𝑛𝑛 𝑖𝑖=1 1 ⊤ = 𝑋𝑋𝜷𝜷 − 𝒚𝒚 𝑋𝑋𝜷𝜷 − 𝒚𝒚 𝑛𝑛 A.k.a. Ordinary Least Squares AI6103 43 Recap: What did we do? Collected some data 𝒙𝒙 1 , 𝑦𝑦 1 , 𝒙𝒙 2 , 𝑦𝑦 2 , … , 𝒙𝒙 𝑛𝑛 , 𝑦𝑦 𝑛𝑛 Specified a model 𝑦𝑦 (𝑖𝑖) = 𝜷𝜷⊤ 𝒙𝒙(𝑖𝑖) Innate, human design 1 𝑛𝑛 𝑖𝑖 𝑖𝑖 2 Defined a loss function MSE = ∑ 𝑦𝑦 − 𝑦𝑦 𝑛𝑛 𝑖𝑖=1 Experiential, Found the parameters 𝜷𝜷 that minimizes the loss function data driven AI6103 44 Ridge Regression AI6103 45 Matrix Inverse 𝐴𝐴𝒙𝒙 = 𝒃𝒃 ⇒ 𝒙𝒙 = 𝐴𝐴−1 𝒃𝒃 Matrix 𝐴𝐴 has an inverse if and only if both conditions hold 𝐴𝐴 is a square matrix. That is, 𝐴𝐴 ∈ ℝ𝑛𝑛×𝑛𝑛. All rows (and columns) of 𝐴𝐴 are linearly independent The number of linearly independent rows (or columns) is called the rank of 𝐴𝐴. 𝐴𝐴 ∈ ℝ𝑛𝑛×𝑛𝑛 has an inverse if and only if rank 𝐴𝐴 = 𝑛𝑛, i.e., 𝐴𝐴 is full-rank. If 𝐴𝐴 = 𝐵𝐵⊤ 𝐵𝐵, 𝐵𝐵 ∈ ℝ𝑚𝑚×𝑛𝑛 , then rank 𝐴𝐴 ≤ min(𝑚𝑚, 𝑛𝑛) rank 𝐴𝐴𝐴𝐴 ≤ min(rank 𝐴𝐴 , rank(𝐵𝐵)) AI6103 46 Another detail: Is ⊤ 𝑋𝑋 𝑋𝑋 invertible? 𝜷𝜷∗ = 𝑋𝑋 ⊤ 𝑋𝑋 −1 𝑋𝑋 ⊤ 𝒚𝒚 For 𝑋𝑋 ∈ ℝ𝑛𝑛× 𝑝𝑝+1 , 𝑋𝑋 ⊤ 𝑋𝑋 ∈ ℝ 𝑝𝑝+1 × 𝑝𝑝+1 is invertible iff Number of data points 𝑛𝑛 ≥ feature dimension 𝑝𝑝 + 1 At least 𝑝𝑝 + 1 data points are linearly independent When we have more features than data points (𝑝𝑝 + 1 > 𝑛𝑛), we have a problem! AI6103 47 Ridge Regression If 𝐴𝐴 is low rank, 𝐴𝐴𝒙𝒙 = 𝒃𝒃 is under-constrained and does not have a unique solution. If 𝑛𝑛 < 𝑝𝑝 + 1, 𝑋𝑋 ⊤ 𝑋𝑋 is low-rank and not invertible. However, 𝑋𝑋 ⊤ 𝑋𝑋 + 𝜆𝜆𝜆𝜆 is invertible, where 𝜆𝜆 is a small positive number. This is known as ridge regression. RR = 𝑋𝑋 ⊤ 𝑋𝑋 + 𝜆𝜆𝜆𝜆 𝜷𝜷 −1 𝑋𝑋 ⊤ 𝒚𝒚 AI6103 48 Ridge Regression Is L2-regularized Linear Regression Ridge Regression is the solution to a different optimization problem. 1 ⊤ 1 2 L = 𝑋𝑋𝜷𝜷 − 𝒚𝒚 𝑋𝑋𝜷𝜷 − 𝒚𝒚 + 𝜆𝜆 𝜷𝜷 𝑛𝑛 𝑛𝑛 We take the derivative against 𝜷𝜷 and set it to zero 𝜕𝜕L 2 ⊤ = 𝑋𝑋 𝑋𝑋𝜷𝜷 − 𝑋𝑋 ⊤ 𝒚𝒚 + 𝜆𝜆𝜷𝜷 = 0 𝜕𝜕𝜷𝜷 𝑛𝑛 RR = 𝑋𝑋 ⊤ 𝑋𝑋 + 𝜆𝜆𝜆𝜆 −1 𝑋𝑋 ⊤ 𝒚𝒚 𝜷𝜷 we prefer the one with In other words, if there are multiple solutions to 𝜷𝜷, the least L2-norm. AI6103 49 Recap: Regularization We can show (but will not) that the ridge regression estimator for 𝜷𝜷 is biased but has lower variance than the OLS estimator [bias-variance tradeoff] This is especially useful when we have limited data. We will see many other forms of regularization later. AI6103 50 Probabilistic Perspective The model is parameterized by 𝜷𝜷 and takes input 𝒙𝒙 We write its output as 𝑓𝑓𝜷𝜷 𝒙𝒙 We interpret 𝑓𝑓𝜷𝜷 𝒙𝒙 as the (input-dependent) parameter 𝜇𝜇 to a Gaussian distribution with unit standard deviation (𝜎𝜎 = 1). The ground truth 𝑦𝑦 𝑖𝑖 is drawn from this distribution 𝑦𝑦 𝑖𝑖 ~𝒩𝒩 𝑓𝑓𝜷𝜷 𝒙𝒙 𝑖𝑖 , 1 Equivalently 𝑖𝑖 𝑖𝑖 𝑦𝑦 = 𝑓𝑓𝜷𝜷 𝒙𝒙 + 𝜖𝜖, 𝜖𝜖 ~ 𝒩𝒩(0, 1) AI6103 51 Maximum Likelihood for Gaussian 1 𝑁𝑁 Data: 𝑦𝑦 , … , 𝑦𝑦 The Gaussian probability is 𝑁𝑁 𝑁𝑁 𝑖𝑖 2 𝑖𝑖 1 𝑦𝑦 − 𝜇𝜇 𝑃𝑃 𝑦𝑦 𝜇𝜇, 𝜎𝜎 = exp − 2𝜋𝜋𝜎𝜎 2𝜎𝜎 2 𝑖𝑖=1 𝑖𝑖=1 Taking log and remove anything unrelated to 𝜇𝜇 𝑁𝑁 𝑖𝑖 2 𝑁𝑁 𝑖𝑖 2 1 𝑦𝑦 − 𝜇𝜇 𝑦𝑦 − 𝜇𝜇 𝜇𝜇 ∗ = argmax log exp − = argmax − 𝜇𝜇 2𝜋𝜋𝜎𝜎 2𝜎𝜎 2 𝜇𝜇 2𝜎𝜎 2 𝑖𝑖=1 𝑖𝑖=1 𝑁𝑁 2 𝜇𝜇∗ = argmin 𝑦𝑦 𝑖𝑖 − 𝜇𝜇 𝜇𝜇 𝑖𝑖=1 AI6103 52 Plugging in … 𝜇𝜇 (𝑖𝑖) = 𝑦𝑦 𝑖𝑖 = 𝑓𝑓𝜷𝜷 𝒙𝒙 𝑁𝑁 2 ∗ 𝑖𝑖 𝑖𝑖 𝜷𝜷 = argmin 𝑦𝑦 − 𝑓𝑓𝜷𝜷 𝒙𝒙 𝜷𝜷 𝑖𝑖=1 Linear regression can be understood as MLE if we assume the label contains noise from the Gaussian distribution. 𝑦𝑦 𝑖𝑖 = 𝑓𝑓𝜷𝜷 𝒙𝒙 𝑖𝑖 + 𝜖𝜖, 𝜖𝜖 ~ 𝒩𝒩(0, 𝜎𝜎) AI6103 53 Ridge Regression [Optional] Ridge regression can be understood as Bayesian maximum a posteriori (MAP) 1 estimation with a Gaussian prior 𝒩𝒩(0, ) for the model parameters 𝜷𝜷. 𝜆𝜆 We omit the details for the purpose of this course. AI6103 54 Perspectives to Understand Ridge Regression Matrix Inverse Perspective: A trick to make sure an inverse exists. Linear Equation Perspective: We don’t have an inverse because the problem is under- constrained and does not have a unique solution. Adding one soft constraint leads to a unique solution. AI6103 55 Perspectives to Understand Ridge Regression Matrix Inverse Perspective: A Variance Reduction: When we trick to make sure an inverse have too few data points, the exists. contribution of each data point Linear Equation Perspective: We to 𝜷𝜷 is large, leading to large don’t have an inverse because variance. Adding regularization the problem is under- reduces variance. constrained and does not have a Bayesian Statistics: Adding a unique solution. Adding one soft Gaussian Prior to 𝜷𝜷 constraint leads to a unique solution. AI6103 56 Recap: What did we do? Collected some data 𝒙𝒙 1 , 𝑦𝑦 1 , 𝒙𝒙 2 , 𝑦𝑦 2 , … , 𝒙𝒙 𝑛𝑛 , 𝑦𝑦 𝑛𝑛 Specified a model 𝑦𝑦 (𝑖𝑖) = 𝜷𝜷⊤ 𝒙𝒙(𝑖𝑖) 1 𝑛𝑛 𝑖𝑖 𝑖𝑖 2 Innate, Defined a loss function MSE = ∑ 𝑦𝑦 − 𝑦𝑦 human design 𝑛𝑛 𝑖𝑖=1 1 𝑛𝑛 𝑖𝑖 𝑖𝑖 2 1 Or ∑ 𝑦𝑦 − 𝑦𝑦 + 𝜆𝜆𝜷𝜷⊤ 𝜷𝜷 𝑛𝑛 𝑖𝑖=1 𝑛𝑛 Experiential, Found the parameters 𝜷𝜷 that minimizes the loss function data driven AI6103 57 Linear Models are Limited AI6103 58 Linear Models are Limited AI6103 59 Linear Models are Limited Model misspecification Incomplete training data AI6103 60 Logistic Regression: Not So Linear AI6103 61 Logistic Regression: a Single “Neuron” The simplest non-linear model. Sometimes referred to as “generalized linear model” as the decision boundary is still linear. Here we emphasize the fact that the model function has a non-linear form. AI6103 62 The Supervised Learning Recipe Collect some data 𝒙𝒙 1 , 𝑦𝑦 1 , 𝒙𝒙 2 , 𝑦𝑦 2 , … , 𝒙𝒙 𝑛𝑛 , 𝑦𝑦 𝑛𝑛 Specify a model 𝑦𝑦 (𝑖𝑖) = 𝑓𝑓 𝒙𝒙 𝑖𝑖 Define a loss function Find the parameters 𝜷𝜷 that minimize the loss function AI6103 63 The Data Collect some data 𝒙𝒙 1 , 𝑦𝑦 1 , 𝒙𝒙 2 , 𝑦𝑦 2 , … , 𝒙𝒙 𝑛𝑛 , 𝑦𝑦 𝑛𝑛 𝒙𝒙 𝑖𝑖 is a p-vector. 𝑦𝑦 𝑖𝑖 is either 0 or 1, denoting the two classes. AI6103 64 The Supervised Learning Recipe Collect some data 𝒙𝒙 1 , 𝑦𝑦 1 , 𝒙𝒙 2 , 𝑦𝑦 2 , … , 𝒙𝒙 𝑛𝑛 , 𝑦𝑦 𝑛𝑛 Specify a model 𝑦𝑦 (𝑖𝑖) = 𝑓𝑓 𝒙𝒙 𝑖𝑖 Define a loss function Find the parameters 𝜷𝜷 that minimize the loss function AI6103 65 Logistic Regression: a Single “Neuron” Model 𝑦𝑦 (𝑖𝑖) = 𝜎𝜎 𝜷𝜷⊤ 𝒙𝒙 𝑖𝑖 Activation function 1 e 𝑧𝑧 𝜎𝜎 𝑧𝑧 = −𝑧𝑧 = 1+e 1 + e 𝑧𝑧 AI6103 66 Activation Function: Sigmoid Activation function 1 e 𝑧𝑧 𝜎𝜎 𝑧𝑧 = −𝑧𝑧 = 1+e 1 + e 𝑧𝑧 Squashes all real numbers into the range [0, 1] Thus, good for binary classification 𝜎𝜎 𝑧𝑧 denotes the probability for one of the two classes. AI6103 67 Logistic Regression Collect some data 𝒙𝒙 1 , 𝑦𝑦 1 , 𝒙𝒙 2 , 𝑦𝑦 2 , … , 𝒙𝒙 𝑛𝑛 , 𝑦𝑦 𝑛𝑛 Specify a model 𝑦𝑦 (𝑖𝑖) = 𝜎𝜎 𝜷𝜷⊤ 𝒙𝒙 𝑖𝑖 Define a loss function Find the parameters 𝜷𝜷 that minimize the loss function AI6103 68 The Cross-entropy Loss The label 𝑦𝑦 (i) either 0 or 1 𝑦𝑦 (i) ∈ (0, 1) is the output of the model. The cross-entropy loss 𝑁𝑁 1 𝐿𝐿 = − 𝑦𝑦 (i) log 𝑦𝑦 (i) + 1 − 𝑦𝑦 (i) log 1 − 𝑦𝑦 (i) 𝑁𝑁 𝑖𝑖=1 For data point 𝑖𝑖, only one term exists. AI6103 69 Why the Name? Do they look very similar to you? 𝑁𝑁 1 i i i i 𝐿𝐿XE = − 𝑦𝑦 log 𝑦𝑦 + 1 − 𝑦𝑦 log 1 − 𝑦𝑦 𝑁𝑁 𝑖𝑖=1 𝐻𝐻 𝑃𝑃, 𝑄𝑄 = 𝐸𝐸𝑃𝑃 − log 𝑄𝑄(𝑋𝑋) AI6103 70 Monte Carlo Expectation Monte Carlo estimation of an expectation 𝐸𝐸𝑃𝑃 𝑓𝑓 𝑦𝑦 = 𝑓𝑓 𝑦𝑦 𝑃𝑃 𝑌𝑌 = 𝑦𝑦 𝑑𝑑𝑦𝑦 The integral can be approximated if we can draw samples 𝑦𝑦 1 , … , 𝑦𝑦 𝐾𝐾 from 𝑃𝑃 𝑌𝑌 𝐾𝐾 1 𝑖𝑖 𝐸𝐸𝑃𝑃 𝑓𝑓 𝑦𝑦 ≈ 𝑓𝑓 𝑦𝑦 𝐾𝐾 𝑖𝑖=1 AI6103 71 Why the Name? Cross entropy: 𝐻𝐻 𝑃𝑃, 𝑄𝑄 = 𝐸𝐸𝑃𝑃 − log 𝑄𝑄(𝑋𝑋) 𝑁𝑁 1 𝑦𝑦 (i) is drawn from an unknown distribution − 𝑦𝑦 i log 𝑦𝑦 i + 1 − 𝑦𝑦 i log 1 − 𝑦𝑦 i 𝑁𝑁 𝑃𝑃 𝑌𝑌 𝑿𝑿 𝑖𝑖=1 𝑁𝑁 𝑦𝑦 (i) 𝑖𝑖 is the probability 𝑄𝑄 𝑌𝑌 = 1 𝒙𝒙 , 𝜷𝜷 1 i = −δ 𝑦𝑦 = 1 log𝑄𝑄 𝑌𝑌 = 1 𝒙𝒙 𝑖𝑖 , 𝜷𝜷 𝑁𝑁 𝑖𝑖=1 1 − 𝑦𝑦 (i) is the probability 𝑄𝑄 𝑌𝑌 = 0 𝒙𝒙 𝑖𝑖 , 𝜷𝜷 −δ 𝑦𝑦 i = 0 log𝑄𝑄 𝑌𝑌 = 0 𝒙𝒙 𝑖𝑖 , 𝜷𝜷 𝑁𝑁 1 i = − log𝑄𝑄 𝑌𝑌 = 𝑦𝑦 𝒙𝒙 𝑖𝑖 , 𝜷𝜷 𝑓𝑓 𝑦𝑦 𝑁𝑁 𝑖𝑖=1 ≈ 𝐸𝐸𝑃𝑃 −log 𝑄𝑄 𝑌𝑌 𝒙𝒙 𝑖𝑖 , 𝜷𝜷 δ ⋅ is the indicator function AI6103 72 Information Theoretical Perspective The cross-entropy is related to the KL divergence 𝐻𝐻 𝑃𝑃, 𝑄𝑄 = −𝐸𝐸𝑃𝑃 𝑥𝑥 log 𝑄𝑄 𝑥𝑥 = 𝐻𝐻 𝑃𝑃 + 𝐾𝐾𝐾𝐾(𝑃𝑃||𝑄𝑄) Minimizing the loss minimizes the distance between the GT distribution 𝑃𝑃 𝑦𝑦 𝑖𝑖 𝒙𝒙 𝑖𝑖 and estimated distribution 𝑄𝑄 𝑦𝑦 𝑖𝑖 𝒙𝒙 𝑖𝑖 , 𝜷𝜷. AI6103 73 MLE Perspective Optimizing the cross-entropy can also be seen as the maximum likelihood estimation of 𝜷𝜷 under the binomial distribution. We omit the details. AI6103 74 Logistic Regression Collect some data 𝒙𝒙 1 , 𝑦𝑦 1 , 𝒙𝒙 2 , 𝑦𝑦 2 , … , 𝒙𝒙 𝑛𝑛 , 𝑦𝑦 𝑛𝑛 Specify a model 𝑦𝑦 (𝑖𝑖) = 𝜎𝜎 𝜷𝜷⊤ 𝒙𝒙 𝑖𝑖 Define a loss function: cross entropy Find the parameters 𝜷𝜷 that minimize the loss function AI6103 75 Optimization via Gradient Descent AI6103 76 Optimization We seek 𝜷𝜷 that minimizes a function 𝐿𝐿 𝜷𝜷 Assumption: We can evaluate the function and its first-order d𝐿𝐿 𝜷𝜷 derivative d𝜷𝜷 AI6103 77 What is a Minimum? 𝒙𝒙 is called a local minimum of function 𝑓𝑓 𝒙𝒙 if there is 𝜖𝜖 such that 𝑓𝑓 𝒙𝒙 ≤ 𝑓𝑓(𝒙𝒙 + 𝒚𝒚) for all 𝒚𝒚 < 𝜖𝜖. 𝒙𝒙 is called a global minimum of function 𝑓𝑓 𝒙𝒙 if 𝑓𝑓 𝒙𝒙 ≤ 𝑓𝑓(𝒚𝒚) for all 𝒚𝒚 in the domain of 𝑓𝑓 𝒙𝒙 AI6103 78 What is a Minimum? A global minimum must be a local minimum, but a local minimum may not be a global minimum. Multiple local minima cause problems for optimization. AI6103 79 Optimization: Convex Functions The line segment connecting 𝑓𝑓(𝒂𝒂) and 𝑓𝑓(𝒃𝒃) always lies above the function between 𝒂𝒂 and 𝒃𝒃. We can understand a convex function as a function where any local minimum is also a global minimum. Easy optimization! AI6103 80 Optimization: Convex Functions Logistic regression has a convex loss function. Deep neural networks usually have non-convex loss functions that are difficult to optimize. We will worry about that later! AI6103 81 The Gradient Descent Algorithm Input: loss function 𝐿𝐿 𝜷𝜷 and initial position 𝜷𝜷0 Repeat for a predefined amount of time (or until convergence) Move in the direction of negative gradient d𝐿𝐿 𝜷𝜷 𝜷𝜷𝑡𝑡 = 𝜷𝜷𝑡𝑡−1 − 𝜂𝜂 d𝜷𝜷 𝜷𝜷 𝑡𝑡−1 𝜂𝜂 is a small constant called the learning rate AI6103 82 Gradient Descent d𝑓𝑓 𝑥𝑥, 𝜷𝜷 𝜷𝜷𝑡𝑡 = 𝜷𝜷𝑡𝑡−1 − 𝜂𝜂 d𝜷𝜷 𝜷𝜷 𝑡𝑡−1 Gradient is negative. We should move to the right. Gradient is positive. We should move to the left. AI6103 83 Optimization: Gradient Descent Starting from a given initial This produces a sequence of 𝜷𝜷 position 𝜷𝜷0 𝜷𝜷0 , 𝜷𝜷1 , … , 𝜷𝜷T Repeat for a predefined amount That goes increasingly closer to of time (or until convergence) the optimum value 𝜷𝜷∗ Move in the direction of negative gradient If, as T → ∞, 𝜷𝜷T → 𝜷𝜷∗ , we say d𝐿𝐿 𝑥𝑥,𝜷𝜷 𝜷𝜷𝑡𝑡 = 𝜷𝜷𝑡𝑡−1 − 𝜂𝜂 d𝜷𝜷 𝜷𝜷𝑡𝑡−1 that the sequence converges to 𝜷𝜷∗. AI6103 84 2D Functions Loss surface in 2D = contour diagrams / level sets 𝐿𝐿𝑎𝑎 𝑓𝑓 = 𝒙𝒙 𝑓𝑓 𝒙𝒙 = 𝑎𝑎} AI6103 85 All local minima on the diagram Probably here? AI6103 86 2D Functions Loss surface in 2D = contour diagrams / level sets 𝐿𝐿𝑎𝑎 𝑓𝑓 = 𝒙𝒙 𝑓𝑓 𝒙𝒙 = 𝑎𝑎} The gradient direction is the direction along which the function value changes the fastest (for a small change of 𝒙𝒙 in Euclidean norm). Along the level set, the function value doesn’t change. AI6103 87 2D Functions Loss surface in 2D = contour diagrams / level sets 𝐿𝐿𝑎𝑎 𝑓𝑓 = 𝒙𝒙 𝑓𝑓 𝒙𝒙 = 𝑎𝑎} For a differentiable function d𝑓𝑓 𝑓𝑓(𝒙𝒙), its gradient of at any 𝑑𝑑𝒙𝒙 point is either zero or perpendicular to the level set at that point. AI6103 88 Gradient Descent on Convex Functions d𝑓𝑓 𝑥𝑥, 𝜷𝜷 𝜷𝜷𝑡𝑡 = 𝜷𝜷𝑡𝑡−1 − 𝜂𝜂 d𝜷𝜷 𝜷𝜷 100 𝑡𝑡−1 90 80 The learning rate 𝜂𝜂 determines how much we move at each step. We cannot move too much because the gradient is a local approximation of the function. Thus, the learning rate is usually small. Contour diagram / Level sets AI6103 89 Gradient Descent on Convex Functions d𝑓𝑓 𝑥𝑥, 𝜷𝜷 𝜷𝜷𝑡𝑡 = 𝜷𝜷𝑡𝑡−1 − 𝜂𝜂 d𝜷𝜷 𝜷𝜷 100 𝑡𝑡−1 90 80 The learning rate 𝜂𝜂 determines how much we move at each step. Too small a learning rate 𝜂𝜂: slow convergence Too large a learning rate 𝜂𝜂: oscillation, overshooting Contour diagram / Level sets AI6103 90 Gradient Descent on Convex Functions d𝑓𝑓 𝑥𝑥, 𝜷𝜷 𝜷𝜷𝑡𝑡 = 𝜷𝜷𝑡𝑡−1 − 𝜂𝜂 d𝜷𝜷 𝜷𝜷 100 𝑡𝑡−1 90 80 Good step The learning rate 𝜂𝜂 determines how Overshoots much we move at each step. As we move closer to the minimum, we often decrease 𝜂𝜂 so that we do not overshoot. Contour diagram / Level sets AI6103 91 Logistic Regression Collect some data 𝒙𝒙 1 , 𝑦𝑦 1 , 𝒙𝒙 2 , 𝑦𝑦 2 , … , 𝒙𝒙 𝑛𝑛 , 𝑦𝑦 𝑛𝑛 Specify a model 𝑦𝑦 (𝑖𝑖) = 𝜎𝜎 𝜷𝜷⊤ 𝒙𝒙 𝑖𝑖 Define a loss function: cross entropy Find the parameters 𝜷𝜷 that minimizes the loss function AI6103 92 A single neuron is still VERY limited It only works well when there is a straight line that can separate two classes. AI6103 93 A single neuron is still VERY limited Perceptron (similar to logistic regression) infamously fails to represent the XOR function (Minsky & Papert, 1969). (0, 1) (1, 1) (0, 0) (1, 0) AI6103 94

Lecture 3 ML Foundations PDF

Document Details

Tags

Related

Summary

Full Transcript