Neural Networks Lecture 1 PDF
Document Details
Uploaded by VictoriousGlockenspiel
Tags
Summary
This document provides a lecture on neural networks, focusing on the introduction to neural networks, logistic regression, backpropagation, and multi-layer perceptrons.
Full Transcript
Neural Networks - Lecture 1 Introduction, Logistic Regression, Backpropagation, MLP Outline ● Housekeeping :-) ● Linear Regression, Objective Function, MSE Loss ● Logistic Regression, Activation Function, Cross-Entropy Loss ● Backpropagation ● Multi-Layered Perceptron 2/43 Housekeepin...
Neural Networks - Lecture 1 Introduction, Logistic Regression, Backpropagation, MLP Outline ● Housekeeping :-) ● Linear Regression, Objective Function, MSE Loss ● Logistic Regression, Activation Function, Cross-Entropy Loss ● Backpropagation ● Multi-Layered Perceptron 2/43 Housekeeping – Semester Organization ● 14 lectures ● 4 Programming and Analysis Homework Assignments (30p) ● 1 project in 3 steps (60p) – Step 1: Topic selection and SotA presentation ● Presentation + paper (Introduction and Related Work sections) – 3 Intermediary short presentations (as slides) set 2 weeks apart – 20 p – Step 2: Intermediary project evaluation (e.g. first results) – 20 p ● – Presentation + continued paper (Method description and first results) Step 3: Final Project Poster Presentation (counts as part of exam) – 20p ● Poster + final paper (Final experiments, Final results, Conclusion) ● Project Hours to be used as Office Hours + Short Tutorial Coverage ● 1 written final exam (10p) ● Passing conditions: – 50% of semester activity required for exam entry (grades for HW + Intermediary presentations + Step 2 of project) – 50% of final project, 30% of written exam 3/43 Housekeeping – Syllabus 4/43 Project Guidelines ● To be discussed after the lecture 5/43 Recap Linear Regression, Objective Functions and MSE Loss 6/43 The case of linear regression Regression = predict the value of a continuous variable Temperature No. chirps Cricket chirps as a function of temperature 20 88.6 19.8 93.3 18.4 84.3 80 17.1 80.6 70 15.5 75.2 60 17.1 82 15.4 69.4 16.2 83.3 30 15 79.6 20 17.2 82.6 10 16 80.6 0 17 83.5 14.4 76.3 100 90 no. chirps ● No. chirps Linear (No. chirps) 50 40 14 15 16 17 18 19 20 21 degrees Celsius 7/43 The case of linear regression General Case ● Training dataset – n instances of input-output pairs {x1:n, y1:n} R1xd, y Training ● xi ∈ ∈ R (for these examples) {x1:n, y1:n} → LEARNER → model params. Θ Testing xn+1, Θ → PREDICTOR → ŷn+1 8/43 The case of linear regression Linear predictor y^ i= ^y (x i )= θ1 + xi θ2 Loss function – Mean Squared Error (MSE) / Quadratic Loss 1 2 J ( θ )= ∑ ( y i − y^ i ) n i=1. . n 9/43 The case of linear regression In our example: 1 2 J ( θ )= ∑ ( y i −θ1 −xi θ2 ) n i=1. . n In the general case d y^ i=∑ xij θ j =xi 1 θ1 + xi 2 θ2 +...+ xid θd , where xi 1=1 j=1 ^y = X θ ,with ^y ∈ℝ n×1 , X ∈ℝ n×d and θ ∈ℝ d×1 10/43 The case of linear regression - 2 In the general case d y^ i=∑ xij θ j =xi 1 θ1 + xi 2 θ2 +...+ xid θd , where xi 1=1 j=1 ^y = X θ ,with ^y ∈ℝ n×1 , X ∈ℝ n×d and θ ∈ℝ d×1 x11 ⋯ x 1 d θ y^1 1 ⋮ = x 21 ⋯ x 2 d ⋮ y^n x n 1 ⋯ xnd θd [ ][ ][ ] 11/43 The case of linear regression - 2 In the general case d y^ i=∑ xij θ j =xi 1 θ1 + xi 2 θ2 +...+ xid θd , where xi 1=1 j=1 ^y = X θ ,with ^y ∈ℝ n×1 , X ∈ℝ n×d and θ ∈ℝ d×1 x11 ⋯ x 1 d θ y^1 1 ⋮ = x 21 ⋯ x 2 d ⋮ y^n x n 1 ⋯ xnd θd [ ][ ][ ] T n T 2 J ( θ )=( y− X θ ) ( y− X θ )=∑ ( yi −x i θ ) i=1 12/43 The case of linear regression - 2 Finding the solution → minimize the loss ∂ J ( θ) ∂ T T T T T T = ( y y−2 y X θ + θ X X θ )=0−2 X y +2 X X θ=0 ∂θ ∂θ ⇒ θ = ( X T X )− 1 X T y 13/43 The case of linear regression - 3 How about classification? 14/43 The case of linear regression - 3 How about classification? What if we try to use linear regression to classify the points? 15/43 The case of linear regression - 3 Convention: y = 1 for positive class, y = 0 for negative class θ0 T T ⇒ find θ = θ1 , such that θ x i =1, if x i is positive and θ x i =0 otherwise θ2 1 where x i = x i 1 xi 2 [] [] 16/43 The case of linear regression - 3 What if we introduce an “unproblematic” set of points for one of the classes? 17/43 The case of linear regression - 3 What if we introduce an “unproblematic” set of points for one of the classes 18/43 Recap Logistic Regression, Activation Functions, Cross Entropy Loss 19/43 The case of logistic regression Classification problem ill-posed for linear regression We need a function that “squeezes in” the weighted input into a probability space σ ( η)= 1 −η 1+e Logistic Sigmoid Function 20/43 The case of logistic regression 21/43 The case of logistic regression - 2 Formulate our classification problem in terms of a probabilitybased model Bernoulli random variable: X takes values in {0, 1} if x=1 p( x∣θ ) = θ , 1−θ , if x=0 x 1− x p(x∣θ )=Ber(x∣θ )=θ (1−θ ) 22/43 The case of logistic regression - 2 Entropy (H) – measure of the uncertainty associated with a random variable H ( X )=− ∑ p(x∣θ )log p( x∣θ) x∈X For a Bernoulli variable 1 x (1−x) H ( X )=−∑ θ (1−θ ) x (1− x ) log [ θ (1−θ ) ]=−[ θ log ( θ )+(1−θ )log (1−θ )] x=0 23/43 The case of logistic regression - 2 Logistic regression model specifies prob. of binary output yi in {0,1} given input xi as: n p( y∣X , θ ) = n p( y i∣x i , θ ) = ∏ Ber ( y i∣σ (x i θ )) ∏ i=1 i=1 n Where = ∏ i=1 [ 1 −x θ 1+e i yi ][ 1− 1 −x θ 1+e i 1− y i ] n xi θ = θ0 + ∑ θ j xij j=1 24/43 The case of logistic regression – Maximizing Likelihood Maximum Likelihood Estimation (MLE) - method of estimating the parameters of a statistical model given observations, by finding the parameter values that maximize the likelihood of making the observations given the parameters MLE property: for i.i.d. data, MLE minimizes Cross-Entropy 25/43 The case of logistic regression – Maximizing Likelihood MLE property: for i.i.d. data, MLE minimizes Cross-Entropy n p( y∣X , θ ) = n p( y i∣x i , θ ) = ∏ Ber ( y i∣σ (x i θ )) ∏ i=1 i=1 n = yi 1 1 1− ∏ −x θ −x θ i=1 1+e 1+e ⏟ [ i ][ i 1− y i ] πi Maximize likelihood p( y∣X , θ )→minimize negative log likelihood n J ( θ )=−log P ( y∣X , θ )=−∑ y i log πi +(1− yi ) log(1− πi ) ⏟ i=1 cross-entropy 26/43 The case of logistic regression – Distribution Divergence Denote πi= p( y i=1∣xi , θ ) and 1− π i= p( yi =0∣xi , θ ) D = {(xi, yi), i = 1..n} - is the Empirical Data Distribution Dm = {(xi, πi), i = 1..n } - is the modeled data distribution We are trying to minimize the “discrepancy” between the modeled data distribution and the empirical one 27/43 The case of logistic regression - 2 Kullback-Leibler Divergence = measure of difference (not a proper distance) between two probability distributions P(x) KL( P∥Q)=∑ P (x)log = Q(x) x ( ) ∑ P(x)log ( P(x))−∑ P(x) log(Q( x)) x x ⏟ ⏟ −H ( P( x)) +H (P (x), Q (x)) where H ( P (x),Q( x)) is the Cross− Entropy measure of P ( x) and Q(x) => minimizing KL divergence = minimizing the crossentropy 28/43 The case of logistic regression – Solve with Gradient Descent n J ( θ )=−log P ( y∣X , θ )=−∑ y i log πi +(1− yi ) log(1− πi ) ⏟ i=1 cross-entropy The loss function no longer has closed-form solution => apply gradient-descent based method 29/43 The case of logistic regression – Solve with Gradient Descent n J ( θ )=−log P ( y∣X , θ )=−∑ y i log πi +(1− yi ) log(1− πi ) ⏟ i=1 cross-entropy The loss function no longer has closed-form solution => apply gradient-descent based method derivative of logit function : σ ' ( x)= σ ( x)(1− σ ( x)) n ∂ J (θ ) ∂ = − y i log σ ( x i θ )+(1− y i )log (1−σ ( x i θ )) ∂θ ∂θ ∑ i=1 ( ∂ J (θ ) = ∂θ n T ) T ∑ x i ( πi− yi ) = X ( π − y) i=1 30/43 The case of logistic regression – Solve with Gradient Descent The loss function no longer has closed-form solution => apply gradient-descent based method Gradient vector points in direction of steepest ascent => for minimization take opposite direction θ t +1 ∂ J (θ ) t =θ − λ (θ ) ∂θ t where λ → learning rate Source: Wikimedia Commons (https://bit.ly/2lOcCi9) 31/43 Logistic regression as a NN Softmax – generalization of neural network from binary classification to multiclass (forced) binary case: π i1 = p( y i∣x i , θ ) = e e x i θ1 x i θ1 x i θ2 , if yi =0 +e xθ e πi 2= x θ x θ , if y i=1 e +e i i 1 2 i 2 Source: Sebastian Rashka (https://bit.ly/2lONuI5) 32/43 Logistic regression as a NN π i1 = p( y i∣x i , θ ) = e e x i θ1 x i θ1 x i θ2 , if y i =0 +e xθ e πi 2= x θ x θ , if y i=1 e +e i i 1 2 i 2 33/43 Recap Backpropagation 34/43 Logistic regression and backprop δ 4=J ( θ ) δ3 = ∂ J ∂ J ∂ z4 = ∂ z3 ⏟ ∂ z4 ∂ z3 1 δ3 = ∂−( y log (z 3 )+(1− y )log (1−z 3 )) ∂ z3 δ3 = − y (1− y ) − z 3 (1−z 3 ) ∂ J ∂ J ∂ z3 δ2 = = ∂ z2 ∂ z 3 ∂ z2 δ2 = δ 3 ∂ σ ( z2) = δ3 σ (z 2 )(1− σ (z 2 )) ∂ z2 d δ21 = ∂J =δ2 2 ∂ z1 ∂ ∑ x i θi i=1 ∂ x2 = δ2 θ2 35/43 Backprop – Layer specification In general Forward pass: z k+ 1=f k ( z k ) Backward pass: j j ∂ zk +1 ∂J ∂ J ∂ z k +1 j δ = i =∑ j = δ ∑ k +1 ∂ zi j ∂ z k j ∂ z k +1 ∂ z k j k i k j j ∂ zk +1 ∂J ∂ J ∂ zk +1 j =∑ = δ k +1 ∂ θ ∂ θk j ∂ z j ∂ θk ∑ k j k +1 36/43 Recap Multi-Layered Perceptron 37/43 Multi-Layer Perceptron Consider the rings dataset Rings dataset Logistic regression result Logistic regression is still insufficient for classification, because the dataset is not linearly separable 38/43 Multi-Layer Perceptron 1 2 1 u k =∑ xij θ jk j=1 2 3 2 3 1 2 u k =∑ a j θ jk =∑ σ (u j ) θ jk j=1 j=1 MLP example: 2 layers with 3 and 2 neurons, respectively Objective: minimize cross-entropy error 39/43 Multi-Layer Perceptron Each neuron computes a separation plane on the space of its inputs 40/43 Multi-Layer Perceptron 3 input neurons create “cuts” that can isolate the inner ring 2 output neurons decide on which “side” of each cut are the positive versus the negative examples 41/43 Multi-Layer Perceptron ● ● Neural Net Playground (https://playground.tensorflow.org/): the simple MLP Observe influence of: – Activation Functions – Relation between absolute gradient value and learning rate – Non-normalized layer inputs – Batch-size 42/43 Summary ● Take aways: – Linear regression + Cost Function for regression → MSE Loss – Logistic regression + Cost Function for Classification (binary or multi-class) → Cross-Entropy Loss – Activation Functions – role of non-linearities – Backpropagation ● Chain rule of probability ● Gradient based update rule 43/43