chap2_REV.pptx
Document Details
Uploaded by CheeryStrontium
Wayne State University
Tags
Full Transcript
CHAPTER 2: SUPERVİSED LEARNİNG Deriving the Delta Rule Define the error as the squared residuals summed over all training cases: Now differentiate to get error derivatives for weights The batch delta rule changes the weights in proportion to their error der...
CHAPTER 2: SUPERVİSED LEARNİNG Deriving the Delta Rule Define the error as the squared residuals summed over all training cases: Now differentiate to get error derivatives for weights The batch delta rule changes the weights in proportion to their error derivatives summed over all training cases The error surface in extended weight space The error surface lies in a space with a horizontal axis for each weight and one vertical axis for the E error. For a linear neuron with a squared error, it is a quadratic w1 bowl. Vertical cross-sections are w1 parabolas. Horizontal cross-sections are ellipses. w2 For multi-layer, non-linear nets the error surface is much more complicated. Online versus batch learning The simplest kind of batch The simplest kind of online learning does steepest learning zig-zags around descent on the error the direction of steepest surface. descent: constraint from This travels perpendicular training case 1 to the contour lines. w1 w1 constraint from w2 training case 2 w2 Why learning can be slow If the ellipse is very elongated, the direction of steepest descent is almost perpendicular to the w1 direction towards the minimum! The red gradient vector has a large component along the short w2 axis of the ellipse and a small component along the long axis of the ellipse. This is just the opposite of what we want. Logistic neurons These give a real- valued output that is a smooth and bounded function 1 of their total input. y 0.5 They have nice 0 derivatives 0 z which make learning easy. Sigmoid approximation of the indicator function The derivatives of a logistic neuron The derivatives of the The derivative of the logit, z, with respect to output with respect to the the inputs and the logit is simple if you express it in terms of the weights are very output: simple: The derivatives of a logistic neuron (details) because Deriving the delta rule (squared loss) To learn the weights we need the derivative of the output with respect to each weight: delta-rule extra term = slope of logistic Learning rate can be adjusted during the training Problems with squared error loss The squared error measure has some drawbacks: If the desired output is 1 and the actual output is 0.00000001 there is almost no gradient for a logistic unit to fix up the error. If we are trying to assign probabilities to mutually exclusive class labels, we know that the outputs should sum to 1, but we are depriving the network of this knowledge. Is there a different loss function that works better? Yes: Force the outputs to represent a probability distribution across discrete alternatives. Deriving the Delta Rule – Cross Entropy Loss (I) 11 Source: Daniel Jurafsky & James H. Martin. Speech and Language Processing. 2021 Deriving the Delta Rule – Cross Entropy Loss (II) 12 Learning with hidden units (again) Networks without hidden units are very limited in the input-output mappings they can model. Adding a layer of hand-coded features (as in a perceptron) makes them much more powerful but the hard bit is designing the features. We would like to find good features without requiring insights into the task or repeated trial and error where we guess some features and see how well they work. We need to automate the loop of designing features for a particular task and seeing how well they work. Learning by perturbing weights (this idea occurs to everyone who knows about evolution) Randomly perturb one weight and see if it improves performance. If so, save the change. This is a form of reinforcement output units learning. Very inefficient. We need to do hidden units multiple forward passes on a representative set of training cases just to change one weight. input units Backpropagation is much better. Towards the end of learning, large weight perturbations will nearly always make things worse, because the weights need to have the right relative values. Learning by using perturbations We could randomly perturb all the weights in parallel and correlate the performance gain with the weight changes. Not any better because we need lots of trials on each training case to “see” the effect of changing one weight through the noise created by all the changes to other weights. A better idea: Randomly perturb the activities of the hidden units. Once we know how we want a hidden activity to change on a given training case, we can compute how to change the weights. There are fewer activities than weights, but backpropagation still wins by a factor of the number of neurons. The idea behind backpropagation We don’t know what the hidden units ought to do, but we can compute how fast the error changes as we change a hidden activity. Instead of using desired activities to train the hidden units, use error derivatives w.r.t. hidden activities. Each hidden activity can affect many output units and can therefore have many separate effects on the error. These effects must be combined. We can compute error derivatives for all the hidden units efficiently at the same time. Once we have the error derivatives for the hidden activities, its easy to get the error derivatives for the weights going into a hidden unit. Sketch of the backpropagation algorithm on a single case First convert the discrepancy between each output and its target value into an error derivative. Then compute error derivatives in each hidden layer from error derivatives in the layer above. Then use error derivatives w.r.t. activities to get error derivatives w.r.t. the incoming weights. Backpropagating dE/dy Softmax The output units in a softmax group use a non-local non-linearity: softmax group this is called the “logit” Cross-entropy: the right cost function to use with softmax The right cost function is the negative log probability of the right answer. C has a very big gradient when the target value target value is 1 and the output is almost zero. A value of 0.000001 is much better than 0.000000001 The steepness of dC/dy exactly balances the flatness of dy/dz The delta rule remains the same Perceptron Decision (a special case) Outputs are: r example, y=(y_1, …, y_5)=(1, 0, 0, 1, 1) is a possible output. e may have a different function g in the place of sign. Perceptron Learning = Updating the Weights We want to change the values of the weights Aim: minimize the error at the output If E = t-y, want E to be 0 Use: Learning rate Input i: input index j: training epoch Error (loss) (training case or minibatch index) Example: Learning Weights X_0 X_1 X_2 t -1 -1 0 0 0 W_0 Indicator function -1 0 1 1 W_1 -1 1 0 1 W_2 -1 1 1 1 tial values: w_0(0)=-0.05, w_1(0) =-0.02, w_2(0)=0.02, and =0.25 ke first row of our training table: 1= sign( -0.05×-1 + -0.02×0 + 0.02×0 ) = 1 _0(1) = -0.05 + 0.25×(0-1)×-1=0.2 _1(1) = -0.02 + 0.25×(0-1)×0=-0.02 _2(1) = 0.02 + 0.25×(0-1)×0=0.02 e continue with the new weights and the second row, and so on e make several passes over the training data. Model Selection & Generalization 24 Learning is an ill-posed problem; data is not sufficient to find a unique solution The need for inductive bias, assumptions about H Generalization: How well a model performs on new data Overfitting: H more complex than C or f Underfitting: H less complex than C or f Triple Trade-Off 25 There is a trade-off between three factors (Dietterich, 2003): 1. Complexity of H, c (H), 2. Training set size, N, 3. Generalization error, E, on new data As N , E¯ As c (H) , first E¯ and then E Steps of Supervised Learning - Discriminative vs. Generative Model directly:g C | x Loss function: Optimization procedure: * arg min E C | X Generative Learning: Model g C | x g (Cg) (gx()x|C ) Bayesian Theorem Figure out components for each class 26 MAP (maximum A Posterior) rule to predict A Fundamental Picture In general training errors will always decline. However, test errors will decline at first (as reductions in bias dominate) but will then start to increase again (as increases in We must always keep this picture in mind when variance choosing a learning method. More flexible/complicated dominate). is not always better! Cross-Validation 28 To estimate generalization error, we need data unseen during training. We split the data as Training set (60%) Validation set (20%) Test (publication) set (20%) K-fold cross-validation (4-fold in this case) Make sure the distribution of the labels are similar across the three sets. Resampling when there is few data Evaluation Metrics 29 Precision: positive predictive value (PPV) p(y = 1| = 1) Recall (Sensitivity) p( = 1|y = 1) Specificity p( = 0|y = 0) Area Under the Curve (AUC) How to calculate? 30 Note The slides are edited and modified by Dongxiao Zhu @ Wayne State University