Representation Learning PDF

Representation learning 1 Representation learning ❏ One of the greatest features of neural networks is their ability to learn representations ψ(x) ❏ Contrast this with linear models, where features ψ(x) are manually engineered ❏ Representations are useful for several reasons: ❏ They can make our models more expressive and more accurate ❏ We may want to transfer representations from one task to another ❏ Deeper neural networks learn coarse-to-fine representation layers Representation learning Representation learning ❏ Bottom level layers (closer to inputs) tend to learn low-level representations (corners, edges) ❏ Upper level layers (farther away from inputs) learn more abstract representations (shapes, forms, objects) Distributed representations ❏ No single neuron “encodes” everything; groups of neurons (e.g. in the same hidden layer) work together! ❏ We need hidden units to capture diverse properties of the objects (i.e. we don’t want all them to capture the same property) Representation learning ❏ How can we learn useful representations of objects from raw inputs only (i.e. no labels)? ❏ Force network to represent latent structure of input distribution ❏ Encourage hidden layers to encode that structure ❏ This can be done with an auto-encoder Key idea: learn the manifold where the input objects live ❏ Learn representations that encode well points in that manifold Representation learning ❏ An auto-encoder is a feed-forward neural network trained to reproduce its input at the output layer Deep learning 7 Convolutional Neural Network Convolutional Neural Networks (CNN) are neural networks with specialized connectivity structure Convolutional Neural Network A convolutional layer preserves the spatial structure Convolutional Neural Network Convolutional Neural Network ❏ The second component of CNN is pooling ❏ Common CNN alternate convolutional layers and pooling layers ❏ Makes the representations smaller and more manageable ❏ Operates over each activation map (each channel) independently Convolutional Neural Network ❏ 2D Convolutions: the same filter (e.g. 3x3) is applied to each location of the image ❏ The filter weights are learned (as tied parameters) ❏ Multiple filters ❏ Alternates convolutional and pooling layers Convolutional Neural Network Training typically consists of two phases: 1. A forward phase, where the input is passed completely through the network a. Each layer will cache any data (like inputs, intermediate values, etc) it’ll need for the backward phase i. Any backward phase must be preceded by a corresponding forward phase 2. A backward phase, where gradients are backpropagated (backprop) and weights are updated a. Each layer will receive a gradient and also return a gradient i. It will receive the gradient of loss with respect to its outputs and returns the gradient of loss with respect to its inputs Long Short-Term Memory ❏ Long Short Term Memory networks, usually just called LSTMs, are a special kind of RNN, capable of learning long-term dependencies ❏ All RNN have the form of a chain of repeating modules of neural network ❏ In standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer Long Short-Term Memory LSTMs also have this chain like structure, but the repeating module has a different structure. Instead of having a single neural network layer, there are four, interacting in a very special way Long Short-Term Memory ❏ The key to LSTMs is the cell state, the horizontal line running through the top of the diagram ❏ The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates ❏ Composed out of a sigmoid neural net layer and a pointwise multiplication operation ❏ The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through ❏ A value of zero means “let nothing through,” ❏ A value of one means “let everything through!” Long Short-Term Memory What information we’re going to throw away from the cell state? ❏ This decision is made by a sigmoid layer called the “forget gate layer” ❏ It looks at ht−1 and xt and outputs a number between 0 and 1 for each number in the cell state Ct−1 ❏ A 1 represents “completely keep this” ❏ A 0 represents “completely get rid of this” Long Short-Term Memory What new information we’re going to store in the cell state? ❏ This has two parts: ❏ First, a sigmoid layer called the “input gate layer” decides which values we’ll update ❏ Next, a tanh layer creates a vector of new candidate values, Ct, that could be added to the state Long Short-Term Memory It’s now time to update the old cell state, Ct−1, into the new cell state Ct ❏ We multiply the old state by ft, forgetting the things we decided to forget earlier ❏ Then we add it*Ct. This is the new candidate values, scaled by how much we decided to update each state value Long Short-Term Memory Finally, we need to decide what we’re going to output. This output will be based on our cell state, but will be a filtered version ❏ First, we run a sigmoid layer which decides what parts of the cell state we’re going to output ❏ Then, we put the cell state through tanh (to push the values to be between −1 and 1) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to Long Short-Term Memory ❏ An RNN using LSTM units can be trained using gradient descent, combined with backpropagation through time (BPTT) to compute the gradients needed during the optimization process, in order to change each weight of the LSTM network in proportion to the derivative of the error (at the output layer of the LSTM network) with respect to corresponding weight ❏ BPTT begins by unfolding a recurrent neural network in time. The unfolded network contains k inputs and outputs, but every copy of the network shares the same parameters. Then the backpropagation algorithm is used to find the gradient of the cost with respect to all the network parameters

Representation Learning PDF

Document Details

Tags

Related

Summary

Full Transcript