RNN Processing Sequences Using RNNs and CNNs PDF

Chapter 15: Processing Sequences Using RNNs and CNNs Tsz-Chiu Au [email protected] Ulsan National Institute of Science and Technology (UNIST) South Korea Recurrent Neural Networks Recurrent neural networks (RNNs) is a class of nets that can predict the future. » They can analyze time series data such as stock prices. In this chapter, we will study » The fundamental concepts underlying RNNs. » How to train them using backpropagation through time. » How to use them to forecast a time series. » How to cope with unstable gradients and a (very) limited short-term memory. » A CNN architecture called WaveNet that is capable of processing time series data as well as RNNs. Recurrent Neurons and Layers A recurrent neural network looks very much like a feedforward neural network, except it also has connections pointing backward. A recurrent neuron unrolled through time: At each time step t (also called a frame), this recurrent neuron receives the inputs x(t) as well as its own output from the previous time step, y(t–1). » In the first time step, the input is 0 A Layer of Recurrent Neurons At each time step t, every neuron receives both the input vector x(t) and the output vector from the previous time step y(t–1). Each recurrent neuron has two sets of weights: Wx for the inputs x(t) and Wy for the outputs of the previous time step, y(t–1). Memory Cells A recurrent neuron has memory because its output is a function of all the inputs from previous time steps. A part of a neural network that preserves some state across time steps is called a memory cell (or simply a cell). A single recurrent neuron, or a layer of recurrent neurons, is a very basic cell, capable of learning only short patterns. » To learn longer patterns, a more powerful type of cells is needed. A cell’s state at time step t, denoted h(t) (the “h” stands for “hidden”), is a function of some inputs at that time step and its state at the previous time step: h(t) = f(h(t–1), x(t)). The output at time step t, denoted y(t), is also a function of the previous state and the current inputs. Input and Output Sequences Sequence-to-sequence network » E.g., predicting time series such as stock prices Sequence-to-vector network » E.g., feed the network a sequence of words corresponding to a movie review and output a sentiment score Vector-to-sequence network » E.g., the input could be an image, and the output could be a caption for that image. Encoder–Decoder » E.g., translating a sentence from one language to another. » Feed the network a sentence in one language, the encoder would convert this sentence into a single vector representation, and then the decoder would decode this vector into a sentence in another language. Training RNNs Backpropagation through time (BPTT) » First, forward pass through the unrolled network » Second, the output sequence is evaluated using a cost function C(Y(0), Y(1),...Y(T)) § The cost function can ignore some outputs » Third, the gradients of that cost function are then propagated backward through the unrolled network » Fourth, the model parameters are updated using the gradients computed during BPTT. § Since the same parameters W and b are used at each time step, backpropagation will do the right thing and sum over all time steps. Forecasting a Time Series A time series is a sequence of data, one per time step. » A univariate time series is a time series that have one value per time step. » A multivariate time series is a time series that have multiple values per time step. Forecasting is a task of predicting future values. » E.g., forecast the value at the next time step (represented by the X) in the following graphs Imputation is a task of predicting missing values from the past. Generating Time Series for Experiments In this chapter, instead of using real time series data in the real world, we will consider the time series generated by this function: The function returns a NumPy array of shape [batch size, time steps, 1], where each series is the sum of two sine waves of fixed amplitudes but random frequencies and phases, plus a bit of noise. » In general, the input features are represented as 3D arrays of shape [batch size, time steps, dimensionality], where dimensionality is 1 for univariate time series and more for multivariate time series. Let’s create a training set, a validation set, and a test set. Baseline Metrics We like to compare our RNN models with some baseline methods to see whether the RNN models perform as well as we expect. Baseline 1: naive forecasting » Use the last value in a series to predict the next value. » It gives a mean squared error of about 0.020 in our previous example. Baseline 2: linear model (implemented as a fully connected network) » Use a simple Linear Regression model so that each prediction will be a linear combination of the values in the time series » If we compile this model using the MSE loss and the default Adam optimizer, then fit it on the training set for 20 epochs and evaluate it on the validation set, we get an MSE of about 0.004. Implementing a Simple RNN Let’s build a very simple RNN and compare it with the baseline methods. It just contains a single layer with a single neuron. By default, the SimpleRNN layer uses the hyperbolic tangent activation function. The initial state h(init) is set to 0. The neuron computes a weighted sum of these values and applies the hyperbolic tangent activation function to the result, and this gives the first output, y0. In a simple RNN, this output is also the new state h0. This new state is passed to the same recurrent neuron along with the next input value, x(1). The process is repeated until it returns y49. By default, recurrent layers in Keras only return the final output. To make them return one output per time step, you must set return_sequences=True If you compile, fit, and evaluate this model (just like earlier, we train for 20 epochs using Adam), you will find that its MSE reaches only 0.014 » better than the naive approach but it does not beat the simple linear model. » Reason: this simple RNN has just three parameters whereas the simple linear model has 51 parameters. Trend and Seasonality There are other models for forecasting time series » E.g., weighted moving average models and autoregressive integrated moving average (ARIMA) models. Some of them require you to first remove the trend and seasonality. » The known pattern of the time series should be ignored first. After the model is trained and makes predictions, you would have to add the trend and the seasonal pattern back to get the final predictions. When using RNNs, it is generally not necessary to do all this, but it may improve performance in some cases, since the model will not have to learn the trend or the seasonality. Deep RNNs To implement a deep RNN with multiple layers of cells: Note that you must set return_sequences=True for all recurrent layers except the last one. If you compile, fit, and evaluate this model, you will find that it reaches an MSE of 0.003 (i.e., better than the linear model) The last layer is too simple, and we can replace it with a Dense layer Forecasting Several Time Steps Ahead To predict not just the value at the next time step but also the next 10 values » One simple way is to use the trained model to predict the next value, then add that value to the inputs, and use the model again to predict the following value, and so on. In this method, the errors might accumulate over time. » We get an MSE of about 0.029, which is even worse than the naive forecasting (MSE of about 0.223) and the linear model (MSE of about 0.0188). Still, if you only want to forecast a few time steps ahead only on more complex tasks, this approach may work well. Forecasting Several Time Steps Ahead (cont.) The second option is to train an RNN to predict all 10 next values at once. » We can still use a sequence-to-vector model, but it will output 10 values instead of 1. » Now we just need the output layer to have 10 units instead of 1 The MSE for the next 10 time steps is about 0.008. » Much better than the linear model But we can still do better » instead of training the model to forecast the next 10 values only at the very last time step, we can train it to forecast the next 10 values at each and every time step. § i.e., turn this sequence-to-vector RNN into a sequence-to-sequence RNN. » The advantage of this technique is that the loss will contain a term for the output of the RNN at each and every time step, not just the output at the last time step. Forecasting Several Time Steps Ahead (cont.) At time step 0 the model will output a vector containing the forecasts for time steps 1 to 10, then at time step 1 the model will forecast time steps 2 to 11, and so on. To turn the model into a sequence-to-sequence model, » we must set return_sequences=True in all recurrent layers (even the last one) » we must apply the output Dense layer at every time step. Keras offers a TimeDistributed layer, which reshapes the inputs for the wrapped layer and then reshape the outputs back to sequences. We get a validation MSE of about 0.006, which is 25% better than the previous model. Unstable Gradients Problem To deal with the unstable gradients problem, we can reuse the same tricks for deep nets: » good parameter initialization, faster optimizers, dropout, and so on. However, unlike deep nets such as CNNs, we should not use nonsaturating activation functions (e.g., ReLU) for RNNs. » They may actually lead the RNN to be even more unstable during training. § Small increase in the outputs will eventually cause the outputs to explode after many time steps. » Hence, use a saturating activation function like the hyperbolic tangent. The gradients themselves can explode too. » If you notice that training is unstable, you may want to monitor the size of the gradients (e.g., using TensorBoard) » Use Gradient Clipping when needed. Layer Normalization Batch Normalization (BN) cannot be used as efficiently with RNNs as with deep feedforward nets. » In fact, you cannot use BN between time steps, only between recurrent layers. » The use of BNs was slightly better than nothing when applied between recurrent layers (i.e., vertically in Figure 15-7), but not within recurrent layers (i.e., horizontally). Layer Normalization: instead of normalizing across the batch dimension, it normalizes across the features dimension. » Like BN, Layer Normalization learns a scale and an offset parameter for each input. » In an RNN, it is typically used right after the linear combination of the inputs and the hidden states. One advantage is that it can compute the required statistics on the fly, at each time step, independently for each instance. » This also means that it behaves the same way during training and testing (as opposed to BN). » It does not need to use exponential moving averages to estimate the feature statistics across all instances in the training set. Implementation of Layer Normalization Use tf.keras to implement Layer Normalization within a simple memory cell. To add dropout, all recurrent layers (except for keras.layers.RNN) and all cells provided by Keras have a dropout hyperparameter and a recurrent_dropout hyperparameter. » The former defines the dropout rate to apply to the inputs (at each time step), and the latter defines the dropout rate for the hidden states (also at each time step). Long Short-Term Memory In RNNs, some information is lost at each time step. » After a while, the RNN’s state contains virtually no trace of the first inputs. To tackle this problem, various types of cells with long-term memory have been introduced. » They have proven so successful that the basic cells are not used much anymore. The most popular long-term memory cells: the Long Short-Term Memory (LSTM) cell. Two ways to add LSTM layers to a model in Keras: The LSTM layer uses an optimized implementation when running on a GPU, so in general it is preferable to use it. The Architecture of LSTM Cells The state of a LSTM cell is split into two vectors: h(t) and c(t) » You can think of h(t) as the short-term state and c(t) as the long-term state. The key idea is that the network can learn what to store in the long-term state, what to throw away, and what to read from it. » c(t–1) first goes through a forget gate, dropping some memories, and then adding some new memories (selected by an input gate) via the addition operation. » c(t–1) is copied and passed through the tanh function, and then the result is filtered by the output gate to produce the short-term state h(t), which is equal to the cell’s output for this time step, y(t). Gates in LSTM Cells The current input vector x(t) and the previous short-term state h(t–1) are fed to four different fully connected layers. The main layer is the one that outputs g(t) given the current inputs x(t) and the previous (short-term) state h(t–1). » Depending on the input gate, g(t) may or may not be added to the long-term state c(t). The three other layers are gate controllers. » The forget gate (controlled by f(t)) controls which parts of the long-term state should be erased. » The input gate (controlled by i(t)) controls which parts of g(t) should be added to the long-term state. » The output gate (controlled by o(t)) controls which parts of the long-term state should be read and output at this time step, both to h(t) and to y(t). Gate controllers use the logistic activation function, whose range is between 0 to 1. » If they output 0s they close the gate, and if they output 1s they open it. The Equations of LSTM Cells A LSTM cell can be implemented by the following equations: Wxi, Wxf, Wxo, Wxg are the weight matrices of each of the four layers for their connection to the input vector x(t). Whi, Whf, Who, and Whg are the weight matrices of each of the four layers for their connection to the previous short-term state h(t–1). bi, bf, bo, and bg are the bias terms for each of the four layers. Note that Tensor-Flow initializes bf to a vector full of 1s instead of 0s. This prevents forgetting everything at the beginning of training. Peephole Connections In a regular LSTM cell, the gate controllers can look only at the input x(t) and the previous short-term state h(t–1). An LSTM variant with extra connections called peephole connections » The previous long-term state c(t–1) is added as an input to the controllers of the forget gate and the input gate » The current long-term state c(t) is added as input to the controller of the output gate. Peephole Connections often improve performance, but not always. Keras offers an experimental implementation of LSTM cells with peephole connections » tf.keras.experimental.PeepholeLSTMCell » You can create a keras.layers.RNN layer and pass a PeepholeLSTM Cell to its constructor. Gated Recurrent Unit (GRU) Cell The GRU cell is a simplified version of the LSTM cell, and it seems to perform just as well » Both state vectors are merged into a single vector h(t). » A single gate controller z(t) controls both the forget gate and the input gate. § If the gate controller outputs a 1, the forget gate is open (= 1) and the input gate is closed (1 – 1 = 0). § If it outputs a 0, the opposite happens. » There is no output gate; the full state vector is output at every time step. § However, there is a new gate controller r(t) that controls which part of the previous state will be shown to the main layer (g(t)). Equations for GRU Cells The equations for GRU cells: Keras provides a keras.layers.GRU layer. Using 1D convolutional layers to process sequences LSTM and GRU cells are one of the main reasons behind the success of RNNs. » But they still have a fairly limited short-term memory. § They have a hard time learning long-term patterns in sequences of 100 time steps or more, such as audio samples, long time series, or long sentences. One way to solve this is to shorten the input sequences, for example using 1D convolutional layers. » Build a neural network composed of a mix of recurrent layers and 1D convolutional layers (or even 1D pooling layers). » The 1D convolutional layers downsample the input sequence by a factor of 2, using a stride of 2. By shortening the sequences, the convolutional layer may help the GRU layers detect longer patterns. WaveNet It is possible to use only 1D convolutional layers and drop the recurrent layers entirely. WaveNet stacks 1D convolutional layers, doubling the dilation rate (how spread apart each neuron’s inputs are) at every layer. » The lower layers learn short-term patterns, while the higher layers learn long-term patterns. » Thanks to the doubling dilation rate, the network can process extremely large sequences very efficiently. WaveNet (cont.) Here is how to implement a simplified WaveNet in Keras: This Sequential model starts with an explicit input layer, then continues with a 1D convolutional layer using "causal" padding. » This ensures that the convolutional layer does not peek into the future when making predictions. Add similar pairs of layers using growing dilation rates: 1, 2, 4, 8, and again 1, 2, 4, 8. Finally, we add the output layer: a convolutional layer with 10 filters of size 1 and without any activation function. GRU with 1D convolutional layers and WaveNet offer the best performance so far in forecasting our time series.

RNN Processing Sequences Using RNNs and CNNs PDF

Document Details

Tags

Related

Summary

Full Transcript