IMG_1566.jpeg
Document Details

Uploaded by FluentChrysoprase2696
Riverside College, Inc.
Full Transcript
# Introduction to Recurrent Neural Networks ## Introduction This chapter covers Recurrent Neural Networks(RNNs). It will cover the following: * Motivations for RNNs, essential RNN ideas such as hidden states. * RNN applications in language modelling and text generation. * The vanishing grad...
# Introduction to Recurrent Neural Networks ## Introduction This chapter covers Recurrent Neural Networks(RNNs). It will cover the following: * Motivations for RNNs, essential RNN ideas such as hidden states. * RNN applications in language modelling and text generation. * The vanishing gradient problem and some solutions * Long Short-Term Memory(LSTM) and Gated Recurrent Unit(GRU) networks. * Attention mechanisms * Bidirectional RNNs. * Recursive Neural Networks. ## Recurrent Neural Networks ### Markov Models Markov Models predicts the next state/word based on the previous word. $P(w_t \mid w_1, w_2,..., w_{t-1}) = P(w_t \mid w_{t-1})$ ### Limitations of Markov Models Markov Models have the following limitations: * Since the conditional probability table is of size $V \times V$, where $V$ is the vocabulary size, it can become very large. (quadratic in the vocabulary size) * It can only look one step behind. ### Recurrent Neural Networks to the Rescue Recurrent Neural Networks address the limitations of Markov Models. RNNs have the following properties: * The hidden state can, in theory, remember information from arbitrarily long context windows. * The model size does not increase for longer input sequences. ### RNN Cell * $x_t$ is the input at time $t$. For example, it could be a one-hot vector representation of the t-th word in a sentence. * $s_t$ is the hidden state at time $t$. It serves as the network's memory. $s_t$ is calculated based on the previous hidden state $s_{t-1}$ and the current input $x_t$: $s_t = f(Ux_t + Ws_{t-1})$ where 'f' is a non-linear activation function. * $o_t$ is the output at time $t$. For example, if we're trying to predict the next word, $o\_t$ would be a probability vector across our vocabulary. $o_t = softmax(Vs_t)$ ### Unfolded RNN The image shows an unfolded RNN over three time steps. The unfolded RNN shows the flow of information through time. ### RNN Types The image shows the different types of RNNs: * One to one * One to many * Many to one * Many to many ### RNN Applications * Language Modeling & Generating Text * Machine Translation * Speech Recognition * Generating Image Descriptions * Video Analysis * Sentiment Classification ### The problem of Long-Range Dependencies The image shows the problem of long-range dependencies. The back propagated error diminishes(vanishes) exponentially with time. ### The vanishing gradient problem The vanishing gradient problem is a major challenge in training RNNs, especially when dealing with long sequences. $\frac{\partial L_t}{\partial s_k} = \frac{\partial L_t}{\partial s_t} \prod_{j=k+1}^t \frac{\partial s_j}{\partial s_{j-1}}$ If the singular values of $\frac{\partial s_j}{\partial s_{j-1}}$ are less than 1, then the gradient will vanish exponentially with time. ### Solutions to Vanishing Gradients * Careful Initialization * Gradient Clipping * Long Short-Term Memory(LSTM) * Gated Recurrent Units(GRUs) ### LSTM The image shows a LSTM cell. The LSTM cell has the following gates: * Forget Gate * Input Gate * Output Gate ### LSTM Equations * Forget Gate: $f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)$ * Input Gate: $i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)$ * Cell State: $\tilde{C}_t = tanh(W_C \cdot [h_{t-1}, x_t] + b_C)$ * Output Gate: $o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)$ * Hidden State: $h_t = o_t \cdot tanh(C_t)$ ### GRU The image shows a GRU cell. GRUs are a simplified version of LSTMs. ### GRU Equations * Update Gate: $z_t = \sigma(W_z \cdot [h_{t-1}, x_t])$ * Reset Gate: $r_t = \sigma(W_r \cdot [h_{t-1}, x_t])$ * Hidden State: $\tilde{h}_t = tanh(W \cdot [r_t \cdot h_{t-1}, x_t])$ * Output: $h_t = (1 - z_t) \cdot h_{t-1} + z_t \cdot \tilde{h}_t$ ### Which one to use? * **Empirical Evidence:** LSTMs and GRUs often perform similarly. * **GRU Advantages:** Fewer parameters, trains a bit faster, may generalize better with less data. * **LSTM Advantages:** More expressive, might perform better with enough data. ### Attention Mechanism The image shows the attention mechanism. Attention mechanism allows the network to focus on the most relevant parts of the input sequence. ### Attention Mechanism Equations * $score(h_t, \bar{h_s}) = h_t^T \bar{h_s}$ * $\alpha_{ts} = \frac{exp(score(h_t, \bar{h_s}))}{\sum_{s'=1}^S exp(score(h_t, \bar{h}_{s'}))}$ * $context_t = \sum_{s=1}^S \alpha_{ts} \bar{h_s}$ ### Bidirectional RNNs The image shows a bidirectional RNN. Bidirectional RNNs process the input sequence in both directions. ### Recursive Neural Networks The image shows a recursive neural network. Recursive Neural Networks are a generalization of RNNs. Recursive Neural Networks can be used to process tree-like structures.