Image Captioning and Sentiment Analysis

Study Notes

Produces descriptive sentences from images using input feature vectors derived from Convolutional Neural Networks (CNN).
Examples of output from image captioning include concise phrases like “The dog is hiding.”

Init-Inject: Image vector serves as the RNN's initial hidden state vector, requiring size alignment with the RNN hidden state.
Pre-Inject: Image vector treated as the first word in input sequence; requires size alignment with word vectors.
Par-Inject: Inputs image vector simultaneously with word vectors, allowing varying vectors for words and not needing presence at every step.
Merge: RNN does not access the image vector during processing; image is added to language model post-encoding.

Back-Propagation Through Time (BPTT): Unfolds RNN into a feed-forward network to compute gradients across the entire sequence.
Truncated BPTT: Processes forward and backward in chunks; retains hidden states across sequential batches.

Multi-layer RNNs can solve complex sequential problems by stacking hidden layers.
Bi-directional RNNs offer forward and reverse processing of sequences, enhancing features for applications like speech recognition.

Vanishing Gradients: Gradients shrink to near-zero due to model complexity across many layers.
Exploding Gradients: Gradients grow uncontrollably, leading to numerical instability during training.

Gradient Scaling: Normalizes gradient vector to a defined norm, often 1.0.
Gradient Clipping: Restricts gradient values to remain within a specified range, improving training stability.

Include Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) architectures, effective in managing long sequences by preventing vanishing gradients.
LSTM utilizes a memory cell for maintaining information over time and has several gates (input, forget, output) to control data flow.

Employs a structure that allows the model to "forget" information and makes predictions based on retained state.
Introduces concepts of Constant Error Flow, facilitating better gradient management.

Simplifies LSTM by combining the forget and input gates into a single update gate; merges cell and hidden states.
Fewer parameters lead to reduced memory usage and potentially faster training, with competitive performance against LSTM.

GRUs are more efficient with memory and processing speed; LSTMs excel on larger datasets and longer sequences.
Choice between GRU and LSTM depends on application constraints like memory and data sequence length.

Designed for sequential data processing, RNNs can handle varying lengths and structures.
Different from fully connected feed-forward networks, RNNs share weights over time steps, allowing them to learn sequential relationships.

Versatile use cases include language modeling, machine translation, stock market predictions, speech recognition, image caption generation, video tagging, text summarization, and medical data analysis.

Core functionality involves predicting future values based on past inputs, mapping previous states into fixed-length vectors to inform future predictions.