Web and Text Analytics 2024-25 Week 8 PDF

Web and Text Analytics 2024-25 Week 8 Evangelos Kalampokis https://kalampokis.github.io http://islab.uom.gr © Information Systems Lab Data preparation - Preprocessing examples 1.1 © Information Systems Lab 2 Data preparation - Preprocessing examples 1.2 © Information Systems Lab 3 Data preparation - Preprocessing examples 1.3 © Information Systems Lab 4 Feature extraction © Information Systems Lab 5 Train and test a classifier, Naïve Bayes (1) © Information Systems Lab 6 Train and test a classifier, Naïve Bayes (2) © Information Systems Lab 7 Train and test a classifier, XGBOOST (1) © Information Systems Lab 8 Train and test a classifier, XGBOOST (2) © Information Systems Lab 9 Train and test a classifier, XGBOOST (3) © Information Systems Lab 10 Evaluate specific examples © Information Systems Lab 11 Onilne Courses ▪ DataCamp: Recurrent Neural Networks (RNNs) for Language Modeling with Keras – https://campus.datacamp.com/courses/recurrent-neural-networks-rnn-for- language-modeling-with-keras/ ▪ Coursera: Natural Language Processing with Sequence Models – https://www.coursera.org/learn/sequence-models-in-nlp © Information Systems Lab Sequence to sequence (seq2seq) models ▪ Seq2Seq (Sequence-to-Sequence) models are a type of neural network, an exceptional Recurrent Neural Network architecture, designed to transform one data sequence into another. ▪ They are handy for tasks where the input and output are sequences of varying lengths, which traditional neural networks struggle to handle, such as solving complex language problems like machine translation, question answering, creating chatbots, text summarization, etc. © Information Systems Lab Use Cases of the Sequence to Sequence Models ▪ Machine Translation: One of the most prominent applications of Seq2Seq models is translating text from one language to another. ▪ Text Summarization: Seq2Seq models can generate concise summaries of longer documents, capturing the essential information while omitting less relevant details. ▪ Speech Recognition: Converting spoken language into written text. Seq2Seq models can be trained to map audio signals (sequences of sound) to their corresponding transcriptions (sequences of words). ▪ Chatbots and Conversational AI: These models can generate human-like responses in a conversation, taking the previous sequence of user inputs and generating appropriate replies. ▪ Image Captioning: Seq2Seq models can describe the content of an image in natural language. The encoder processes the image (often using Convolutional Neural Networks, CNNs) to produce a context vector, which the decoder converts into a descriptive sentence. ▪ Video Captioning: Similar to image captioning but with videos, Seq2Seq models generate descriptive texts for video content, capturing the sequence of actions and scenes. ▪ Time Series Prediction involves predicting the future values of a sequence based on past observations. ▪ Code Generation: This process generates code snippets or entire programs from natural language descriptions. © Information Systems Lab Encoder-Decoder Architecture ▪ The most common architecture used to build Seq2Seq models is Encoder-Decoder architecture consisting of two major components: – an encoder that takes a variable-length sequence as input, and – a decoder that acts as a conditional language model, taking in the encoded input and the leftwards context of the target sequence and predicting the subsequent token in the target sequence ▪ Different kinds of neural networks including RNN, LSTM, CNN, and transformer can be used based on encoder-decoder architecture © Information Systems Lab Encoder-decoder architecture ▪ In this architecture, the input data is first fed through what’s called an encoder network. ▪ The encoder network maps the input data into a numerical representation that captures the important information from the input. ▪ The numerical representation of the input data is also called a hidden state. ▪ The numerical representation (hidden state) is then fed into what’s called the decoder network. ▪ The decoder network generates the output by generating one element of the output sequence at a time © Information Systems Lab Encoder-decoder architecture ▪ The following picture represents the encoder-decoder architecture as explained here. Note that both input and output sequences of data can be of varying length as shown in the picture below © Information Systems Lab Recurrent Neural Networks (RNN) ▪ We have already described NLP applications such as sentiment analysis, multi-class classification, text generation, and machine neural translation. ▪ All these applications are possible with a type of Deep Learning architecture called Recurrent Neural Networks (RNN). ▪ The main advantages to use RNN for text data is that it reduces the number of parameters of the model (by avoiding one-hot encoding) and it shares weights between different positions of the text. © Information Systems Lab Recurrent Neural Networks ▪ In the example, model uses information from all the words to predict if the movie review was good or not. ▪ RNNs model sequence data and can have different lengths of inputs and outputs. © Information Systems Lab Sequence to Sequence models (Classification) ▪ Many inputs to one output is commonly used for classification tasks, where the final output is a probability distribution. ▪ This is used on sentiment analysis and multi-class classification applications. © Information Systems Lab Sequence to Sequence models (Text generation) ▪ Many inputs to many outputs for text generation start the same as in the classification case, but for the outputs, it uses the previous prediction as input to the next prediction. © Information Systems Lab Sequence to Sequence models (Machine Translation) ▪ Many inputs to many outputs for neural machine translation is separated in two blocks: encoder and decoder. ▪ The encoder learns the characteristics of the input language, while the decoder learns for the output language. ▪ The encoder has no prediction (no arrows going up), and the decoder doesn't receive inputs (no arrows from below). © Information Systems Lab Sequence to Sequence models (language models) ▪ Many inputs to many outputs for language models starts with an artificial zero input, and then for every input word i the model tries to predict the next word i+1. © Information Systems Lab RNN ▪ Recurrent Neural Network models are themselves language models when trained on text data, because they give the probability of the next token given the previous k tokens. © Information Systems Lab RNN - embeddings ▪ Also, an embedding layer can be used to create vector representations of the tokens as the first layer. © Information Systems Lab RNN example ▪ Let's look at the following example “Nour was supposed to study with me. I called her but she did not ______” ▪ An N-gram (trigram) would only look at "did not" and would try to complete the sentence from there. As a result, the model will not be able to see the beginning of the sentence "I called her but she". Probably the most likely word is have after "did not". © Information Systems Lab RNN example ▪ RNNs help us solve this problem by being able to track dependencies that are much further apart from each other. As the RNN makes its way through a text corpus, it picks up some information as follows ▪ Note that as you feed in more information into the model, the previous word's retention gets weaker, but it is still there. © Information Systems Lab RNN example ▪ Another advantage of RNNs is that a lot of the computation shares parameters ▪ The magic of RNN is that the information from every word in the sequence is multiplied by the same weights, Wx. ▪ The information propagated from the beginning to the end is multiplied by Wh. In other words, this block is repeated for every word in the sequence. ▪ The only learnable parameters are the ones in Wx, Wh, and W. © Information Systems Lab Training RNN models ▪ Forward propagation and Backward propagation ▪ They both follow two directions: vertical (between input and output) and horizontal (going through time) – Because of this horizontal direction, back propagation is referred as back propagation through time. © Information Systems Lab RNN – Forward Propagation ▪ In the forward propagation phase, we compute a hidden state a that will carry past information by applying the linear combination over the previous step and the current input. – The second step combines the results from the first step and receive the second word as input. – The weight matrix Wa is used on all steps, which means the weights are shared among all the inputs © Information Systems Lab RNN – Forward Propagation ▪ The diagram shows the order of computation, and how information is propagated within a recurrent unit. ▪ Hidden states are the variables that allow RNNs to propagate information through time, or in other words through different positions within the sequence. ▪ As you saw at every step, recurrent units have two inputs. © Information Systems Lab RNN – Back Propagation through time ▪ Vanishing gradient problem – Gradient is the value used to update a neural networks weights – Gradient exponentially shrinks as it back propagates through time – Small gradients -> small adjustments – Thus earlier layers do not learn and RNN have short memory © Information Systems Lab Vanishing gradient problem © Information Systems Lab RNNs and Vanishing Gradients ▪ Advantages of RNNs – RNNs allow us to capture dependancies within a short range and they take up less RAM than other n-gram models. ▪ Disadvantages of RNNs – RNNs struggle with longer term dependencies and are very prone to vanishing or exploding gradients © Information Systems Lab RNNs and Vanishing Gradients ▪ Note that the sigmoid and tanh functions are bounded by 0 and 1 and - 1 and 1 respectively. ▪ This eventually leads us to a problem. If you have many numbers that are less than |1|, then as you go through many layers, and you take the product of those numbers, you eventually end up getting a gradient that is very close to 0. ▪ This introduces the problem of vanishing gradients. © Information Systems Lab Simple RNN cell ▪ On every cell, we compute the new memory state based on the previous memory state t-1 and the current input word Xt. ▪ In the computations, we have a weight matrix Wa that is shared between all steps. © Information Systems Lab GRU cells ▪ Gated Recurrent Unit (GRU) cells add one gate to the standard RNN cell. ▪ Now before updating the memory cell, we first compute a candidate that will carry the present information ▪ Then we compute the update gate gU that will determine if the candidate will be used as memory state or if we keep the past memory state a-1 © Information Systems Lab GRU and LSTM ▪ The GRU is like a long short-term memory (LSTM) with a gating mechanism to input or forget certain features, but lacks a context vector or output gate, resulting in fewer parameters than LSTM. ▪ GRU's performance on certain tasks of polyphonic music modeling, speech signal modeling and natural language processing was found to be similar to that of LSTM © Information Systems Lab LSTM ▪ LSTMs are the best known solution to the vanishing gradients problem ▪ The LSTM is a special variety of RNN that was designed to handle entire sequences of data by learning when to remember and when to forget, which is similar to what is done in the GRU. ▪ An LSTM is essentially composed of a cell state, which you can think of as its memory. ▪ The hidden state where computations are performed during training to decide on what changes to make. ▪ An LSTM has multiple gates that transformed the states in the network © Information Systems Lab LSTM ▪ The LSTM allows your model to remember and forget certain inputs. It consists of a cell state and a hidden state with three gates. The gates allow the gradients to flow unchanged. You can think of the three gates as follows: ▪ Input gate: tells you how much information to input at any time point. ▪ Forget gate: tells you how much information to forget at any time point. ▪ Output gate: tells you how much information to pass over at any time point. © Information Systems Lab LSTM ▪ LSTM adds three gates to the standard RNN cell ▪ The forget gate gf determines if the previous state ct-1 state should be forgotten (meaning to have its value set to zero) or not. The update gate gu do the same for the candidate hidden state c~. The output gate go do the same for the new hidden state ct. © Information Systems Lab

Web and Text Analytics 2024-25 Week 8 PDF

Document Details

Tags

Related

Summary

Full Transcript