NLP unit 2.pdf

Bag of words Broadly speaking, a bag-of-words model is a representation of text which is usable by the machine learning algorithms. As we all know, most machine learning algorithms cannot work directly with non-numerical data, which is why we use various encoding methods like One-Hot-Encoding to convert this textual data into numerical matrices, to be used by the algorithm. Bag-Of-Words (BoW) aims to extract features from the text, which can be further used in modeling. Let’s see how this works How a bag-of-words model works. The process by which the input data is converted into a vector of numerical feature values is known as Feature Extraction. Bag-Of-Words is also a feature extraction technique for textual data. One important thing to note is that the BoW model does not care about the internal ordering of the words in a sentence, hence the name. An example of this is, sent1 = "Hello, how are you?" sent2 = "How are you?, Hello" The output vector of both sent1 and sent2 would be the same vector. BoW consists of two things: 1. A vocabulary of known words - We need to create a list of all of the known words that the model will consider while the process of feature extraction. This can be thought of the process when we try to understand a sentence by understanding the referring to the words in a dictionary.2. A count of the known words which are present - This keeps a count of the words in the input sentence, which are also present in the vocabulary created above. Let us see how a simple bag of words can be created. 1. First, let us create a vocabulary of the known words: We shall use the famous poem, No man is an island by John Donne.Below is the snippet of the first four lines from the poem1. No man is an island, (5 words) 2. Entire of itself, (3 words) 3. Every man is a piece of the continent, (8 words) 4. A part of the main. (5 words)We shall consider each line to be a separate document. As it can be seen, we have 4 documents in our example according to the assumption we made above. Now we shall create a vocabulary of all the known words from these documents.The vocabluary is (ignoring the punctuation and case):no, man, is, an, island, entire, of, itself, every, a, piece, the, continent, part, mainWe can see that our vocabulary contains 15 words. We can see that this vocabulary is created from a collection of 21 words. 2. After the vocabulary is created, we shall create vectors for the different documents present. This process is known as Words Scoring. The easiest way to do this is the Binary Scoring method. As we know that our vocabulary consists of 15 words, so we can create vectors of length 15, and mark 1 for the words present, and 0 for the words absent in a particular document.So for Document #1, the scoring would look like this:No: 1, man: 1, is: 1, an: 1, island: 1, entire: 0, of: 0, itself: 0, every: 0, a: 0, piece: 0, the: 0, continent: 0, part: 0, main: 0Converting this to a vector, it would look like this [1,1,1,1,1,0,0,0,0,0,0,0,0,0,0] There are different ways of scoring words, like binary scoring which we saw above. Methods for Scoring Words in NLP Natural language processing is defined as the interaction between the computer and the human language. As we all are aware, that the human language is messy, and there are a lot of different ways of saying the same thing. There are a lot of ways in which human beings can communicate with each other, but what about the communication of humans and computers?? That is where Natural Language Processing comes in. While solving problems related to NLP, textual data is converted into numerical data for the machine to understand. This conversion is very crucial to the results of an NLP model. There are a lot of different ways in which we can convert textual data into numerical values (vectors in most cases). The scoring of words is done with respect to a well-defined vocabulary. There are different ways in which scoring can be done, namely, they are: Binary Scoring Count Scoring Frequency Scoring Tf-IDF Scoring We shall understand briefly about these different types of scoring methods. Binary Scoring This is a very simple way of scoring words in a document. In this method, we simply mark 1 when a particular word is present in a document, and 0 when the word is not present. To understand this, we can take an example Let us assume that our vocabulary is: no, man, is, an, island, entire, of, itself, every, a, piece, the, continent, part, main So for Document No man is an island, the scoring would look like this:No: 1, man: 1, is: 1, an: 1, island: 1, entire: 0, of: 0, itself: 0, every: 0, a: 0, piece: 0, the: 0, continent: 0, part: 0, main: 0Converting this to a vector, it would look like this [1,1,1,1,1,0,0,0,0,0,0,0,0,0,0] As we can see, we mark 1 for the words that were present in the vocabulary and 0 for the others. The vectors in this scoring method contain only 0 or 1, hence the name. One more observation in this scoring method is that the length of the vector is equal to the number of words in the vocabulary. Count Scoring This scoring method works on the count of the words in a document. This creates a vector in which the values correspond to the number of times the particular word has occurred in the document. If we consider the example above (binary scoring), we shall get the same vector as none of the words is repeating in the document. Frequency Scoring This scoring method is often confused with Count scoring. Both the methods are similar, the only difference is that Frequency scoring calculates the frequency (Number of times the words appear in the document out of all the words in the document) of the words in a document. We shall see this type of scoring with an example Let us assume that our vocabulary is: no, man, is, an, island, entire, of, itself, every, a, piece, the, continent, part, mainSo for Document No man is an island, the scoring would look like this:No: 1, man: 1, is: 1, an: 1, island: 1, entire: 0, of: 0, itself: 0, every: 0, a: 0, piece: 0, the: 0, continent: 0, part: 0, main: 0Converting this to a vector, it would look like this [0.2,0.2,0.2,0.2,0.2,0,0,0,0,0,0,0,0,0,0] TF-IDF Scoring This is perhaps the most important type of scoring method in NLP. Term Frequency - Inverse Term Frequency is a measure of how relevant a word is to a document in a collection of documents. For example, in the document No man is an island, `is` `an` might not be as relevant as `man` `no` `island` to the document. TF-IDF is calculated by multiplying the number of times the word appears in the document and the inverse of the frequency of the word in the set of documents. As we saw that TF-IDF is calculated by multiplying two metrics. Term Frequency: The number of times a word appears in the document Inverse Document Frequency: The inverse of the frequency of the word in the set of documents. Term frequency formula (Created using https://www.mathcha.io/) Inverse Document frequency formula (Created using https://www.mathcha.io/) In this blog, we read about the different types of Word scoring methods used in NLP and how each of them works. Word of Classes The closed classes are the most different among languages ▫ Prepositions: from, to, on, of, with, for, by, at,... ▫ Determiners: the, a , an (il, la, lo, le, i, gli, un,..) ▫ Pronouns: he, she, I, who, others,… ▫ Conjunctions: and, but, or, if, because, when,… ▫ Auxiliary verbs: be, have, can, must,… ▫ Numerals: one, two,.., first, second ▫ Particles: up, down, on, off, in, out, at, by (e.g. turn off) Prepositions occur before noun phrases ▫ Semantically they express a relationship (spatial, temporal, etc..) ▫ In English some prepositions assume a different role in predefined contexts and they are considered in the special class of particles e.g. on in verbal phrases as “go on” where they have a role like an adverb The determiners are often at the beginning of a noun phrase ▫ They are among the most common terms (e.g. the in English) Conjunctions are used to connect phrases, clauses or sentences ▫ The coordinating conjunctions are used to join two elements at the same level (for, and, nor, but, or, yet, so are the 6 most frequent) copulative (and ,also,..), disjunctive (or, nor,..), adversative (but, however, still, yet..), illative (for, so,…), correlative (both…and, either…or, neither…nor,..) ▫ Subordinating conjunctions are sued to express a fact that depends on a main clause (they define a relation between two clauses) condition (unless, provided that, if, even if), reason (because, as, as if), choice (rather than, than, whether), contrast (though, although, even though, but), location (where, wherever), result/effect (in order that, so, so that, that), time (while, once, when, since, whenever, after, before, until, as soon as), concession and comparison (although, as, as though, even though, just as, though, whereas, while) Pronouns are short elements that are used to refer noun phrases, entities or events ▫ Personal pronouns refer to persons or entities (I, you, me,..) ▫ Possessive pronouns define the possess or, in general, an abstract relation between a person and an (abstract) object (my, his/her, your,..) ▫ Relative pronouns are used to relate two sentences by subordinating the sentence they start with respect to the sentence containing the referred word (who, whom,…) ▫ Demonstrative pronouns refer to a person or object given a spatial or temporal relation (this, that, these, those,..) ▫ Indefinite pronouns are used to refer a generic object, person, event (none, nobody, everybody, someone, each one….) Auxiliary verbs are used in combination with other verbs to give a particular meaning to the verbal phrase (have, be, do will) ▫ they are used to define the compound verb tenses (present perfect, future perfect,..) he has eaten an apple, he will go home ▫ they are used to form a question or a negative form of a verb I do not (don’t) walk, Do you like it? ▫ “be” is used to define the passive voice of verbs (the apple is eaten) ▫ they can express a modality for the action (modal auxiliary) need/requirement (must, have to, need to) possibility (may) will (will, wish) capacity (can) Tagsets Some different tag sets have been proposed for PoS tagging ▫ The tagset for English have a different detail level Penn Treebank tagset: 45 tags (Marcus et al. 1993) C5 tagset: 61 tags (CLAWS project by Lacaster UCREL, 1997) C7 tagset: 146 tags (Leech et al. 1994) ▫ Tags are usually specified at the word end after / ▫ The Penn Treebank tagset does not describe some properties that can be derived from the analysis of the lexical entity or from syntax e.g. prepositions and subordinating conjunctions are combined into the same tag IN since the are disambiguated in the syntactical parse tree Penn Treebank tagset PoS tagging consists in assigning a tag to each word in a document ▫ The selection of the employed tagset depends on the language and specific application ▫ The input is a word sequence and the employed tagset while the output is the association of each word to its “best” tag ▫ There may exist more tags for a given word (ambiguity) ▫ The PoS tagger task is to solve these ambiguities by selecting the most appropriate tag given the word context The percentage of ambiguous words is not too high, but among them there are very frequent words (e.g. can – Auxiliary verb, Noun, Verb, still has 7 compatible tags – adj, adv, verb, noun) SKIP GRAM Skip-gram is one of the unsupervised learning techniques used to find the most related words for a given word. Skip-gram is used to predict the context word for a given target word. It’s reverse of CBOW algorithm. Here, target word is input while context words are output. As there is more than one context word to be predicted which makes this problem difficult. skip-gram example The word sat will be given and we’ll try to predict words cat, mat at position -1 and 3 respectively given sat is at position 0. We do not predict common or stop words such as the. Architecture As we can see w(t) is the target word or input given. There is one hidden layer which performs the dot product between the weight matrix and the input vector w(t). No activation function is used in the hidden layer. Now the result of the dot product at the hidden layer is passed to the output layer. Output layer computes the dot product between the output vector of the hidden layer and the weight matrix of the output layer. Then we apply the softmax activation function to compute the probability of words appearing to be in the context of w(t) at given context location. Variables we’ll be using 1. The dictionary of unique words present in our dataset or text. This dictionary is known as vocabulary and is known words to the system. Vocabulary is represented by ‘V’. 2. N is the number of neurons present in the hidden layer. 3. The window size is the maximum context location at which the words need to be predicted. The window size is denoted by c. For example, in the given architecture image the window size is 2, therefore, we will be predicting the words at context location (t-2), (t-1), (t+1) and (t+2). 4. Context window is the number of words to be predicted which can occur in the range of the given word. The value of a context window is double the window size that is 2*c and is represented by k. For the given image the value of the context window is 4. 5. The dimension of an input vector is equal to |V|. Each word is encoded using one hot encoding. 6. The weight matrix for the hidden layer(W) is of dimension [|V|, N]. || is the modulus function which returns the size of an array. 7. The output vector of the hidden layer is H[N]. 8. The weight matrix between the hidden and the output layer(W’) is of dimension [N,|V|]. 9. The dot product between W’ and H gives us an output vector U[|v|]. N = context window Working steps 1. The words are converted into a vector using one hot encoding. The dimension of these vectors is [1,|v|]. one hot encoding 2. The word w(t) is passed to the hidden layer from |v| neurons. 3. Hidden layer performs the dot product between weight vector W[|v|, N] and the input vector w(t). In this, we can conclude that the (t)th row of W[|v|, N] will be the output(H[1, N]). 4. Remember there is no activation function used at the hidden layer so the H[1,k]will be passed directly to the output layer. 5. Output layer will apply dot product between H[1, N] and W’[N, |v|] and will give us the vector U. 6. Now, to find the probability of each vector we’ll use the softmax function. As each iteration gives output vector U which is of one hot encoding type. 7. The word with the highest probability is the result and if the predicted word for a given context position is wrong then we’ll use backpropagation to modify our weight vectors W and W’. This steps will be executed for each word w(t) present in vocabulary. And each word w(t) will be passed k times. So, we can see that forward propagation will be processed |v|*k times in each epoch. Probability function softmax probability w(c,j) is the jth word predicted on the cth context position; w(O,c) is the actual word present on the cth context position; w(I) is the only input word; and u(c,j) is the jth value in the U vector when predicting the word for cth context position. Loss function Loss function As we want to maximize the probability of predicting w(c,j) on the c th context position we can represent the loss function L. Advantages 1. It is unsupervised learning hence can work on any raw text given. 2. It requires less memory comparing with other words to vector representations. 3. It requires two weight matrix of dimension [N, |v|] each instead of [|v|, |v|]. And usually, N is around 300 while |v| is in millions. So, we can see the advantage of using this algorithm. Disadvantages 1. Finding the best value for N and c is difficult. 2. Softmax function is computationally expensive. 3. The time required for training this algorithm is high. Continuous Bag Of Words The CBOW model tries to understand the context of the words and takes this as input. It then tries to predict words that are contextually accurate. Let us consider an example for understanding this. Consider the sentence: ‘It is a pleasant day’ and the word ‘pleasant’ goes as input to the neural network. We are trying to predict the word ‘day’ here. We will use the one-hot encoding for the input words and measure the error rates with the one-hot encoded target word. Doing this will help us predict the output based on the word with least error. The Model Architecture The CBOW model architecture is as shown above. The model tries to predict the target word by trying to understand the context of the surrounding words. Consider the same sentence as above, ‘It is a pleasant day’.The model converts this sentence into word pairs in the form (contextword, targetword). The user will have to set the window size. If the window for the context word is 2 then the word pairs would look like this: ([it, a], is), ([is, pleasant], a),([a, day], pleasant). With these word pairs, the model tries to predict the target word considered the context words. If we have 4 context words used for predicting one target word the input layer will be in the form of four 1XW input vectors. These input vectors will be passed to the hidden layer where it is multiplied by a WXN matrix. Finally, the 1XN output from the hidden layer enters the sum layer where an element-wise summation is performed on the vectors before a final activation is performed and the output is obtained. Implementation of the CBOW Model For the implementation of this model, we will use a sample text data about coronavirus. You can use any text data of your choice. But to use the data sample I have used click here to download the data. Now that you have the data ready, let us import the libraries and read our dataset. import numpy as np import keras.backend as K from keras.models import Sequential from keras.layers import Dense, Embedding, Lambda from keras.utils import np_utils from keras.preprocessing import sequence from keras.preprocessing.text import Tokenizer import gensim data=open('/content/gdrive/My Drive/covid.txt','r') corona_data = [text for text in data if text.count(' ') >= 2] vectorize = Tokenizer() vectorize.fit_on_texts(corona_data) corona_data = vectorize.texts_to_sequences(corona_data) total_vocab = sum(len(s) for s in corona_data) word_count = len(vectorize.word_index) + 1 window_size = 2 In the above code, I have also used the built-in method to tokenize every word in the dataset and fit our data to the tokenizer. Once that is done, we need to calculate the total number of words and the total number of sentences as well for further use. As mentioned in the model architecture, we need to assign the window size and I have assigned it to 2. The next step is to write a function that generates pairs of the context words and the target words. The function below does exactly that. Here we have generated a function that takes in window sizes separately for target and the context and creates the pairs of contextual words and target words. def cbow_model(data, window_size, total_vocab): total_length = window_size*2 for text in data: text_len = len(text) for idx, word in enumerate(text): context_word = [] target = [] begin = idx - window_size end = idx + window_size + 1 context_word.append([text[i] for i in range(begin, end) if 0

Document Details

Tags

Related

Full Transcript

Upgrade to continue