Web and Text Analytics 2024-2025 Week 7 PDF
Document Details
Uploaded by CooperativeIntellect47
University of Macedonia
Evangelos Kalampokis
Tags
Summary
This document is a lecture on Web and Text Analytics. It covers the use of vector space models for text analysis and includes topics about Euclidean distance, cosine similarity, and word embeddings. The notes cover various aspects of applying the methodologies to real-world examples.
Full Transcript
Web and Text Analytics 2024-25 Week 7 Evangelos Kalampokis https://kalampokis.github.io http://islab.uom.gr © Information Systems Lab Vector space models ▪ Suppose we have two questions – Where are you heading? –...
Web and Text Analytics 2024-25 Week 7 Evangelos Kalampokis https://kalampokis.github.io http://islab.uom.gr © Information Systems Lab Vector space models ▪ Suppose we have two questions – Where are you heading? – Where are you from? ▪ These sentences have identical words except for the last ones. However, they both have a different meaning. ▪ Two more questions – What is your age? – How old are you? ▪ The words are completely different but both sentences mean the same thing © Information Systems Lab Vector space models ▪ Vector space models will help us to identify whether the first pair of questions or the second pair are similar in meaning even if they do not share the same words. ▪ They can be used to identify similarity for a question answering, paraphrasing and summarization. ▪ Vector space models will allow to capture dependencies between words. ▪ Vector space models allow you to represent words and documents as vectors. This representation captures relative meaning. © Information Systems Lab Vector space models ▪ The famous quote says, "You shall know a word by the company it keeps". ▪ When learning these vectors, you usually make use of the neighboring words to extract meaning and information about the center word. ▪ If you were to cluster these vectors together you will see that adjectives, nouns, verbs, etc. tend to be near one another. ▪ Another cool fact, is that synonyms and antonyms are also very close to one another. This is because you can easily interchange them in a sentence and they tend to have similar neighboring words © Information Systems Lab Word-by-word design ▪ To get a vector space model using a word-by-word design, we make a co-occurrence matrix and extract vector or presentations for the words in our corpus. ▪ The co-occurrence of two different words is the number of times that they appear in a corpus together within a certain word distance k ▪ With a word by word design, we can get a representation with n entries, with n between one and the size of your entire vocabulary © Information Systems Lab Word-by-document design ▪ We count the times that words from our vocabulary appear in documents that belong to specific categories. © Information Systems Lab Vector space ▪ Based on the previous example we could take a representation for the words, data and film from the rows of the table. ▪ Or we could take the representation for every category of documents by looking at the columns. This vector space will have two- dimensions. ▪ We can represent the entertainment category, as a vector v=[500,7000]. © Information Systems Lab Euclidean distance ▪ Euclidean distance is a similarity metric. ▪ This metric allows to identify how far two points or two vectors are apart from each other. ▪ You can generalize finding the distance between the two points (A,B) to the distance between an nn dimensional vector as follows: © Information Systems Lab The problem of using Euclidean distance ▪ When comparing large documents to smaller ones with euclidean distance one could get an inaccurate result. ▪ Suppose that we are in a vector space where the corpora are represented by the occurrence of the words disease and eggs. Here's the representation of a food corpus, an agriculture corpus, and the history corpus. ▪ But the word totals in the corpora differ from one another. In fact, the agriculture and the history corpus have a similar number of words, while the food corpus has a relatively small number. © Information Systems Lab Cosine similarity ▪ Another common method for determining the similarity between vectors is computing the cosine of their inner angle. ▪ If the angle is small, the cosine would be close to one. As the angle approaches 90 degrees, the cosine approaches zero. © Information Systems Lab Manipulating Words in Vector Spaces (example) ▪ We are in a hypothetical two-dimensional vector space that has different representations for different countries and capitals cities. ▪ We know that the capital of the United States is Washington DC, and we don't know the capital of Russia. ▪ First, we need to find the relationship between the Washington DC, and USA vector representations. To do that, we get the difference between the two vectors. ▪ The values from that will tell you how many units on each dimension you should move in order to find a country's capital in that vector space. © Information Systems Lab Manipulating Words in Vector Spaces (example) ▪ To find the capital city of Russia, we will have to sum its vector representation with the vector that we also got in the last step. ▪ At the end, we should deduce that the capital of Russia has a vector representation of 10, 4. However, there are no cities with that representation, so you'll have to take the one that is the most similar to it's by comparing each vector with a Euclidean distances or cosine similarities. ▪ The vector representation that is closest to the 10, 4 is the one for Moscow. © Information Systems Lab Vector spaces ▪ The importance of having vector spaces where the representations of words capture their relative meaning in natural language. ▪ We have now seen the clustering of all vectors when plotted on two axes. ▪ We have also seen that the vectors of the words that occur in similar places in the sentence, will be encoded in a similar way. ▪ We can take advantage of this type of consistency encoding to identify patterns. ▪ For example, if we have the word doctor and we are to find the closest words that are closest to it by computing cosine similarity, we might get the word doctors, nurse, cardiologist, surgeon, etc. © Information Systems Lab Visualzation of word vectors ▪ Vector space dimension is higher than two ▪ The words oil and gas and city and town are related ▪ We want to see if that relationship is captured by the representation of our words. ▪ To visualize our words in order to see this and other possible relationships we can use dimensionality reduction (e.g., Principal Component Analysis - PCA) © Information Systems Lab Visualzation of word vectors ▪ Principal component analysis is an unsupervised learning algorithm which can be used to reduce the dimension of our data ▪ If we perform PCI on our data and get a two dimensional representation, we can then plot a visual of your words. ▪ We see that oil & gas are close to one another and town & city are also close to one another. © Information Systems Lab Word embeddings ▪ A word embedding is a learned representation for text where words that have the same meaning have a similar representation. ▪ Word embeddings are in fact a class of techniques where individual words are represented as real-valued vectors in a predefined vector space. ▪ Each word is mapped to one vector and the vector values are learned in a way that resembles a neural network, and hence the technique is often lumped into the field of deep learning. ▪ Key to the approach is the idea of using a dense distributed representation for each word. Each word is represented by a real- valued vector, often tens or hundreds of dimensions © Information Systems Lab Word representations ▪ In our previous lectures we have seen basic document represenations – Bag of Words – TF-IDF – Word Frequencies – Naïve Bayes ▪ In the same context we can have basic word represenations. – One-hot vectors: To implement one hot vectors, you have to initialize a vector of zeros of dimension V and then put a 1 in the index corresponding to the word you are representing. – This representation doesn't carry the word's meaning © Information Systems Lab One hot vectors © Information Systems Lab Word Embeddings (Example) ▪ The first coordinate represents whether a word is positive or negative. The second coordinate tell you whether the word is abstract or concrete ▪ When encoding a word in 2D, similar words tend to be found next to each other. ▪ The pros: – Low dimensions (less than V) – Allow to encode meaning © Information Systems Lab Applications of Word Embeddings ▪ Word embeddings are used in most NLP applications – Find semantic analogies between words – Combine word embeddings with a classifier, to perform sentiment analysis, or to classify customer reviews or comments from user feedback surveys – Machine translation systems – Information extraction – Question answering. © Information Systems Lab Overview of Machine Translation ▪ Calculate word embeddings associated with English and word embeddings associated with French ▪ Retrieve the English word embedding of a particular English word such as cat. ▪ Find some way to transform the English word embedding into word embedding that has the meaning in the French word vector space. ▪ Take the transformed word vector and search for word vectors at the French word vector space that are most similar to it. © Information Systems Lab Machine Translation ▪ If our machine does a good job, it may find the word chat which is the French word for cat. © Information Systems Lab Transforming Vectors using Matrices ▪ We want to find the matrix R that can do this transformation ▪ We can start with a randomly selected matrix R and then see how it performs. When we try to translate the English vectors in matrix X and compare that to the actual French word vectors which is in the matrix Y. ▪ In order for this to work, we will first need to get a subset of English words and their French equivalents ▪ We train on a subset of the English French vocabulary and not the entire vocabulary. © Information Systems Lab Finding the translation ▪ The vector after the transformation would be in the French word vector space. But it is not going to be necessarily identical to any of the word vectors in the French word vector space. ▪ We need to search through the actual French word vectors to find a French word that is similar to the one that we created from the transformation © Information Systems Lab Create Word Embeddings (Corpus) ▪ To create word embeddings we always need – a corpus of text, and – an embedding method. ▪ The corpus contains the words we want to embed, organized in the same way as they would be used in the context of interest ▪ A simple vocabulary list of common words, would not be enough to create embeddings ▪ The context of a word tells you what type of words tend to occur near that specific word. The context is important as this is what will give meaning to each word embedding. © Information Systems Lab Create Word Embeddings (Methods) ▪ There are many types of possible methods that allow you to learn the word embeddings. ▪ The machine learning model performs a learning task, and the main by-products of this task are the word embeddings. – For example, the task could be to learn to predict a word based on the surrounding words in a sentence of the corpus, as in the case of the continuous bag-of-words approach. ▪ Word embeddings can be tuned by a number of hyper parameters just like in any machine learning model. – E.g., the dimension of the word embedding vectors © Information Systems Lab Create Word Embeddings (Transformation) ▪ The contents of the corpus must first be transformed into a suitable mathematical representation from words into numbers – E.g., one hot vectors © Information Systems Lab Basic Word Embedding Methods ▪ Word2vec uses a shallow neural network to learn word embeddings (Google, 2013). Two model architectures – Continuous bag-of-words: The objective of the model is to learn to predict a missing word given the surrounding words – Continuous skip-gram / Skip-gram with negative sampling (SGNS): The objective of the the model is to learn to predict the word surrounding a given input word ▪ Global Vectors (GloVe) (Stanford, 2014): factorizes the logarithm of the corpus's word co-occurrence matrix, similar to the count matrix you’ve used before. ▪ fastText (Facebook, 2016) is based on the skip-gram model and takes into account the structure of words by representing words as an n- gram of characters Supports out-of-vocabulary (OOV) words – Word embedding vectors of fastText can be averaged together to make vector representations of phrases and sentences. © Information Systems Lab Advanced Word Embedding Methods ▪ These methods use advanced deep neural network architectures to refine the representation of the words' meaning according to their contexts. ▪ In the previous models, a given word always has the same embedding. In these more advanced models, the words have different embeddings depending on their contexts. ▪ Deep learning, contextual embeddings – BERT (Google, 2018) – ELMo (Allen Institute for AI, 2018) – GPT-2 (OpenAI, 2018) ▪ We can find off the shelf pretrained models on the Internet. ▪ We can use our own corpus to generate high-quality domain specific word embeddings. © Information Systems Lab Continuous Bag of Words Model (CBOW) ▪ The objective of the task is to predict a missing word based on the surrounding words ▪ If two unique words are both frequently surrounded by a similar sets of words when used in various sentences, then those two words tend to be related in their meaning © Information Systems Lab Create Training Data ▪ How do you use the corpus to create training data for the prediction task? ▪ To train the model, we will need a sets of examples, and each example will be made of context words and the center word to be predicted. ▪ C is called the half size of the contexts, it is a hyperparameter of the model. ▪ The window is the count of the center word plus the context word © Information Systems Lab Create Training Data ▪ By sliding the window we create a sets of examples, and each example is made of context words and the center word to be predicted © Information Systems Lab CBOW ▪ As you can see in the model architecture from the original paper, context words as inputs, and center words as outputs. © Information Systems Lab Transform word into vectors ▪ To transform the context vectors into one single vector, you can use the following. ▪ As you can see, we started with one-hot vectors for the context words and and we transform them into a single vector by taking an average. As a result you end up having the following vectors that you can use for your training. © Information Systems Lab Final prepared training set © Information Systems Lab Architecture of the CBOW model ▪ The continuous bag of words model is based on a shallow dense neural network with an input layer, a single hidden layer, and an output layer. – the inputs of the model is a vector of context words – the output is the vector of the predicted center word – the size of these vectors is equal to the size of the vocabulary – the size of the hidden layer is equal to the size of the expected word embedding (the size or dimension of the word embeddings is a hyperparameter of the model) © Information Systems Lab Architecture of the CBOW model ▪ We have an input, X, which is the average of all context vectors. ▪ We then multiply it by W1 and add b1. ▪ The result goes through a ReLU function to give us our hidden layer. ▪ That layer is then multiplied by W2 and you add b2. ▪ The result goes through a softmax which gives us a distribution over V, vocabulary words. © Information Systems Lab Word Embeddings ▪ We will derive the word embeddings from the weights matrices of this neural network © Information Systems Lab Extracting Word Embedding Vectors ▪ There are two options to extract word embeddings after training the continuous bag of words model. You can use w1 as follows. ▪ If you were to use w1, each column will correspond to the embeddings of a specific word. © Information Systems Lab Extracting Word Embedding Vectors ▪ We can also use w2 as follows: ▪ The final option is to take an average of both matrices as follows: © Information Systems Lab Gensim ▪ Gensim is an open-source library implemented in Python – Gensim includes streamed parallelized implementations of fastText, word2vec and doc2vec algorithms. ▪ Word2Vec model – vector_size: Dimensionality of the word vectors. – window: Maximum distance between the current and predicted word within a sentence. – min_count is for pruning the internal dictionary. Words that appear only once or twice in a billion-word corpus are probably uninteresting typos and garbage – sg: Training algorithm: 1 for skip-gram; otherwise CBOW. © Information Systems Lab Gensim ▪ model.wv: This object essentially contains the mapping between words and embeddings. After training, it can be used directly to query those embeddings in various ways. © Information Systems Lab GloVE ▪ It is developed as an open-source project at Stanford and was launched in 2014. ▪ It was designed as a competitor to word2vec, and the original paper noted multiple improvements of GloVe over word2vec. ▪ As of 2022, both approaches are outdated, and Transformer-based models, such as ELMo and BERT, which add multiple neural-network attention layers on top of a word embedding model similar to Word2vec, have come to be regarded as the state of the art in NLP © Information Systems Lab