Lecture 3 Notes PDF
Document Details
Uploaded by Deleted User
Tags
Summary
These lecture notes cover the concept of encoding sequences for various data types, such as text, DNA, and images. It focuses on creating numerical representations of input objects for machine learning models. The note details tokenization, vocabulary building and numerical encoding. The notes also discuss preserving order in sequences and adding static features such as length data.
Full Transcript
Lecture 3 [ November 9, 2023 at 10:05 – ] Part I STRUCTURED INPUTS [ November 9, 2023 at 10:05 – ] [ November 9, 2023 at 10:05 – ] DESIGNING FEATURES 1 We want to encode an input object x into a numerical...
Lecture 3 [ November 9, 2023 at 10:05 – ] Part I STRUCTURED INPUTS [ November 9, 2023 at 10:05 – ] [ November 9, 2023 at 10:05 – ] DESIGNING FEATURES 1 We want to encode an input object x into a numerical value. We must define the features which distinguish this object from other objects. We will focus more on designing the features for objects which can have rich representations h(x). Such objects include as text documents, images, DNA sequences, molecules, and so on. 1.1 encoding sequences In this section, we will learn how to convert sequences into numbers that computers can understand. This process helps us discover valu- able insights and patterns hidden within these complex structures. 1.1.1 Text Sequences Text documents are an example of sequential data. Computers cannot interpret the words as we do as humans, and needs a numerical representation of the text. Therefore, the following steps have been proved effective in converting sequences of strings into a numerical representation: Step 1: tokenize the sequence. This means splitting the string into a sequence of tokens/words. Step 2: build a vocabulary of all the tokens which appear in the training data. This would mean putting all your tokens in a set. Sometimes it’s helpful to order the tokens in the vocabulary by frequency. Step 3: encode numerically the tokens in the initial text. We could replace each token in the initial input object x with the index in the vocabulary of that token. This will lead to a numerical representation of each text document. Note 1.1.1 Text sequences have different lengths but all text inputs representations must map to the same dimension in the feature space. 3 [ November 9, 2023 at 10:05 – ] 4 designing features Encoding text with a bag-of-words approach: each text document is represented as a vector of the length of the vocabulary, where each entry represents the number of times a token in the vocabulary appears in the input sequence. Note 1.1.2 The bag-of-words approach leads to a sparse representa- tion of each input sequence because many words in the vocabulary will not appear in the input text all the time. Encoding text with a zero-one approach: each text document is represented as a vector of the length of the vocabulary, where each entry is 1 if a token in the vocabu- lary appears more than once in the input and 0 otherwise. Encoding text with a normalized frequencies approach: each text document is represented as a vector of the length of the vocabulary, where each entry represents the percentage of appearances of a token in the vocabulary in the input sequence. Note 1.1.3 The disadvantage of the bag-of-words, zero-one and nor- malized frequencies approaches is that they do not preserve the word order in a sentence. However, a different word order in a sentence can have a different meaning. 1.2 preserving sequential order We introduce the notion of n-grams which take n consecutive tokens and treats them as a single unit. For example, bigrams will consider pairs of consecutive words as a single vocabulary unit, trigrams will take groups of three consecutive tokens and treat them as a single vocabulary unit, and so on. This method allows for capturing some sort of local order. Note 1.2.1 Among the features we have defined, we can add on top many other static features, such as: length features. The number of words/characheter/sentences in the input x. lexicon count features. The number of times a word from an external lexicon appears in the input x, used often in computa- tional social science. [ November 9, 2023 at 10:05 – ] 1.2 preserving sequential order 5 measures of complexity features. The average number of charac- ters per word, the average number of words per sentence, which often correlate with measuring reading speed. Always try to include these sort of hand crafted features because they can add more value to your model than you would think. 1.2.1 DNA Sequences DNA, RNA and proteins are the main structures of living organisms that we attempt to analyze with a model. The DNA is a sequence compounded of three nucleotides: A, C, G and T. These sequences are often analyzed for their biological significance, such as in gene regulation studies or DNA motif analysis. DNA has a blueprinting type of role. RNA is also a sequence which is represented by A, C, G and U and the role of RNA is to catalize reactions and control different processes in the organism. The mRNA (messenger RNA) triggers the synthesis of certain protein. The proteins go on to do the work instructed by mRNA. For more details about how the three structures interact with one another and their behavior check https: //cm.jefferson.edu/learn/dna-and-rna/. To encode the DNA, we can use the same techniques as in the text encoding in section 1.1.1, where the compounds A, C, G and T are considered as words. However, we refer to n-grams as k-mers in the context of computational biology. A k-mer refers to a substring of length k that appears in a DNA sequence. Note 1.2.2 6-mer is also called hexamer. On the other hand, proteins have different levels of structure and we cannot use the same encoding method as in the DNA case all the time. The primary structure of proteins is a sequence, where each sequence is a list of aminoacids which are given as a three letter abbreviation code. For this type of structure we can use previously seen methods for encoding. Higher-order structures have to deal with complicated folding due to interactions between (chunks of) aminoacids. We can en- code higher-order structures in proteins with a graph with edges corresponding to interactions in the protein. [ November 9, 2023 at 10:05 – ] 6 designing features 1.3 encoding graphs 1.3.1 Molecules as Graphs Main idea: represent each atom as a node, and a bound between two atoms as an undirected edge. We can map the notion of bigrams to graphs, more precisely we can see any two adjacent nodes as a bigram. The edges of the graph can be seen as a generalization of bigrams. Similarly, trigrams in graphs overlap many more times. In graphs, we might even obtain n-grams which are not sequential but may be more in the shape of a triangle. Generalizing the n-grams to graphs results in taking all the possible combinations of n-nodes. The number of possible combinations is higher than in the sequential scenario. 1.4 encoding trees Trees can be seen as a special case for graphs, but their specific struc- ture that avoids cycles can lead to some more interesting encodings of data. 1.4.1 Modelling Internet Conversations with Trees In a conversation, each text is already structured, however, there is still some structure between messages. The main idea is to en- code this in-between message structure using prompt and response pairwise features. In the paper http://richardcolby.net/writ2000/wp-content/ uploads/2020/10/tan-et-al-2016-winning-arguments.pdf, the authors aim to understand the mechanism behind persuasion and online discussions, using the Reddit discussion tree /ChangeMyView as a data source. They use both features coming from one message, as well as using a pairwise feature construction approach for compar- ing two similar arguments that contest the same original opinion, but where only one is successful in changing someone’s view. Check the paper for more information about how the features are encoded in this case. Note 1.4.1 Practical tricks for handling discrete features: [ November 9, 2023 at 10:05 – ] 1.5 encoding grids 7 add a $ as a placeholder feature when computing the in-between features between two pieces of text to obtain unary and bigram relations. use a hashing trick to build the vocabulary efficiently: count- min sketch, do one pass over the data set and prune the rare features. instead of building a vocabulary and assigning a number to every feature, one can use a hasinh function to assign a number to every feature 1.5 encoding grids 1.5.1 Images as Grids Images can be seen as 3D tensors, seen as grids of pixels. Each pixel value is a vector with C channels, often referring to the red, green and blue colour percentages.An image has a width W and a height H and therefore each image is represented in the space of size W xHxC. In the structured world, images can be seen as a collection of image patches. When you look at a block of neighbouring pixels, there is more information in that block than in a pixel alone. Another idea is to define filters to view images as patches, applying some soft matching between the filter and the original image. This is referring to convolution, on which you can find more details here Note 1.5.1 Images are usually represented as 3D tensors. Mathemat- ically we can treat tensors as if they were vectors, by flattening them. Therefore, we obtain the Frobenius inner product: P, F ∈ Rw×h× c , ∑︂ ∑︂ ∑︂ P ·F = p i, j,k w i, j,k i j k For example, F refers to the filter window, and P refers to the portion in the original image where you want to apply the filter. They both have the same dimension. i refers to the width dimension, j to the height dimension and k to the channel dimension. This product is also known as the element-wise multiplication and sum and is used in convolution. Depending on your application and need, you would prefer different features for your learning task. For small filters, you would want to [ November 9, 2023 at 10:05 – ] 8 designing features define those which can extract edges and corners. However, these features won’t say a lot about the objects in the image. However, they can be valuable and add knowledge to your model. You can use dictionary learning to learn optimal fixed-sized filters, by using the most representative edges and corners. These filters can also be defined by hand. For example, the Sobel filter extracts horizontal and vertical edges: −1 0 1 −1 −2 −1 (︂ )︂ (︂ )︂ G horizontal = −2 0 2 G vertical = 0 0 0 −1 0 1 1 2 1 For larger filter patches, you must define a criterion to only be able to select the interesting ones (i.e. using clustering). Deep learning is the method allowing for feature learning. [ November 9, 2023 at 10:05 – ]