Lecture 4 PDF - Feature Learning with Convolution

Summary

Lecture 4 presents notes on feature learning with convolutions, focusing on 1-d convolutions, embeddings, and sequence convolutions. The notes provide examples and formulas, offering a detailed exploration of these key deep learning concepts.

Full Transcript

Lecture 4 [ November 9, 2023 at 10:05 – ] FEATURE LEARNING WITH CONVOLUTIONS 1 So far, we have seen how to handcraft features. This chapter intro- duces the capabilities of deep learning of learning these features...

Lecture 4 [ November 9, 2023 at 10:05 – ] FEATURE LEARNING WITH CONVOLUTIONS 1 So far, we have seen how to handcraft features. This chapter intro- duces the capabilities of deep learning of learning these features on its own for structured data. Refer to section ?? for a quick recap on deep models. Note 1.0.1 The last layer z m of the network is a rich representation of the input x. 1.1 1-d convolutions Recall that a dense linear layer is of the form W z + b, then the input and output dimensions must be fixed: shape(W) = (d out , d in ) shape(z) = (d in , 1) shape(b) = (d out , 1) Convolution refers to learning smaller linear layers which can be slid along an input of variable size. 1.1.1 Example Slide the filter k over the input x and take the dot product between the filter and the current window in the input. For example: 1 1 1 k = ( , , ) x = (.1,.2,.3,.1,.1,.1,.3,.2,.1) 3 3 3 This filter k slid over the input x will compute the average of every three points in the input, acting as a smoothing filter, resulting in: y = (.2,.2,.17,.1,.17,.2,.2) Different filters will have different effects, for example, peak detec- tor filters, and value detector filters. The values in the output show how activate the filter was at certain positions. Note 1.1.1 The output size has a different size than the input because in this case one can only start applying the filter from the second 1 [ November 9, 2023 at 10:05 – ] 2 feature learning with convolutions element in the input and stop applying it at the element before the last one in the input. Therefore, the final output will have a size with two elements less than the original input. A solution would be padding the input vector x with a number of 0s on the left and right such that the output vector will have the same size as the input after applying the filer k. 1.2 embeddings Neural Networks perform continuous operations. However, we saw cases where we can have data which is discrete (e.g. tokens for a task using sequential data) and which we want to use to perform ML tasks. Therefore, we need to embed the inputs into a continuous form. This means that each discrete input has a vector representation. The embeddings could be fixed or learned as model parameters by the neural network. Embedding a sequence means constructing a sequence of token embeddings retrieved from the embedding matrix E ∈ R|V |×d. |V | refers to the size of the vocabulary (therefore it tells us that every token must have an embedding) and d is the size of the embedding (which can be fixed or learned). At this point, if the sequence x is constructed of L tokens, the sequence representation will have a shape of: shape(x) = (L, d) We need to combine these separate representations of tokens in such a way that we can have one vector representation/embedding for the sequence. We can use different aggregation techniques for the token embeddings such as averaging: 1 z=(z1 +... + z L ) L It was proved that taking the max coordinate-wise embedding representation can also be a useful final sequence representation: [z] j = max([z1 ] j ,... , [z L ] j Note 1.2.1 Taking the max coordinate-wise means that for every column in the matrix x, shape(x) = (L, d), one will take the max over that column for every column. More specifically, you will have d columns and L values to compare for taking the max for every column. [ November 9, 2023 at 10:05 – ] 1.2 embeddings 3 Note 1.2.2 Yet again, the order of the tokens will have no effect on the final representation of the sequence. 1.2.1 Sequence convolutions Let’s define a filter K of shape: shape(K) = (d, k) Applying such a shaped filter to a sequence embedding would mean taking the dot product vertically of the filter K with k sequential embeddings at each position in the input. The input sequence has L embedded tokens of dimension d. Note 1.2.3 The dot product in this case means taking the sum of each multiplied element of the filter with the k shaped portion in the input. To obtain a richer representation, one can take m filters (d, k) in parallel. These filters are learned parameters of our network. This can be seen as a convolution layer which maps the sequence represented as shape (L, d) represented as a sequence (L, m). This maps the d dimension initial output to a hidden dimension m. You can think of the filer as an n-gram to be matched at various positions in the sequence. 1.2.2 Stacking convolution layers One step further would be to apply alternating convolutions with max pooling over smaller windows. Hidden representations will go from finer in the initial layers to coarser in the next layers. Figure 1.1 shows how the different layers interact with one another and how dimensions change between steps. [ November 9, 2023 at 10:05 – ] 4 feature learning with convolutions Figure 1.1: Visual explanation of dimensions when stacking convolution layers and max pooling at the end. Note 1.2.4 Observe how in figure 1.1 the final output layer after max- pooling has a dimension of ( L2 , d) because a max-pooling of two was used in this case. 1.3 2-d convolutions This is a special convolution for grids, where the filter is slid left-to- right top-to-bottom. It is common practice to alternate convolution layers with pooling layers. Compared to the sequential filters, this time the filters take the shape of square windows applied to grids (images). As we add more convolution layers, the representations will become more and more abstract and global. [ November 9, 2023 at 10:05 – ] 1.4 practical consideration of convolutios 5 1.4 practical consideration of convolutios Convolution work great when the phenomena to detect is fairly local. Capturing more local representations requires deeper convolu- tion layers. Striding lets the calculation be a bit more efficient for convolu- tion and helps avoid some redundancies in the input. Computing representations of every position of the component in a structure: – no pooling and no striding, which keeps the representa- tions quite local – down-sample by applying pooling and striding up to a certain lower dimension and then up-sample again using transpose convolutions (where there are more outputs than inputs) to a higher dimension. This strategy is used for semantic image segmentation, known as U-net. Note 1.4.1 Striding means skipping a few points in the input when applying a filter. [ November 9, 2023 at 10:05 – ]

Use Quizgecko on...
Browser
Browser