24-Neural-Network-Basics.pdf

Basic Multi-Layer Perceptron (MLP) Quick recap of neural networks: 17 Tasks & Requirements Deep Learning uses MLPs (and variations thereof) with large numbers of layers. These approaches achieve state-of-the-art performance in: ▶ Machine translation ▶ Language modelling ▶ Question answering ▶ Dialogue state tracking & response generation ▶ Classification ▶ Sentiment analysis ▶ POS-tagging To solve these tasks, neural networks (NN) must extract useful features, receive/produce variable length inputs/outputs, and balance expressiveness (#trainable parameters) and data size. 18 Recurrent Neural Networks (RNN) A recurrent neural network uses hidden states to pass information from previous inputs into the current calculation. They are generally used in conjunction with MLPs (e.g., as encoders). ▶ input: sequential data ▶ output: fixed size vector summarizing data ▶ allows abandoning Markov-assumption ▶ hard to train (exploding/vanishing gradients) [BeSiFr94; Hoch91] 20 RNN Architectures Depending on our task, we may align inputs and outputs differently: Many to many (“encoder-decoder”) Example: machine translation: sentence to sentence 21 Generating Text from Recurrent Networks The network produces “softmax” vectors of possible words, but we can only output one. ⇒ we get more consistent results if we feed this back into the decoder. Use token to begin generation, to stop. Note: if it had generated a, not an, then example would have been bad. 22 Long-Short Term Memory (LSTM) [HoSc97] LSTM stabilizes the gradient flow by introducing different gates in the hidden units: 23 Gated Recurrent Unit (GRU) [CGCB14; CMBB14] GRUs have gates to control what information to update and what to forget. output / hidden state 24 Sequence-to-Sequence [SuViLe14] Developed for neural machine translation. Problem: Die Wandleuchtenhalterung war rostig. 4 words ⇒ The wall light fixture was rusty. 6 words While deep NNs are very expressive, they usually require fixed sized inputs/outputs. This is not suitable for tasks where their sizes are not previously known (e.g., machine translation, speech recognition). Solution: Encoder-Decoder-Architecture, e.g., Sequence-to-Sequence: 25 Sequence-to-Sequence [SuViLe14] /2 Seq-2-Seq uses a straightforward application of LSTM, i.e., no changes in the basic architecture of LSTM cells. The main contributions are: ▶ Use of 2 LSTMs for encoding/decoding (more parameters ⇒ more expressiveness) ▶ Deep LSTMs (4 layers) ▶ Reverse order of input tokens to shorten paths between occurrence of word in input and output (⇒ helps gradient flow) 26 Embeddings from Language Models [PNIG18] Since 2013 distributional word embeddings (word2vec [MCCD13], GloVe [PeSoMa14]) were widely used in NLP. They learn vector representations of words based on corpus statistics. Problem: Let’s stick to AI. to continue to adhere to sth. embedding(stick) = [2.3, 5.4, 7.7] The dog ran to get the stick. a thin piece of wood embedding(stick) = [2.3, 5.4, 7.7] Encoding each word into one vector conflates different meanings of words. Solution: Learn contextualized word representations, e.g., ELMo [PNIG18]: ▶ Bidirectional LSTM trained with a LM objective ▶ Deep representations: linear combination of hidden layer outputs ▶ Higher-level LSTM: context-dependent information ▶ Lower-level LSTM: syntax 27 Embeddings from Language Models [PNIG18] /3 Image credit: Alammar, Jay (2018). The Illustrated Transformer, CC BY-NC-SA 4.0 29

24-Neural-Network-Basics.pdf

Document Details

Tags

Related

Full Transcript

Upgrade to continue