transcript_BERT.txt
Document Details
Uploaded by HumourousBowenite
Full Transcript
00:25 um hey uh can you just confirm if everything is working as expected if you can see me and hear me and also see the screen um hey guys uh can you just confirm if everything is working as expected if you can see and hear me if you can see my screen we will get started in a minute or so we’ll giv...
00:25 um hey uh can you just confirm if everything is working as expected if you can see me and hear me and also see the screen um hey guys uh can you just confirm if everything is working as expected if you can see and hear me if you can see my screen we will get started in a minute or so we’ll give a couple of minutes for everyone to join in uh my check can everyone hear me and see me and see the screen 01:30 okay cool cool so we’ll get started at 902 let’s give a little time for everyone to join um it’s already 902 probably we should give a minute more for everyone to join in uh 903 we will get started again you can see the agenda for today so uh we have today’s class and also the class on Wednesday right so we have two classes and since we have understood self-attention spent good amount of time on attention itself uh understanding Transformers and birth will be fairly straightforward so that’s what we’ll try and Achieve 02:23 today and the next class in the next class we also wanted to walk you through some um some additional real world problems um and how to tackle them so again between the two classes we’ll try and wrap up uh most of NLP okay so since it’s 903 uh let’s dive into the discussion so today’s plan of action is uh uh we’ll continue where we left off basically right so we’ll start with Transformers we learned the Transformer encoder module so we have not yet discussed the decoder module which is slightly more 03:02 intricate than the encoder module now once we understand the decoder module we’ll do end-to-end training right we’ll see how end-to-end training of a transformer works with some very simple diagrams once that is clear of course see we can I’ll give you some pointers to Transformer code written by tensorflow it’s a tensorflow scheme themselves it’s a very nice implementation I would strongly recommend everyone to go through it I’ll share this pointer to you there is a collab notebook Etc basically the Transformer code is 03:34 exactly what we’ve discussed right whatever we have discussed the same thing is implemented in code I’ll also walk you through a simple real world problem solving by using a pre-trained Transformer so today I was thinking that first uh first let’s go through the architectures because we have seen tens of thousands of lines of code right so code is morally straightforward so I was thinking that let’s understand Transformers in detail then I’ll point you to the Transformer code and then we’ll also understand the birth 04:03 architecture bird is a special type of Transformer which is probably one of the most widely used architectures in the world today or in the real world right of course Bert as an architecture can be used for various NLP tasks and we’ll see how to modify the generic bird architecture for various NLP tasks these are very very important and in case we have time we will cover the code for birth to solve a medical problem in case we run out of time we’ll cover this in the next class does the agenda make sense so I would ideally want to cover 04:37 Transformer architecture decoder but architecture bird for various cases then cover as much code as possible in case we run out of time for code we anyway have the next class which will be fairly code heavy in the sense that we will actually solve real world problems using Transformers but and we have also additional problems that we wanted to bring it to you does the agenda make sense for today and the next class good good so let’s get started with Transformers right so again I’m using J alimar’s block one of the best blogs 05:08 that I know of very visual of course there are some details that are missing here I’ll also share the detailed notes that our team has created with a lot of code and stuff like that so okay so let’s let’s get going so let me recap what we covered in the last class right so this is all attention we know attention I mean we know multi-heritation that we have covered in the last to last class positional encoding we have covered so basically at the end of last class we were here right so this is one encoder literally all right 05:40 what do I have just to recap I have my words here I have two words here given the two words I have representations of words typically word to work or simple embeddings then I’m adding positional encoding here right using the sinusoidal waves right so I’m adding the position and coding and then I get this X1 X2 which I am passing through a self-attention layer the outputs of the self attention layer let’s call them Z1 corresponding to X1 Z2 corresponding to X2 now I have a skip connection here so this skip connection 06:12 what is it doing I’m concatenating Z1 and Z2 into a matrix like this and I have my original x what is X x is nothing but x 1 plus X2 right I’m adding both of them and normalizing it this is called a add and normalize layer then after I add in normalize I take the first row which I am calling it as Z1 the second row which I’m calling as Z2 now this I am passing it through a feed forward dense connection and then I’m adding and normalizing generating two outputs so if you think of one encoder literally what it’s doing is it’s taking 06:43 words and it’s giving some D dimensional representation for words by encoding them again there are two pieces to the puzzle I solved from very nice questions in the last class so everybody understood the purpose of self attention again if you break this down right we have a self attention we have a skip connection because of which we have the add part normalizes to make sure that the inputs to the next layer are well normalized right then we have a feed forward because on top of self-attention Because if you think 07:13 about it self-attention is basically a matrix multiplication with the soft Max we also if we want more non-linearity to be added before we pass the output we can have a field forward layer and then again once we have a feed forward layer there is also this skip connection remember this skip connection we have so it’s exactly add and normalize again and we take it to the next stage so if you break this down right so the key components are self-attention number one the second is keep connections this and this 07:42 the third are added normalized layers after Self attention and after feed forward and the fourth part is the feed forwards and again the feed forward for Z1 for each of the words is going to be different architecture or in other words it might be the same architecture but different weights are learned for Z1 different weights will be learned for Z2 is is the architecture of an encoder very clear because once this is clear decoders become very easy is this Crystal Clear again all of the components we already know 08:17 again can you just confirm if you have understood the key pieces and why why each of these pieces are there we discussed this in the last class I’m just quickly recapping it for you okay so with that recap a typical architecture okay this is your whole if you see this this is your whole Transformer a Transformer consists of again this is my input followed by positional encoding then this is my first encoder the output of the encoder I can keep stacking encoders just like the way I can stack MLPs or I can spec 08:50 lstms right I can stack one encoder on top of other to add more complexity more non-linearity Etc so the number of encoder the more the number of encoders there so this is called the encoder stack of a transformer similarly there is something called as a decoder stack will come to the decoders pack in a second but in essence the suppose here here we see two encoders but I could have 20 encoders stacked one on top of the other the final output remember I get an output from here right I get an output from here right this whole output 09:22 goes in to each of the decoders see this is my decoder one again I’ll explain the decoder architecture this is my decoder one again they did not draw the decoder too in depth but this is decoder two so the output of the encoder block okay which is sometimes referred to as the context or called as a context vectors or the context Matrix this whole context goes to each of the decoders right again in this case we have two decoders there is decoder one decoder Pro I will explain you about the inputs to decoders in just a few minutes 09:56 but I could have a stack of decoders decoders have a slightly more complicated structure than an encoder that we will discuss but imagine if I want to have two decoders at the end of it I can just have a linear layer which is basically matrix multiplication and if I have a multi-class classification setup I can just have a soft Max does the again we have not yet discarded discussed how a decoder Works internally but is a broader architecture clear so basically what we have is we have a stack of encoders and the final encoder 10:25 output is passed to each of the decoders and then I can also stack up decoders the final output from decoders goes to my linear and soft Max layers now we’ll see how this part of the architecture works this is something that is not yet completely explained yet but I’ll explain you this in one step by step again I’m following some of the like very brilliant diagrams from jail image blog which my team also collated we’ll share the notes with you now here is a very interesting idea again I really 10:52 love this diagram so let’s understand the decoding suppose imagine I have a sequence to sequence problem right so for example I have one sequence I have an input sequence I want to generate an output sequence okay so for example let’s assume my input is okay so let me just I’ll walk you through this whole thing so let’s assume my input is Jesus etudant okay I think this is French and let’s assume we want to Output some English translation of it so what happens is this input goes through the 11:25 encoders look at the diagram okay so this whole input right goes to the encoders the encoder generates an output this output which is which this whole output is copied into key and value and this key and values are input into my decoder now what happens is if you if you observe the decoder stack this is the encoder stack this is the decoder stack right so just to quickly recap the output of the encoder stack goes in as keys and values to the to each of the decoders first it goes to the first decoder this goes to the next decoder 11:57 then there is a linear layer and soft Max then there is a decoding time step because a soft Max only generates one output for timestamp right if you think about it so for the first time step the softmax layer will generate the output for the like whatever suppose I want to translate French to English it will generate one output per timestamp right that’s the whole idea behind it is this is this clear on how we can do French to English translation with decoding time steps like every for at every unit of time because the soft Max 12:32 can only generate one output at any point which means you can generate one of the many words that we have in the vocabulary so this is what happens in decoding time step one now okay so now now comes the fun part okay see remember there is no input to the decoder right now so for my at the first time Step At the very first time step I am taking all this input I’m just generating the first output now what happens at second third fourth fifth sixth decoding time steps there is a slightly more complicated system that 13:03 happens okay so imagine okay so this is a very nice diagram okay so I’ll just I’ll just walk you through this simple animation suppose we are at second time step okay I can’t pause this so first time step is over suppose if I’m at second time step okay let me just let me just open this GIF here suppose I’m at second time step and again if you see the okay okay let me just refresh this let me walk you through it because the animation is very nice so at the second time step my whole encoder is not 13:35 running because my encoder input is is the same right so let me so in the first time step to generate the output at the first time step right if all this input passed through the encoder I got keys and values which got passed to each of the decoders and all these decoders got executed there was no other input to the decoder except the keys and values from the last encoder and the decoder stack finally gave me one word at the first time step at the second time step this is this is where things get very very interesting okay so let me just refresh 14:08 this at the second time step look at this my encoder is not running again whatever was the input that I had from the encoders right the same input from encoders is passed to each of my decoders I could have two three four five as many decoder stack as I want right so the encoder is not run re-running again right encoder is run only once that is a very critical thing that often people get confused with now what happens is because I’m at time step 2. 14:40 right which means my my previous output see at the first time step so if when the when the first time step ran the output was one or I here in this case the output is the first output is I so in the second time step what happens is V input to this decoder the previous output so we input the previous output of of whatever previous outputs are there we input it to these decoders along with keys and values or the context these and values are nothing but the context from the encoder stack right so the output of the encoder is given to each of the 15:18 decoders along with all the previous words so what happens is at each time Step At The First Time step it generated I so in the second time step I is given as an input and then we generate the output the next output is am in the third time step in the fourth time step everything that has been generated till now is given as an input to generate the next one so if you think about the decoder itself the decoder has each decoder again these are multiple decoders each decoder has two inputs one input is the context which is which is 15:53 from the encoder so the last encoder gives you whatever is the last encoder that gives you key and value and this is referred to as keys from encoder decoder values values which go from encoded to decoder that is one input to the D to each of the decoders the other input is whatever is the output till that timestamp right imagine we are generating multiple we are generating a sequence of words here or tokens so whatever was the previous outputs all of them were as a input so just to retreat the encoder is run only once the decoder if the output 16:30 is K words it is run K times is this clear again we will understand the internals of a decoder again what I really like about this blog is it’s literally like peeling an onion it goes step by step step by step step by step with simple to follow diagrams and animations but is this clear before we proceed any questions till now of course what’s happening in the decoder we’ll come to that we’ll come to the mathematics of what’s happening in a decoder in just a second okay at any point if you don’t 17:04 understand anything please stop me okay so now okay so all this is cool this is this is very simple right how do I generate the output word it’s very simple I have a linear layer a linear layer I get legits which I pass through softmax linear layer basically is matrix multiplication when I get soft Max I get probabilities whichever is the highest probability I do an ARG Max and give that as the word it’s very simple thing now okay so again this is something one second okay so let’s let’s walk into the 17:34 internals of a decoder here right this is my decoder so let’s understand the decoder input step by step okay okay so the D there is no explicit input to the decoder let’s not forget that right so decoder has two inputs one is very see our inputs which is thinking machines is not input directly to the decoder so we have the output from the last layer which becomes my key encoder decoder it goes as key it also is value encoder decoder so whatever is the output made output like at the end of the day I get a vector here I get a 18:14 vector here I stack all these vectors and I make so my key and value will be the same now each so my key and value will go so let’s look at the decoder one architecture right so it goes into decoder because all decoders have the same architecture now let’s let’s dive into the decoder architecture step by step first now what is what are the inputs to the decoder at time step one right imagine I’m generating my first output there is no there is nothing there is no input here right there is no input right or you 18:47 could create a special input called start right you could create sometimes this is also done in implementation you have a special input called start and you can give both again this symbol plus basically means position Limited right you say hey I only have one word which is which is a start token and you give this now if You observe this this architecture is very similar to what we have seen here you have a self-attention layer add and normalize this is exactly like what you have in an encoder right so again remember that in this 19:18 case whatever input you are giving at any time Step At first time step you give only one at second time step along with start you give the output of the first time step at time step three right because you’re generating one word at a time at a time step three you give start and whatever was the output whatever was the output at time step one output at time step two and so on so forth right all of them will be fed into self-attention followed by added normalized along with the skip connection so this architecture is 19:51 exactly same as this architecture this is standard but in addition to that there is a new addition in the encoder in the decoder architecture because the decoder has two inputs one which is the previous outputs it has two inputs right which is the previous time step outputs previous time step outputs is one input the second is the keys and values from the encoder right so this is key and value from the end these are called encoder decoder keys and values right so this V input so there is now an encoder decoder attention now what does the 20:27 encoder decode retention do it is exactly like attention but what what what happens in self-attention and self-attention key equals to query equals to value that’s what that’s what is self-attention right we saw this in the previous classes in an encoder decoder retention my query comes from the previous stages so this this is my query so let me write it clearly okay so okay in an encoder decoder attention my query is the output of the previous layer these these layer outputs this is q1 this is first first row of Q 21:02 This is second row of Q so my queries come from my previous layer my keys and values come from the encoder stack so it is not exactly a self attention but the key and value come from encoders right the query comes from the previous layer so it’s it’s not exactly self-attention but if you think about it carefully the purpose we have this the you might argue that hey why do I have two attentions here why do I need self-attention here right and why do I need encode or decode retention and the reason is very simple 21:40 the self-attention basically says given these previous outputs what should I generate the encoder decoder architecture says hey we have we have come to some very interesting way or interesting we have encoded our thinking machines the two words that we have we have encoded these in a very nice format can I use the input words here the for the input words I say hey whatever is the input words that I will pass it through these key and values here and whatever is the self-attention that I’ve done on previous time steps that 22:15 I’ll pass a square so both of these have a very important role to play once these two are done then I do simple add and normalize feed forward add and normalize output is a decoder clear again unlike an encoder decoder has two attentions one is self-attention other is encoder decoder retention both of them solve a very important task but other than that after every attention you have added normalize the output of everything is like apart from this encoder decode retention block apart for this your encoder is same as your decoder 22:54 right your encoder and decoder are exactly is the same except for the encoder decoder retention part which is taking inputs from the encoder block and it’s taking inputs from the previous outputs and it’s trying to come up with the next word based on all of these does the intuition on why we need two attentions does the architecture of a decoder make sense now just again if you have any questions feel free to ask me so keys and values come from here query comes from here but remember that this key and value 23:31 that we get from here goes to every decoder so decoder 2 also has an input from the previous decoder it also takes your context it also takes the context vectors that you have from the input it also passes this foreign if you have understood what a decoder is doing any questions please talk me here I’m happy to address your questions okay let me ask precise questions then okay does everybody understand why we need two attentions there is one attention here there is a second attention here does everybody understand that why we 24:20 need two attentions in a decoder one is a self-attention other is encoder decode retraction okay okay so I’m assuming everybody understood it okay if anybody has not understood please talk me I’m happy to explain now has everybody understood how a decoder generates one output per timestamp is is this understood like this is a very simple diagram right does it has everybody understood how decoders are gener how the decoder stack is generating one word at a time till of course till the time you get end of sentence 25:03 again you can see this animation this is very well done animation okay so I’m assuming everybody has understood the arcade this is what a Transformer is literally this is what a Transformer is and now now if imagine if I have to train this end to end how do you think this works suppose if I want to train this end to end right suppose I have some translation job wherein suppose this is in English right suppose I have an English sentence and I’m generating an output right which is let’s say French I’m doing sequence to sequence right in 25:53 such a case how do you think the whole end-to-end training works can anybody have again forward prop I think everybody understood now tell me how does the model get trained any ideas again please shoot your ideas in the chat window because we are almost at the end of deep learning so I’m assuming you would have understood how backprop Works how some of these ideas work somebody says I am not able to use the chat I can’t see the user name also um hey I’m not sure why you’re not able to see the chat because 26:38 um one second I don’t I’m not sure why you’re not able to use the chat can you just raise your hand I don’t know it just says user it doesn’t even tell me the name uh hey uh uh shrikanth this is Jay Kumar yeah again I’m not I’m not even getting your name in the chat window here um I’m not sure what happened but if you have a question I’m happy to answer uh not the question I was trying to respond to your channel it’ll do one thing can you just uh close the window and again join back in um has does 27:18 anyone else have this problem that you’re not able to interact in the chat um because uh Jack I’m not able to see you also here it just says user for me okay it could be me I will try rejoining it yeah can you just rejoin please okay yeah sure okay cool okay cool cool so can somebody help me with uh how does the whole back prop work how does the whole training suppose imagine if I have a bunch of English sentences and the corresponding French sentences how do you think the whole translation part works 28:00 that the model has to train right within self-attention there are a bunch of Weights there is WQ there is WV there is w u and again in feed forwards there are a bunch of Weights here right again self-attention there are again WQ wvwu again here encoder decoder also has corresponding weights there are tons tons and tons of Weights here all right so how does the whole model get trained that’s the most uh interesting part here any ideas again uh you can always raise your hand and we can discuss I’m happy to do that 28:41 so if you okay okay so let’s do one thing okay so the whole idea behind it is as follows I could say it’s a very simple idea if you think about it it’s fairly basic so I have one output there is a soft Max right which which at every time step is generating one output suppose I have some word one which is generated in the first time step word two word three so on so forth for every word look at this I have an encoder stack and I have a decoder stack this was generated at time step one this was generated a decoding 29:15 step two decoding step three so on so forth corresponding to each of the outputs I’ll match it with the actual words and I can generate a loss I can generate a loss for each time step so there will be loss 1 at time step one there will be loss two at times step two and so on and so forth right I can collate all those losses see I’m running my encoder stack once I’m taking this output I’m running my decoder stack as many times as needed right imagine if I if I have to run this till K because imagine the output has 29:48 scale alphabets or sorry K tokens or k uh K words right so till the time I get K words and end of sentence I’ll keep running it which means there will be a loss associated with each of these output words because output were initially when the model weights are not read they’ll be simply random right and that loss I can see I can back propagate through soft Max I can back propagate through a linear layer I can back propagate through a add and normalize okay this all these layers are back propagatable and because I have skipped connections 30:20 look at this if this decoder has three script connections an encoder has two skip connections because I have skipped connections I can have like 20 encoders 20 decoders right and all of that will work perfectly well and whatever whatever is the loss that we get this loss will also go back this way right this loss will always go back this way because we are getting the input in the forward propagation we are getting the keys and values this way which means in the output in in back propagation this will go back so literally all of the operations that 30:55 we have each of them is differentiable and because we have skipped connections we can train a very very complex model with billions of parameters right does that make sense on how the whole end-to-end training of a transformer work any questions about the training piece itself again it’s very simple I have keywords I’ll compute K losses and I’ll just back propagate because everything is back propagatable of course there will be back propagation from here also from every decoder there will be a back propagation and there are 31:32 tons of skip connections again the whole point of skip connections is even weights which are there in this self-attention can easily be trained because my whole gradient does not vanish even though I might have 10 decoders here 20 encoders here that’s the reason why the whole skip connections are so fundamental to encoders and decoders both okay cool do you see a drawback do you guys see a drawback for the encoder decoder Transformer again feel free to raise your hand or type in the chat room do you see a fundamental drawback 32:13 or do you see multiple drawbacks come on make an educated guess it’s okay to be wrong because then I will know where you’re lacking and I can help you correct what are the drawbacks we have seen so many architectures till now what do you see as the obvious drawbacks please feel free to raise your hand okay I’ll give you a minute or two think about it it’s computationally expensive Psy why is it computational expensive you’re right about it but why okay too many parameters good that’s 33:24 true too many parameters to train I agree with you and it’s a very valid point it has too many parameters because self-attention has a bunch of weights right again either normalize doesn’t have there is feed forward networks and imagine if I have 20 encoders and 20 decoders it’s going to be like crazy number of parameters it’s very easy to create a billion Network a billion parameter network with this that’s one of the problems very valid what else what else do you think is a problem good 33:59 good very good point what else comes to your mind okay if you see the decoding happens per time step right there are decoding time steps do you see a problem with that I told you how decoding works right decoding Works Through Time steps right so I showed you that animation this decoding time steps do you see a problem with that any gut feel here so let me explain this here uh if you don’t have a strong idea here see my encoder stack is running only once which is great right because see there are bunch of 35:02 parameters here right so it’s running running once is perfectly okay but my decoder stack if you think about it my decoder stack is running if I have K words here if I have K words till the end of sentence comes right I’m literally running my decoder K times till the time I get end of sentence right so the decoder time steps right could become expensive could be many the decoder time steps could be many right so remember one of the advantages see one of the problems with lstms was there was this whole uh unwrapping over 35:45 time you remember that one of the biggest problems with encode with the um one of the biggest problems with lstms was that we were we were unwrapping things over time but if you think about it the decoder is also doing something like unwrapping over time because for every output that I need to generate I’m running the whole decoder once isn’t it more like your lstm then of course I agree with you that encoders and decoders are looking at attention that I agree with I’m not denying so they’re looking at the whole context and 36:17 they’re trying to say which word should I pay attention to that way they are more powerful right because we saw earlier that just a bidirectional listium with attention is more powerful than a simple better rational listing so then people said hey why do I need the lstm I might just build a whole model with only attention and that’s how Transformers came into existence but do you understand that the decoders the decoding per time step is behaving like an lstm not exactly but there is still this unraveling over time 36:48 does it make sense that this is a disadvantage of decoder because the decoder has to run K times right does this make sense to you that this is one of the disadvantages so let me tell you one more problem can can you think of one more problem here I said that’s that’s probably the biggest problem of a transformer so if you think about a Transformer right see I if I have three words here my input could be three words ten words it could be a whole document with hundreds of words now my whole self-attention that I’m 37:33 having here my whole self-attention basically if you think about it for every word for every word I have to pay it see if I’m looking at this word right what does self-attention say you have to find using your whole self-attention key value um kvq right using your kvq matrices for every word you have to say how much attention should I pay to this word how much attention should I pay to this world so for every word if you do that and if I have n words here look at this if I have n words here the time complexity is 38:07 order of n Square attention right my attention Waits because for everything I have to pay attention to this right my whole self-attention the number of parameters that I have and the matrices that I have to operate are of the order of uh n Square does it make sense that that could be very large imagine instead of three words here I want to give a whole Wikipedia article imagine a Wikipedia article has thousand words then I have literally a million parameters right for my for for all of my million parameter matrices that I 38:41 have to manage right because the attention is order of n Square which means for every word I have to say how much attention should I pay to every other word and this is for a single head if I have K heads it becomes order of n Square k does it make sense that this can be a computationally expensive headache if you think about logical this the process of self-attention itself because for every word you’re titled to every other word and that’s a pain computationally okay cool these are some of the problems 39:14 that Transformers have again I’m very happy that some of you thought through that it’s computational expensive it’s computational expensive because of number of attentions that you have also the decoder thing is a headache because of that people said hey well Transformers are great can we optimize because the performance of an optic of a transformer is great but how do we work on top of it like in our notes okay so this is the original dag again this is the whole diagram of uh this is from the original research paper our team has 39:44 drawn this again again architecture is simple you have you have an encoder stack you have a decoder stack all of that stuff again we’ve also put uh okay so whatever I’ve discussed right now right so for example if you have so this is a very nice summary of sorts in an attention module if you have n words if the length of sequence is n and if the dimensionality of each word if you’re using word to back if it is d then the time complexity per layer is order of n Square D right because there are order of n 40:17 Square attentions that you have to worry about and each data point is D Dimensions again this is only for one self-attention it is not for case self attentions if there is case Alpha tensions multiplied by K but the best part is all there are no sequential operations in self-attention all operations can be done simultaneously so all of these Matrix multiplications I can do it on a GPU on the other hand your typical RNN or lstm the problem is there are order of n sequential operations because I’m only taking one 40:50 word at a time as an input again this is for generic attention right not necessarily for Transformer model right because Transformer model there is a decoding step which can be cumbersome but the input if the input is of length n look at this in in a standard Transformer that we have seen if the input is of length n this n is only input once the outputs if there are K outputs then there are K time steps okay so we have collated some of that nodes again there is also the there is also this uh there’s also this very nice comparison 41:23 on various architectures right various architectures versus Transformers this this is English to French and English to German and Transformers like there are various there is a simple model there’s a very big model that people have constructed and they outright beat everything else that people have tried ensembles of uh ensembles of bi-directional stms plus attention they’ve tried lots of things okay so we’ll come to the code in a while but since we are discussing about architectures again um I wanted to point 41:55 you to this okay so this is probably one of the best implementations that I have seen so when I first read the Transformer paper probably in early 2018 if I recall so I had some gaps in my understanding so the best way I learned about a Transformer was when I tried to implement it or read others implementation from scratch so I would strongly recommend let me post this here in the chat window I would strongly recommend everyone to go through this Google collab notebook I also put this in a post read okay so this Google 42:29 collab okay so let me open this this Google collab notebook is the implementation of Transformer in Keras and tensorflow from scratch right this is the actual research paper implemented by by the Google AI team and their explanation is also very good right so of course this also has some explanation but the best part about it is the code so the so they first take a simple data set and then they talk about I think this is French to English itself so this is the original architecture that they have and then they try to 43:01 implement each of these blocks what I really love about is okay first they say let me implement the positional encoding block okay and this is the sinusoidal functions that we’ve seen they first implemented as a function right then they try to visualize how these how the position of the first Vector is second Vector is so on so forth right then this is the whole positional embedding class that defines then they said let me Implement add and normalize okay they just write an ad and normalize then they say let me implement the 43:29 attention the base attention plus okay then they say okay let me this is the best part then they say let me do the encoder the encoder decode retention is also called as cross attention because it’s a cross of attention between from the inputs to the outputs cross attention again the only key thing here is the key and value are the context that you’re getting from this the query is what you get as an input X from here right again all this code is very very simple to follow through so this is a key and value come 44:00 from here query comes from this and they I mean and then the global self-attention which is which is which is a standard self-attention that you have which is here right so very simple very nice again the code is not overwhelming at all if you have understood Transformer still off this is extremely simple to follow because they Implement each of these chunks very very carefully again this this attention that you have which is which is which is called the causal self-attention because you’re taking the outputs from yourself outputs 44:30 from the decoder layer giving it as an input here even that part they Implement very very carefully okay and then everything is put together then they Implement feed forward layers which is very simple some dense connections and some Dropout nothing more than that and then the whole then they do the whole encoder layer right so they Implement one encoder as basically a bunch of self-attentions then they replicated as much as many times as you want it so then they have a decoder so the way they’ve written code 44:59 is for each of these blocks they write a function or a class and then they put it together so this is a whole decoder layer that they’ve written clearly then they repeat the decoder layer the whole decoder consists of multiple decoder layers as many times as you want then the whole system and the best part is so once you get the final output here they have a linear implementation and soft Max so while this code might look little lengthy this is probably one of the best code I have seen for the implementation of 45:29 Transformer from scratch for academic and educational purposes of course I’ve seen other code of Transformers um by pytorch hugging face Etc uh hugging phase is probably one of the most widely used libraries which is very Transformer heavy of course their code is much more optimized but what I like about this code which I’ll share the Google collab as a post read is the clarity with which they’ve implemented it right so please go through it it’s very very helpful again whatever we have seen till now the same thing is there 46:00 but written in a very crisp clean Way by some of the best developers and researchers at Google okay cool so I’ll share this also as I posted of course this is not written by us just to be clear it’s written by uh the team at tensorflow now we’ve seen a few disadvantages right of of so one of the disadvantages right of our Transformer was the decoding per Pine step right the problem is there is a decoding per time step so people said hey why should I have the whole encoder and decoder why can’t I do 46:36 encoder only or decoder only stuff so the idea is this the idea is very very ingenious people said why can’t I just have encoder only architecture or a decoder only architecture right the encoder only architectures were very were researched mostly at Google and they came up with this architecture called birth and variations of birth then people at open AI they they did a lot of research or decoder only architectures and they came up with this model it’s called GPT models right both of them are powerful Bert is 47:10 in many cases more widely used because computation it’s easier to tackle I’m not saying GPT is not used GPT is also used but we’ll focus mostly on word because it’s more popular from our real world implementation standpoint it’s also easier because encoder architecture is much more easier and simpler than a decoder architecture now the question is this suppose imagine I have an encoder one here I have an encoder block okay of course I have my word one my word two which is represented along with 47:39 positional encoding right I’m giving this as an input I have my positional encoding which I add and I’m giving input this generates two outputs right this is my encoder one on top of it I can draw one more encoder this becomes my encoder too right it generates again two outputs again each encoder has self attention skip connections added normalize dense fully connected layer added normally right simple now imagine I have three encoders the idea is very ingenious right now one thing that you can realize is hey I have 48:12 two inputs here and here also I have two outputs simple right now now comes the interesting thing which is like if I want okay there are multiple tasks if I want to build an encoder only architecture typically the tasks in NLP if you think about it one task in NLP that I can use is classification let’s say you might have a classification I have a sentence and I want to classify then what do I do okay now I have I have some output here this is it again if you break down NLP tasks you either want to classify a sentence 48:48 based on polarity or whatever you want it’s a multi-class it’s a multi-class classification that you often want to do or you want to do a sequence to sequence these are broadly the two categories that you have so the challenge the designers of birth said was hey if I only have encoder stack how can I achieve both of this it’s a non-trivial task right and they said hey the reason I don’t want a decode respect is because I don’t want decoding per time step because that was what was taking the 49:18 most time right in terms of time complexity because there’s a per time step thing and I run my whole decoder every time what if I just completely get rid of it of course this is not solving all the problems the number of parameters in a bird can still be fairly large see your bird is basically this sort of architecture it’s an encoder only stack now we’ll see how to train a bird end to end uh what is language masking or a mask language model will come to all that in just a few minutes but is the context of a bird why people even 49:49 imagined to design a bird is that clear is the motivation for bird clear if it’s clear it’s very simple for us to go through it okay so again I’m using the J alimar’s block very nice blog where he talks about multiple models okay and we’ll also see how transfer learning works here all of that stuff will see step by step okay again we our team also created some terrific nodes uh with with libraries Etc that will come to in just a while okay so let’s go step by step imagine so let’s go top down okay suppose I have a 50:27 sentence like this and I want suppose I get some email like this I want to do let’s say sentence classification okay I pass it through a bird at the end of birth again we don’t know what birth is yet we’ll come there in just a few minutes this is a broad task that I’m talking about imagine I get an email like this I want to determine whether it’s time or not so the here is my birth algorithm and it generates some output right but if you think about an encoder only stack then each of the encoders we 50:56 already know the architecture right if I give K words as input it gives me K vectors at the output now I can pass all of that to a simple feed forward neural network with a soft Max I can get spam or not spam right I can do this but it’s still not clear all right so let’s let’s again birth itself has two architectures that people have initially trained there is something called birth base model and birth large model I really love these diagrams they’re fun and interesting but large is a larger architecture so for 51:25 example bird base is a stacking of 12 encoders Bert Lark is a stacking of 24 encoders obviously if you have more so if you have if you are stacking more obviously uh it has more computational but of course more more number of parameters to train also okay so the way we give input right for any birth model is let’s assume I want to say that I want to cap mine again at the end of the day my input can be very large but imagine I want to cap it the first the first input that I give is a special token called CLS 52:11 CLS basically just stands for classification because my task is a classification class this is a special symbol like the start of a sentence then I give the whole word each of the words I give as an input including the CLS token this is a special token that we are giving right of course imagine if my email only has 200 words then what do we do you guys already know this if it has only 200 words what do I do then we have studied this when we studied LST padding exactly so rest of everything will say empty empty mtmt will simply 52:51 pad it with empty strings right but there is a problem even in this right even in this I am only allowing sentences of length 512 beyond that I am not allowing it now if you have a situation in the real world where you need to process sentences which are much larger then you have to train your own birth with much larger number of inputs but of course that’s going to be computationally more expensive right the typical birth so Google has trained a bunch of bird models and also others have trained not just Google the typical input size is 53:24 512 tokens or words the first one is often CLS okay so then again we know how the whole uh we know how the encoders work right encoder architecture we already know this very well here is here is my encoder architecture again positional encoding self-attention layer Norm feed forward add-in now very simple same architecture is also followed here finally I get this I get these outputs I get one output corresponding to each of my inputs all right now again look at this so I get one output corresponding to each of my 54:01 inputs now if I want to build a classifier what I’ll do is this this is what this is how the architecture is designed the architecture is designed in such a way that hey if you give me all these inputs on the first output look at this on the first output that I have here I will generate I will build a feed forward neural network with softmax classifier which will generate this output rest all outputs I’m going to ignore rest all look at this because my setting is a classification setting right so all these other outputs I am going to 54:33 completely ignore again this is my birth let’s assume there are 12 layers 12 layers of encoding here right so this sort of model the way it learns the right way is look at this see look at this in my whole sequence this was a special symbol this was a special symbol for me but when I’m processing this let’s assume I have the first so because I’m ignoring all of this there is no back propagation happening through these the only back propagation that’s happening is this way right all the back propagation is 55:09 happening like this but when I’m processing any word I’m also paying attention to all the other words right so which means my attention for this word also my attentions of other words look at this my attentions of other words are also going to get impacted here because if you think about the architecture so if you think about so if you think about a stack of if you think of a stack of encoders right after self-attention there is a feed forward here this feed forward and then add in normalize that goes in as input right 55:42 because I’m adding normalizing I’m paying self-attention I’m adding feed forward networks there is lot of information exchange that’s happening between the words also or between the inputs that my sequence it’s not that because I have two outputs here because I have two outputs here this output corresponds only to this and this output corresponds only to this it’s not like that because self-attention is going to pay attention to other words and I have feed forward networks whose outputs I’m 56:10 going to normalize and combine together before I pass it again so what’s happening is this output corresponding to this word also has some information from the other word that’s why that’s the reason that’s the most important reason why even though I’m just building my classifier or even I’m building my feed forward neural network only at the first output even then the bunch of Weights that I have within this whole bird in the 12 layers that I have it will still work because there is attention which is 56:42 doing a lot of exchange of information from these other words to this and because there is no back prop from the other outputs there is only backdrop from this output all the weights will get learned in such a way that it should detect spam or not spam based on all the other words of course especially when I give the input CLS it’s a very simple idea but does this make sense on how something like an encoder only stack can do classification again any questions about again bird is nothing but an encoder only stack we 57:20 already know what encoders are we only know we already know self-attention so bird is a very simple idea and this is how we can train a bird based classifier again any questions about birth based classification we’ll see sentence to sentence translations and all in just a little while but how to build a multi-class classifier using birth is that clear you get any questions related to this feel free to raise your hand I’m happy to address them okay cool no problem okay so it looks like see this this is very similar to 58:06 your convolution you give an image you have a bunch of convolutions and then you do some classification using fully connected layers and you give an output it’s very similar to that right I have some input I have a stack of encoders finally I build a classifier and give an output so intuitively they’re very very similar of course there are other architectures see you could have also done this using lstms we’ve seen how to do it with ls teams right suppose I have three words as inputs I get their 58:32 embeddings this is basically my lstm my lstm one this is uh unraveling over time then I pass these outputs to lstm 2 there is unraveling over time and then I build a feed forward neural network and softmax at the last time step you can also do this with lstms right but of course it’s much better to again open AI basically you did pure Transformer based architectures just like the way we have seen uh encoder only architectures the decoder only architecture is also very simple it basically gives these as inputs now 59:04 there is no masking of the future so in a decoder only architecture again just like this is this is the core behind GPT architectures right just like the way we had encoder only stack we have a decoder only stack what is the difference between encoder and a decoder we have we have the encoder decoder self-attention which is no more there here right then the decoder looks very very similar to encoded except that there’s two attentions instead of one attention look at this if I don’t have encode only attention here sorry if I don’t have 59:33 encoder decode retention here if I don’t have this connection right if if my keys and values are not coming from this in a Transformer I will just replace this with self attention so then your encoder and decoder are very similar except that there are two attentions here followed by a feed forward here there is one self attention followed away feed forward so people have said here I mean that’s why the GPT based architectures and bird-based architectures are very similar in this context because all I 01:00:00 have is I give the input this is what openai designed right I have a bunch of decoders the outputs of the decoders I give to a feed forward neural network plus soft packs then I give output right intuitively it’s it’s very very similar to what we have seen of course they use start and end and things like that let’s go back to our birth again okay so with bird so this this is half a simple bird works but let me give you an interesting challenge now one of the question here is how do I train a bird-based model 01:00:33 right so the problem here is okay or else before I go there huh let’s let’s take before I come to training part there is another task that I want to give okay which is imagine if I give uh let’s say imagine if I have a twos okay this is the actual diagram from um this is the actual diagram from the research paper okay let me just zoom this in a little huh let me explain these tasks so here what do I have imagine imagine I have a sentence pair so I have two sentences right imagine okay so let me see if I 01:01:09 can open imagine a new tab let me just zoom this in so let’s go one by one and try to understand each of these These are variations of birth for various tasks so imagine if I have sentence pair classification suppose I can give sentence one and sentence two I will start with CLS I’ll give token one to token n of the sentence then I create a special symbol called separator SCP then I give sentence to now imagine imagine my task is because there are many two sentence input setups imagine if my class label is hey is 01:01:47 sentenced to logically coming after sentence one then label one or else zero this is called the next sentence prediction right so the is is the task clear imagine I have two sentences that I want to input my CLS is anywhere there I have the N tokens or n words of sentence one then I’m creating a special symbol called separator then I’m giving M tokens of sentence two now I want to predict binary output whether sentence two comes after sentence one logically or otherwise does it make sense these are called as 01:02:25 classification tasks with two sentences does this logically make sense right and now all like it’s very simple for me right I have my encodings I have my birth based architecture whatever is the first output that I get because because my task has all these inputs whatever is the output corresponding to CLS I will just train a model here and back propagate right so using the same boat architecture I can classify one sentence I can classify a pair of sentences and of course I can do again this is this is a simple single sentence classification 01:03:03 we’ve seen this already again this diagram is from the original research paper this is a single sentence classification right I start with CLS I give the N tokens of my single sentence and it generates all these outputs the output corresponding to the CLS the output corresponding to the CLS I will just have a model of a feed forward neural network and I’ll do back propagation this is very simple this we have already seen but let’s see more complex tasks okay so imagine imagine this this is actually more interesting 01:03:36 there are more interesting tasks so imagine I have a question answering system suppose I have a question and I I want to get an answer for that this is very interesting okay so what do I do I give CLS as the input these are my n tokens of the question that I have I might ask a question like for example what is the capital of India all right and then I give a separator then I can give a paragraph or this could be the Wikipedia page there are M tokens let’s assume I’m taking the Wikipedia page from Wikipedia page for India 01:04:18 somewhere it says New Delhi is the capital of India right so I give the paragraph as an input and I give the separator now you get a bunch of outputs now what I want is this whatever wherever the answer to this question lies in these M tokens right look at this there are M tokens here which means there are going to be M outputs there is one output here second output so on so forth MTH output there are M outputs here in this paragraph let’s assume somewhere in this paragraph it says New Delhi is the capital of 01:04:50 India then wherever new and Delhi are there I should get one rest every other place I should get zero so New Delhi again New Delhi could be it interpreted as one word or two words here I should get is once test everywhere in this paragraph I should get zeros that’s how I train my model so imagine if you give me a data set there is a data set from Stanford called Squad wherein there are questions there is a paragraph in which the answer to the question exists and you have to find which words correspond to the answer 01:05:25 so you can do a question answer pair uh I mean you can solve question answered problems using Word does this make sense again here you have two inputs you have a question you have paragraph and your output is in the paragraph which words matter or which words are my answer make sense to everyone this is a very very popular task there is one more task if you remember named entity recognition is a sequence to sequence model right wherein look at this I have CLS I have a single sentence that I am passing now for each word in this sentence for 01:06:07 each word in this sentence I want a class associated with it for this word of course first for CLS There Is No Label this is other this is let’s say person name so on so forth I’m giving a I can do this for parts of speech I can do it for any ER I can do it for anything because I have n inputs here I have n outputs for each output I will have a like for example there’s an output here right I will train a feed forward neural network which will generate the output for me and so on so forth for everyone 01:06:42 so if I want a sequence to sequence I can do this with word these are the four types of tasks that are very easily doable in a bird-based architecture sentence pair classification single sentence classification question answering system as well as single sentence tagging tasks does it make sense on how I can use the same architecture to solve lot of it’s a slight twist in architecture but the core system stays the same does this make sense that an encoder only stack of an encoder