5. Issues and Techniques in Deep Learning 2 - 28012024 Complete transcript.pdf

01:45 - 01:45 Dr Anand Jayaraman: You 02:17 - 02:25 Speaker 2: Interesting and engaging. Otherwise the subject is kind of going over the head most of the time. Thank you. 02:25 - 02:56 Dr Anand Jayaraman: Thank you. Thank you. Thanks for the kind words, but I'm sure you know this is like everything right. It when you When you let it sit in for some time, right? Things get, you know, the concepts get easier. So that's 1 big advantage of having that. I'm not starting with, you know, fresh set of people. You guys have already seen many of these concepts before. So it's a- But- Not really, some of them are experts 02:56 - 03:10 Speaker 2: in this group like Sunila, Satchin and all that. We are all newbies. So you have to spoon feed us most of the time. So please won't be under the exception. That's why sometimes I feel when you ask for some answers. 03:10 - 03:30 Speaker 3: So please don't fall for these expert statements. We've implemented some things in our career. That's the only differentiation I would say. You know, you have to love and learn for everything out here and love the subject more. In fact, when you teach, I see very interesting ways of analogy. It's always a learning. 03:32 - 03:39 Dr Anand Jayaraman: Somebody has coined this word deep very right way. Because every day, every class we go, you're going deeper and deeper 03:43 - 03:46 Speaker 4: like scuba diving. I would say that. Find out the reason 03:46 - 03:47 Speaker 5: why is 03:47 - 03:49 Speaker 2: it called? You have to go deep anyway. 03:51 - 03:53 Dr Anand Jayaraman: But now we are more than neck deep. That is the problem. 03:56 - 04:20 Speaker 5: And yesterday, Professor, you had opened a nice topic. God and the evolution. Creation versus evolution. I in fact spoke to my family. They're all in the discussion now. It's very interesting because it looks like we are watching a movie and we are waiting after the interval, what will happen kind of the feeling we were having. 04:23 - 05:13 Dr Anand Jayaraman: I find it fascinating. Like this arguments. I mean, clearly having grown up in India, it's like whether you want to or not, you know, you have a religious upbringing, right? And I'm a science guy, I have my PhD in physics. So there are questions over there. And you start looking at issues, right? When you're learning science, it's like you don't necessarily have to assume existence or non-existence of God and so on. And, but for me, you know, any of these studies on brain, right? It's just fascinating. Suddenly you start questioning, like, is it possible that all 05:13 - 08:05 Dr Anand Jayaraman: of this is by chance? And so neuroscience talks about that, but when you're actually trying to build, when you're doing what we are doing, right, AI, right, and we are in a sense playing God in that. You're trying to give intelligence to the machine, to the algorithm, right? And the question is, could this have happened by, you know, through evolution? That's a fascinating question, right? And to the question, right? And. You You You You So, let me just move to a side and we can start fresh here. Good. Now, let's remind ourselves some of the things 08:05 - 08:16 Dr Anand Jayaraman: that we have done yesterday okay so let's start can someone help summarize different things we talked about yesterday? 08:17 - 08:20 Speaker 5: Shallow versus deep knowledge. Very nice. 08:20 - 08:20 Dr Anand Jayaraman: Very nice. 08:20 - 08:22 Speaker 5: So we started 08:22 - 08:27 Dr Anand Jayaraman: so deep versus shallow knowledge. And we 08:27 - 08:32 Speaker 5: go for why we go for deep learning like why we need deep learning. 08:32 - 09:09 Dr Anand Jayaraman: Good yeah so this is the shallow versus deep This allowed us to define what we expect from shallow knowledge, right? Versus deep knowledge. And then we defined what is deep learning in the context of neural networks, right? And that's essentially number of hidden layers greater than 2. That's what we define as deep learning, right? And then we started talking about why deep learning, right? And why do we need deep learning? What are the big advantages? Generalization. That's right. Better generalization. 09:09 - 09:10 Speaker 5: Compact representation. 09:11 - 09:44 Dr Anand Jayaraman: Yeah, exactly. Better generalization is what we are hoping and this is level, right? And other 1 was compact representation. This was the main reason why we wanted to go for deep learning. And then we talked about issues in deep learning. Right? What are the different issues? Vanishing gradients. That's right. 09:45 - 09:47 Speaker 5: Optimization difficulties and overfitting. 09:48 - 09:55 Dr Anand Jayaraman: Vanishing gradients and separately the optimization difficulty and then 09:55 - 09:56 Speaker 5: over fitting. 09:56 - 10:10 Dr Anand Jayaraman: So these are all issues that we talked about. So yeah and then yesterday we sort of finished with fixing the first step, the vanishing gradients. How did you fix it? 10:10 - 10:12 Speaker 5: Back in, improved training. 10:12 - 11:13 Dr Anand Jayaraman: Yeah, they actually, so what we did is better activation functions, Activation functions. This is what this thing we instead of sigmoid we realize the problem was not sigmoid activation function. And this really, really speeded up the train. So this is what we, these are sort of the topics that we touched upon. Now, we still need to go about talking about, you know, optimization difficulties and overfitting. We will address those things today. Okay, that's primarily what is in agenda for today, discussing different kinds of innovations that have come out, which allows, which makes deep learning possible. But 11:13 - 11:26 Dr Anand Jayaraman: before we do that, before we go there, There was a request, I forgot who had asked this question on feature engineering. 11:28 - 11:28 Speaker 4: Me, Sachin. 11:32 - 12:37 Dr Anand Jayaraman: So, 1 of the interesting things about machine learning is this. What we find is sometimes you have problems which are not linear. Let me get a new exert sheet. So there are problems sometimes that are not linear in nature which still we are able to the thing is what let's it's easiest to understand all of this using linear regression. So let me start with simple examples on linear regression. So let's say I have a set of x values and y values and increase the form so that you're able, all of this is clearly visible. So put 12:37 - 15:34 Dr Anand Jayaraman: some numbers here. And then I have let's say the y values are You human intelligence can help do a better model than what we can otherwise do just using a simple machine learning model like linear regression. Now, I look at it, and I realize that this has got a nice parabolic kind of shape. So I'm wondering, maybe would a parabola be a better fit for this data. So what I'm going to do is I'm going to start to artificially create a new feature. And I'm going to call this feature x squared. What this feature is going 15:34 - 16:32 Dr Anand Jayaraman: to do is I am going to take this value of x and just square it and create that feature. So 0, 1, 4, 9, 16, 25, and 36. I'm adding 2 features. I've added and created a new feature from the original feature. Right? The thing is, let's try regression now, linear regression. I'm again doing linear regression. Nothing else. The linear regression instead now what I am going to do the YYL is still same. The x values instead of it just being that I am going to make the x values to be this 1 this. As if 16:32 - 17:19 Dr Anand Jayaraman: there are 2 independent features. I am telling linear regression. Professor. There are 2 independent features. This is x1 and this is x2. Now, linear regression will go ahead and do regression with whatever number of features I give it. So here I am going to go ahead and do that. So essentially what I am trying to do here is I am asking linear regression. So here is my y values. And I have now my original x and this is x squared, right? I know it's x squared, but I'm going to call it as just x1 and x2. 17:19 - 20:04 Dr Anand Jayaraman: And I'm asking it to fit linear regression. The linear regression will happily go ahead and do this. It will go ahead and fit beta 0 plus beta 1 times x1 plus beta 2 times x2. This is what a linear regression will do. It will go ahead and fit with these coefficients. But what I was able to manually create some features out of existing data. Out of existing new features, out of existing data. Right? And that is called as feature engineering. Okay. This is typically done because you know something about the domain and you're making it easier 20:04 - 20:40 Dr Anand Jayaraman: for machine learning algorithm to learn the pattern. Right and in fact, in fact the real secret to when I'm doing my consulting work, right, the real secret to getting better results in any of the real world consulting problems is doing the right type of feature engineering. And how do I know the right type of feature engineering? You talk to the domain experts. You talk to them, and they will be able to tell you what kind of features are there and then you look at the data, you stare at the data, you think that maybe I should 20:40 - 20:53 Dr Anand Jayaraman: transform this or not. And based on that you are able to do feature engineering. Right? And then let me give you another example of feature engineering, the kind of feature engineering you might do. 20:56 - 21:04 Speaker 5: Professor Sachin, my question was like feature engineering we usually do with machine learning, but do we really need a feature engineering with deep learning? 21:04 - 21:13 Dr Anand Jayaraman: We will talk about that. I am getting there. I know I have a long way of getting there. But 21:13 - 21:24 Speaker 4: I do have 1 question. Can I ask? Since we are creating a new feature, so X2 based on X1, so there will also be a correlation between these 2, right? 21:24 - 22:09 Dr Anand Jayaraman: Is that a right thing to do? Wonderful question, Kalpana. So the question is that the second feature that I created, the second feature that I created is related to that. The thing is when you're asked, it is related to that. But when you're talking about the word correlation, That is term strictly refers to linear correlation. Those 2 columns are not linearly correlated. This problem is called. That's the only thing that matters. All that matters is are they, they should not be linearly correlated. If it is related, that's fine. It can be nonlinearly related. That's OK. 22:10 - 22:13 Speaker 6: So this problem is called multicollinearity, right? 22:13 - 22:53 Dr Anand Jayaraman: Multicollinearity is absolutely a related problem. I don't want to talk about what is the difference between just correlation, high correlation and multicollinearity. I don't want to talk about that. That'll take us too far, too much into linear regression again, I don't want to do that. But I just want to introduce to you the idea of feature engineering. Let me show you another example. So I hear is for example, that data set that's given to us, where I, 1 of the standard problems where you are trying to, for example, let me write this. I have a this 22:53 - 23:43 Dr Anand Jayaraman: is a data set from Washington DC for the number of bikes from the bike rental companies. You have this you know this automatic kiosks right Where you can rent a bike. So this is the data, hourly data of the, when, how many bikes were rented at different hours, right? So this, for example, is January 2nd at 8 p.m. 22 bikes were rented. Whereas January 2nd at 11 a.m. 70 bikes were rented. This is that particular data set. And 1 of the things that we often want to do is we want to understand how many bikes, what 23:43 - 24:21 Dr Anand Jayaraman: is the demand of the bikes? This is a demand prediction problem. Okay. So in order to do that, I want to be able to predict, given a particular time, how many bikes are likely to be rented, right? Particular hour, right? That's the ask. So this can be done through linear regression. This is the value, count of bikes to be, this thing needs to be predicted. And they're giving us a bunch of data. What season is it? Is this a holiday or a working day? What is the type of weather? Whether it's raining, sunny or whatever. And 24:21 - 24:54 Dr Anand Jayaraman: what is the temperature? What is the humidity? What is the wind speed? They're giving all of this data, which are potentially useful data to tell whether someone is likely to rent a bike and ride on the bike outside. These are all important data, whether it's a season, holiday, temperature, and wind speed, all of that is important. Clearly people, when the weather is really bad, are unlikely to be customers for bike rentals. And the weather is great, then there's gonna be a lot more people who are renting the bike. So all of these are features that are 24:54 - 25:46 Dr Anand Jayaraman: given to us. Now you can actually try to predict this using all of these features and then try to predict it. But you know what? You won't get great results with just that. However, if you recognize that, there is this hour, This date and stamp also given. From the time stamp, you can potentially create new features. You can just extract the hour of the day from here. And then you can put it as different blocks between let's say 2 a.m. And 6 a.m. That's unlikely for any office goers are there, even late night party goers won't 25:46 - 26:26 Dr Anand Jayaraman: be there. That block, We can call it as sleeping block. Then morning 6 a.m. To 10 a.m. Is office times. People are likely to go to office. So we call that as office, true office time block and lunchtime and then returning from office. Then late in the evening, which is more casual riders. And then late in the night is party goers returning from party, right? We can put the label, those time blocks and create a new feature out of that. Creating that new feature increases the accuracy of the algorithm. Understanding something about the problem, we can 26:26 - 27:14 Dr Anand Jayaraman: create new features. And this process is called as feature engineering. As I mentioned before, feature engineering is really the secret for success in many of the consulting projects that we have now. Professor, what is the difference between the future engineering and annotation in this case? Feature engineering and? Annotation. So annotation is a form of feature engineering. But there are feature engineering can be of different types. There are different types of feature engineering. 1 feature engineering, for example, you can create this. I can perhaps create a new feature which is a product of temperature times humidity times 27:15 - 27:45 Dr Anand Jayaraman: wind speed. I might think that maybe that product actually means something. I can create a new feature like that. That's more than annotation. Or I can take yet another feature, which is log of n plus log of wind speed. And depending on my domain knowledge, I might believe that this is 1 important predictor in making the predictions. This is feature engineering. 27:46 - 27:50 Speaker 4: And again, Professor, when we standardize the data, can we call it also feature engineering? 27:51 - 28:07 Dr Anand Jayaraman: Standardization of the data does not involve mixing of 2 separate features. Standardization of the data will not change the results in any way. The R squared will remain exactly the same. 28:07 - 28:11 Speaker 4: Right, but like we are doing log etc. So that is, that is the thing. 28:11 - 28:14 Dr Anand Jayaraman: Right, when you change log, that is a nonlinear transformation. 28:14 - 28:16 Speaker 4: Nonlinear transformation, right. 28:16 - 28:18 Dr Anand Jayaraman: Standardization is a linear transformation. 28:18 - 28:19 Speaker 4: Right, right. 28:20 - 28:25 Dr Anand Jayaraman: Got it. So linear transformations does not do anything. It needs to be a nonlinear transformation. 28:27 - 28:39 Speaker 7: Professor, when you use kernel trick, where you actually at higher dimensions, you make things more linear, and you're able to separate them, which you use in support vector machines is that similar to this because you're adding a dimension. 28:39 - 29:21 Dr Anand Jayaraman: That is in a sense the kernel trick is a form of feature engineering yes right it's a form of except that you're not doing manually out of domain understanding, you're having the machine do it. Here is that point that I wanted to make, right, just 1 second, just 1 second. We did this particular problem, right? Yesterday, we did this problem of where we are trying to figure out the, trying to get this complex patterns that are there. And we use this bunch of these neurons to do these complex patterns. You know what I'm gonna do? Let 29:21 - 30:05 Dr Anand Jayaraman: me copy this and create a new path to put that. So this is running. I did put ReLU, so that's fine. And over there it did this nice learning and there it captured it so let me just chop it okay I don't want my laptop to heat up any more than it's needed. So it's actually captured that particular pattern. Now, the thing is, if I did not know about this deep neural network, 1 of the things I would have done is that I would have thought, you know, this is a complex pattern, x1, x2, right? I 30:05 - 30:28 Dr Anand Jayaraman: mean, there's a spiral patterns. Now I will try to do some kind of feature engineering to model this. What I would try to do is see this is a spiral pattern and actually I don't know what is the analytical form of that spiral but I know that it sort of looks like circles right you know what is the equation of a circle? 30:32 - 30:33 Speaker 4: Pi r squared. 30:34 - 30:46 Dr Anand Jayaraman: Pi r squared is the area right but what is the equation of a circle let's say I have x1 axis and x2 axis do we remember the equation of a circle? 30:46 - 30:50 Speaker 5: Is it X minus X square that 1 are you talking about? 30:50 - 30:51 Dr Anand Jayaraman: That's right. 30:52 - 30:54 Speaker 5: So, X minus X square 30:54 - 31:42 Dr Anand Jayaraman: plus X square is A square is a circle around the origin of radius A. Right? We have done this in school. Thank you Sunil. It is an equation of a circle. So, What I am going to do is I am going to actually, because I know that this has that circular shape, I am going to give it those features. And here I have x1 squared, x2 squared. I'm going to add those features because I wanted to be able to internally combine x1 squared plus x2 squared and try to create a circle. So essentially, these are quadratic 31:43 - 32:36 Dr Anand Jayaraman: features I've created. And let me go ahead and create x1, x2, because that is also a quadratic features so all 3 of them together form quadratic features. Now another thing is in the same spiral it can be represented in the polar coordinates as r equal to theta. If you remember polar coordinates from math, if you don't remember don't worry about it it's fine. But this polar coordinates, essentially relates to, this kind of coordinates are called as Cartesian coordinates and polar coordinates are where every single, so in Cartesian coordinates, every single point, you can represent it as 32:36 - 33:20 Dr Anand Jayaraman: x1, x2 right every point is represented as x1, x2 in polar coordinates every single point is represented as the distance from the origin and the angle at which that distance that point is there. So polar coordinates represents every point as r, theta. That's polar coordinates. If you don't remember, it's perfectly fine. You don't need it. The only thing is you do remember that this angle and this x1 they are related to trigonometric function sine and cosine functions that much I'm sure you remember that this angle this distance and this distance they are all related to trigonometric 33:20 - 34:01 Dr Anand Jayaraman: functions. So what I'm going to do is I am going to add trigonometric functions also as a set of features. Now What I'm going to do is I'm going to go ahead and cut down the layers. I'm going to put only 1 hidden layer. I manually added a bunch of features and put only 1 layer. Now I'm going to try and start training. And you know what? It is actually able to capture that particular pattern. Right? Now, the thing is all of this, this domain understanding I was able to do, and Because of the domain understanding, 34:01 - 34:47 Dr Anand Jayaraman: I was able to use a very simple network to model this. My domain understanding allowed us to make a very simple network. Now, what is the meaning of it? The meaning of it, therefore, is that when I did not use any of the domain understanding in my old network, what these other layers are doing is they are doing automatic feature engineering. These other layers, their entire role is figuring out what kind of feature engineering you can send to the last layer so that the last layer can correctly classify. So all of these 5 hidden layers that 34:47 - 37:32 Dr Anand Jayaraman: were there prior is effectively playing the role of this additional feature engineering that I did. We talked about this on, I think, day 1 of the class. I want an example of this. It's doing automatic feature engineering. This is what you would have done as an expert. You would have engineered these features. And here, the neural network is automatically feature engineering and figuring out all of this from data. Except in order to do that, you now need a lot more data and a lot more computational power. But there are some examples where you are not able 37:32 - 37:45 Dr Anand Jayaraman: to do this manual feature engineering and instead you rely on a hardcore machine and data, large availability of data to go ahead and do that. Is the idea clear? 37:46 - 38:19 Speaker 5: Yes professor, 1 quick question professor. Yeah. So If you go ahead, if you the example which you showed, right, the access spreadsheet. Yeah. So it is a derived column, right? The bike data is a derived column where you collate a certain date range and said this is prime hour, this is non-prime hour or lunch hour. So how does that, I mean, technically, not technically from a business standpoint, how does it improve the prediction result? Because that's what we are trying to understand. Because of feature engineering, the prediction becomes very accurate, right? That's the code. So now 38:19 - 38:24 Speaker 5: I'm trying to understand that, because how does that really happen? 38:26 - 38:50 Dr Anand Jayaraman: Essentially, there is some nonlinear transformation happening. That's what it is. I mean, this is that example, right? I mean, it required so many layers before. And now it's able to accomplish, essentially, with this nonlinear transformations. It's able to do that. So I'm just showing you proofs in the pudding. I'm showing you I'm able to do this using a single neural network layer. 38:52 - 38:57 Speaker 5: You got that professor, because visualization is really able to appeal. 38:57 - 39:09 Dr Anand Jayaraman: But that's the thing, right? We are talking about a black box model, right? Visualization, we are light years away from explainable models. 39:11 - 39:13 Speaker 4: Professor, 1 question on that. 39:13 - 39:26 Dr Anand Jayaraman: The train has left a long while ago, right? So we have given up on the explanation, explain it to Bob. All I can do is I can wave my hands around and show you some examples and say that, look, it's reasonable. That's all. 39:27 - 39:31 Speaker 5: Because the reason why I asked this question is about explainability, right? 39:32 - 39:48 Dr Anand Jayaraman: So- Yeah, yeah. That was a train, you know, it's long gone. We are like completely in the realm of black box models. We just sit back and admire that it's actually happened. Yeah. So, Professor, 39:49 - 40:06 Speaker 4: just to understand this thing, right? So here you are saying that we didn't need a lot of hidden layer because we looked at a concentric circle and that's the reason we were able to kind of predict what kind of feature engineering we need to do. So this is an example, right? And not just the only option. 40:06 - 40:17 Dr Anand Jayaraman: Not the only option. Absolutely not. This is an example. The thing is the kind of features that it would engineer on its own would be different. 40:18 - 40:22 Speaker 4: Yeah then we need more hidden layers so that it identifies by itself. 40:22 - 40:27 Dr Anand Jayaraman: Yeah and depending on what set of random weights it started it might engineer different features. 40:33 - 40:51 Speaker 6: Professor essentially it will also understand the previous bike dataset that you showed, the time of the day or daytime stamp, right, That column is also would also be understood automatically by the different layers of the neural network 40:51 - 41:24 Dr Anand Jayaraman: and the neural network generator. Absolutely right Gopal. That's what that's meant this thing that you feed in these raw things and internally it will does those feature engineering. It might not do the same kind of split and again you're allowed you're giving up on interpretability. Because ultimately you're telling I want a model which will give me accurate results. So you go and engineer whatever features you want. And that's what it does. 41:26 - 41:34 Speaker 8: 1 question. Yeah. So if you go back to the TensorFlow example. 41:35 - 41:37 Dr Anand Jayaraman: Yeah. So in the 41:37 - 41:56 Speaker 8: first 1, like we use the deep learning, like you were saying that the first 1, yeah. You are saying that what feature the deep learning solution will provide, do we have any no control on that or no no explainability on that? 41:56 - 42:28 Dr Anand Jayaraman: Correct, we have no control on that. And this is exactly was my point I was making that we didn't drive it in any way, and still it automatically learned patterns. The first layer was learning simple patterns, it's next 1, a combination patterns, and so on and so forth. Very similar to how our visual context are learned. Right, which was amazing. But we didn't force it, absolutely we didn't force it. Yeah, what I'm 42:28 - 42:37 Speaker 8: going to ask is a loaded question. If this is not the desired result, but business is wanted, then what we gonna do? 42:38 - 43:17 Dr Anand Jayaraman: What do you mean? Because the business, the only thing the business told me was find the pattern, find the model which will give me this pattern. And it accomplished a low error. Look at the training loss was less than 1. It's 0.1% and test loss is about 3%. Just it's accomplishing what the business asked me to do. The business needs to tell me what is it they want. They told me the only thing the business tells me is predict this correctly. That's all. They don't tell me that I need to be able to understand. And so 43:17 - 43:22 Dr Anand Jayaraman: we built a model that's able to predict it correctly. So I'm satisfying the business goals. 43:27 - 43:51 Speaker 8: What goes in my mind right now is that about the stock prediction. So we, but in the beginning of yesterday's class, you were mentioning that initially, the all the prediction algorithms were correct and it go haywire after some time. So in that case, also the, we were meeting the, all the models were meeting the business requirement and after a certain time. 43:52 - 44:40 Dr Anand Jayaraman: Yes, yes, that's the thing, right? What had, the reason why stock markets are completely different creature than anything else, than anything else that is physical, right? You are trying to understand how like how do I predict you know an oil well, how much is going to be the daily throughput of an oil well, right? Or you want to understand what is the likely, the weather prediction, you want, that's 1 of the thing. Or you want to understand the particular plant where machines are failing, you wanna understand how does the failure happen. Any of those things, there 44:40 - 45:25 Dr Anand Jayaraman: are some fundamental physics related, I'm saying physics in a loose way, some fundamental forces that determine the final behavior. Right? The weather, the temperature is being determined as a function of, you know, the temperature into the future is being determined by the current temperature, the current air, the wind speed, humidity and so on. And plus your neighboring states temperatures and so on. All of these features, Once you know, you'll be able to correctly predict it. There is some physics that's guiding it. There exists the physics that's guiding it. All neural networks are doing is it is 45:26 - 46:01 Dr Anand Jayaraman: trying to understand what is that function. It knows it's a nonlinear function. It's trying to mimic or model that function. That's all neural networks are trying to do. I know these variables are important, but I don't know what is that function. You're asking the neural network to go ahead and figure out the function. Deep learning does a very good job of figuring out the function. The question is in stock markets, it's not even clear that there exists a function because today the kind of factors that determine the stock prices might necessarily not work 2 years later. 46:01 - 46:22 Dr Anand Jayaraman: There might be different set of forces that are driving it. Right? Predemonetization, post-demonetization, the markets completely changed. Right? So there are different factors. So the function that you use, the assumption that the function will remain same is not true in stock markets. So in that sense, it's very different. 46:27 - 47:08 Speaker 3: I've got a quick question. I've got 1 quick question. Yeah. Just a 20 second 1. So I think you've explained this very, very well. So it's very clear in terms of what we expect. So 1 of the projects I've done about 4 years back was identifying the correlation between features or driving on the features in part of feature engineering using knowledge graphs. So I'm not using the automation of neural networks at that time. What I did was I just observed some, let's say, 6 months, 10 months of data, right? And I looked at the attributes it 47:08 - 47:46 Speaker 3: had, picked up the features and established a knowledge trap. And I understood what are the third, fourth level of attributes that can impact you? Let's say your example of temperature today is a function of temperature yesterday and velocity of the wind. But velocity of the wind could have other parameters which are outside in it, right? Let's say velocity could be determined by hurricane conditions or it could be which is that. So identifying those dependent attributes from there to bring in and then find a correlation. So in this case, I wasn't using any neural networks per se, 47:46 - 48:04 Speaker 3: not even graph neural networks. Just taking a knowledge graph, getting dependencies and doing. So you would classify this approach as 1 of the ways of teaching the neural network. So is there any correlation you would suggest that I can take away from this? From this thought of knowledge graph? 48:04 - 48:37 Dr Anand Jayaraman: Yeah. Sorry for interrupting. That we'll have to actually look at it in detail. However, generally, if you do have, you're finding that your traditional way of feature editing and figuring out, applying only this knowledge graphs and doing it is not giving your desired results, then this would be a natural next step. But it will have to be, I mean, we'll have to talk details on that. But it's a natural thing. 48:37 - 48:44 Speaker 3: More in terms of complementary, there can be complementary ways of looking at it. Right. Makes sense. 48:45 - 49:33 Dr Anand Jayaraman: Good. Let's go ahead and get to today's lecture. Okay, we, I put there just a summary for this thing. We talked about the vanishing gradients, difficulty in optimization and overfitting. Those are the problems. And we talked about how we nurture the neural network. And I mentioned that there's a whole bunch of approaches that have come about in trying to understand, trying to make deep learning possible. And 1 of them is trying to come up with a better activation functions, right? And so I mentioned that ReLU is 1 of those activation functions that has shown to be 49:34 - 50:31 Dr Anand Jayaraman: very good, works well in really large set of situations. And, but then there are been a whole bunch of other innovations from Relu, right? The ReLU, the fact that ReLU has this structure where when the input to the neural network is positive, that same output comes, but when the input is negative, the neural network basically shuts down. That is basically the feature of RAID. 1 gentle modification people have done is instead of the neural network completely shutting down, let's have it go down at a much slower rate, right? So instead of output being 0, it just 50:31 - 51:19 Dr Anand Jayaraman: gives 0.1 and much lower slope. This is called as leaky wave, like people have tried this. And another 1 is called A-loop. What this 1 does is that because these 2 ones have 1 point where there's non-differentiability there, right? This portion is differentiable, this portion is differentiable, the joining part is not differentiable. So people felt that maybe networks behave better if you have a continuous function which gently goes down like a flat like this, that is a ALU activation function where it smoothly transitions from a straight line to this straight line, it has that transition region 51:20 - 52:07 Dr Anand Jayaraman: that's fitted. This kind of modifications have been tried out and in some domains that work well, some it doesn't look great, whatever. But generally speaking, sticking with the ReLU is fine. And it works in majority of situations, right? From our perspective, we are going to make sure that for pretty much all the hidden layers, we are going to use ReLU. For the output layer, we will use a sigmoid or linear, sigmoid field used for binary classification, softmax we'll use for multiclass classification and linear we'll use for regression. That is the rules of thumb that we will 52:07 - 53:02 Dr Anand Jayaraman: follow from now on. All clear? Whatever we have done till now? Yes, Professor. Now let us go ahead and look back at what was our other big problem. The other big problem was what We talked about vanishing gradient that we handled, vanishing gradient we handled. Now the other 1 is optimization difficulties in optimization. So let us go ahead and try to address that. So the way we improve upon optimization is essentially the way optimization is being done right now is through gradient descent. So, what we are going to do is we are going to try and 53:02 - 53:52 Dr Anand Jayaraman: improve upon gradient descent. That is what we are planning on doing. Again, this gradient descent method was necessarily stretched the mathematical understanding of some people for whom maths was not the major, right. So, we might have found it hard, but hopefully you understood the main idea behind gradient descent. We're going to go back now and what we are going to do is we're going to talk about how to improve gradient descent. So again, it's going to be a little bit mathematically heavy, not terribly heavy, but a little bit mathematically heavy. But don't give up hope. Just 53:52 - 54:38 Dr Anand Jayaraman: like what has happened till now, everything after a while, when it settles down, when you allow your brain some time to sit with that piece of information, piece of knowledge, it will slowly start to feel comfortable and you will have a better understanding after some time. Or perhaps the second time you look at the lecture notes, you will feel better about it. So let us talk about this. So there are a couple of problems that are with learning with the gradient descent right. 1 of the things that did the problem with gradient descent is that the 54:38 - 55:33 Dr Anand Jayaraman: gradient descent is extremely this thing with sensitive to starting point, which is fine, we knew that. But these also struggle near saddle points. And once you are in really high dimensional space where you are trying to determine you know thousands of weeks, gradient descent actually starts to struggle really a lot. So we are going to try and find a way to improve upon gradient descent. So we're going to start with a quick refresher on what is gradient descent. I should have done this last time itself. I mentioned to you what does gradient do. I talked about 55:33 - 56:17 Dr Anand Jayaraman: the mathematical function This of a gradient of a function where it's a function of n different variables. In this case we actually have n different weights, right? The gradient is defined as this where you're trying to understand the derivative of that function. How sensitive is the function to change in x1? How sensitive is the function to change in x2? How sensitive is the function to change in xn? That's exactly what the gradient is measuring, right? Or directional derivative. What gradient actually does the gradient at any point tells you it gives you a vector. In this case 56:17 - 57:06 Dr Anand Jayaraman: there's a two-dimensional graph x, y the gradient of this function this particular function is plotted as the surface. This is the F of X comma Y. That's that surface. Gradient is a vector. And that vector lies on the XY plane what does the gradient do the gradient points in the direction of steepest ascent if you walk in that direction you will see the steepest increase in that surface. That is what a gradient means. The gradient is a vector that's pointing in your base plane. And it's pointing in the direction you should walk to see the steepest 57:06 - 57:41 Dr Anand Jayaraman: increase in that function. Hopefully that's clear. If not, here is this really nice applet, which shows you the same thing here. For example, you see the surface there, right? This particular surface. And what you're seeing here on this side is the contour map of the surface. We all know what is contour map? 57:44 - 57:48 Speaker 4: Hello? Yeah, in geography we can see the landscape using contour. 57:48 - 58:08 Dr Anand Jayaraman: It's a landscape, the contour lines represent lines of constant height. Lines of constant height. So you're seeing the surface, These are all lines of constant size. So I'm seeing the same picture from the top. I'm viewing it from the top and I'm seeing this. 58:08 - 58:10 Speaker 4: It's like a bird's eye view, right? 58:10 - 59:08 Dr Anand Jayaraman: It's a top view. Yeah, just from your viewing it from the top. Now, this particular gradient is defined at a point. Every point is a gradient, right? And what does the gradient do? The gradient points in the direction of steepest ascent. I see, I see what it's doing, the red arrow. The red arrow is pointing, it's saying, if I'm in this point and this point is mapped here, right? If I walk in this direction, if I walk in that direction, I will see the highest increase in my surface, okay? I seen that? That if I walk 59:08 - 59:44 Dr Anand Jayaraman: along that direction, I will see the steepest increase in that surface. Okay, that is the intuition that gradient gives you. Okay that is that's what the gradient does and the reason why gradient descent works is that you at every single point you want to find your goal is to find the minimum of a surface so what you are saying is every single point I will see where the gradient is pointing and I will walk 1 step in the opposite direction. And after I walk that 1 step, I re-evaluate the gradient again. Now it'll again point the 59:44 - 59:59 Dr Anand Jayaraman: new direction I might have to walk but each time I am walking in the direction that is opposite to that of my gradient right so I walk in the direction of negative of my gradient and I take a step size alpha 01:00:00 - 01:00:02 Dr Anand Jayaraman: That direction and that is gradients. 01:00:03 - 01:00:05 Speaker 2: Professor just a query 01:00:06 - 01:00:06 Dr Anand Jayaraman: sorry 01:00:07 - 01:00:26 Speaker 2: professor query and I don't know probably it could be very basic or maybe wrong but is it not I mean we as part of the gradient you know what is the direction for the steepest ascent. Correct. But the opposite of that is not could not necessarily be the steepest descent. 01:00:26 - 01:00:41 Dr Anand Jayaraman: It is absolutely the direction of steepest descent when your step is infinitesimally small. Right? When it's infinitesimally small, it is absolutely true. Steepest ascent, the opposite is steepest descent. 01:00:42 - 01:00:47 Speaker 2: Okay, yeah. So when the step is infinitely small, yes. Okay, I got it. Yeah. When 01:00:47 - 01:01:26 Dr Anand Jayaraman: you are, so now that step size is determined by this alpha or the learning rate. And here is a criteria for the learning rate. If you do very small learning rate, you take a long time to reach there. If you do very large learning rate, you might actually miss the bottom. So finding the right learning rate is a tricky thing, and you've got to be very careful. And the PhD thesis have been written on how do you choose an appropriate learning rate. Now what people do is you the surface. This is not a 2 dimensional surface. 01:01:26 - 01:02:05 Dr Anand Jayaraman: So we are able to see. Right. Then you have to determine 79 weights. Then you are actually talking about a 79 dimensional space and the surface in that space. There's no way you can visualize it. So you don't actually know what's happening, how far away from a pit you are, how far away from a peak you are, none of that. So you're blindly following the gradient and you're walking, you're adjusting it in that particular direction. So what you do is you're, the way you do is each time you're looking at what is the current value of 01:02:05 - 01:02:50 Dr Anand Jayaraman: the function, the loss function. A loss function is what we are trying to minimize. What you find is that when you are taking very high learning rate, quickly the loss function can actually explode. Right, very high learning rate. So you know that your learning rate is too high. Yeah. The step size you're taking is too much. Right. When, when you adjust the learning rate, the, to lower down, then you will actually reach a situation where if the loss function is decreasing and quickly you might reach 1 particular minimum. If you however choose a very low learning 01:02:50 - 01:03:34 Dr Anand Jayaraman: rate, you might often end up at a much lower minimum. Right. So, there's a really a trick between adjusting the learning rate and the obvious trick is find the learning rate which might we were always looking for a learning rate which does not necessarily require us to take too many steps and nor should it be too big in which case because if it's too big you will end up at a suboptimal minimum as well. And so this is a big problem. So what people do in practice is this, they approach, they use a method called learning 01:03:34 - 01:04:22 Dr Anand Jayaraman: rate decay. What is learning rate decay? What they do in learning rate decay is they are, they initially start off with a high learning rate. Because initially, you're really far off from the minimum. So let's take some quick steps. After some time, when you find out some of the loss trying to settle down, then you again take, cut the learning rate. Now you start walking much more slowly, right? And what happens again, you do a, your loss function again decreases. So you try to cut the learning rate to take smaller and smaller steps as you start 01:04:22 - 01:05:04 Dr Anand Jayaraman: approaching your final distant point so that you don't miss it. This idea is called as learning rate decay where for fewer clocks, you do large learning rate, then after fewer clocks, you cut the learning rate. And again, after fewer clocks, you cut it even more and so on and so forth. This speeds up the overall amount of speed of learning. Is a broad idea intuitive? When you're far away from your destination, you take really large steps. As you get closer and closer to the destination, you'd be a little bit more careful. This is exactly what you 01:05:04 - 01:05:34 Dr Anand Jayaraman: do when you're riding on the highway. And you know that you still have 100 kilometers to go before a turn comes. You don't care. But once you start coming closer, then you start watching more closely. You don't want to be in the wrong lane so that you miss the exit. And once you are very close to exit, you clearly slow down. And you do much careful navigation. And that's exactly what we are doing here. Is the idea clear? 01:05:36 - 01:05:41 Speaker 3: Yeah, professor, the idea is clear. But how to find out? When? 01:05:42 - 01:06:29 Dr Anand Jayaraman: Very good question. How to find out? These are all tricky as to what are the things. However, the default version of TensorFlow that is there right now, when you use the default settings, most of them, it works reasonably for most problems. But these were arrived at after a lot of experimentation. But there are also specific problems where you will have to actually experiment as well. Now, this is 1 method that has been used. There are other methods that are there to improve your optimization. You understand right? This is a method that will potentially improve the speed 01:06:29 - 01:06:36 Dr Anand Jayaraman: of optimization. Correct? You agree? Let's now see if 01:06:36 - 01:06:56 Speaker 4: the... Professor, as I'm looking at the graph that compares different learning rates, I am able to understand everything except low learning rate that is kind of flat, flatly descending right so I couldn't understand that. 01:06:56 - 01:07:29 Dr Anand Jayaraman: So what happens with very low learning rate is that your error function decreases very, very, very, very slow, right? And you are doing many, many, many, many epochs after, you know, I don't know, maybe 5,000, 6,000 epochs, it's decreasing very slowly, Right? A good learning rate is when where it quickly reaches more or less the same minima compared to this very low learning rate. 01:07:33 - 01:07:44 Speaker 4: Okay, so then if you see the high learning rate that reaches the maxima, I mean the minima quite sooner in number of people. 01:07:44 - 01:08:00 Dr Anand Jayaraman: It reached that much sooner, except the minima it reaches higher point then where this one's are reaching. Okay, okay, thank you. Right it reached a sub optimal level and it flattened out at a suboptimal level. 01:08:00 - 01:08:02 Speaker 5: It's reaching the saddle point. 01:08:02 - 01:08:58 Dr Anand Jayaraman: So it's actually it reaches a different minimum. What happens is, as someone pointed out, right? When the direction in which you need to walk for the steepest ascent, the opposite of it is the direction of steepest descent is true only when your step size is infinitesimally small. If it is not infinitesimally small, then you might potentially reach a slightly different place. And that is what you are seeing here. You reach a slightly different minimum. Thanks professor. Now, I want you to, we already showed this graph of the contour plot. This is the contour plot of some 01:08:58 - 01:10:01 Dr Anand Jayaraman: particular surface. The surface actually looks like this as a parabola. Imagine a surface that looks like a parabola. This 1, when you're viewing the surface from here, this is your eye, you're viewing a surface from here, then this surface looks like this in terms of a contour, more or less circular, more or less circular contours. This parabolic bowl is represented by this contour. Everyone ok with that? Now, I, when I start at 1 point, let us say this is the function axis this is the function value f of X comma Y and this is the X 01:10:01 - 01:10:33 Dr Anand Jayaraman: axis this is the Y axis and here you are representing the same thing this is the x-axis and this is the y-axis. That's the same thing. So when you start off from 1 point here some combination of X comma Y, some point here, here You are computing the gradient. The gradient will point in this direction. And you take a step in the opposite direction. So, from this point, you walk to this point. And from that point, you walk to that point. From that point, you walk to that point. You will eventually reach the bottom of the 01:10:33 - 01:11:13 Dr Anand Jayaraman: hill. This is basically what gradient descent does. Now, the same picture when you look at it the contour map, it looks like this. So, this point is probably here. This is the gradient line, The line of steepest ascent of the surface. The gradient line, the way you can arrive at what the gradient line is like this in the contour picture. In the contour picture, you have your, this is your contour. Here is the line that is tangent to the contour this line is a tangent to the contour at that point right the gradient is the direction 01:11:13 - 01:11:49 Dr Anand Jayaraman: that's perpendicular to the tangent So that is the direction of the gradient. Right? The same direction is what I'm representing here. And we are going to take a step in the opposite direction. We're going to take a step here. And at this point, there is another contour, right? At this point, there is another contour. There in that point, what do you do? You again draw the tangent line. You again draw a tangent line. The gradient is in this direction. So you walk 1 step there in that direction. And you keep doing this and eventually you reach 01:11:49 - 01:12:36 Dr Anand Jayaraman: the surface. The same gradient descent in this picture and this picture, I'm showing you how there are 2 different views of the same process that's happening. It's the same process that's happening. Now, imagine if I had the surface that was not like nicely parabolic like this. Instead, let us say the picture looks like a crushed parabola. I am applying a force from this direction and this direction and I am crushing this parabola. So, when you crush it, what will happen? The circles, the contour map will no longer look like circles. The contour map will look like 01:12:36 - 01:13:21 Dr Anand Jayaraman: this. Agree? Is this clear, folks? Yes. I'm looking at a crushed surface. And I'm gonna ask the question, what happens to gradient descent in a crushed surface? It's possible that in some situation, the loss function that you're doing is such that on some directions, the gradient is steep in 1 direction and not so steep in another direction. Like that's what a crash surface does. So here, what will happen? Let's say you start off at this point, the gradient line will be, how do you determine the gradient line? This is the tangent line. The gradient is perpendicular 01:13:21 - 01:13:22 Dr Anand Jayaraman: to 01:13:22 - 01:13:24 Speaker 3: the- Perpendicular, yeah. 01:13:24 - 01:14:05 Dr Anand Jayaraman: Yeah, and so you walk in the opposite direction, you reach there. There, You again do the gradient and walk in the opposite direction, you reach here. And what ends up happening? You keep doing this process. So the process of you trying to reach the middle goes haywire like this. You're taking a scenic route down there instead of what you are actually more of taking a direct route there. The direct route happened in the previous nice curves. You directly went from this point to there But there are times when in this higher dimensional space, you have complicated 01:14:05 - 01:14:32 Dr Anand Jayaraman: surfaces where you end up having to do this, which means it takes a really long time for you to end up finding the minimum. Is the problem clear? This is 1 of those kinds of problems that happen when you are doing with large dimensional surface. Okay, clear idea? Now, How do you solve it? What do you do? 01:14:33 - 01:14:37 Speaker 5: Professor Sachin, what happens if it is not contour, right? 01:14:40 - 01:14:59 Dr Anand Jayaraman: Whether it's contour or not, it's the same thing that's happening. Okay. I'm just saying the surface was not a nice parabolic bowl, instead if the surface is a crushed bowl. Imagine you're going to Grand Canyon, right? The Grand Canyon, there are crevices that are there. The surface is not even, right? It won't be a nice bowl. 01:14:59 - 01:15:02 Speaker 5: I asked the wrong word, but I was asking if it's not parabola. 01:15:04 - 01:15:48 Dr Anand Jayaraman: It's not, this is, yeah, I'm talking about a surface which let's say it looks like a parabola from 1 side but When I look at it from this side, this looks like very narrow, narrow thing, man, I can't draw. This is, I give up. Imagine that parabola, which is just crushed, that's all. Right? Parabolic bowl, that's just crushed. That's what this thing is. So the main problem is on 1 side, the gradients are smooth. On the other side, the gradients are very steep. 1 side, the gradients are shallow, the other ones other direction the gradients are 01:15:48 - 01:15:59 Dr Anand Jayaraman: steep that is the problem with the surface. And these kinds of structures exist in higher dimensional space. It is there in 2 dimensional space, it is of course there in higher dimensional space. 01:15:59 - 01:16:04 Speaker 3: So Professor, so we are talking about crushing from the top and crushing from the sides. 01:16:04 - 01:16:06 Dr Anand Jayaraman: Only crushing from the sides. 01:16:07 - 01:16:11 Speaker 3: Yes, that's how we find out how deep is 01:16:13 - 01:17:09 Dr Anand Jayaraman: the particle. I'm saying there are situations where the surface looks like a crushed parabola. And there the gradient descent takes really long time to come, struggles to reach there. That's what I'm saying. So what people have done is the way they figure out how to solve it is this, right? They try to solve it using an idea called momentum, okay? What is momentum? Now, imagine I am coming in a car, right? I took a step in this direction and I am asked to take a step in that direction, it's very hard for me to immediately change. 01:17:11 - 01:17:42 Dr Anand Jayaraman: Right? If a person, so let me first tell you what what we have done, the core idea and then I'll tell you why this is actually momentum. What I'm gonna do is I'm gonna look at what is the direction in which I walk now? What is the new direction that I'm being asked to walk? This is the new direction that I'm being asked to walk. This is the new direction that I'm being asked to walk. This is my old direction. This is a new direction in which I'm being asked to walk. What I'm gonna do is 01:17:42 - 01:18:29 Dr Anand Jayaraman: I'm going to take, I'm not gonna listen to this advice of walking this direction. Okay, instead, I'm going to walk in the direction of average of old direction plus the new direction. I'm going to walk in the direction, the average of these 2 numbers. What does it mean? What does the average even mean? Okay. So this is a vector, let me call it as vector O. This is the new vector, right? I'll call it as vector n. What is the direction that o plus n points to? In between. In between remember vector addition when I have 01:18:29 - 01:19:09 Dr Anand Jayaraman: a vector a like this and I have a next vector arranged from tail to head kind of that. The resultant vector is in this direction. You remember this from vector algebra from long time? Yeah. So this was the direction I was supposed to walk. This is the new advice that is given. If I walk in the average direction, I will be walking towards where the minimum is. Right? And that is basically equivalent to having a momentum. I was originally going in 1 direction. You are suddenly asking me to make a turn that side. My turn won't 01:19:09 - 01:19:14 Dr Anand Jayaraman: be immediately that side. I'll be sort of moving in an average direction. Right? 01:19:15 - 01:19:16 Speaker 6: That- But in this case, Professor, 01:19:16 - 01:19:24 Dr Anand Jayaraman: if you're having momentum allows you to take the blue part rather than 01:19:25 - 01:19:33 Speaker 6: this. So Professor, if you're moving in the average direction from the second point, we will actually be heading in the wrong direction. It's only. 01:19:35 - 01:20:17 Dr Anand Jayaraman: So it is a question of implementation, right? Actually, I'm talking only about direction. I'm not talking about position. What direction am I going to walk? The direction that I'm going to walk each time is going to be my current direction in which I just walked, and then new direction which is being pointed. I will change my direction not dramatically to whatever you're pointing. I'll change my direction to the average. That is the principle. And what it allows you to do, and again, this exact statement I made, this was the simple gradient descent. Now my modified gradient 01:20:17 - 01:20:41 Dr Anand Jayaraman: descent is, I'm not going to walk in the direction of here. This is my old weight. I'm adjusting that weight based on the negative gradient. But now here I'm going to adjust the weight in a different direction, which is based on the gradient and whatever direction I previously walked. That's what this formula is trying to tell you. Don't worry too much if that formula actually sounds, looks too complicated for you. 01:20:41 - 01:20:51 Speaker 3: So basically professor we need to calculate the first direction then predict the second direction and the start point would be the 0, the start point. 01:20:51 - 01:21:34 Dr Anand Jayaraman: Yeah, the first direction, you listen to the gradient. Next time, it'll again tell you gradient, but this time don't listen to it. Take the average of where you are going before versus where you're being pointed. Each time let's take the average. Now how do I implement it? So this is very complicated math. In terms of implementation, How does this get implemented? The way it gets implemented is this, you remember this, we called an optimization function when I showed you the code, right? This was the neural network model, sequential model and dense 10 neurons with 5 dimensions, 01:21:34 - 01:22:25 Dr Anand Jayaraman: 5 input dimensions and so on. Right, we did this. Now, I am just before fitting, I call this SGD, what does SGD stand for? Stochastic gradient descent. Stochastic gradient descent, brilliant. Stochastic gradient descent is what? Because we are trying to do in batches, right? We don't do fully, we do in batches, right? Now, to that, you just add an argument, allow it some momentum. You allow it some momentum, and this automatically speeds up the train. How much momentum to allow is a parameter it's a hyper parameter right here and this number is telling you how much 01:22:25 - 01:22:35 Dr Anand Jayaraman: of my old direction should I listen to compare to my new direction like what combination do I use how much weightage to give the old versus That is what momentum is talking about. 01:22:38 - 01:22:48 Speaker 4: So, does that mean we have to, if the momentum is higher, then the time it takes to reach the optimality will be longer? 01:22:49 - 01:23:24 Dr Anand Jayaraman: It depends on surface. Unfortunately, any of the questions you ask, the only correct answer I can tell is, it depends. It depends on the problem we are trying to solve because we don't know what the surface is and we are all like blind men who are trying to navigate our way to the bottom of the hill, not in a two-dimensional surface. We are in that some you know 79 dimensional space So we can't even see the surface. We are just blindly moving to this point. And it just turns out that you need to try these different 01:23:24 - 01:23:44 Dr Anand Jayaraman: values to see what 1 works optimally. Which is painful thing, right? This is why Deep learning requires a lot of computation. People basically play around with it, try out different parameters to see which 1 results in the most optimal solution. Right? I will also preface 1 question. 01:23:45 - 01:23:56 Speaker 4: This is the same as the learning rate we have, or is it? I mean, I know it's different, but what is the intuition? I'm not able to understand momentum versus learning rate. 01:23:56 - 01:24:29 Dr Anand Jayaraman: Learning rate is the size of the step. Momentum is talking about which direction I'm going to walk in. Learning rate is another parameter that you can add here. Actually, there is a learning rate is another parameter that's available to add here. But right now not specifying it. So I'm asking you to give the take the default now learning right is about the step size momentum is about the direction in which you're walking okay okay but thanks 01:24:30 - 01:24:40 Speaker 2: yeah any like this code right what you're written is it something which you can do it using Azure Studios and kind of thing where you don't? I mean, it's 01:24:40 - 01:25:16 Dr Anand Jayaraman: more efficient. Yeah, I'm confident these settings are available in Azure Studio. Right? I don't know what this thing is. All I am trying to tell you, I'm not asking you to memorize the code. I'm saying the theory might be complicated, but people have written the code. And all you are doing, most of us end up doing, is changing some numbers in a pre-written code. So implementation has become easy, which is why all of us who have more of a domain understanding rather than the technical details of it will also be able to code and play and 01:25:16 - 01:25:42 Dr Anand Jayaraman: make predictions. Previously, only the programmers were able to do all of that. But now it's actually gotten to a level where if you have a broad conceptual understanding, You know that there are these different parameters you can do and speed up your learning. You can do that. That's my point. Right? You don't need to be aware of that 79 dimensional space at all. 01:25:44 - 01:25:55 Speaker 6: Professor coming to momentum question here. Momentum we are not actually adding the vectors we are adding the vectors in proportion correct this 0.9 means 0.9 of current vector because we add the vectors 01:25:55 - 01:25:56 Speaker 3: we will 01:25:56 - 01:26:00 Speaker 6: probably diverge right I mean at least for the example you gave 01:26:00 - 01:26:05 Dr Anand Jayaraman: so we are taking an average. This is actually a weighted average. 01:26:05 - 01:26:07 Speaker 6: Weighted, yes. Weighted average. 01:26:08 - 01:26:20 Dr Anand Jayaraman: Right. Normal average would be 0.5 of A plus 0.5 of B. Now we are doing weighted average. Some other 1 weight of 0.9 of A plus 0.1 of B. 01:26:21 - 01:26:31 Speaker 6: And this weight says that 0.9 of the currently recommended vector plus 0.1 of the yes okay we still give you more weight to the current to the currently recommended direction 01:26:31 - 01:27:13 Dr Anand Jayaraman: That is something that you have the ability to play with. Absolutely. You have the ability to play with that. Thank you, Professor. Yeah. Now, the complication hasn't ended yet. I will just make 1 more change. I'm sorry, I'll take the questions in 5 minutes, right? We're approaching break time, right? Just 5 minutes. I'll, this thing. This is 1 way of handling the problem. There is another way of handling the problem. This was done by a method that was suggested by this gentleman called Jeff Hinton. Wait, where have I heard this name? And only every single lecture 01:27:13 - 01:27:57 Dr Anand Jayaraman: that we have had, we have heard this name, right? This guy also made a contribution. Sorry. Yes, there is a reason for it. Every single lecture this you will hear this contribution that he has made. He is based out of Toronto. The contributions that they have done is like amazing. What this he came up with another 1. He said, why should the step size be constant as well? The Step size also we will adjust based on whether the current gradient is steep or shallow. We'll automatically adjust the step size also based on the current and the 01:27:57 - 01:28:39 Dr Anand Jayaraman: size of the gradient. And there's some what looks like a very complicated formula, don't worry about it. It is genuinely not all that complicated, even though it looks very complicated. The idea is we are talking about different learning rates, depending on whether 1 direction is more steep or 1 direction is less steep. Now, this modification is called as RMSPROP or root mean square propagation. There is some root there, you see the squares there, there's some root mean square value somewhere there. So this method is called as RMS prop. This also improves the performance. Let me show 01:28:39 - 01:29:18 Dr Anand Jayaraman: you how it improves the performance. Right, you are starting from this point. You want to reach the bottom of the surface. And this surface has a saddle point. You're around the saddle point right now. Now, here are these different algorithms. 1 is stochastic gradient descent. This was our original gradient descent algorithm. And then this 1, the green 1 is momentum. And there's a bunch of different methods. People have been trying a whole bunch of these other methods. RMSProp is what Geoffrey Hinton suggested. Now, I want you to pay attention now. Focus only on the red dot. 01:29:18 - 01:29:51 Dr Anand Jayaraman: Only on the red dot. You see what the red dot is doing? It's slowly walking down, trying to climb down the surface. Right, this is the original gradient descent. Are you noticing how slowly it's walking down? And then momentum, momentum is what we discussed first. Right, Momentum is a guy who's coming in a motorcycle, right? Or a gal who's coming in a motorcycle, right? What happened to the moment on the green line? You went there, right? It's zoom. It went past and then it turned back and then came here because you have momentum, right? The motorcycle 01:29:51 - 01:30:29 Dr Anand Jayaraman: just speeded past and then turned around and is coming down. That's what happened with momentum. RMS prop is this more trickier 1 where you adjust the learning rate based on whether you are near a steep gradient or a slow gradient. RMS prop, what does it do? It beautifully turns around and very quickly comes to the bottom. Are you noticing it? Don't worry about complicated formulas. Don't worry about complicated formulas. All I'm saying is people have thought about it and make modifications to this gradient descent method. The original gradient descent method is this red line, which still 01:30:29 - 01:30:51 Dr Anand Jayaraman: works, just takes a long time. But there are these other modifications that allow you to move the training faster. How exactly do you put up this RMS prop? You know what? Again, it's very simple. Instead of calling stochastic gradient descent, for optimization, you call for RMS prop. And you go ahead and do the other thing that's only change that is done. 01:30:52 - 01:31:06 Speaker 2: Sorry professor I'm slightly confused can you the previous slide what you showed yeah right the momentum goes even though it comes in a different direction, it goes faster than the RMS prop. 01:31:06 - 01:31:31 Dr Anand Jayaraman: Correct. Because you have momentum, you're continuing, you're not, the gradient descent is telling you, you turn this side, but you're giving a lot more value to the direction in which you came before. So you're continuing to move forward in that direction. Right? We don't listen to the gradient descent. We are saying I'm going to place attention to what the direction in which I came as well. 01:31:32 - 01:31:35 Speaker 2: Agreed. So my question is that is momentum better than RMSPROC? 01:31:36 - 01:31:38 Dr Anand Jayaraman: No, but Bhaskar, if you see, 01:31:38 - 01:31:44 Speaker 4: the green dot is actually moving slower to the minima, right? I mean, when it reaches minima, it is actually slower. 01:31:45 - 01:31:59 Dr Anand Jayaraman: But then again, it turns around. So very good question. Are these momentum better than RMS problem? You know what? What is the correct answer? All these complicated questions. What is the correct answer? 01:32:00 - 01:32:01 Speaker 3: Depending on the01:32:02 - 01:32:42 Dr Anand Jayaraman: It depends. So here is another innovation. People said that, why don't we use both the methods both the methods that when you use them they call it as adaptive momentum or it's shortened as Adam okay what it does is it's does 1 of the, this is RMS prop, this is just your gradient descent, momentum with RMS prop together, that errors decrease much faster than either 1 of them separately. When both of these algorithms are combined together, it is called as Adam. And the nice thing about Adam, again, very easy implementation, the optimization, you just call it 01:32:42 - 01:33:35 Dr Anand Jayaraman: as Adam. Don't call it stochastic gradient descent, instead you call Adam, which basically uses momentum and it uses momentum and also RMS prop together and this dramatically speeds up the optimization okay so let me pause there and we'll take our break. For me, much faster than I expected, but I it might have been very, very, very heavy for you. Apologize. But when we take a break and come back and then just quickly summarize this, hopefully it will help. So, it is actually root mean squares of the gradients in different direction. It looks at what is the 01:33:35 - 01:34:14 Dr Anand Jayaraman: gradient in X, what is the gradient in Y. It takes the squared values of those 2 gradients and adjust the learning rate based on the values. Whichever place has steeper gradient, it has lower learning rate, whichever 1 has shallower gradient, it has greater learning rate. The intuition is when you're coming down a steep slope, you take small steps. When you're going on a flat level, you take large steps. So the step size will depend on the gradient that is the innovation. Dr. Jain, a very fundamental question, will the speed at which you descend matter? It should 01:34:14 - 01:34:52 Dr Anand Jayaraman: matter, right? That is what is covered in momentum. So the speed at which ultimately you reach is your overall speed of learning. It's measured in number of epochs. How many steps, how many epochs you have trained, right? What we are- That's not what I mean. Sorry? That speed at which you finally reach the answer depends on what are the things it depends on? First is it depends on the surface, right? What type of mountain are you climbing down? Correct. What does it even mean? That's our first meaning, which domain problem are you solving? How many hidden 01:34:52 - 01:35:34 Dr Anand Jayaraman: layers you have? How many neurons you have? All of that determine the surface. It also depends on the step size that you're planning on taking. It also depends on what strategy you're using, are you taking momentum, essentially, which basically you're acting as if my past direction is going to, is as there is some inertia, I'm going to continue moving a little bit on my past direction before I turn. Right, that's what momentum does. Momentum was introduced because we thought it allows you to speed to the end point quickly. But momentum, which it does, which helps in 01:35:34 - 01:36:18 Dr Anand Jayaraman: many situation, but around the saddle points, it gets strict. So RMS prop will fix that for you. RMS prop handle saddle points very well. So this combination of momentum and RMS prop together is called as Adam. The optimization algorithm is called Adam. It's not based on the discoverer of this method or inventor of this method. Adam refers to adaptive momentum. And that is the better optimization algorithm that is followed. It's 8.05 in my watch. Let's wait, take a break. 8.15 we'll come and talk. I want people to have time for this to settle down. I'll come 01:36:18 - 01:36:21 Dr Anand Jayaraman: back, do a quick refresher of it, and then we'll move on to 01:36:21 - 01:36:31 Speaker 2: the next. Professor, did you use which prop did you use for the first half of learning? Adam, which prop did you use for the first half of learning? 01:36:33 - 01:37:11 Dr Anand Jayaraman: So this method is actually, in a sense, it's adaptive method. You remember I talked about learning rate decay? Yeah. Right? Learning rate decay, we were manually controlling the learning rate. Now, what these methods are doing is it's automatically controlling the learning rate based on whether the surface is deeper, shallower. Based on that, it's automatically controlling all of that. So we have innovated and doing something better than before. So it's adaptively learning from the surface. 01:37:13 - 01:37:19 Speaker 2: So more often than not, people use Adam. Exactly, More often than not, people use Adam. 01:37:20 - 01:37:30 Dr Anand Jayaraman: So I'll give you some of these rules of thumbs towards the end. Use ReLU. Use Adam. And then some of these other advices that's coming in. 01:37:31 - 01:37:53 Speaker 5: Yeah. Thanks, man. Professor, I just have a question. Yeah, so I was asking, you've given an example of open bowl right if it is a closed bowl. So, the minimize on the starting point itself right Or maybe on the on the end, it's not always 01:37:53 - 01:38:00 Dr Anand Jayaraman: be in center. So, ultimately, you want to find the bottom of the ball. So, if it 01:38:00 - 01:38:09 Speaker 5: is a closed ball, then like center would be highest point, not the because it's not an open ball, right? Open ball center would be. 01:38:10 - 01:38:14 Dr Anand Jayaraman: When you say closed ball, are you referring to an upward down, upside down ball? 01:38:14 - 01:38:15 Speaker 5: Yes, yes. 01:38:15 - 01:38:42 Dr Anand Jayaraman: Yeah, yeah. So there are going to be surfaces that's going to have some portion of it as an upside down, some portion as a normal 1. Right. But this algorithm will always try to go and find wherever is the minimum. There will be some minimum. There will be some minimum. Okay. You won't be able to find the global minimum. That's all I'm saying. Okay. Yeah. Okay. Thank 01:38:42 - 01:38:44 Speaker 3: you, Professor. We will see you. 00:05 - 00:46 Dr Anand Jayaraman: But hopefully, seeing the final code helped you realize that even if you don't understand the technical details, it's okay. Only thing I'm conveying is there were a bunch of different decisions that had to be made. A bunch of different innovations that were done in order to make deep learning possible, right. That's what we're talking about. This whole process of improving the optimization method, huge amount of innovations had happened in the back end. And this Adam is 1 of them. And then I don't mean to imply that the innovations have stopped, right. It's happening, people are inventing 00:46 - 00:58 Dr Anand Jayaraman: better optimization methods and people are also trying different activation functions and a bunch of other things as well as we will see. Questions, please. 01:00 - 01:16 Speaker 2: So Professor, 1 thing, maybe you'll cover it later. We spend so many computational cycles on different incremental steps. We're not starting at random places in the search space. Is that something that we'll cover later? 01:16 - 01:26 Dr Anand Jayaraman: Yeah, very nice, very nice. Right now, that's 1 thing. That's exactly did you did you have a sneak peek at my lecture notes? Is that the reason you're asking this? 01:26 - 01:27 Speaker 3: No, sir. 01:27 - 02:29 Dr Anand Jayaraman: So that's where we are going. Absolutely. I love this intuition. I love this intuition. And this is exactly, you're right. So till now I've been talking about starting completely at the random place. And we will try to provide some nuance around that. So each part you try to make improvements. So for me, this whole idea of deep learning And the progress that is being made in deep learning sort of reminds me of the kind of innovations that I'm a cricket fan, okay? I'm a cricket fan. And a lot of these things, cricket analogies comes quite naturally 02:29 - 03:20 Dr Anand Jayaraman: for me. I apologize to you folks if, you know, that analogy does not stick well or you're unfamiliar with cricket. If you are unfamiliar with cricket and you don't really enjoy cricket quite as much. I'm really sorry, you're missing out some lot of fun in life. But you know what, not everyone can be lucky. So let me go ahead for others who understand cricket. Let me talk about 1 particular time in history. The South African cricket team were banned from playing internationalistic because of their apartheid policies and so on. So pretty much all countries were not 03:20 - 04:04 Dr Anand Jayaraman: dealing with them at that time. This is what I'm talking about, this is in the 80s and yeah, 80s. Towards the end, what happened was early 90s time period, when they changed the policies and so on, South Africa was being welcomed back into global affairs and so on. And their cricket team had, as a part of this welcoming back into the mainstream, different countries were inviting their sports team also to come and play. And India had called their cricket team. Now, what South Africans had done until then, they were not playing any international cricket. They were 04:04 - 04:43 Dr Anand Jayaraman: just playing internally. And so they didn't actually have a way of checking, measuring their standards versus the rest of the world. Countries were not playing with them. So when they came in initially, and then they realized that there were a bunch of places where they had to improve, right? There are a bunch of different skill sets that they had to improve. Now, 1 of the strategies that the South African coach had come up with, this gentleman called Bob Ulman, right? Some, At least the name hopefully is familiar with for some of you folks. So he talked 04:43 - 05:30 Dr Anand Jayaraman: about this like he was 1 of the early guys to look at statistical analysis of these cricket games. And he was talking about, he realized that most games, cricket games, were 1 day games, like the T20 hadn't started, 1 day games, that were being won or lost with this thing of around 15 runs. Not a lot. For people who are unfamiliar with cricket, it's not a lot of runs. The difference between the winning team and the losing team was often around just 15 runs. And so what he realized was that for a, if I am somehow 05:30 - 06:16 Dr Anand Jayaraman: able to sneak in an average player in my team, team is supposed to have 11 players. If I'm somehow able to secretly sneak in a new player who's on average is able to get me 15 runs, then I can win most games. Right, obviously you can't sneak in a new player unknowing to everybody. So given that, what can you do? You have 11 players, you find a way to make sure that each 1 of those guys find an extra contribution of 2 runs, right? So if each player can contribute an extra 2 runs, I made up 06:16 - 06:54 Dr Anand Jayaraman: my 15 run shortfall, right? So this is the main reason why speeding of increased speed of running between the wickets started. This was the main reason why feeding, where people are just diving and stopping the ball, right? All of that started because with the intention that in a one-day game, 50 overs, if each player can contribute by 2 runs, then that's more than enough to win most of the games. That was a calculation that he did, which is a very different way than most of the other countries were looking at it. India, we had Sachin Tendulkar, 06:54 - 07:38 Dr Anand Jayaraman: you look at a star player making a big contribution. Instead of looking at it like that, This guy was looking at every player making some small contributions. And that takes your team forward much farther than what would have been possible. This is also beautifully captured in this movie called Moneyball. You've got to watch that as well, right. Well, again, analytics plays a big role in money. Now, why am I saying all of this? I'm saying all of this because This is what made deep learning work. It is not 1 single innovation. Multiple innovations, each 1 adding 07:38 - 08:17 Dr Anand Jayaraman: just a little bit to your efficiency. Adding a little bit to the efficiency, together all of this combination really helped moving it forward, deep learning forward. Right, that is the really cool about the cool thing about all of this, right. It's amazing each 1 making small small small contribution and each 1 just increasing you know you just say that the idea is a good idea but does it always work? No, depends on the situation and there is another idea and then there is another idea together putting all of this together makes finally all of this magic. 08:19 - 08:20 Dr Anand Jayaraman: Yeah. Sorry for08:20 - 08:23 Speaker 4: Got nostalgic about John D. Rhoades professor. 08:23 - 08:51 Dr Anand Jayaraman: I know, right? Amazing. I mean, yeah, John D. Rhoades, absolutely. It was like, rest of the world were thinking that, you know, you are, we all are starting at the same point in a 100 meter race. And this gentleman was already at 90 meters, right? There's like no competition at all that time for the way he was speaking. Amazing. Okay. 08:51 - 09:25 Speaker 5: Again, on the lighter note, right, so being non-mathematical, the gradient design reminded me of Sabarimala. From Sabarimala to Pampa. Sabarimala walking down from Sabarimala hill to Pampa, right? How do you walk the cliff down step by step? So it reminds me of gradient design. While you were teaching with the mathematics, I was thinking, okay, how did I walk down from the cliff all the way up to the bottom 1. And what's the speed, right? The learning rate as you said, so the speed exactly the same. How do I get my steps, where it is very steep? 09:25 - 09:26 Speaker 5: How would I walk slowly? 09:27 - 10:15 Dr Anand Jayaraman: And... Exactly. That's exactly the Intuition. That's exactly the intuition, right? Just to be clear, right? Again, when you actually take deep learning course, formal course in the US and They make you do these mathematics and actually make you prove that this idea actually converges and so on. These are all deeply mathematical ideas. And there's solid mathematical proofs that are behind all of this. But for us, I mean, I don't say we don't care about it, but since we are not actually bothered about recreating that, we're just saying, giving an overview of different ideas that went in. 10:15 - 11:03 Dr Anand Jayaraman: It's always nice to see those ideas that go in. Because once in a while, when a problem doesn't misbehave, it'll allow you to think back and say that there are other parameters I can play with as well. And see if I should change some of the parameter and would that result in success, right? So even for a business person, knowing that there are these other control knobs that are there helps. So hopefully that's the big picture that you're getting. Okay let's move forward. Let's get to WAINT initialization. Good. So we have Good. So we have been 11:03 - 11:58 Dr Anand Jayaraman: talking about, we are behind. It's fine. Catch up. We'll catch up. So we are talking about different ways in which we can improve. 1 of the places we still haven't touched is our initial initialization right. We are still there's this complicated surface, which multiple minima and so on. And we are just not, we are still starting off at some random point. And from that random point, we are trying to walk down to get to that gradient descent. So this random point on the surface is here. And you're walking on this gradient all the way to there. 12:00 - 12:45 Dr Anand Jayaraman: Is there something we can do over there? So a bunch of different things we can do. We're going to start off with the simplest thing we can do, then we can come back later, a couple of episodes later, we'll come back and we'll talk about much better ways of doing it. So the this weight initialization is really an active area of research. 1 initialization that you can do which you normally think of is just start off with a random Gaussian. I have for example I need to determine 1373 weights. So here is my vector with W1 12:46 - 13:29 Dr Anand Jayaraman: all the way to W1373. I can just say 1 way to start out with each way I will randomly pick from a Gaussian distribution. Each 1 I will randomly pick from a Gaussian distribution. This 1 from a Gaussian distribution, this 1 from a Gaussian distribution, this 1 from a Gaussian distribution and so on and so forth. This is purely random initialization. This is fine as a starting point, but turns out that when you have large number of neurons, large number of weights to determine, this takes really, really long time to start to converge. So there are 13:29 - 14:35 Dr Anand Jayaraman: these other ways of initialization. What are these other ways of initialization? The other way of initialization is this. Imagine my, here is my input nodes. Input nodes. Okay I have 6 nodes here Let's say I have 4 here and 3 here. Some nodes like that. So when I'm talking about weights w1 to w1, w2, w3, so on and so forth. Each nodes are there right. These are all the weights that I'm talking about. These are all densely connected neurons. So each of these neurons have connections to all the neurons from the previous layer. Let me just 14:35 - 15:16 Dr Anand Jayaraman: make more neurons here okay all of them let's say and this layer there are 10 neurons this layer there are 4 neurons and this layer there are 3 neurons. Now when I'm putting initially I said all of these weights will be chosen from a Gaussian distribution. And I said it works, but there are better ways. What do we mean by better ways? For now, let us go back to imagining sigmoid. For now, let's go back to imagining sigmoid activation function. The argument remains the same, whether we're talking about sigmoid or ReLU or whatever. For now, let's 15:16 - 16:01 Dr Anand Jayaraman: think about sigmoid. It's easier to explain initially using sigmoid. Now, imagine I want you to think about currents flowing in from these neurons to this particular neuron. And similarly, currents flowing in from all of these neurons into this neuron. The weights that are here, W73, W74, whatever, some set of weights that are there, the weights are essentially amplifying the currents, or suppressing the currents. That's what it is. When I'm choosing these weights, I'm talking about the level of amplification of the current or level of suppression of the current that is going into this neuron. Now, when 16:01 - 16:51 Dr Anand Jayaraman: you are choosing random, we are effectively saying they are equally likely to be high or equally likely to be low. That is what we say. But if that is the case, imagine this neuron. This neuron is getting currents from 10 of them. 10 of those connections are supplying to its currents to this 1. Whereas this neuron, only 4 of them are supplying currents to this neuron. Which neuron is greater likelihood of saturating faster? This neuron is likely to saturate faster or this 1? This 1 is likely to saturate faster. Because larger the amount of current, it 16:51 - 17:40 Dr Anand Jayaraman: quickly saturates, right? This neuron is likely to saturate faster. But we don't want the neurons to get saturated fast. We want the weights to modulate in such a way that all of them don't saturate unnecessarily. If that is my intention, then what that should mean is that the weights flowing in over here should be smaller in size because currents from 10 neurons are coming in so these weights should be small this set of weights should be small compared to this set of weights. Agree? Is that intuitive? 17:40 - 17:42 Speaker 6: Agree. Yes. 17:42 - 18:38 Dr Anand Jayaraman: So what we do is we say that then weights we will pick not random numbers from a uniform Gaussian distribution. Instead, over here, see the normal Gaussian distribution as a standard deviation 1 and center of 0. This is your standard Gaussian distribution, right? Instead of picking weights from a normal Gaussian distribution what we will do is we want smaller weights for here and we want relatively larger weights for here so we will adjust the width of the Gaussian distribution to more narrow or more shallow based on how many neurons are sending in currents. If larger number 18:38 - 19:21 Dr Anand Jayaraman: of neurons are sending in their current, then I want it to be, the values that are likely to be picked would be a smaller range. But if the number of neurons that are sending in current is small, I want the potential values of weights to be picked to be a larger values. Or another way of saying is that the standard deviation of this Gaussian distribution, I am going to make it to be inversely proportional to the number of neurons that are connecting the previous layer. Larger the number of neurons from which current is coming, I want 19:21 - 19:26 Dr Anand Jayaraman: the standard deviation to be lower. So that only small number of weight sizes are smaller. 19:27 - 19:28 Speaker 7: Professor. Smaller 19:28 - 19:50 Dr Anand Jayaraman: the number of neurons, I want the weight size to be larger. So the standard deviation, which is basically the width of the Gaussian distribution should be proportional, should be inversely proportional to the number of neurons. But actually what you're saying is the variance should be inversely proportional to the number of neurons. That is what we said. Professor, would 19:50 - 19:59 Speaker 7: you be able to explain once again, sorry, I missed, I didn't understand that part, how it is inversely proportional and why is it inversely proportional? Could you explain again, please? 20:00 - 20:48 Dr Anand Jayaraman: When larger number of neurons are sending in current, this neuron is likely to saturate quickly. So we want the weights to be smaller and dampen it down the current, modulate it down. Right? Because so many of them are sending in cards, right? So I'm saying the weights that I'm going to pick would be, should be smaller numbers, But we're picking the weights randomly, right? So I'm going to pick the range of values I'm gonna allow, maybe small, will be all close to 0. Or the Gaussian distribution, right? The variance of the Gaussian distribution should be inversely 20:48 - 21:40 Dr Anand Jayaraman: proportional to the number of neurons connected. That is the intuition. This method of initializing is called as Xavier-Glorot initialization. Xavier-Glorot initialization. And how do I make that statement in my code this was my code right I had 10 dense neurons 5 input dimensions activation is radio. Now there's 1 more variable there. Kernel initializer is blow rod marker. I'm saying choose the reads to be proportional to be inversely proportional to how many neurons are connected, input from how many neurons, you go ahead and set the weights. That's what it's saying. 21:41 - 21:59 Speaker 7: Quick question, Professor, that's what I was, anyway, so whatever irrespective of the weights, right, I mean, maybe just quickly the system based on the descent, the gradient descent and type of what we accept, it automatically course correct. So, what is the purpose of initializing the weights? 21:59 - 22:36 Dr Anand Jayaraman: Very well, because what happens is this starting point matters. This is the surface. You see the surface extends all the way from plus infinity to minus infinity. The minimum is somewhere here. There is no point if you're starting in this corner. Right? You want to find a way to start from this corner so that you're, because if we started in this corner, you're going to be taking small steps forever and ever and ever before you reach here. So we are setting the right magnitude of the weights, not the right values, just the magnitude. We are saying 22:36 - 22:48 Dr Anand Jayaraman: these are likely to be small, these are likely to be large. And we are using that as a way of initialization. That is all. So by the initial magnitude. By 22:49 - 22:56 Speaker 7: the weights and manipulating the weights as smaller in the initial layers versus the later. 24:55 - 25:41 Dr Anand Jayaraman: So, see every layer has weights, right? Instead of choosing all weights, see normally if you want to choose randomly, you would say 0.1, 0.2, minus 0.5 or something, pretty much all of them are in the similar magnitude is what we choose. I am saying, you know that some of these weights are on the initial layers. Some of these weights are of the later layers, right? Each of these weights belong to different set of layer. Be cognizant of whether the weight has 4 neurons connecting or whether the weight has only 3 neurons connecting. Based on that, we'll 25:41 - 25:43 Dr Anand Jayaraman: choose the magnitude. That is all I'm saying. 25:43 - 25:45 Speaker 2: Thank you. Thank you. 25:46 - 25:54 Dr Anand Jayaraman: And that helps. As I said, every field that's saving a couple of runs helps. And this is 1 of those fields. 25:55 - 26:00 Speaker 6: So basically, we choose the startup weights, the weight that we need to start with, right? 26:02 - 26:17 Dr Anand Jayaraman: We are still starting random. The size of it is all we are controlling. The magnitude of it is all we are controlling. Yeah, ultimately the priority distribution, right Professor. So 26:17 - 26:25 Speaker 3: if you take a narrow 1 it means the probability of getting higher weight is higher than the more flat 1. 26:28 - 27:13 Dr Anand Jayaraman: I'm sure you understand it right But you said it wrong. Okay. Okay. So here is a narrower distribution. Yes. The range is maybe from point minus 0.5 to 0.5. Here is a wider distribution. The range is from minus 2 to plus 2. When you are randomly picking numbers, you are picking numbers from the x axis. The y axis is a probability of that number. So here there is a greater probability of picking smaller numbers. 27:14 - 27:16 Speaker 3: Okay, which is for the later neurons? 27:17 - 27:37 Dr Anand Jayaraman: Here there's a greater probability of picking larger magnitude values than compared to here. The chance of you picking minus 1.5 is almost none for this 1. Why not the other way? 1.5 is quite high. 27:38 - 27:51 Speaker 3: Yeah, why not the other way like, so you're saying like neurons which are connected to more neurons will be much more the second 1, a neuron which are connected to less neurons would be the narrow 1. No, so this 1 27:52 - 28:07 Dr Anand Jayaraman: will be if you are connected to this neuron is getting connections from 3 neurons I will use this okay however if there's a neuron which is connected, getting connections from any of them, I will 28:07 - 28:21 Speaker 3: use that. That more smaller numbers. So that the total light is ultimately coming, you know. Exactly. So we can visualize it like the like a current of a bulb, right? 28:21 - 28:38 Dr Anand Jayaraman: Exactly. Exactly. I will use different analogies. Sometimes I use chemicals flowing. Sometimes I use currents. Sometimes I use cricket whatever works. Okay so this 28:38 - 28:41 Speaker 2: is there an analogy with the football 28:53 - 28:55 Dr Anand Jayaraman: professor. Now we are going 28:55 - 28:55 Speaker 6: to It doesn't quite naturally for me, sorry. 28:55 - 30:02 Dr Anand Jayaraman: Now we are gonna talk about regularization. So here is the regularization. 1 of the things is, have you discussed regularization? Has regularizations been discussed? Okay, I see silence. Does that mean? No, not yet. No, professor. Okay. Okay. We'll get So, this is, I'm sorry, right? I mean, still we're still figuring out the lay of the land on which module should appropriately cover which set of topics. And that's why I'm having to ask. Each 1 of us have freedom to pick what we want. We do broadly want to cover the set. But who's going to cover this 30:02 - 30:51 Dr Anand Jayaraman: 1 particular subtopic that hasn't been fully articulated, fully clear? That's why I'm asking you whether some particular topic is covered or not. And of course, each professor does it at different levels of detail based on their own subjective opinion of what they think is important. And that's why this issue is slightly there. But I don't think it's such a big deal. Let's go ahead and cover it. Now, we talked about yesterday possibility of overfitting. Neural networks can overfit easily, because you are allowing more and more neurons And so you can fit more and more complicated, potentially 30:51 - 31:42 Dr Anand Jayaraman: more and more complicated functions. And when you have more and more complicated functions, you're going to get train error to be lower. But test error might not be low. And I want us to think about this particular problem clearly. And here is the way we want to think about this. I have a particular set of, here is a set of y values and here is my matrix of x values, x1, x2, x3, x4. And I am, let let's for now talk about linear regression. So why the model that I'm trying to fit is this, beta 1 31:42 - 32:25 Dr Anand Jayaraman: x1 plus beta 4 x4. This is the model we are trying to fit. And let's say the kind of errors that we are getting, we are getting some large errors. It's the strain error is not being small. R squared numbers are not satisfactory. So not satisfactory. So we are not getting a good fit. What do you do? What do you start doing? Is you start doing feature engineering, right? You start creating new features out of this existing data. So you maybe you can add x1 square, then x2 square, then x3 square and so on. And then 32:25 - 33:12 Dr Anand Jayaraman: after x1 times x2, then x2 times x3, all kinds of combination of numbers. You see that right, the number of columns will quickly explode. Why stop with q square? Why not q? You can add those variables too and then see that. But you know what will happen? Eventually when you add more and more features, the quality of your train fit will get better and better and better. But at some point, What will happen is the quality of your test fit will start dropping because you're overfitted. So here is a graph, kind of graph, that you would 33:12 - 34:05 Dr Anand Jayaraman: have seen before. Model complexity in the x-axis. And on the y-axis is errors. What we find is that as the model complexity increases the error in my training data continues to drop. As you make more and more complex model, the error in the training data starts dropping. Now However, if you look at the test data or validation data, what you'll find, the error in the validation data will always be higher than the error in the trained data. Initially, as you increase the model complexity, the validation error or test error will drop. At beyond some point, it 34:05 - 34:25 Dr Anand Jayaraman: will start going up. So this is the test error or validation error. Will start going up. And this 1 was a trained error. I'm sure you've seen this graph side. Overfitting was discussed or bias and variance was discussed. 34:27 - 34:28 Speaker 6: Right. 34:30 - 35:23 Dr Anand Jayaraman: Is this clear? Is this understandable that this is likely to happen as the model gets more and more complex? Your train error gets smaller and smaller but test error goes back. Hopefully that much is clear. Yes? Yes. So here is a conundrum that we face. When we were building models, 1 of the things the things that we, how did we build linear regression? We told when you're building linear regression, for now I'm going to just because it's easier for me to draw, I'm going to consider only 1 variable y and x. So, I have a bunch 35:23 - 36:08 Dr Anand Jayaraman: of data points here. And I am trying to, what does linear regression do? Linear regression is trying to fit a line through these data points. And the lines equation is y equal to beta naught plus beta 1 x. This is the equation that I am trying to fit. When you are doing linear regression, the question that we are asking was, how do I determine beta 0 and beta 1? How do I determine that? The way I determine that is by saying, see, the values in

5. Issues and Techniques in Deep Learning 2 - 28012024 Complete transcript.pdf

Document Details

Tags

Related

Full Transcript

Upgrade to continue