Deep learning session 3 transcript.pdf

00:00 - 03:15 Speaker 1: You you you you you you you you you you you you you you 04:06 - 04:12 Speaker 2: Good morning, good afternoon, wherever you are. How are you all doing? 04:17 - 04:18 Speaker 3: Great. Very well, thank you. 04:20 - 04:21 Speaker 1: We – 04:22 - 04:33 Speaker 2: while we wait for others, are there any questions, anything that I can address At all from whatever has been happening whatever we have been covering 04:35 - 04:47 Speaker 4: Yesterday when you are showing the next coming part You have mentioned have confirmed about the GAN part and there is also this variational autoencoders VEC models those will be covered to 04:49 - 04:54 Speaker 2: not variation not in for this we will touch upon other corners and what 04:55 - 04:57 Speaker 4: they okay okay okay 04:57 - 05:29 Speaker 2: what more you are also I'm hoping you are seeing the level at which we are covering. It's definitely not, it's gonna be at a very high level. I want you to understand the essence of it, but we won't be covering details of it right and as much as possible I'll be doing my best to simplify without making it overly simplistic or wrong. 05:29 - 05:42 Speaker 4: Okay I'm in healthcare domain and in healthcare also some of the DBN and DNN type of architectures apart from feedforward is also used. So those will be at least briefly touched upon? 05:46 - 05:53 Speaker 2: All the network architectures will be covered. Yes. First I have 05:53 - 06:36 Speaker 5: a question. I was thinking through the activation function, particularly sigmoid function. I tried to plot that in Excel to see the graph and I found that when X is equal to 0, my Y is 0.5. My question is, in the neural, when we are using an activation function as sigmoid, my X is actually summation of y I and x I plus plus my beta right I'm ignoring the beta for now. Now this means that the summation of YIXI needs to be 0 at certain point of time to activate my neuron or deactivate my neuron right? 06:37 - 06:53 Speaker 2: Ah no so the weights are playing an important role right so let me see if I can find a quick picture of a sigmoid that I can paste. 06:59 - 07:04 Speaker 5: So my fundamental question was how do you become 0 that's that's where I got stuck but anyway I 07:04 - 07:29 Speaker 2: will be sure sure yeah I'll let me share the hey man this is interesting 1 learns new stuff as we search I was searching on Google right now for a sigmoid function. Is my screen visible? 07:31 - 07:32 Speaker 5: Yeah, professor. 07:33 - 08:29 Speaker 2: Okay, so there is our sigmoid function. Exactly. Right? And so, so let's, I'm going to write it as this is, this is sigmoid of some number is the and this is the axis okay in a perceptron what happens is let's say there are 3 inputs that are there and with 1 output, and we are talking about a particular neuron, a particular problem where there are 3 different types of inputs that are being given, and you have a classification problem Y and so multiple data points are there and for each data point and the output is either 1 or 0. It's this problem that you choose to solve with a SIP point. 08:29 - 09:54 Speaker 2: It's a classification problem and there are 3 inputs. So in this, so X1 is fed here, X2 is fed here, X3 is fed here, and this is my output, the predicted value of Y is coming out, which is given by the sigma of beta 1X1 plus beta 2X2 plus beta 3X3 plus beta naught right this is the full function right so essentially the right so essentially your Z is where beta 1 X 1 plus beta 2 X 2 plus beta 3 X 3 plus beta naught this is your Z so this the thing is what happens during training is that these weights, beta 1, beta 2, beta 3, these weights are essentially beta naught, beta 1, beta 2, beta 3, right? These 4 numbers are determined in such a way that you have the smallest error between y and y hat. So y and y hat, the absolute value, the difference is as small as possible. Like that is exactly what the training algorithm ensures. 09:55 - 10:05 Speaker 2: Right now, I don't know about the sign of beta 1, beta 2, beta 3, right? The beta 1 can be positive or beta 2 can be negative. The science can be anything it wants. 10:06 - 10:07 Speaker 5: Okay, got it. 10:07 - 10:40 Speaker 2: When this combination is 0, when the entire combination is equal to 0 then that represents probability of 0.5. Okay. We showed this yesterday. Let me see if I can find my lecture deck from yesterday. Remember this equal to 0. 10:41 - 10:44 Speaker 2: It represents 0.5 probability, right? 10:44 - 10:44 Speaker 1: That's what 10:44 - 10:54 Speaker 2: we're talking about. That represents 0.5 probability. And the 4, it represents the boundary decision boundary 10:55 - 11:00 Speaker 5: got it got it yeah so at y is equal to 0 is where my line is created 11:00 - 11:05 Speaker 2: yeah That is that is the decision boundary, which represents a probability of 0.5. 11:07 - 11:09 Speaker 5: Understood, Professor. Thank you. That clears. 11:12 - 11:32 Speaker 2: Any other questions? Any other questions before we get started? Okay. That's 28 participants. Let's get started. 11:32 - 11:54 Speaker 2: I see a little bit of a drop in the number of participants from yesterday. I hope yesterday's class wasn't too heavy. But we'll continue. We're going to provide more details about neural networks. OK, so what are the different things that we have covered so far? 11:55 - 12:15 Speaker 2: We so far have introduced what a perceptron is. And we said that actually a perceptron is very similar to a logistic regression. Right. Particularly when you have a sigmoid person logistic regression. And what does logistic regression do? 12:15 - 12:21 Speaker 2: What kind of problems does it solve? Anyone? 12:22 - 12:23 Speaker 6: Single class and multi-class. 12:24 - 12:33 Speaker 2: Logistic regression. No, I said binary. Binary. Binary classification. Classification of multi-class and binary class. 12:34 - 12:41 Speaker 2: So actually logistic regression can do only binary classification. And what kind of decision boundaries does it have? 12:43 - 12:44 Speaker 6: Linear decision boundaries. 12:45 - 13:06 Speaker 2: Very nice. Linear decision boundaries. That's the main thing with logistic regression. And we are able to go to nonlinear by moving to multi-layer perceptron, or what is called as neural networks. We have multiple neurons connected. 13:07 - 13:48 Speaker 2: And this 1 can have very complicated nonlinear boundaries. And we talked about how for this, some of the main design constraints that we talked about is that if I'm solving a problem with P features, right, with P features, then what does it, what kind of neural network should I have? What does it know about, what can we tell about our neural network? Input layer. Input layer. 13:49 - 13:50 Speaker 2: Should have P nodes. 13:50 - 13:51 Speaker 6: P nodes, yes. Yeah, Input 13:51 - 13:54 Speaker 2: layer should have p nodes. Input layer should have p nodes. 13:54 - 13:55 Speaker 6: Output layer 1 node. 13:56 - 14:08 Speaker 2: So output layer, we don't know yet about output layer. Output layer depends on the target. So depends on the target. If the target is numerical. Linear. 14:08 - 14:31 Speaker 2: Or exactly or regression, then we are doing a regression problem. So the output layer needs to have a linear activation, linear activation. Right? And instead, if we have classification, there are 2 kinds of classification problems we can do. 1 is binary and other is multi-class. 14:31 - 14:42 Speaker 2: Multi-class, yes. Right? So for binary classifications, then what should be the activation function? Sigmoid. Sigmoid, very nice. 14:42 - 15:05 Speaker 2: Sigmoid activation function for multi-class, it is Softmax. Softmax activation with as many neurons as the number of classes. Right, if you're solving a 3 class problem, we should the last layer, output layer should have 3 neurons. Right, And then we talked about the network architecture. So these are all questions on network architecture. 15:07 - 15:40 Speaker 2: And we said that the hidden layers, the number of hidden layers is a hyper parameter. It's a design choice you get to make. And then in each hidden layers, number of neuron, that is also a hyper parameter. Right? In fact, what is also a hyper parameter is what activation function to put in is also a hyper parameter, but generally we'll just stick to sigmoid for now. 15:41 - 16:05 Speaker 2: Right, all the internal hidden layer activation functions we'll just keep it as sigmoid for now. We won't worry about that being completely different activation function for now. But it's potentially possible to design a neural network with some other complicated function as well. Any other questions? Anything else that we touched upon? 16:05 - 16:21 Speaker 2: Anything that we touched on not clear? All good so far? Yes, perfect. Now, finally, finally, we are starting to get up. How do we train a neural network? 16:22 - 16:45 Speaker 2: OK. How are the weights obtained? So Here is a, this is going to require us to talk about some details. And we will talk about details. Okay, now the weights are going to be obtained through an iterative process, right? 16:45 - 17:06 Speaker 2: This is not going to be a 1 step thing. The weights are going to be determined through an iterative process. And the way we are going to do is this. We are going to start with random set of weights everywhere in the network. OK. 17:07 - 17:42 Speaker 2: And if you happen to choose those weights in such a way that the output is exactly matching the desired values, then you don't do anything, right? You've randomly chosen the perfect set of weights, right? Then the first thing you do is you drop out of this program and go buy lottery tickets, because you seem to be amazing at picking the right set of weights randomly. Right. But rest of us who for whom the random set of weights does not work. 17:43 - 18:15 Speaker 2: What we then do is we look at the difference between the actual value of y versus the predicted value of y. And then what we are going to do is we are going to decrease or increase the weights in such a way that the error decreases. So we keep adjusting the weights 1 after another to again, try the problem and see how it works. And again, readjust the weights and so on and so forth. So this is going to be an iterative process. 18:17 - 18:51 Speaker 2: Fundamentally, this is what is going to be happening. Now this iterative process we are going to rely on instead of the thing is there's going to be a question of how do we adjust the weights, How much do I increase the weights or how much do I decrease the weights? That is going to be a tricky question and that there is an algorithm for it which will help us address that. The name of the algorithm is called as gradient descent. Gradient descent. 18:53 - 19:28 Speaker 2: And this is the algorithm that's going to be allowing us to iteratively change the weights until we get to the correct answer. Okay, I believe the algorithm, the gradient descent algorithm was discussed in class. Right? The task, the act of gradient descent is sort of equivalent to trying to climb down a mountain. Let me try to provide some context over here by talking about a simple problem. 19:29 - 19:29 Speaker 1: Okay, 19:29 - 20:07 Speaker 2: now let's, what exactly are we trying to do here? We have a problem where you're, It's best we talk about this problem, not in context of neural networks, but in something much more simple than neural networks. Let's talk about linear regression. And Then I'll come to connect it back to neural networks in a second. So linear regression. 20:07 - 20:58 Speaker 2: So let's have this, let's say the problem that we are trying to solve is this mileage per gallon problem, where we saw this data set. I have mileage per gallon of a variety of cars. And I have, let's say, only 1 predictor, the weight of those cars. And I want to find the relationship between weight and mileage per gallon. So essentially, I'm trying to predict, I'm trying to write the mileage per gallon as beta 0 plus beta 1 times weight, where beta 0 and beta 1 are coefficients to be determined in such a way that the difference between the predicted value of mileage per gallon and the actual value of mileage per gallon is as small as possible. 21:03 - 21:30 Speaker 2: And so what I'm going to do is this I am going to I want you to imagine this process that we are going to be doing is the intent here this is what linear regression does right I'm trying to find the set of weights, this coefficient. Y is equal to mx plus c. Exactly. This is y equal to mx plus c. Except, of course, if I write m and c, I won't appear as smart as if I had written beta 0 and beta 1. 21:30 - 22:04 Speaker 2: So I'm going to choose beta not and beta 1 instead. Choosing Greek letters always make your peers smarter. Now, the way I'm going to determine what is the correct value of beta not and beta 1 is this. What I'm gonna do is I'm going to start writing a set of beta naught and beta 1, random values of beta naught and beta 1. Okay, I'm gonna start writing different combinations of this thing. 22:04 - 22:42 Speaker 2: So let's start with 0 and 0, then 0.1 and 0, and 0 and 0.1, 0.1 and 0.1, 0.2 and 0.1, and like variety of these numbers, I'll try different ones. Okay, now let me try the first 1. Okay, I substitute this beta naught and beta 1 over here and what do I find when beta naught is 0 and beta 1 is 1 no matter what is the weight the predicted value of mileage per gallon will all be zeros. Everyone agrees? Predicted value. 22:42 - 23:12 Speaker 2: So for all of these problems, The predicted value of mileage per gallon, predicted value of mileage per gallon, will all of them be zeros. Then what I do is I write, I try to find the difference between the actual value and predicted value. That's my error. Action value and predictability, difference between those is my error. Now, do I compute the error for 1 car or do I compute for all the cars? 23:12 - 23:27 Speaker 2: Well, it's possible that for some cars, you might coincidentally be right. So let's compute the error for all of these cars. For all of these cars from 1 to 32. There are 32 cars in my data set. So for all 32 cars, I'm going to compute this. 23:28 - 23:52 Speaker 2: But then, see, This 1 number is talking about is a signed number, right? Mileage per gallon minus predicted value of mileage per gallon. Really, I want this number to be as small as possible. And it's considered an error regardless of whether it's positive or on the negative side, right? Error on the, if I over predict the mileage per gallon or under predict the mileage per gallon, it's an error, right? 23:52 - 24:26 Speaker 2: So what I'm gonna do is to account for the fact that both being over prediction and under prediction is wrong. I'm going to start measuring the squared errors. When you square a number, the number automatically becomes positive. This number is what we call us sum of squared errors or SSE, like sum of squared errors. So what I'm going to do is for this value of beta not and beta 1 I'm going to plot here the sum of squared errors. 24:28 - 24:52 Speaker 2: And the sum of squared errors for this beta value of beta not and beta 1 might be some number let's say it's 33.1 That's the value of the error for this value of beta naught and beta 1. Then I'm going to try this value of beta naught and beta 1, 0.1 and 0, right? 0.1 and 0. And I'm going to compute the predicted value of mileage per gallon again. From the predicted value of mileage per gallon, I'm going to compute sum of square errors. 24:52 - 25:12 Speaker 2: And this will be some other number, let's say 37. And this might be some other number, let's say 32. And I'm going to write for each 1 of these combinations, the value of some of squared errors. Okay. Now, what I can do is I can actually plot this. 25:12 - 25:43 Speaker 2: This is beta naught, this is beta 1, and that axis is sum of squared errors. When I plot it, what will end up happening is I will get a surface, a 3 dimensional surface that will look like this. The place where the surface has the lowest error is the correct value of beta not and beta 1. Is that clear? This is the value of beta not and beta 1. 25:43 - 26:04 Speaker 2: This is the correct 1 which minimizes the sum of square error. This is why we need to determine the value of the parameters. Is that yet here? Right, I haven't told you yet how we are gonna do this. I'm saying suppose we try out all of these combination and make a plot from the plot I can determine this. 26:06 - 26:16 Speaker 2: Is that clear? Thumbs up would help. Yeah. There's no thumbs up in this. It's clear. 26:16 - 26:37 Speaker 2: Yes. Okay. Now what the this is the idea of trying to find weights for any problem. For any problem, we are trying to find the value of parameter that will minimize the error between predicted value and actual value. The sum of square errors we are trying to minimize. 26:38 - 26:54 Speaker 2: Right? This is what we are doing. Now, the thing is, you know that this is going to be a painful task. Right? Trying to list all the values of beta naught and beta 1 and trying to draw the surface is an extremely painful task. 26:55 - 27:08 Speaker 2: Can we do something smarter? Right? Instead of tabulating for every combination of beta naught and beta 1 and doing it is going to be extremely painful. Can we do something smarter? And yes, we can do something smarter. 27:09 - 27:32 Speaker 2: And that's where the method of gradient descent comes in. Right? Ultimately, this error surface, sum of square error surface is going to be some complicated looking surface. Is going to be some complicated looking surface. Now I have drawn only in 2 dimensions right now, right? 27:32 - 28:04 Speaker 2: And it will be 2 dimensions when there is only 1 predictor. Let's say there are 2 predictors, then I would have 1 more thing like this, beta 1 weight plus beta 2 horsepower, which means now you're searching for the best possible combination in a 3 dimensional space. If I have 2 predictors, I'm searching in a 3 dimensional space for the best combination because intercept also is the value, right? You have 2 slope coefficients and an intercept. 3 values you need to find. 28:05 - 28:22 Speaker 2: If I have 10 predictors, how many parameters am I searching for? In what dimension space am I searching? 11. 11 dimensional space. And can you imagine the number of combinations I'll have to go through? 28:22 - 28:48 Speaker 2: Very complicated. Very complicated and extremely computationally intensive. And That is why the method of gradient descent is employed. What method of gradient descent does is philosophically speaking, it considers this problem as I have some complicated surface and I'm trying to find the bottom of the surface. 28:50 - 28:50 Speaker 1: I have some complicated surface and I'm trying to find the bottom of the surface. 28:50 - 29:16 Speaker 2: Right. I haven't some complicated surface and I'm trying to find the bottom point of that surface. And that problem is essentially equivalent to you are standing in the top of a hill and you're trying to find the quickest way to come to the bottom of the hill. Right? You're standing on the top of a hill and you want to find the quickest way to come to the, approach the bottom of the hill, which might be somewhere over there. 29:16 - 29:59 Speaker 2: Now, the quickest way to approach there would be no matter which point you're standing on the hill, you find out what is the path which leads you to the steepest descent over the surface. And you continue, you take 1 step in that direction. Then over there, you again find out from this point, which is the point which takes the steepest descent downwards and you take 1 step in that direction and you keep doing and that way you might reach that particular point fast. Right, Keeping walking down this hill on the direction in the direction of steepest descent is the quickest way to get to the bottom. That is our claim. 30:00 - 30:40 Speaker 2: Okay, it might not be the safest way of course, when you're actually trying to climb down the hill, you don't want to necessarily walk down on the steepest side of that thing. But of course we are talking about a numerical algorithm trying to find the best way to reach the bottom of the surface. So what we are trying to do, essentially what I'm saying is our strategy is going to be you randomly pick any point in the surface that as your starting point. From that point, you identify what is the steepest part down the surface and you keep walking down 1 step at a time to the steepest part of the surface and you will eventually reach the bottom. That is the hypothesis behind this method. 30:40 - 31:00 Speaker 2: This method is called this method of gradient descent. Right? This has been covered for you. I'm hoping you all are reasonably okay with that level of understanding. Right now, in order for you to follow this method. 31:01 - 31:07 Speaker 7: I mean, for people doing the first class professor, We didn't cover this before. 31:10 - 31:28 Speaker 3: I think we just have the concept, sir, but we don't know the calculation of how do we take the steepest descent. We just know the concept that we have to take the steep down and then try to find a local minima and then a global minima. I think that we have just understood, but not the actual calculations. 31:29 - 31:38 Speaker 2: Not the actual calculations. OK, which is fun. Yeah, Srikant, go ahead. 31:39 - 31:45 Speaker 7: And also, this is the first class of a few people, so we didn't hear this before until now. 31:46 - 32:01 Speaker 2: Okay. Okay, let's see how many folks are there right now. 29 people are there. Okay. Shall we take a quick vote. 32:01 - 32:08 Speaker 2: Can you raise your hand. If you'd like some more details on the method of gradient descent? 32:09 - 32:23 Speaker 1: Dr. May I suggest maybe to go with the 2 dimensional which the cost function. Yeah, the cost function maybe it will help because it's simple. And that's new. 32:23 - 32:55 Speaker 2: Professor, I think there are some newbies who have never attended. So for the benefit of them, we can cover and also can be benefited. Okay, so Tarek, we will touch upon the cost function I will talk about that a little bit. Okay, I, I do want to have, I want you to have a sense for what does gradient descent do mathematically and why is this gradient there? That I do want everyone to have a sense for it. 32:56 - 33:19 Speaker 2: And so, let's see. Okay, I see 3 people raising, can we all, whoever wants, can we raise? Let me, if I get at least 10 folks, I will go ahead and cover it now. Otherwise, we'll see for some of the time, right? I have, okay, 14, that's almost 50% of the class. 33:19 - 33:45 Speaker 2: Okay, wonderful. Let's cover this. Okay, so this is going to be a very, very, very quick exposure to the idea of gradient descent. Okay, So let's start talking about that. I want to, okay, you can put down the hand, unless of course if you have a question, in which case please go ahead and unmute yourself and ask. 33:46 - 34:17 Speaker 2: Okay, good. So here is this idea. Let's start with a basic idea of calculus that we have, that was taught to us in high school. I have a function, some particular complicated function, right? And I want to find out where, what is the minima of the function? 34:19 - 34:42 Speaker 2: Okay, now, do we remember the difference between these 2 terms, minimum and minima? These 2 terms, do we Remember the difference? No, sir. Okay, so let's consider this function. Y equal to X squared plus 1. 34:42 - 35:17 Speaker 2: Okay, this is a particular function that looks like this. Okay and at x equal to 0 x equal to 0 1 2 3 minus 1 minus 2 and 3 so 1 at x equal to 0 the function value is 1 Like everyone agrees this graph represents that particular function. Right? So the term, if I ask what is the minimum value of the function, the minimum value of the function is Y equal to 1. Okay. 35:17 - 35:33 Speaker 2: When I ask what is the minimum of the function, I'm asking what is the X value at which the function is minimal. So X equal to 0 is the answer. When somebody is asking you, what is the minimum? When somebody is asking you, what is the minimum? You would say Y equal to 1. 35:33 - 35:51 Speaker 2: Is that clear? Minima versus minima. Now, what is it that we are looking for here? Over here, I'm trying to find the beta 0 and beta 1. That minimizes the sum of squared error. 35:52 - 35:56 Speaker 2: So am I seeking the minimum or are you am I seeking the minima. 35:57 - 35:58 Speaker 5: Minima. 35:58 - 36:35 Speaker 2: I'm looking for the minima. So I'm looking for what is the values which results in the lowest sum of square errors right this is the problem we are trying to find the minimum of a function right now for any function there's a 1 way to find the minima is sort of list out, compute the value of at all of these points and find out which is the minimum value of the function and then read the X value from that. That's 1 way of doing it. But extremely painful way of doing it, right? There's a better way of doing it that calculus has taught us. 36:37 - 37:18 Speaker 2: What is that method? The thing is we noticed that, sorry go ahead someone is saying something. Right so what we do, what we did is we noticed that the algorithm that people are taught us was this if you want to find the minima of any function f of x what you should do is find the derivative of the function, df by dx and find the root of that, set that equal to 0, you solve it. And then the x value, which results in df by dx equal to 0 is your minimum. That is what they taught to us in high school calculus, right? 37:18 - 37:59 Speaker 2: The reason why this method works is understanding that if you're trying to find the minimum at that, the, let me start with that, this thing. The understanding is that the derivative of a function, What that gives you is the slope of the graph. So if I try to find the derivative of the function yet, it tells you what is the slope at this particular point. Now I know that at the minimum point, my slope is 0. So if I look for places where the slope is 0 then I can, I have minimized my search space instead of looking at every single point I can only hunt on those set of points where the slope is 0? 37:59 - 38:24 Speaker 2: For example, in the graph that I drew here here is 1 place where the slope is 0. Here is another place where slope is 0. Here is another place that slope is 0, here and here. Now, all I need to do is look at these 5 places where the slopes is 0, find out whichever is the lowest 1 and I've found the minima of the function. So using calculus allows us to reduce our searching time. 38:24 - 39:04 Speaker 2: We don't need to look at every single point. We can just quickly find the place where which is likely to get the minimum. Even this method that was taught to us in high school, however, in reality is not very, very, very easily implementable. It's not very easily implementable because imagine this particular function f of x is equal to x to the power 4 plus 3 x squared plus 2 x plus 7. Okay here is this function. 39:04 - 39:07 Speaker 2: Now what is the derivative of this function? 39:08 - 39:10 Speaker 3: 4x cube plus 6x plus 2. 39:10 - 39:12 Speaker 2: 4x cube plus 6x 39:12 - 39:13 Speaker 3: plus 2. 39:13 - 39:15 Speaker 2: Where is this 0? 39:18 - 39:24 Speaker 5: 2. What is the question? Sorry, come again. 39:24 - 39:42 Speaker 2: Where will this point, this function be equal to 0? What is the values of x at which this will be 0? That is a cubic equation. This is a cubic equation. So we need to find out where this is 0. 39:42 - 40:00 Speaker 2: This is a cubic equation. We don't know the formula for a cubic equation. Even though theoretically I know that I need to find the place where this function is going to be 0. I don't know the formula for doing it. Now, the shame of it is this. 40:00 - 40:17 Speaker 2: We all thought in high school, trying to find zeros of a function is easy. We remember the quadratic formula, right? Most of us, you know, when you wake up in the middle of the night and ask the quadratic formula, you will be able to rattle it out. Minus B plus or minus square root of B squared minus 4AC by 2 A, right? That's what we know a lot. 40:17 - 40:27 Speaker 2: But you know what? Only quadratic, that's all we can do. Cubic, we don't know how to do. Forget about fourth order polynomial or fifth order polynomial or any of them. We don't know how to find zeros. 40:28 - 41:05 Speaker 2: So this is why this method that was taught to us in calculus is not really good enough yet. So what I'm gonna do is this, I'm gonna try and find an alternate method. This alternate method is this. Now, this derivative that we learned in calculus, in high school calculus and that we were forced to fall in love with it is extremely useful. Except that when you start dealing with functions of 2 variables, There is a new complication that introduces, it introduces, right? 41:05 - 41:30 Speaker 2: Yeah, the generalization of derivative, they call it as gradient, right? And they write it with this complicated formula. What does this creature, what does this actually do? This is partial derivative of F with respect to X, partial derivative of F with respect to Y, anything else? 41:36 - 41:38 Speaker 7: Okay. Partial by the 41:38 - 41:54 Speaker 2: I had plus J hat. This gradient is actually a vector. It is not a numerical value. Right? When you talked about derivatives, derivative was a number, numerical value. 41:54 - 42:13 Speaker 2: But when you're starting to talk about 2 dimensions, suddenly you have a vector. Right? Now What additional complication is this? Why are we, why is a gradient vector and what in the world does it mean? To understand that, let's go ahead and generalize this particular function I talked about. 42:14 - 42:34 Speaker 2: I talked about F, a function with 2 variables, F of X, Y. This function is let's go ahead and pick the function that we already discussed is X squared plus 1. Like you asked me what happened to the Y. I said, there is no Y dependence. Okay, so how does this graph look like? 42:34 - 43:03 Speaker 2: So here is this function value. I'm going to put on this axis is X and on this axis is Y. So essentially I'm saying that this function as X changes, it looks like a parabola, right? But for every value of Y, it is the same function, right? So essentially what you have is like a folded paper in the shape of a parabola. 43:03 - 43:19 Speaker 2: I know my drawing is terrible. Like this. Is that clear? A folded paper, right? For all values of Y is the same parabola. 43:20 - 43:45 Speaker 2: Okay, now let's consider this particular function and let's think about what does gradient mean? Okay, what is gradient? I wrote the formula here, up here, right? Gradient is partial derivative of X, I hat, plus partial derivative of Y, J hat, right? And so this gradient of this particular function is equal to what? 43:45 - 44:02 Speaker 2: Can someone tell me? First differentiate with respect to X. When I differentiate with respect to X, what do I get? 2X2XI hat. And then there is no derivative with respect to y, right? 44:02 - 44:05 Speaker 2: So the gradient of this function is just 2x I hat. 44:06 - 44:06 Speaker 1: Yes. 44:06 - 44:14 Speaker 2: Everyone okay with me? Yeah. Yeah. Anyone, anyone having nightmares about math and calculus yet? 44:16 - 44:17 Speaker 3: No. Yeah, 44:17 - 44:19 Speaker 2: most of us. Yeah, most of us. 44:19 - 44:21 Speaker 6: We are already inside the nightmare, Professor. 44:22 - 44:35 Speaker 2: We are already dumbfounded. Yeah. So, but, but I assure you, clarity will come in a second. Right. This complicated picture right since I can't draw. 44:35 - 44:46 Speaker 2: I'm going to talk about it as if I'm viewing it from here. Okay, this is my eyes. I'm viewing it in this direction. So essentially I'm drawing the projection of that. Okay, this is my X axis. 44:46 - 44:50 Speaker 2: This is my function. Is that fine? Everyone okay with the diagram? 44:51 - 44:51 Speaker 3: Yes. 44:51 - 45:08 Speaker 2: Yeah, so now look at this, right? I mentioned this function value, let's just start writing the value. 0123, and this side minus 1, minus 2, 3, okay? Now, what is the value of the function at 1? Let's start with 0. 45:08 - 45:17 Speaker 2: What is the value of this function? This is the function we are talking about x squared plus 1. What is the value at 0? 1. At the value at 1? 45:18 - 45:19 Speaker 7: 2. 45:19 - 45:26 Speaker 2: 2, right? This is 2. Value at 2? 4. 4 plus 1, 5. 45:26 - 45:34 Speaker 2: 5, yeah. Okay? And similarly, minus 1. Value at minus 1? It's also 2. 45:34 - 45:45 Speaker 2: It's 2. Minus 1 squared plus 1 and then minus 2 is also 5. It's symmetric. Okay. Now, let's write the gradient. 45:46 - 45:50 Speaker 2: What is the gradient? The gradient was 2X I hat. 45:50 - 45:50 Speaker 5: Yeah. 45:51 - 46:01 Speaker 2: Okay, good. Now let's think about this. I let's compute the value of the gradient X equal to 1. What is the value of the gradient 46:01 - 46:02 Speaker 1: at X equal 46:04 - 46:05 Speaker 5: to 1? To II. 46:06 - 46:09 Speaker 2: II. Exactly. What is I have? What does that mean? 46:10 - 46:12 Speaker 5: It is on the X. 46:12 - 46:16 Speaker 2: It's a vector that is pointing in this in 46:16 - 46:17 Speaker 1: this direction. 46:17 - 46:20 Speaker 2: Now, what is the value at at minus 46:24 - 46:25 Speaker 5: 1? Minus 46:26 - 46:42 Speaker 2: 2. So, it's a vector that's pointing in this direction. Now, what does that mean? It's basically this gradient is a value, it's a vector that is lying on this surface, on the bottom surface. It's lying on this bottom surface. 46:42 - 47:05 Speaker 2: It's pointing that if you walk in the direction of the gradient that will see the fastest increase in the function. You see that the value of the gradient is pointing which direction. See it's a surface, right? I'm here, I can walk along this direction, This direction, this direction, this direction. I can walk in any of these directions. 47:06 - 47:24 Speaker 2: The gradient is telling me if you walk in the direction pointed by the gradient, you will see the maximum increase in the function. Is that clear? Yes, Professor. Now, what is our goal? Our goal was to find, come to the bottom. 47:25 - 47:26 Speaker 2: So, what should I do 47:26 - 47:28 Speaker 7: to go to the other 47:28 - 47:31 Speaker 2: direction? Ah, I should walk to the negative direction of the 47:32 - 47:32 Speaker 1: gradient. Yes. 47:34 - 48:05 Speaker 2: And that is the gradient descent algorithm. The gradient descent algorithm is basically you are walking towards the negative direction of the gradient. So here is what you do is this, you start off. So you are going to, let's say there are 5 weights that we need to determine. Beta naught, beta 1, beta 2, beta 3, and beta 4. 48:05 - 48:21 Speaker 2: 5 set of, a set of 5 weights that you need to determine. Right? What you, they're saying, so it's essentially a 5 dimensional surface. Now I want to find the bottom of this 5 dimensional surface. How in the world am I going to find that? 48:21 - 48:42 Speaker 2: I can't visualize it, right? And here is the beauty of the gradient descent algorithm. The gradient descent algorithm tells you that you start off with some random values, minus 0.1, 0.2, square root of 3, minus 77 and 35. Just pick randomly 5 values. Okay? 48:42 - 49:05 Speaker 2: And for those values, find out what is the value of my sum of squared errors? This is the value of my surface is there, right? This is my complicated surface. Of course, I'm drawing a 2 dimensional surface, but it's actually a 5 dimensional surface. I'm finding out what is the value of that surface at this particular point. 49:07 - 49:25 Speaker 2: And what I need to do is compute the gradient at that point. The gradient at that point will tell me what direction should I start walking in order to see the maximum increase in the function. But I want the opposite, right? I want to walk in the other direction. So I'm going to put a minus sign. 49:26 - 49:47 Speaker 2: So that I'm walking in the opposite direction of the gradient. And I take a small step size in that direction. And this number alpha that you see, this is my step size. What step I'm going to, how big a step size I'm going to take in the direction of opposite side of the gradient. So this is my random value of 5 values that I chose. 49:48 - 50:16 Speaker 2: And I take that 1 and I corrected by this 1 by the small step size alpha in the direction of negative of gradient. And this will give me the next position. And I keep repeating it over again. Each time I compute the gradient and walk in the opposite direction of the gradient with a small step size alpha. This number alpha, a step size that I'm taking, they call it as the learning rate. 50:19 - 50:32 Speaker 2: Right? What happens if I, let's go back to this picture. What happens if I take the large step size. I don't have a picture for it. What happens? 50:32 - 50:36 Speaker 2: What is the disadvantage of taking small step size versus large step size? 50:37 - 50:39 Speaker 1: Large step size you have less computation. 50:42 - 51:00 Speaker 2: Too small a step size, what will happen is You're taking way too long to reach the bottom. But larger step size, there is another problem. You might miss. You might miss a better path. So trying to find the learning rate is actually a skill. 51:01 - 51:11 Speaker 2: Or in other words, it's another hyperparameter that you can control to adjust the learning. This is the method of gradient 51:15 - 51:21 Speaker 5: descent. And delta E by delta W which is basically gradient is mathematically calculated right? 51:22 - 51:40 Speaker 2: This 1 is mathematically calculated because we know how to compute derivative. All of you took calculus a decade ago at least. And still all of you remember what is the derivative of x squared or x cube. 51:41 - 51:43 Speaker 5: And this derivative is easy. 51:44 - 51:50 Speaker 2: Finding the root of the derivative is not all that easy. So we don't use that method. Instead we use a method of gradient descent. 51:50 - 51:55 Speaker 5: And this would have, if this will be a vector across all the betas, right? Like 51:56 - 52:13 Speaker 2: this 1 will act as stretch, compute for all the betas. It'll keep correcting all the values of betas. You started off with random values for all of them, right? Beta not beta 1, beta 2, all of them. It will adjust each 1 by the right amount. 52:13 - 52:17 Speaker 2: Some of them it'll increase, some of them it'll decrease, depending on what the gradient is. 52:17 - 52:21 Speaker 6: Yeah. So basically it will have I had J had K had, et cetera. 52:21 - 52:38 Speaker 2: Right? Correct. Correct. Now I had J had K had is something that we used to use in our undergraduate, this thing, when we were imagining only about 3 dimensions, right? But in 5 dimension space, we can't put appropriate this thing. 52:38 - 53:07 Speaker 2: So we just call that as a vector in a 5 dimensional space, right? So essentially, this is what I'm saying that this gradient, I wrote it as I hat plus partial J hat. This is the way I wrote it right? In this vectorial notation, now it is written like this. This is a vectorial notation, right? 53:07 - 53:26 Speaker 2: This way I can scale it even if I have beyond K hat you don't know what symbol to use right so but this way we can add it for any any dimension you want So that's the notation I'm using it here. Did I confuse you more than explain? 53:27 - 53:33 Speaker 5: It's clear but mathematically how would the algorithm create delta f by delta x? Right so 53:33 - 54:34 Speaker 2: so that is a very good question and this is basically it's easy to compute derivatives mathematically however how to do it computationally in an algorithm in a way that's reproducible, that was done by this gentleman called Geoffrey Hinton. He proposed an algorithm called back propagation. Now, here is this name, Jeffrey Hinton, you will keep hearing over and over and over again, all through this course, right? This gentleman is often considered as father of deep learning. He has done the number of innovations that he has done, that amount of contributions he's done to the field of neural network is just unbelievable. 54:36 - 55:18 Speaker 2: So just like, I'm a physicist by degree, we consider Einstein as like the God. And Jeffrey Hinton has that same equal in place in the area of neural networks. We will, I'm sure you'll hear about Jeffrey Hinton more when we start talking about LLM as well, and his objections to it and so on, even though he was 1 of the big contributors to the system. Not objections to LLM, but his fear about general AI, AGI. Professor? 55:18 - 55:18 Speaker 2: Yes. 55:18 - 55:28 Speaker 6: I have to interrupt. Being honest here, completely nightmare. If this is going to be an exam, right? I'm really, really afraid going forward. If any business problems you can apply. 55:29 - 55:34 Speaker 2: Don't worry, the exam is just going to have second order differential equations, which you have to solve by hand. That's all. 55:34 - 55:46 Speaker 6: So I mean, yeah, so the reason 1 is exam, another part is understanding, right? So I need to apply this in my business. End of the day, I want to play in business. So if you can give any business example, which is- We are coming. 55:46 - 56:02 Speaker 2: We are coming there. All I'm saying is you don't need this formula at all. Somebody has coded it up. Somebody has coded it up. All you are going to do, remember we saw the set of steps yesterday, sequential. 56:04 - 56:16 Speaker 2: And then add model. So model.add, dense, and so on and so forth. We define a neural network. You remember that yesterday? Yes, professor. 56:16 - 56:26 Speaker 2: This process will be called, It will be inside a function. The function will be called fit and internally, it will call this function and find out all the set of this. 56:26 - 56:40 Speaker 6: Yeah, I understand professor the what you are teaching for the last 1 hour right. It is really really useful. I mean, you are telling the crux of the problem. I could understand the crux of the problem. I mean, I'm also being selfish here. 56:40 - 56:52 Speaker 6: I also want to know that the crux of the problem because I saw my data scientist friends are able to catch it very quickly. I also, if there is any English kind of statement where I can understand it very simple. Maybe I can 56:52 - 57:20 Speaker 2: absolutely absolutely so I will not I I refuse to categorize people as that, even though this is supposedly no code program, right? I believe knowing the code will help you, right? So which is why I'm actually walking through codes and I'm gonna show you more code as well, right? Same thing with math, right? Right now, yes, I completely understand it looks bizarre. 57:20 - 57:34 Speaker 2: It looks complicated and so on. Absolutely agree. But you know what? The second time you see it, you will feel much better. And I'm going to send you the resources that will you the resources that will help you. 57:35 - 57:45 Speaker 2: This is a video. Professor Shetty is talking about the exam. Is that right Shetty? 57:46 - 57:56 Speaker 6: No, both. See, 1 is exam, second is that I could see a lot of our friends are able to get it on the spot. Right. I'm really like, man, how will I understand? Right. 57:56 - 57:56 Speaker 1: So no, 57:56 - 58:06 Speaker 2: no. So yeah. So I'm completely right. I understand what you're saying. There are a lot of people very similar to Shakti's state of mind. 58:06 - 58:27 Speaker 2: I completely agree on that, Professor. For non-mathematical people like me, I'm quite struggling for the last 1 hour. And it's quite late for me here. At 1 in the morning, quite difficult to comprehend this mathematical nuances. So if you could have a little bit more business commutation to where this would help you, please. 58:27 - 58:43 Speaker 2: We are getting to the business in just 3 slides. Just 3 slides. We are getting there. But I'm going to recommend to you this particular channel. You see this 3 blue, 1 brown. 58:45 - 59:06 Speaker 2: Can you add window here? The link here. Yeah, sure. This particular is 1 of the best math features I wish I had. Let me just find out. 59:06 - 59:36 Speaker 2: How do I get to the chat window? Public, there. This 1, essentially the same neural network that we've been talking about in the past 2 and a half classes now, he walks through that in very, very basic steps. Now you will understand this because now you're going to be seeing it the second time. Right. 59:38 - 59:59 Speaker 2: Or this entire thing, you will find it to be very, very helpful. And he also talks about gradient descent. This is the first 1, what is a neural network? And the second video is gradient descent. I would strongly encourage you to watch both of these videos because it has animations you will see. 01:00:00 - 01:00:02 Speaker 1: Definitely understand it better. 01:00:02 - 01:00:03 Speaker 2: Thanks, Professor. 01:00:04 - 01:00:05 Speaker 1: Yeah. Thank 01:00:05 - 01:00:18 Speaker 3: you. Yeah. Yeah. Another thing is, if you look at the video of the professor today's class, when he explained the equation, right? So you can understand that English wordings of that equation. 01:00:18 - 01:00:26 Speaker 3: So if you just go through that portion again, you will easily understand once again, right? Yeah. 01:00:27 - 01:00:29 Speaker 1: That's helpful. Yeah. Don't, thank you. That was very helpful. Yes. 01:00:30 - 01:00:56 Speaker 1: But I'm saying do not worry about the exam point of view. And right now, most of us have this block about mathematics. I believe all of us are capable of understanding and really the level that's needed for to grasp neural networks is really not that complicated. When you go back and see the second time, you definitely will understand it better. Yeah. 01:00:56 - 01:01:15 Speaker 1: Yes. I've been teaching this for a while now, you know, 8 years. And every time this is very similar to the reaction that I've had. But I know that they all come up through this perfectly fine. So will you. 01:01:15 - 01:01:19 Speaker 1: Then maybe you can separate it. 1 is for the interview, 1 is for the exam. 01:01:19 - 01:01:21 Speaker 4: Exam, please don't put any of these. 01:01:22 - 01:01:29 Speaker 1: Yeah. Yeah. No, it's it's not like any of the this is not the level of for the exams. 01:01:29 - 01:01:55 Speaker 2: So, professor, I mean, The problem statement is not just exam, or it's more about, I mean, while exams are there, we still want to understand the crux, right? So I also want to get to know how the gradient is, and that visualization is not coming. See, yesterday when you spoke about the biological neuron and you translate to mathematical and electrical, we're able to, lightning fast, we're able to understand, remember the code, functions, everything. This visualization itself, it's a nightmare. How do I visualize this? 01:01:55 - 01:01:59 Speaker 2: And it's really, I'll wait for it. Once you start, that will be helpful. 01:02:01 - 01:02:25 Speaker 1: You will understand the visualization the moment when you start watching the video. All it's doing is this, so you have some surface. I want to try to reach the bottom of that surface. Here is my coordinates, X and Y coordinates, and this is the Z. I want to find which is the place which is where my surface has the lowest value. 01:02:26 - 01:02:47 Speaker 1: These are the points, beta naught, beta 1. This is the place where my surface has the lowest set of values, beta naught, beta 1, right? This point I want to identify, right? The question is, how do I identify that point? Right, when I can't see the surface, when I'm not able to plot the surface, 2D surfaces you can plot, but if I have 5D surface, how do I plot? 01:02:47 - 01:03:09 Speaker 1: So we arrive at a method which does not require a visualization. What does the method do? The method suggests that you find an arbitrary point in the surface. Here is this arbitrary starting point, right? So the starting point might be what this is 1 and this might be 7. 01:03:09 - 01:03:32 Speaker 1: So 1 comma 7 is a starting point for my surface. What I'm going to do at this point, This is the surface of some value here. I am going to evaluate the gradient at that point. What does gradient do? Gradient gives me a vector in which to walk on the surface to see the maximum increase in the surface value. 01:03:33 - 01:03:52 Speaker 1: That's what a gradient does. Gradient points the direction in which I need to walk on this flat space to see the maximum increase in the surface. Gradient descent method says that you walk in the opposite direction for a distance of alpha. Now I am reached a new point. This new point is what this was 1. 01:03:52 - 01:04:18 Speaker 1: So this new point is maybe 1.5 and we have walked on this site. So it's maybe from 7 it's lower 6.5 1.5 comma 6.5 is this new point. Then again I evaluate the gradient Again the gradient will be pointing in the direction of increasing surface. So I walk in the opposite direction. And each time I continue walking 1 step at a time on the surface and I'll eventually reach here. 01:04:18 - 01:04:29 Speaker 1: That is the core idea of gradient descent. I'm always walking down on this flat surface. I'm trying to understand whether which direction I should walk to reach the minimum point. 01:04:31 - 01:04:51 Speaker 5: Sir, if I may share, we actually applied this algorithm as human beings. Anytime we are walking on a hill, anytime you're walking on a hill, a two-dimensional space, we will see which way. If you want to go to the bottom of the hill, we'll see which side is the steepest, which way is the steepest, we'll go the opposite way. So this is the algorithm as human beings, we apply any time we have to come to the bottom of a 01:04:51 - 01:05:06 Speaker 1: hill. Absolutely, absolutely, right. Yeah. But as humans, we will also see the safety part. Safety of the thing, That's the only subtle point that we, safety is what we look at, but this is a mathematical space. 01:05:06 - 01:05:25 Speaker 1: So the quickest way to reach that will always be the direction of steepest descent. That's the idea of this algorithm. Okay. This is called as a gradient descent algorithm. This is the way all of learning happens in machine learning methods. 01:05:26 - 01:05:44 Speaker 1: All of them, you design a network or you in your, when you do linear regression, you say train. When you hit train, right? This is the algorithm that gets executed underneath. Right? Gradient descent is the key for all of these machine learning algorithms. 01:05:48 - 01:06:13 Speaker 1: Questions? Okay, let's come back to this in a bit, okay? Let's do something a little bit light. Is this picture clear in front of you? Do we want to take a break and freshen up and then come back? 01:06:15 - 01:06:29 Speaker 1: It's fine, it's a local data. It's fine, yeah. It's fine, yeah. It's fine, yeah. Okay, so now here is a classification problem that we are going to try and do. 01:06:29 - 01:06:50 Speaker 1: Okay, here is the set of data points that I have. So here is this set of data points I have. And I have this is my x1 axis. This is my x2 axis. And so the data is given like this. 01:06:50 - 01:07:06 Speaker 1: This is the way you're seeing x1 and x2. This is the way you're normally used to seeing a data. And you have 0, 0, or 1. This is the Y value, the classification problem that you're trying to do. You're trying to solve this problem. 01:07:06 - 01:07:22 Speaker 1: You're trying to solve this problem using neural networks. Now, this X1 and X2, I'm plotting it. I'm plotting the 2 axes over here. What is this kind of plot called? Scatter plot. 01:07:23 - 01:07:36 Speaker 1: Scatter plot. Scatter plot is a plotting command. Right. However, the functional aspect of this functional plot. This is we call it as a feature space plot. 01:07:36 - 01:07:49 Speaker 1: This is a 2 dimensional feature space. So you are able to nicely visualize. OK. Now I want to design a neural network for it. What are the things that you know about neural network? 01:07:52 - 01:07:56 Speaker 1: I see 2 initial features. So what does that tell you? 01:07:56 - 01:07:57 Speaker 2: 2 inputs. 01:07:57 - 01:08:13 Speaker 1: I need to have a neural network with 2 inputs. And this is a classification problem. So the output neuron output neuron should have. Sigmoid function. Sigmoid function. 01:08:14 - 01:08:29 Speaker 1: Everyone with me so far. Yes. OK the output neuron should have a sigmoid function. Good. Now, the question is, we are going to try and design a neural network to try and classify this. 01:08:30 - 01:08:47 Speaker 1: OK, for example, here, let's go ahead and choose a sigmoid activation function for everything. Learning rate. Have you heard this term before? What is learning rate? It's alpha in the equation. 01:08:47 - 01:09:01 Speaker 1: That's right, alpha. The step size, when it's trying to learn, what should be the size of the step it's taking? That's called as learning rate. For now, let's just not touch this learning rate. Let's keep it as it is. 01:09:02 - 01:09:13 Speaker 1: And I want to, in this case, what are we choosing to be? We are choosing 2 hidden layers. The first layer has 4 neurons. Second layer has 2 neurons. And then this is the last 1 is the output neuron. 01:09:15 - 01:09:32 Speaker 1: And I'm asking this neural network to go ahead and draw the classification and by clicking this button. I click it and it trains. You see the steps. It's running these are different steps it's taking. To try and find the thing. 01:09:32 - 01:09:43 Speaker 1: But you know what, it's already found the weights. It's found the nicely found the separation boundary between the blue points and orange points. Everyone with me so far? Yes. 01:09:43 - 01:09:48 Speaker 2: Yes, professor. Yes, professor. Professor, for the input data, it automatically picks up? 01:09:49 - 01:09:57 Speaker 1: It's picking from the matrix, and it keeps trying, keeps changing the weights until it reaches the best set of weights. 01:09:59 - 01:10:04 Speaker 4: Then we can include it just to play it again. 01:10:06 - 01:10:23 Speaker 1: Okay, let's do that 1 more time. Let me go ahead and come back. Yeah, now we have reset it. So let me go ahead and put only 1 hidden layer. 1 hidden layer with 4 neurons in the hidden layer. 01:10:24 - 01:10:45 Speaker 1: This is the input layer. Right, these are the pins, input pins, where X1, X2 is being sent. And this is the output. There is 1 output neuron which they are not showing okay now I go ahead sorry what is epoch epoch if for now it's number of steps 01:10:45 - 01:10:46 Speaker 4: okay okay 01:10:47 - 01:10:51 Speaker 1: I'll explain more we'll come there again for now it's like number of think of it as number of steps 01:10:52 - 01:10:54 Speaker 4: and probably just starting with the random weights 01:10:54 - 01:10:57 Speaker 1: it's initially starting with a random set of weights 01:10:57 - 01:11:00 Speaker 4: we are not defining that it's just picking up by itself 01:11:00 - 01:11:29 Speaker 1: it's picking by itself random set of weights right and then I'm gonna hit it's going to go ahead and classify. Here is the error, train loss and test loss, it's plotting the error. Now, here is my question for you. You all are seeing the network, right? 01:11:29 - 01:11:30 Speaker 6: Yes. 01:11:31 - 01:11:46 Speaker 1: Now, can you tell me how many hidden layers, what is the smallest number of hidden layers I can use to classify this problem? 1 hidden layer. 01:11:46 - 01:11:48 Speaker 4: 100 01:11:49 - 01:11:51 Speaker 1: hidden layers. Why? 01:11:53 - 01:11:57 Speaker 4: Because there's only 2 X1 and X2, right? 01:11:57 - 01:11:58 Speaker 6: So, yesterday, we 01:11:58 - 01:12:00 Speaker 4: use the perceptron to 01:12:00 - 01:12:15 Speaker 1: do this. Do a perceptron because what does a perceptron do? Perceptrons draws linear separating boundary. It's a regression, right? So I should be able to solve this problem with 0 hidden layers. 01:12:17 - 01:12:28 Speaker 1: With 0 hidden layers, I go ahead and do it nicely classifies 1 person from just the output node alone. It's enough. Yeah. 01:12:29 - 01:12:32 Speaker 2: But professor You also created 1 challenging question, right? 0, 0. 01:12:34 - 01:12:37 Speaker 4: And it is also linearly separate. 01:12:38 - 01:12:47 Speaker 1: It is linearly separable. So a perceptron will do that. Your output node is anyway a perceptron. So that is your output node is the only thing you need. You don't need a hidden layer at all. 01:12:48 - 01:13:00 Speaker 3: Yeah, but Professor, if we know that, if we know what is why, then we can decide whether it is, it is linear or nonlinear, right? In this case, it's probably linear. So perceptron is okay. 01:13:00 - 01:13:05 Speaker 1: Yes, it is. You're seeing the picture, right? You are able to draw a separating line. 01:13:06 - 01:13:09 Speaker 3: Yeah, before we try, we do not know that, right? 01:13:10 - 01:13:20 Speaker 1: In this case, we are able to plot it. So the plot is actually clearly there. I'm able to see the plot and say that I can draw a line separating that. Got it. Let's go to the next 1. 01:13:20 - 01:13:24 Speaker 1: Then it becomes clear. Now, what about this problem? 01:13:28 - 01:13:32 Speaker 6: I think it takes more than 1 line. 01:13:32 - 01:13:35 Speaker 1: It's going to take more than 1 line to solve it? 01:13:36 - 01:13:36 Speaker 6: I think yes. 01:13:37 - 01:13:43 Speaker 1: Yes. Let us, you know, what do we use, right? Let's try with single percent run. What 01:13:43 - 01:13:46 Speaker 4: happens? This is equal to the XOR problem yesterday. 01:13:46 - 01:13:57 Speaker 1: Exactly. Exactly. This is the XOR problem, right? Exclusive OR problem. You see that it's running, but it was not able to classify. 01:13:57 - 01:13:58 Speaker 6: Yes. 01:13:58 - 01:14:11 Speaker 1: It's not able to classify. So this is a nonlinear problem and I need a hidden layer. So, I'm going to introduce 1 hidden layer. So, how many neurons do I need? What is the concept of hidden layer? 01:14:12 - 01:14:16 Speaker 1: Like, this is a 4, no? No, what is the concept of hidden layer? 01:14:16 - 01:14:20 Speaker 2: 3, 3. Okay, looks 3 or 2 colors are 01:14:21 - 01:14:26 Speaker 4: all the same. We need to like to do 2 01:14:27 - 01:14:28 Speaker 7: lines. 2 lines 01:14:29 - 01:14:52 Speaker 1: and that's the right thing to do. We just try with the simplest 1 so let's try with 2 and run it it's learning this 1 drew 1 line the other neuron drew the other line and it did 2 lines but unfortunately, it's not great. Yeah. So, what do we 01:14:52 - 01:14:53 Speaker 4: do? Increase 01:14:57 - 01:14:59 Speaker 7: Maybe increase the neurons. 01:15:00 - 01:15:13 Speaker 1: Neurons. Only there is a mathematical theorem says that with a single hidden layer, you are can solve any problem. Yeah. So, you don't need to increase the hidden layer. Just increase the number of neurons. 01:15:14 - 01:15:24 Speaker 1: Yes. So, no professor. What is the meaning of what is the meaning of hidden layer, first of all? Hello? You think it'll do it? 01:15:24 - 01:15:27 Speaker 6: No. I think not. No. 01:15:28 - 01:15:29 Speaker 4: It's probably going to be difficult. 01:15:31 - 01:15:46 Speaker 1: Let's wait and see. Let's give it some time. What do you think? Better. It's better than the previous 1. 01:15:46 - 01:15:58 Speaker 1: It's getting better. It's learning. The steps are increasing. As the steps are increasing, it's starting to get better. Let me just regenerate it and have it start again. 01:15:59 - 01:16:13 Speaker 1: Start again. Thank you. Start again. Is this 01:16:13 - 01:16:15 Speaker 4: tool really available, like TensorFlow? 01:16:16 - 01:16:26 Speaker 1: Yeah, we'll talk about that in a second. Just 1 second. Look at that training. It's training. Definitely better than before? 01:16:27 - 01:16:28 Speaker 6: It's a lot better. 01:16:29 - 01:17:19 Speaker 1: It's a lot better. And what is it doing? Let me quickly, I want to be able to scribble on this but unfortunately can't so I'm going to have to paste it again here right look what it's done So, what was our hope? Our hope was it had I really need So, the site where all of this is there is playground.nseblo.org. So this is the dataset that I have. 01:17:23 - 01:17:32 Speaker 1: And I was really hoping when it's going to do the classification, it's going to classify like this. I bet that's what all of you are hoping. 01:17:33 - 01:17:33 Speaker 6: Yeah. 01:17:34 - 01:17:41 Speaker 1: Yeah. But what did it do? It did something else crazy. It's doing like this, right? But you know what this is? 01:17:41 - 01:17:52 Speaker 1: This 1 actually is a combination of 3 lines. This is 1 line. This is a second line. And this is a third line. I seen that. 01:17:53 - 01:18:07 Speaker 1: And that is what these 3 neurons are doing. This neuron is 1 line, this is the second line, this is the third line. Oh, wait, let me just train it. Sorry, I clicked on the wrong 1, this 1. This 1 is 1 line. 01:18:07 - 01:18:11 Speaker 1: This is the second line. This is the third line. Are you seeing it? 01:18:12 - 01:18:12 Speaker 6: Yes. 01:18:12 - 01:18:32 Speaker 1: That is what 3 neurons do. Each neuron is trying to draw a line to classify this. And the way I determine whether a neural network is enough or not is by this, by trying different ones. And this is why the design matters. This is a hyperparameter. 01:18:35 - 01:18:52 Speaker 1: So this process, this thing that's running, is basically the steps that are running. Each time it's trying to adjust the weights. Right? Look, these are the weight values. The X1, the first neuron, is getting input from X1 and X2. 01:18:52 - 01:19:08 Speaker 1: Are you seeing this 2 inputs that's coming? Right? So, the first weight is 1.3. So, it's 1.3 times X1 is going in and minus 0.85 times X2 is going into this neuron. This neuron is learning to classify this line. 01:19:08 - 01:19:19 Speaker 1: This line. Are you seeing my mouse move there? And that you're seeing the colors there, right? What is what is the responsibility of each neuron? You're able to see that there. 01:19:19 - 01:19:33 Speaker 1: Yes. Yes. Right. And this is what is happening at the essentially, it's 3 neurons. All of them are equally capable but they decide to specialize automatically. 01:19:34 - 01:20:00 Speaker 1: 1 neuron specializes in this another 1 specializes in this another 1 specializes in this. It's exactly like how let's say the 5 of you go to interview some college students to a new college, right? You will automatically decide the most optimal solution is allocate responsibility to 5 of them. 1 person evaluates that person using 1 set of skills. The other person evaluates differently. 01:20:00 - 01:20:09 Speaker 1: Each person evaluates differently. Finally, to see whether the person is the right candidate or not. That is the specialization that's happening during this learning. 01:20:11 - 01:20:13 Speaker 4: So explainability is easy then, right? 01:20:16 - 01:20:32 Speaker 1: Right now, explainability looks like an option. However, not really. It's going to get complicated. Now, let's get to the next data set. What about this? 01:20:32 - 01:20:47 Speaker 1: Is this a more complicated data set or a more simple data set? It's more complicated. More complicated. Do you expect to need more neurons? 01:20:48 - 01:20:50 Speaker 4: Yeah, 1 more. 1 more. 1 more. 01:20:51 - 01:20:54 Speaker 1: 1 more, okay. But you know what? Let's... 01:20:54 - 01:20:57 Speaker 6: I think with the 3, with the 3 could work. 01:20:57 - 01:20:57 Speaker 7: We could be trying, yeah. 01:20:57 - 01:21:02 Speaker 1: Let's live life riskily and try with 3 neurons and see what happens. 01:21:02 - 01:21:08 Speaker 6: Because we can kind of do a circle on the blue with the 3 neurons, right? 01:21:09 - 01:21:11 Speaker 1: Look at that. Perfect. 01:21:12 - 01:21:13 Speaker 6: It's a triangle. 01:21:14 - 01:21:15 Speaker 1: Yeah, I do a triangle. Yeah. 01:21:15 - 01:21:16 Speaker 6: Yes. 01:21:16 - 01:21:40 Speaker 1: It's what you expect with 3 neurons, right? The 3 neurons expect a triangle and that's exactly what it did and it's beautifully clustered. This is the thing, right? You cannot guess how many neurons are needed for a particular problem And this is a 2 dimensional problem where I'm able to visualize it, where I can see the feature space. In general, it's going to be a 5 dimensional problem or 6 dimensional problem, right? 01:21:40 - 01:21:56 Speaker 1: And you can't visualize it. The problem of digit recognition, That was 784 dimensions. How in the world are you going to visualize it, right? You won't be able to visualize it. And the only way to figure this 1 out is keep trying different levels of neurons. 01:21:58 - 01:22:07 Speaker 6: Yeah? How the decision is made between the neurons to this line or 01:22:07 - 01:22:19 Speaker 1: another. This is why it's random. So, the way you're always going to start on your gradient descent is you start off at some random point. 01:22:20 - 01:22:20 Speaker 6: Yes. 01:22:20 - 01:22:41 Speaker 1: And which point you start off with, when you start at 1 point, you're essentially giving 1 way to 1 neuron and another random way to a different random way to another neuron. This random set of choice. Okay. So then maybe the way. Allow that particular neuron to specialize in whichever 1 is closer to that and other neuron takes up some of the responsibility. 01:22:42 - 01:22:46 Speaker 6: So, from that, from it's neuron, we start from different point. Let's say 01:22:46 - 01:22:59 Speaker 1: exactly that random choice breaks the tie. Okay. And remember who you never start off with 00000. Any topic. Yes. 01:22:59 - 01:23:18 Speaker 1: Because then all neurons are have the same set of weights starting with the same 1, right? But allowing the random choice, you're allowing the neuron to specialize. You can't predict which neuron will take responsibility of what. Okay. But it does automatically evolve to take a different responsibility. 01:23:19 - 01:23:21 Speaker 5: Can others please mute? 01:23:24 - 01:23:38 Speaker 1: Okay. Now, I'm hoping all the nightmares due to calculus is slowly disappearing away. Nobody wants to give any encouraging words yet. 01:23:38 - 01:23:53 Speaker 6: Yeah, since the program is taking care of the conditions. Exactly. So it's just the explanation before of what it's coming from, but now we don't need to know about it. 01:23:53 - 01:24:14 Speaker 1: Exactly. You don't need to know, but you can tell that that's what is happening. You saw that learning, right? You saw it learning. How do we understand the mystery of this neurons taking which The the the thing is this is not an explainable ML algorithm right? 01:24:15 - 01:24:29 Speaker 1: So that I'm I'm telling you it's actually quite hopeless task to think about that explainability factor, right? This is why we will- So is there a test for that? Is there a test for that? Business decisions are being made. Yeah, but there will be tests and exams and all. 01:24:29 - 01:24:47 Speaker 1: You need to really think about what is it that's important for me? Is accuracy important for me or is explainability important for me? Is important. Pretty much restricted to what set of algorithms you can use. Questions. 01:24:47 - 01:24:51 Speaker 1: Neural networks. You won't choose. There are some situations when you need. 01:24:52 - 01:24:59 Speaker 4: So, in the business context, professor, we have always seen expandability is something which the business wants to know, right? 01:24:59 - 01:25:18 Speaker 1: So, so, the the question is what level of explainability? Exactly. Right? What level of explainability? And this 1, random forest, for example, can give you the set of important variables and say that this variable is much more important than this 1. 01:25:18 - 01:25:54 Speaker 1: It may be able to give you that level of explainability. Are you happy with that? Or are you really expecting to answer to, why did my algorithm say that this particular stock was a buy today, whereas last month it said no. If you're expecting that level of explainability, then unfortunately neural networks are not for you. On the other hand, if all you want to do is, generally these set of algorithms place highly importance to earnings growth, but not so much importance to how much dividends companies are giving. 01:25:55 - 01:26:01 Speaker 1: If you all you want is only that level of explainability, then neural networks are fine or random forest is fine. 01:26:02 - 01:26:07 Speaker 4: Right? If you repeat this, I missed it in the first part, what you say. 01:26:07 - 01:26:34 Speaker 1: Okay. So, the way you pick stocks, right? What will end up happening is, do I have that slide? It's okay, I don't want to complicate stocks data here. So what Random Forest will tell you is, let's say you have a data set that looks like this. 01:26:34 - 01:27:16 Speaker 1: I have the growth rate, earnings growth rate of a company. Then dividend yield of a company. How much dividend is it giving? And the third 1 is- giving right and the third 1 is different companies you have stock 1 stock 2 stock 3 and so on for all of these stocks you have this data and you look at you know when this was the earnings rate, and this was the dividend yield, and this was the P ratio of that particular company, in the next quarter, it did great. So this was a buy. 01:27:16 - 01:27:32 Speaker 1: This was another company with the different values of numbers and this was a cell, right? So, you have a data like this, right? For all companies. Now, you can use a neural network to learn this. Let's let's think about that, right? 01:27:32 - 01:27:34 Speaker 1: What will be the architecture of that neural 01:27:36 - 01:27:39 Speaker 7: network? 3 input layers. 01:27:41 - 01:27:52 Speaker 1: 3 3 nodes, input nodes. Okay. And an output layer. Right? And some hidden layer. 01:27:52 - 01:28:08 Speaker 1: I don't know how many, but for the sake of argument, let's put a hidden layer with 2 neurons. Right? So I will have this kind of connections. Right? And this will get trained. 01:28:09 - 01:28:28 Speaker 1: We will train this neural network algorithm on that. Right? Now, what I'm saying is this. From this, it is possible to extract 1 level of explainability. It's too small. 01:28:29 - 01:28:52 Speaker 1: I can say that. Right among these 3 variables. Earnings growth rate for for any topic or versus dividend. No, no, no. It will tell you that earnings growth rate is this much level importance is not so much important but P ratio is this much important. 01:28:52 - 01:28:54 Speaker 1: You can get it out. 01:28:54 - 01:28:55 Speaker 7: Yeah, this is the way it's right. 01:28:55 - 01:28:56 Speaker 4: This is the best 01:28:57 - 01:29:07 Speaker 1: of the way. Yeah. Someday, there is a way to extract this. Okay, this is possible to extract it. If you're happy with this level of explainability, well and good. 01:29:09 - 01:29:29 Speaker 1: However, what people normally expect is something more. They are not asking about this general rule. They're asking about why did stop to why was that a set? They're expecting an answer like because it's different. Was low. 01:29:30 - 01:29:45 Speaker 1: Was lower than you know, some 0.7 and that is why it's a fail. Okay. They're expecting that level of explainability that level of explainability neural networks. I'll be learning. Okay. 01:29:45 - 01:29:52 Speaker 1: And and again, this is very important. Challenge for 01:29:54 - 01:29:58 Speaker 4: business. Why would you go for neural network in this case? 01:30:00 - 01:30:05 Speaker 1: You won't. I don't. Okay. I absolutely don't. I go with Random Forest. 01:30:06 - 01:30:24 Speaker 1: I am happy with this level of explainability. I am very happy with this level of explainability. And so, I, as I mentioned yesterday, for me, random forest is the queen of all algorithms. That, by default, I choose that. Yeah, we've got communes. 01:30:25 - 01:30:29 Speaker 1: Right? Wonderful. It works in a large part of it. 01:30:29 - 01:30:34 Speaker 3: But even with the Random forest, the second thing is not explainable, right? The second statement that you made. 01:30:35 - 01:31:03 Speaker 1: This is not possible with random forest, but in my business, all I care about and all my clients care about is, am I making the money or not? Not why I made that decision, Except people do ask sometimes why I made the decision after a lossy month. But you know what, generally I show this and I'm able to tell a story around it. And that's exactly what you all are required to do as well. Did Dr. 01:31:03 - 01:31:14 Speaker 1: Murthy mention the word translators? No. Okay. Yeah. So I suck at languages. 01:31:14 - 01:31:37 Speaker 1: Okay. I was born in Tamil Nadu. My mother tongue, I guess my native language, I consider that as Tamil. However, my mother is actually Malayali and unfortunately speaking Malayalam is extremely hard for me. I'm married to a Telugu person, and Telugu, again, is very hard for me. 01:31:38 - 01:31:49 Speaker 1: Since I'm married for 20 plus years, I recognize all the curse words. I mean, clearly, a lot of arguments happen. So curse words, I know. But speaking is very hard. I suck at languages. 01:31:49 - 01:32:15 Speaker 1: The language that I'm very good at is math. I understand math. Now, the thing is, what all of us need to be good at is 1 particular language. We need to be good at the business language, the language that the business folks speak. And then there is this ML engineers who talk about there is this neural network algorithm. 01:32:15 - 01:32:37 Speaker 1: There is this learning rate. I need to use a learning rate of 0.07 and a stochastic gradient descent method and so on and so forth. They completely talk this very complicated language. The thing is you are dealing with other decision makers and other people who are interested 01:32:37 - 01:32:37 Speaker 2: in business. 01:32:37 - 01:32:56 Speaker 1: So they are telling you, explaining you to the business problem in 1 language. Your role is the role of a translator. You need to take that and figure out, tell to the machine learning engineer, what algorithm to choose. The machine learning engineer will code it up. You don't need to worry about code. 01:32:57 - 01:33:19 Speaker 1: They will make decisions regarding all of that complicated thing. But they do not understand this business use case. And so you are going to tell that, you know what, this level of explainability is good enough for this particular problem. So we all, you all are going to be training to be translators. This is a translator course. 01:33:21 - 01:33:32 Speaker 1: And translator role is 1 of the highest paid roles, because you are actually able to talk, communicate between this 1 and that 1. 01:33:32 - 01:33:34 Speaker 2: Yes. It's kind of a ubiquitous classic. 01:33:36 - 01:33:37 Speaker 1: Exactly. There's 1 question 01:33:37 - 01:33:57 Speaker 6: on the network that you're showing. So if the reason that each neuron took a plan to make 1 boundary, right? Would you be able to replicate the same neuron taking the same boundary or if you change the initial weights, initialization, that neurons would have switched and some neurons would have taken 1? 01:33:57 - 01:34:02 Speaker 1: Exactly. If you change the initial set of weights, the different neurons will take different decisions. 01:34:03 - 01:34:12 Speaker 6: Right? So then, 1 question before you just answer. So then, doesn't that have to do with gradient descent within each neuron? 01:34:18 - 01:34:35 Speaker 1: No, yeah. It's not each of them training. It's basically the training is happening so that the overall problem is correctly classified. When the weights are adjusted, All weights are adjusted simultaneously. Every step, right? 01:34:35 - 01:34:59 Speaker 1: When you move from this point, when you move from this point to the next point, right? Remember, from 1 7, both beta and beta 1 get got adjusted. Every single weight will get adjusted in each step. And the weights are adjusted all the weights are adjusted in such a way that the overall error becomes lower and lower. Next will be... 01:34:59 - 01:35:00 Speaker 1: Yes Rekha. 01:35:02 - 01:35:04 Speaker 5: So so then so if the weights are 01:35:04 - 01:35:05 Speaker 1: adjusting based 01:35:05 - 01:35:08 Speaker 5: on each other. Sorry, please go ahead. 01:35:09 - 01:35:11 Speaker 1: Yes, you can go ahead. You had raised your hand, so 01:35:11 - 01:35:12 Speaker 8: explain. 01:35:12 - 01:35:13 Speaker 1: Yeah, sure. 01:35:14 - 01:35:14 Speaker 8: Am I audible? 01:35:15 - 01:35:18 Speaker 1: Yeah. Hello? Yes, you are. 01:35:18 - 01:35:33 Speaker 8: How do we explain the reasoning why 1 of the variables is more important? The question that you had read before, how do we explain that to business than the graph that you showed? The question in a second. 01:35:34 - 01:35:50 Speaker 1: So that is at this point beyond the scope of this particular model. It will be handled. It will be handled later. But it's beyond the scope of the current. But what I'm pointing out is those are decisions you will be making. 01:35:52 - 01:36:04 Speaker 1: You are telling us about the board. You'll be learning about business use case. In fact, that's 1 of you. But it's going to be your main project, right? You're going to do a project where you're going to take a problem and you're going to think through the steps. 01:36:05 - 01:36:21 Speaker 1: And so you'll be guided through that process. Absolutely. Yes, Venky. You said the translators is from business language to the machine language, correct? So, what about machine language back to business language? 01:36:21 - 01:36:39 Speaker 1: You have to do it. You are the translator back and forth because what will happen is when the results come back, you need to tell the business folks why you're believing that, why they should be believing. So we should be able to explain that situation. This is exactly what I do. This is my role. 01:36:39 - 01:37:01 Speaker 1: I don't do coding. I cannot do coding. I've got to say, I play a good translator, I play the role to the T. In large number of situations, I'm able to communicate well with the business folks. And when the results come back and challenges that come back, I'm able to communicate back to them as well. 01:37:01 - 01:37:04 Speaker 1: You will love this. This is exactly what all of this training is for. 01:37:04 - 01:37:07 Speaker 4: Professor on the Python node you are teaching us all the curse words. 01:37:12 - 01:37:36 Speaker 1: So let me quickly, we are at the break time, So let me quickly tell you what we are going to do. I am going to actually get to that Python code. I want you to see the code. And for this I'm going to talk about a business problem. I'm going to talk about the business problem and I'm going to talk about the Python code and take this business problem and exactly how do you translate that into a business code, we will walk through that. 01:37:37 - 01:37:51 Speaker 1: So we will get there. I don't think we'll get to the slide of what is deep learning yet. I would have hoped to get to that in the last half an hour by now, but it's fine. We're about half an hour behind, 40 minutes behind. It's okay, we'll catch it up tomorrow. 01:37:52 - 01:38:01 Speaker 4: Professor, are you willing to also talk about the backward, back propagation of finding the delta, the gradient? 01:38:02 - 01:38:22 Speaker 1: The gradient, we, not today. It's too late to talk about it today, but we will talk about it probably in tomorrow's class or day after tomorrow's class. When we talk about challenges in deep learning, we need to understand why back-up propagation works and we'll come back. Shreya, please. 01:38:25 - 01:38:49 Speaker 7: So 1 question where we saw the example, right, on that website where we're showing the calculations of using 1 hidden layer or no hidden layer. We tried to fit the classification problem using 3 lines, using 3 neurons, and it tried to not draw a line but rather a curve. So, 01:38:50 - 01:38:50 Speaker 1: I had a 01:38:50 - 01:39:04 Speaker 7: question there whether you know, how do we define that? This could be an overfitting and underfitting there because it actually fit it to the data well, right? It went and actually drew that beautiful line of a curve structure there. 01:39:06 - 01:39:36 Speaker 1: So underfit and overfit are terms that are coming from how, whether on the train data, what is the error you're getting? And what is on the test data, what is the error you're getting? If those 2 errors differ by a large amount, we start talking about overfitting or underfitting. And here, that's why you see this portion of the graph. You see those 2 numbers, right? 01:39:36 - 01:39:57 Speaker 1: The test loss and the train loss will be different. And generally speaking, test loss will be higher than the train loss. It's just that when overfit happens, they are dramatically over. Now look, in this case, it actually hasn't reached the optimal 1 yet. Because it depends on where you start. 01:39:57 - 01:40:18 Speaker 1: If you start at a good point, you quickly reach the best 1. Otherwise, sometimes it takes a long time before you reach the minimum point. But eventually it does, it will reach the minimum. Yeah. This is where the role of randomness plays an important element here. 01:40:19 - 01:40:27 Speaker 1: Now the triangle is faced this way. Last time the triangle was faced this way. Because different starting points each time. Can you hear me, sir? 01:40:27 - 01:40:32 Speaker 7: So trade-off then will look at the output of test and train loss and then decide. 01:40:32 - 01:40:50 Speaker 1: Yeah, exactly. That train loss is, in this case, both train loss and test loss are about the same. You see that curves are emerging, so there is no overfit. Can you hear me? The overfit is all determined by how big is the difference between test loss and the training. 01:40:51 - 01:40:56 Speaker 7: But this can be an underfit also, right? If they are too close and they are equal, like as we can see. 01:40:56 - 01:41:16 Speaker 1: So it would be an underfit if the errors are large. Underfit. So here, let's talk about this, right? So I have a set of points. This set of points. 01:41:17 - 01:41:32 Speaker 1: Underfit is this. Essentially, the complexity of the line is not enough to capture the data, capture the complexities of the data. That is an under fit, okay. 01:41:32 - 01:41:33 Speaker 7: Over fit. 01:41:33 - 01:41:36 Speaker 1: Yeah, yeah. Over fit would be this 01:41:39 - 01:41:40 Speaker 7: all the points and 01:41:40 - 01:42:00 Speaker 1: draws necessarily is this thing are connected right that would be an over fit The right level fit would be something like this. Sorry. But man, my, my curves are really bad. Something like this. That would be a right level fit. 01:42:00 - 01:42:05 Speaker 4: Where the, where the loss function is, or cost function is minimum. 01:42:06 - 01:42:42 Speaker 1: Where the cost function is minimum but that is not see in all of these cases the cost function is minimum it's just a question of are you trying to fit with the linear equation or quadratic or cubic right but in all of it the parameters are chosen to minimize the cost in that 1. But are you using too many parameters is the question. Overfit and underfit is really talking about are you using too many parameters to quantify that particular problem, to solve that particular problem. Yeah. The few more, 3 more questions, 3 more people raised their hands. 01:42:42 - 01:42:56 Speaker 1: Let's address that and then we can take a break. We are really, we won't be able to take a break there. I am fine, but I think you guys need a break. Sorry. Bharat, is it? 01:42:57 - 01:43:15 Speaker 5: Yes, yes, sir. So just building on the question where you said that all the weights and all the neurons are getting trained simultaneously. I'm still trying to draw an analogy. I agree it's not like random forest. Isn't it like boosting where the weights are iteratively adjusting round by round, right? 01:43:15 - 01:43:20 Speaker 5: Based on trying to get to an optimization? Isn't it like an ensemble with boosting? 01:43:21 - 01:43:52 Speaker 1: It is not ensemble in that in an ensemble, every neuron, if every neuron was trained on the same output, that would be the thing. But actually, that this particular neuron, let me go back to this graph. This is the only neuron which knows what should be the desired value of output. The output of this neuron, it does not know what value it should target at all. This neuron also does not know what value it should target. 01:43:52 - 01:43:53 Speaker 5: Understood. 01:43:54 - 01:43:59 Speaker 1: Whereas in ensemble systems each 1 of them is exposed to the target. 01:43:59 - 01:44:00 Speaker 5: Very clear sir. 01:44:00 - 01:44:13 Speaker 1: This is trained with 1 and this is trained with 1 separately, whereas here it doesn't know what should be the value I should target to in order for this value to reach that, why it doesn't know. Which is why all of the values are sort of adjusted. 01:44:13 - 01:44:15 Speaker 5: Lovely, so thank you, very clear now. 01:44:16 - 01:44:28 Speaker 1: Good, thanks. Srikanth, and then we can take a break. Last question. I think you can raise your hand. OK. 01:44:30 - 01:44:35 Speaker 8: Srikant, I'm sorry. What is the local minima and global minima? 01:44:35 - 01:45:00 Speaker 1: Very, very good. So there is no assurance that you're going to get to a global minima. With gradient descent, the only thing is, depending on the starting point that you choose, you might end up with this minima or that minima. There is no controlling which minima you'll end up. But the thing is, most of these local minimas do pretty decently, so it doesn't matter. 01:45:00 - 01:45:40 Speaker 1: You remember I told you that you will be doing an assignment and you will be evaluated based on the output predictions you make. The thing is, 2 of you might come up with the exact same architecture, but might get slightly different answers. Because your instance might start off with 1 set of random weights and your friends instance might start off with a different set of random weights and you might end up with completely different, slightly different network weights. But in terms of business use case, the conclusions will be similar. There might be some candidates where you make completely different, you might disagree on, but in most of these candidates, you both will agree. 01:45:44 - 01:45:59 Speaker 1: Yeah. Okay, sure. Yeah, so let's take a break. It's 8.12 in my watch. Let's come back at 8.20, please. 01:46:00 - 01:53:30 Speaker 1: So 8 minute break. Please do look at your watch, then adjust for that. 8 minute break. You you you you you you you you you you you you you you you you you you you you you you you you you you you you you you 01:54:06 - 01:54:13 Speaker 4: Professor, if you're there, can we request you like your whiteboard, can this be converted to PDF and send across as well? 01:54:14 - 01:54:41 Speaker 1: Oh, I need to be hopeless. As you can see the level of scribblings that's there, I don't think it would help. I wish there was an AI algorithm that could figure out my handwriting and put it in the right 1. But there's still some hope, someday someone will invent it. Sorry. 01:54:41 - 01:54:56 Speaker 1: No, but 1 word of appreciation, Professor, this is the, We hit the global maximum of interaction in this class. We never hit. Wonderful, wonderful. I'm certainly enjoying all the interactions. We are going a little bit slow, but that's OK. 01:54:56 - 01:55:21 Speaker 1: As I said, 15 episodes is a long time. So we'll get there. This particular set of classes, what happened, this 1 is the third class. I would say up to the next 2 classes are extremely important in getting the big picture. And I think once we understand this big picture, everything else is easy. 01:55:22 - 01:55:28 Speaker 3: Professor, I would say that you have given the first taste of the AI medicine 01:55:30 - 01:55:46 Speaker 1: in the TDA program. I don't know whether it was a sweet medicine or not, but thank you. You gave it in the spirit form. So, Professor, 1 quick question, maybe it will not be relevant. You said the random forest is the queen of all algorithms. 01:55:46 - 01:56:07 Speaker 1: Who's the king? Now, I, suddenly that's going to, you know, you're asking me to rank like that. I meant it as a second 1. I Genuinely meant it as the number 1. So let me offer some context to it. 01:56:07 - 01:56:37 Speaker 1: So I know this is not part of neural networks, but still we are business folks and it's important for us to know this. This is genuinely the algorithm that's my favorite, right? My of mine is random forest. Like pretty much any algorithm, any time, any problem I get, the first 1 that I use to try that is random forest. Right out of the box, it does well. 01:56:38 - 01:56:53 Speaker 1: It minimizes overfitting. It doesn't do too much of overfitting. And you get pretty decent results. But it's very rare. It's not often that you see random forests actually get into production. 01:56:54 - 01:57:18 Speaker 1: OK. Because what happens is that the random forest very quickly gets you pretty good results. But if you're thinking about winning competitions where every single half a percentage point accuracy really matters, right? If you are, and this is typically competitions and so on, right? Kaggle competitions and so on. 01:57:18 - 01:58:01 Speaker 1: There, XGBoost or any of the other boosting algorithms, LG, light gradient boosting, and there are a whole bunch of other gradient boosting algorithms. Adaboost. Not quite Adaboost, but XGBoost and LightGBM, which is a Microsoft 1, LightGBM, which is a Microsoft 1, both are excellent. You see them, they're capable of handling a large amount of data, can be multi-threaded, and so on. So the accuracy-wise, the performance-wise, These 2 are great. 01:58:01 - 01:58:36 Speaker 1: Essentially, the boosting class of algorithms are fantastic. Except in order to get that accuracy, you need to train, you need to tune the algorithms very, very, very, very carefully. You need to put in a lot of work for it to actually get to that accuracy, but it is possible to get to that accuracy with these algorithms. Random forest, very quickly, it'll get you to the, you know, pretty good accuracy levels, right? And so if you want to get quick results and quick turnaround and so on, Random Forest is the place to go. 01:58:37 - 01:59:19 Speaker 1: This 1, there are a lot more hyperparameters, and tuning it takes a lot more time. That's more from a real world practitioners view. Just to give you a context, for example, booking.com, 1 of the, again, you know the hotel aggregators company, right? For example, for their production, this tuning, they do use 1 of the boosting algorithms. And for them to figure out the right set of hyperparameters and tune with that kind of data set that they have, it takes about 1 week's run time to figure it out. 01:59:20 - 01:59:36 Speaker 1: So a massive amount of tuning is needed. But once the tuning is done, it performs fantastic. So that's the thing of the boosting algorithms. But quick testing of ideas. PAX, Random Forest works well. 01:59:38 - 01:59:50 Speaker 1: OK, let's move on. Let's get to the last piece, okay? Everyone had a cup of Coffee? More than that, yeah. Okay. 01:59:52 - 02:00:00 Speaker 1: Okay. Okay, for the last stretch, what we are gonna do is we are gonna go back and walk through the, 02:00:00 - 02:00:33 Dr Anand Jeyaraman: Through the math that I actually talked about okay but a little bit more careful okay I promise you it won't be calculus but I want to explicitly talk through the steps And then I'll show you the core of how it actually comes to. Okay. Now let's think about this data set. We go back to the data set that I talked about earlier on. See the stock data set, right? 02:00:33 - 02:00:49 Dr Anand Jeyaraman: I have earnings growth, earnings growth rate of a company, P ratio of the company, and then dividend yield of the company. These are the 3 features that I have for all of these different stocks. 02:00:50 - 02:01:00 Speaker 2: And an outcome of whether it went up or it went down in the next quarter. This is my data set that I have. 02:01:04 - 02:01:07 Dr Anand Jeyaraman: And I chose to design this network 02:01:09 - 02:01:15 Speaker 2: with 2 neurons in the hidden layer and 1 output neuron. Everyone remembers this? 02:01:17 - 02:01:34 Dr Anand Jeyaraman: Right? And this is the architecture. I don't know if there's any other color. Let's just do this. So this is the architecture that is there. 02:01:34 - 02:02:05 Dr Anand Jeyaraman: Right, so what will happen, and I want to talk about this explicitly, right? The way I'm going to get the training done. So this is my y value. What I'm going to do is, I and the way the network operates as this, I'm going to take these guys, this data of the first stock. And I'm going to take this vector and line it up like this. 02:02:08 - 02:02:25 Dr Anand Jeyaraman: I'm going to send the earnings into this node for the first talk. The P ratio into this node, dividend into that node. Those are all 3 numbers, right? Those numbers get transported to this neuron. These numbers also get transported to this neuron. 02:02:26 - 02:02:42 Dr Anand Jeyaraman: This neuron is going to give some output. This neuron is going to get some output. That gets collated over here. And finally, some output comes, which will get interpreted is either 0 or 1. And so this is my predicted value. 02:02:45 - 02:02:57 Dr Anand Jeyaraman: These are all the actual values. This is the predicted value from the neurons. Everyone with me so far? Hopefully, I'm setting it up slowly so that everyone can be with me. Yes, yes. 02:02:58 - 02:03:03 Dr Anand Jeyaraman: OK, now let's start talking little bit details 02:03:04 - 02:03:07 Speaker 2: okay this particular neuron 02:03:08 - 02:03:12 Dr Anand Jeyaraman: how many inputs does it have 3 inputs 02:03:12 - 02:03:13 Speaker 3: 3 02:03:14 - 02:03:17 Speaker 2: 3 So how many 02:03:42 - 02:03:50 Dr Anand Jeyaraman: weights need to be determined for this particular problem. Agreed? All with me? Yes. Okay. 02:03:50 - 02:04:05 Dr Anand Jeyaraman: So let's call this 11 weights. OK, so it's going to be something like this, right? So W0, W1, whatever, till W11. This is the 11 set of numbers I need to determine. Till W10 right? 02:04:05 - 02:04:17 Dr Anand Jeyaraman: Till W10 absolutely you're right. W10. I need to determine. These are the 11 weights that I need to determine. Now the way I am going to determine is this. 02:04:17 - 02:04:19 Dr Anand Jeyaraman: Now let's say I have the data 02:04:22 - 02:04:37 Speaker 2: of all the stocks, okay, for the past 50 years of data that I have, right? From all geographies, right? So let's say I have 1,000,000 02:04:37 - 02:04:37 Speaker 4: data points, okay? Have from all geographies. So let's say I 02:04:37 - 02:05:02 Dr Anand Jeyaraman: have 1 million data points. Now let's do this 1 step at a time. So what I'm going to do is, in order to determine these 11 set of bits, what did I say I will do? I will start off with some random set of bits, some random gist. I'll start off. 02:05:04 - 02:05:14 Dr Anand Jeyaraman: I don't want to use the same symbol over again. So let me make this as beta not beta. I really apologize. I should have thought about it. Hopefully that doesn't throw you off too much. 02:05:14 - 02:05:36 Dr Anand Jeyaraman: OK. Now the initial guess weights is I'm going to call it as W0. And there are 11 numbers I need to guess. Minus 0.7, 0.8, all the way to some number, some 11 random numbers I'm going to start off with. So in my 11 dimensional space, I need to find the right set of numbers. 02:05:37 - 02:05:49 Dr Anand Jeyaraman: That's the meaning of it. So then what I'm going to do I take all of these numbers and apply it over here. So this is minus 0.7. This number is 0.8. This number is dot. 02:05:49 - 02:06:05 Dr Anand Jeyaraman: Whatever I get to all the numbers I'm applying it all out to these lines. And then I send this stock, this 1 stock I'm going to send it in, right? I send it here, it'll go ahead and get me a number. I put this number here. And I say that this 1 resulted in an up. 02:06:05 - 02:06:14 Dr Anand Jeyaraman: Then I send this stock. And again, send it through. And I get this and say that this 1 results in a down. This 1 resulted in an up. This 1 in a down. 02:06:14 - 02:06:30 Dr Anand Jeyaraman: And so on and so forth. For all the billion data points, I compute this. After I compute it, I compute the error. Let's call it as some of the errors. This error for all of these data points. 02:06:30 - 02:06:39 Dr Anand Jeyaraman: Why minus y hat. I'm going to square I'm going to compute this. This is the error that I'm computing. Yes, go ahead. 02:06:43 - 02:06:48 Speaker 5: So for numerical things, I can understand why I might have a classification that we do in 1. 02:06:48 - 02:06:50 Dr Anand Jeyaraman: It is not. It 02:06:50 - 02:06:51 Speaker 2: is not 02:06:51 - 02:07:17 Dr Anand Jeyaraman: some of squared errors, but some error function and going to compute. It's a much more complicated error function. And I was trying to pull 1 over all of you guys, but you are smart enough and you quickly recognize that's not the case but some error function I have. Okay. My goal is to reduce the value of that this error function I want to have it to be a number as low as possible. 02:07:18 - 02:07:47 Dr Anand Jeyaraman: This error function generally it's not called sum of square others. They call it as an error function or more commonly cost function. This is the term that's often used, cost function. We are trying to get the value of the cost function to be as low as possible by adjusting these weights. The set of weights which gives the lowest weight for the cost function is the weight that I'm ultimately seeking. 02:07:48 - 02:08:04 Dr Anand Jeyaraman: That's the 1. Once we have reached that value of weights, then I will say that the network is fully trained. All I'm doing is repeating myself something I've said before. That's all. I haven't talked about details yet. 02:08:05 - 02:08:38 Dr Anand Jeyaraman: What I do is I go ahead and put all of these weights and compute the predicted value and compute the value of cost function. And then this weight gets adjusted through my gradient descent function. The gradient descent function gives you a rule. The n plus 1 version of the weight can be obtained by the last weight minus alpha times the gradient of the cost to function. This is the gradient descent method. 02:08:39 - 02:08:54 Dr Anand Jeyaraman: Okay, don't worry if you don't understand the messiness of the formula. All I'm saying is given 1 set of weights, How do I adjust it is what this formula is telling. It says that you take the weight and take 1 step forward or backward depending on the gradient. 02:08:55 - 02:09:02 Speaker 2:

Deep learning session 3 transcript.pdf

Document Details

Tags

Related

Full Transcript

Upgrade to continue