102106094 (1)-compressed-501-600.pdf

Full Transcript

Suppose, I have an image which we can represent as a matrix I x y and I want to calculate its gradient because gradient of image represents edge. And so, why we are calculating this grad I, please refer to the paper and the lecture? So, grad I can be calculated like this, using central scheme. And...

Suppose, I have an image which we can represent as a matrix I x y and I want to calculate its gradient because gradient of image represents edge. And so, why we are calculating this grad I, please refer to the paper and the lecture? So, grad I can be calculated like this, using central scheme. And in our formulation, external energy is equal to minus of magnitude of this grad I, although it can have other terms also like curvature terms and then density terms but I am considering just this one. So, given an image i, I can calculate its gradient and then I can calculate its external energy. Now, the term here that we have, the fourth term, it is gradient of E external. So, if I again take the gradient of E external, what will happen? 500 (Refer Slide Time: 13:20) It will, because E external is calculated at grid points, if I calculate a gradient of external, E external then it will also be calculated at the grid points. However, in the formulation, the fourth sum needs to be calculated at snake locations. (Refer Slide Time: 13:35) 501 So, for example, this is my grid. So, these black circles represent grid points. So, if there is, it is not necessary that our snake points always lie on the grid points. For example, in this figure you can see, these red circles represent snake points and they are not lying on the grid points. So, what we have to do? To calculate this grad E external, we have to interpolate the quantity that grid locations. So, we will calculate. So, first given image i, I will calculate t external and then I will again take the gradient. This gradient will give me value of grad E at these grid locations. And then finally what I will do? 502 (Refer Slide Time: 14:23) I will interpolate, use some linear interpolation or bilinear interpolation. Basically, linear interpolation. And I will interpolate these values at snake locations. So, just to show what I mean, I will show you something. (Refer Slide Time: 14:42) For example, I am just showing you. Suppose, I have an image like this, a black circle. Here, I have made it blur, so that its gradient, its boundary gets little bit diffused and this is done by, this is done in the paper as well. 503 (Refer Slide Time: 15:23) So, the effect of this blurring is this. If you will see here, it is like a well. So, if someone comes here, he will be slipped into this well. So, if we would not have used this blurring, this will be very small. But because of blurring it becomes this well becomes little wide and actually snake should come and fit into this well. This is what we want. So, this is the like plot for, this is the like surface for potential energy, E external. Now, what happens? Now, the third thing. 504 (Refer Slide Time: 16:00) So, because of this. So, now, we have some external energy and if we take its derivative, we will find the field forces, just like we have a gravitational field. But in gravitational field, all the like if I make a gravitational field, so, at all points, we face like feel the same force and downwards. So, it will be just arrows pointing downwards of all of same length, 9.8 meter per second square. But here you will see. So, like little bit magnify it. You see your image field like the field that you have, the force field that you have is not a constant but it has different directions also. So, for example, look here, maybe you see here. The forces are in rightward direction. It is although they are small but you can still see this is pointing towards. 505 Here, they are pointing towards left. So, if a particle is lying here, it will feel the force in the direction. So, it will try to come here and if it is here, it will try to come here. And the well the potential well, it is somewhere in between. For example if you can appreciate if I make it little, sorry. (Refer Slide Time: 18:22) So, you see, the original thing is here and if you are here, the force will take you towards the edge and if you are here, it will again take you towards this edge. So, the minimum energy is at this point. So, this is called quiver plot. This is not very good for this image but you at least get the idea. I will show a cleaner picture now. So, our objective is to like minimize energy and we have a force field that forces, these snake points towards this potential well. So, because here, it was not very clear, I will show you something else. 506 (Refer Slide Time: 19:20) 507 So, for example, I have created a field. This one it will show completely then I can. (Refer Slide Time: 19:47) 508 So, for example, suppose, I have some image and I calculate its gradient and find the energy due to edge and it looks like this. So, clearly you can see here along the diagonal 1, 1, along this diagonal, the energy is minimum. If you go little left, it is increasing you go little light, it is increasing. 509 (Refer Slide Time: 20:16) So, its minimize at this diagonal and you can also see this from the contour plot that the value is minimum 0.5 here and as you go here it is increasing and if you go towards this side, it is also increasing. So, the minimum energy is at in the diagonal. So, actually, you can think of like this diagonal represents the edge. 510 (Refer Slide Time: 20:54) So, if you have an energy function like this then the forces will be like this, the force field. Though at that time I was not able to show the skewers properly. So, you can see, because the edge is along 1 1. So, if a particle is here. So, it will feel a force here. It will be forced towards this edge and as it becomes, as it comes closer to the edge, this force becomes less and less. Similarly, here also, if particle is here, it feels, suppose, it comes here and it comes. And as it comes closer to this edge, it becomes, the force becomes less and less. And once it come, like settles in this edge, there is no force. Or this is the point of minimal energy where snake should actually be settled. 511 (Refer Slide Time: 21:52) Now, as I was saying earlier like our snake points need not to be coinciding with the grid point. So, for example, in this case I have this grid, these green lines. But our snake points, these are red points, they are some of them are not falling on the grid points. So, what we have to do? We have to interpolate. So, we calculate like gradients, basically because gradient of energy gives us the force. So, we calculate the forces at grid points and then we interpolate them in the snake locations. So, here you can see, we have calculated forces. 512 (Refer Slide Time: 22:38) So, here you can see, we have this energy, we took the gradient of it. So, this is the force field. (Refer Slide Time: 22:51) 513 Then to find the forces on snake coordinates, we have to use interpolation. So, the default one is a bilinear interpolation in this in MATLAB. You can have other methods also. But the idea is we have to use some sort of interpolation. So, here, I am just showing that I am showing the forces, you can see like at this point the force is towards this direction. Here the force is towards this direction. Here the force is little bit stronger than this one but it is in the same direction. So, because the edge is along 1 1. So, when the particle or the point is in this direction, it will be pulled towards the edge. And if it is on the opposite side, the left hand side of this edge, it will be pushed towards right side. You can see, this arrow and you can see because at this point is closer to the diagonal, the force is less. This is a little far. So, it is more. So, this is the interpolation idea. 514 (Refer Slide Time: 24:11) So, if I just throw some points, initial contour, if I throw some points, these black ones and this is our initial contour and I throw these points in this image field. So, where will it go? So, because of the forces that we have seen, it will try to align itself along the edge where we have it. So, you can see here, slowly due to the forces, this circle becomes an ellipse and it tries to align itself with the edge. So, I am just trying it once more. So, for example, you see here. (Refer Slide Time: 25:08) 515 516 We started with a circle, slowly, it tries to align itself with the edge. Now, here, we are completely neglecting the effect of like the influence of internal forces. So, this motion, the motion of these particles is entirely due to external force and this is an explicit computation, just like we do normally the explicit one. But when we like also take into account the property like material properties of snakes then those alpha, beta terms also come. And then things become little tricky. Because those derivatives, alpha beta derivatives are calculated at the present time t. (Refer Slide Time: 26:04) 517 So, just remember, these are terms are calculated at t. But this term is calculated at t minus 1. I have written this also, second and third terms are t but fourth term at t minus 1 that is the combination of the 2 and it is called semi-implicit. (Refer Slide Time: 26:33) So, when you solve, when you substitute these values, you will get something like this. Here you can see, b a plus 4b, this is a standard pentadiagonal matrix structure. So, in matrix form it looks like this x star t multiplied by M is equal to x star t minus 1 plus delta t times this. Computation of this we have already seen, how we will calculate it. But here you can see, we have, if it were purely implicit, this M would be like gone there will be no M. 518 But because we have this semi implicit thing and we are calculating the alpha beta related terms at time t, so, we have M here. So, this M is the cyclic pentadiagonal matrix. You see, the coefficients here are a b alpha delta t by delta s square etcetera. But this is the structure. So, if I have to find x star t, given x t minus 1, should be x star actually, x star at t minus 1. If x star at t minus 1 is given and I want to find x star at t, what I have to do? I have to take its inverse. (Refer Slide Time: 27:44) So, this is the evolution equation of snake in numerical form. So, this is the analytical form. Numerically, the same thing. ∂𝐶/∂ a = − δ𝐽 will look like, it will reduce to this. So, we have to find this M inverse and then solve this. 519 (Refer Slide Time: 28:18) So, this is what sir has also taught in the lecture. Here a plus lambda i represents M. There will be some differences in plus minus because of the science feature sometimes we can say. So, do not worry about that but the idea is this only. So, when you like implement this thing. So, the things right now that, till now that I have shown is, they do not consider the behaviour of material, behaviour of snake but you can include that also. So, in the code that we will provide to you. (Refer Slide Time: 28:53) So, this is not written by me. This is available online. It is taken from MATLAB central. So, this code includes not just the forces that I have shown here or the sir has taught there, it 520 includes some other forces also like balloon forces and gradient vector fields. Sir has discussed these forces at the last, the last slide. So, you will find those forces also there. So, it is an advanced version of snake that is being taught to you at the class. But anyway, the structure will still be there. For example, the external forces you will find something like this. (Refer Slide Time: 29:36) And you will also find the pentadiagonal structure there. You have to look at the code. And then you will have some evolution equation. So, when you will solve, so, if you run the code. So, with this is a full-fledged code that you will be given. 521 (Refer Slide Time: 29:56) So, when you will run it, you would be able to... so, the code is written by this Dirk Jan Kroon. It is available in MATLAB central. So, you can download it. So, when you will run it, you will see the full implementation of snake with all the forces and you see here. (Refer Slide Time: 30:14) This is your initial contour and this is the object that you want to segment. And you can see, this is your external energy. Why it is looking blurred? Because this image is convolved with a Gaussian filter. So, that well becomes, the potential well becomes wide enough, so that snake can fall there. And these are the external force field, just like the quiver plot I have shown you earlier. This is the real one. 522 And here you can see, it is evolving over time and slowly it is like capturing the boundaries of the object. These external forces, if you want to magnify it you can zoom it also. Sorry, I do not know what happened. This also came here. But yes. So, that quiver kind of field is available here also. And you will see that if whenever you are close to well, you will find forces from both the directions throwing the particle towards this minimum potential well. (Refer Slide Time: 31:56) So, this is what you will get. I have just taken a snapshot of it. So, your initial contour will be like this. This will be your external energy. This is a force field that is and then finally snake will evolve in such a way that it captures the boundaries. (Refer Slide Time: 32:18) 523 Another thing is you have 2 parameters alpha beta there, which determine the property of snake. So, how would you can play with these parameters? So, because you are given the code, you can play with the initialization, you can play with parameters like alpha beta and other things, you can play with the weights also. So, although there are some guidelines for all these weights but I am just giving you some guidelines for playing with alpha beta. So, if your image has deceptive image gradient, if your image is deceptive gradient, you can use high value of alpha and otherwise you can use low value. Similarly, if your object that you want to detect has some sharp features, sharp edges then the choice of beta should be low. And similarly, if it is smooth, you can choose high value of beta. Now, high and low with respect to what? With respect to like if alpha is 1, low means beta is very less. And there are some other papers also; you can just take a look. So, this completes our first tutorial on snake. We have shown you a basic implementation of snake starting with in which we have not considered any internal force and we can see that how the snakes can evolve just with the help of external force. And later we have seen the effect of internal forces also. The code that has been given to you is available in MATLAB central and it contains all the forces, external forces including the image force, gradient force, phototropic curvature and this balloon force and this gradient vector field force. So, it is an advanced version of the code but you can still like go through it and see that the structure that has been taught to you is sitting there. You can also play with alpha beta and initialization to see how well or how bad this method is. So, thank you for your time. That is all for you for the snake problem. Thanks. 524 Medical Image Analysis Professor Ganapathy Krishnamurthi Biomedical Engineering Design Indian Institute of Technology, Madras Lecture 35 Level Set Method (Refer Slide Time: 00:15) Welcome to the second tutorial of this Active Control Model. Here, we will cover Geodesic Active Control Model, which is based on level set formulation. It is different from what we have just covered, the Snake model. So, how is it different? So, in the Snake model, if we start with the initial snake, initial, an initial curve, for example it is C, the curve was C, it was parameterized by S, and then we had an explicit evolution equation of C, like del C by del t is equal to something. But in level set formulation, we do not have an, we do not have a such explicit equation for C. Instead, we have something else. 525 (Refer Slide Time: 01:12) So, what is this? So, you can refer to this page. So, the idea of… so basically these are implicit contours, so this is phi. So, what is this phi? So, instead of evolving C we evolve a surface which is represented by phi. And this, and with the help of phi, we can calculate, we can estimate the location of C in an implicit way. How? It is, as you can read here, the idea of evolving a surface instead of front, C, was proposed. And the front is defined to be all points. So, the front is defined to be all points where the surface has no height. So, phi, you can also think of as a height function, that we are going to evolve. So, this is the main difference between the 2. For more details please refer to the lecture videos, my explanation is not very good but you can refer to a lecture video. I have added the time stamps here. 526 (Refer Slide Time: 02:06) 527 528 529 So, although there is little difference in what we are going to evolve, whether it is a contour or the height function, the main theme of all these models is this. You start with a functional, and functional contains some energy terms, with that you calculate some forces also. And then use all this Euler-Lagrange equation and find some partial differential equation that you solve numerically. And then you find the… and then you do the segmentation. So in this method, an edge indicator function is first defined because it is used in the energy functional. So here, the edge indicator function is 1 2 1+|∇(𝐺σ*𝐼)| So, you would have, you have already seen this thing earlier. Given an image, we can, we use this Gaussian blurring to widen the well, widen the potential energy well. So, that is one thing. And second, and then, this indicator function as you can see, when gradient is high, gradient is high, its value becomes low. So, it becomes less. So, this is the purpose of it. So, this G is used in the energy functional here. Then we have some direct delta function also that we have to smooth because we want our function to be differentiable at all times. So we use some smooth dirac delta function. 530 And finally, we have this equation. So we are not going to the details of how we got here and even what these terms mean, but you can refer to this time stamp of lecture video to understand what this equation means. So, in the MATLAB code that we will share with you, the idea is same but the functional that has been used has some additional terms also. So, you need not to worry about it. Even though the code is little bit advanced, but you will find that the ideas that we have seen in the lecture videos are sitting there. This is the main purpose of this tutorial. So that Snake one was a more detailed tutorial but here, we are just trying to show you that whatever you have learnt in this lecture video, it is sitting there even in the advanced codes. So, for example, in this code, you will see we start with some initial phi, for example this, a binary step function. Then we have the same indicator function that we were, that you, 1 that sir taught in the class, 2. So this is how you implement it in 1+|∇(𝐺σ*𝐼)| MATLAB. So I am just attaching some snapshots of that code. So here you have initial phi then you have edge indicator function, then this is smoothing function also. So, here, they have used some other function. You can try, you can also try the one that sir taught in the class, it is up to you. So, but the idea is smoothing function is used there. And this level set evolution equation you see, here you have some additional terms also. And their weights are mu, lambda and alpha. So, if you want to like shut down some of the terms, you can just set alpha equal to 0 or mu equal to 0 and maybe, so, you understood. So, then you can like, and this is the evolution, it is the solution numerical solution for this. So the idea is, whatever you have been taught is already present in this code also. You can play with it, you can add your own terms, remove some of them, and understand the code more. So, for example, I will show some demo here. So, hopefully, you would have seen this. 531 (Refer Slide Time: 06:38) 532 So, you start with some initial, initial function, level set function. In this case, it is binary step. And after some time, it evolves. So after some 200 iterations, this binary function is evolved to this shape. And now, we are able to capture these 2 also. These 2 objects are very nicely segmented by our level set formulation. So, that is first demo. You can, there is another demo that is available. So initially you have a level set function of this type and after some time it evolves and it captures the boundary of the object. And this is the final shape of the level set function. So the main idea is, you can see, we are 533 evolving our level set function, and therefore, we are able to see the evolution of contour as well, in an implicit way. And the main thing is whatever thing, whatever ideas that sir has conveyed during the lecture video, that are present in this code as well, and some additional ideas also. So, you can take a look at this code, play with it, and understand it more. So, that is all for this tutorial. Thanks for your time. 534 Medical Image Analysis Professor Ganapathy Krishnamurthi Biomedical Engineering Design Indian Institute of Technology, Madras Lecture 36 Chan Vese Segmentation (Refer Slide Time: 00:14) 535 In this tutorial, we will see the MATLAB implementation of Chan-Vese Segmentation Model. So, this model is different from a Snake and the Geodesic Active Contour Model because of its, because of its functional. So, the overall theme, both in the explicit method or the Snake kind of method as well as the level set method that we used for Geodesic Active Contour is this. You start with some functional, then Euler–Lagrange equation, then we find a PDE and then we try to solve this PDE with some numerical method. So, the functional used for Can-Vese segmentation is different from the other 2. So, that is why we have, we are treating it separately. So, the functional is called Mumford-Shah functional. Its expression is shown on your screen. It has three terms, 𝐹𝐴, 𝐹 5, 𝐹 6. 𝐹𝐴, and here some notations are there. C is same as the evolving curve, that we are using all the time. Here, small F represents image, u represent its piecewise smooth approximation, Ω represents the complete computational domain, and Ω/ 6 represents the domain within C. So, with this notation, you can understand the functional here. The first term penalizes for large length for a, large length of the contour. So, its objective is to keep C smooth and compact. 𝐹 6, the third term, it penalizes for the presence of gradient, gradients within, within the region bounded by the C. So, basically it, its main objective is to avoid too many edges in the region bounded by the C, by the curves. 536 The second term, 𝐹 5 term is the main term, but it looks like a mean square error. Mean square error between what, the original image and its piecewise smooth approximation. So, sir has explained this in this timestamp, almost 3 minutes in the lecture video, you can refer to the video. I am taking a simple example to just explain what he taught there. So, for example if you have an image like this, so, we have an object here, this black, this, which is shown in this black color and we want to segment it from its surrounding, which is in white color. So, this white line is the curve C. So, within C, we have Ω1, and outside it, we have Ω2. So, we consider all the possible cases for the position of curve. So, and also we define our 𝐹 5 = 𝐹 (𝑐) + 𝐹2(𝑐) because here it is over all the Ω, not just inside. So, 𝐹1(𝑐) is 1 mean, calculate the mean square error within Ω1and 𝐹2(𝑐) calculates mean square, mean square error within Ω2. Here, u in this function is the area integral, basically, average of image intensity in the region, so, this is µ1, µ2. So, here you can see, very intuitively, in the first figure, in Ω2, all values have same, all the pixels have same value. So, the value is same as the average. That is why 𝐹2(𝑐) is almost equal to 0. But here, in Ω1, this will not be, this will not be the case because the average will be different from the individual pixels. In the second case, Ω1, Ω1 1 is like, it contains only black pixels. So, that is why 𝐹1(𝑐) is close to 0 but 𝐹2(𝑐) will not be 0. So, you can see it is, the fitting term is minimized only in the case when the curve is on the boundary of the object. So, that is the purpose of this, this functional term. So, that is the main, main hallmark of this Chan-Vese segmentation model, this Mumford-Shah functional. 537 (Refer Slide Time: 05:27) So, this is also a level set formulation method. Although the function is different, but still it is, it has a level set formulation. So, what happens in level set formulation is given constraints on C, we define an energy function in terms of phi. In case of Chan-Vese model, this functional F is this, with the representation which we have already seen. This F is actual image, mu 1 is what, the average and all. Here, capital H is the heavi-side function. And because it is not differentiable, it is a smooth function, a smooth version is used. 538 (Refer Slide Time: 06:06) 539 And when so, now we have a functional in terms of phi, not in terms of C, please note that. So, when we use our Euler–Lagrange equation, we find this, this PDE, del phi by del t is equal to this. Here, delta is the smoothened direct delta function. So, this is the PD that we solve. And we are not going to explain you the discretization and how, how we are going to do it. You have already seen how we calculate gradient and all, and how we take square. This, you can also do. We are providing you the code directly and the code that we are giving to you contains some other terms also, I think, as usual. So, we just want to tell you, I just want to show you that whatever is taught to you in the class is sitting in the code as well. For example, in this snippet you can see, first is identifying Ω1 and Ω2. What is Ω1, what is Ω2, what is µ1, µ2. So, here you can see, we dealt with phi. First, they found the points within Ω1, these are the interior points. Then, the exterior points, points in Ω2. Then we find the value of means. So, this is µ1 this is µ2. And then we calculate this force I minus u squared, I minus v squared. So, this is one penalty. 540 And then also, we have something for curvature that was not taught in the class that they have used. So, you can study the code and identify it, learn it and then we are solving some equation, d phi by d t. (Refer Slide Time: 08:10) So, an example implementation of Chan-Vese segmentation is here. So, you start with an input image of this aircraft, and then this is your initialization. After some 200 iterations, 250 iterations, it will completely segment the airplane. So, this is one demo. You can download other codes also, you can understand them, play with them, understand them. 541 Yeah so, that is all for Chan-Vese segmentation. We hope that all the 3 methods, you were able to understand the similarities and differences between the 3 methods with these implementations. And also because you have the codes for all the 3, you can play with it, you can implement your own terms. So, you can like upgrade the code by adding more terms, or you can also simplify the code by putting some of the weights equal to zero. And we hope that with the code that will be provided to you and the inside, and the basic idea of the code you will be able to understand these active contour models better. So, thanks for your time, hope this tutorial helps. 542 Medical Image Analysis Professor Ganapathy Krishnamurthi Biomedical Engineering Design Indian Institute of Technology, Madras Lecture 37 Neural Networks Introduction Hello, and welcome back. From this week on, we will start to look at machine learning methods for medical image analysis. We will start off with a linear regression. It is one of the similar techniques for machine learning, in machine learning. And from there, we will try to understand how, how learning methods work, and specifically, we will move on to artificial neural networks and then into a convolutional neural network, which is specifically meant for analyzing image signals. So, we will conclude with that. So, this comes to the last portion of the course. So from this week, like we said, we will look at linear regression and gradient descent. Thank you. 543 Medical Image Analysis Professor Ganapathy Krishnamurthi Department of Engineering Design Indian Institute of Technology, Madras Lecture 38 Linear Regression Hello, and welcome back. So, we start off with linear regression. We will eventually get to artificial neural networks and subsequently convolutional neural networks. (Refer Slide Time: 00:27) We start off by looking at this breast cancer data set. It was taken from Kaggle and the appropriate citation is given here, part of a scientific study. The idea here is to classify a certain breast cancer lesion as either it is benign or malignant. So, which is, basically the categories here. If you look, MMM is malignant, and of course, other category is benign, which is B, not shown here in the snapshot of the data set. So, the each of the columns that you see here would correspond to the so-called feature of the breast mass and each row corresponds to that of a patient. So, how are these so-called features computed? They are computed from a digitized image of a breast mass biopsy. It is called the fine needle aspirate. And what they describe are the characteristics of a cell nuclei present in an image. 544 So, you have a picture of the cell nuclei and you measure these properties of the cell nuclei from the image, and you list them against every patient. Now, the idea is, using these so-called measurements or features, you classify the breast cancer as either benign or as malignant. What is interesting of course, is that these features are hand calculated by, from the images of the biopsy tissue or biopsy breast cells. So, this is one, what you call, category of data that you can use in linear regression. Even though this actually, we will see later, it is probably not a good candidate for linear regression. So, here there, are the diagnosis is either M or B. So, your target y, as you call, Y is your target would be either 0 or 1 or you can even say -1 or 1. So, -1 being benign or malignant, 1 being benign. So, either or, other way. So, if you test positive then, you are malignant, if you test negative, you are benign. So, -1 would benign, 1 would be malignant. Same thing for 0,1. You assign benign and malignant accordingly. So, that is your target, that is what you want to estimate, whether a particular breast leash mass that was found may be based on a test is either benign or malignant. So, that is, this is one task that you can accomplish with this data. And this data we, we assume that all these features or some of these features have a direct bearing on this particular diagnosis of either benign or malignant. (Refer Slide Time: 03:17) 545 Another data set that is often used to illustrate linear regression is the housing price statistics. Again, this is taken from Kaggle challenge. So, you can go online, search for this data set, and there are quite a few versions of this, it seems. So, the idea here is to predict the sale price of a house given some of its features. So, these features could be continuous values, like for instance lot area, or it could be categorical variables like features like LotConfig. So, it is a mixture of categorical and continuous features and these features, you use to predict the sale price of a house and this is a classical linear regression problem. (Refer Slide Time: 04:03) 546 So, let us just come back to get back to what you mean by linear regression. So, the idea is we want, given some feature values, we should be able to predict the price of a, for instance in the case of what we saw earlier, for the housing price challenge, given some features, we should be able to estimate the price of the house. So, the price of the house is what we call the target here, that you want to estimate, Y. So, and X are the features. So, what is this f(X) that you have written there that gives you this Y. So, this function f is the function of features, X. So, if X is a vector of features like we saw, every house has a row corresponding to the table I showed you. So, the X is a vector of features and we give that as input to this function X, and its output would be 1 scalar variable which is the price of the house. So, that is the kind of problems we will typically look at in linear regression wherein the input is a vector of features and we call it X, and output is a scalar value Y. So, then again, you might say, what is this f(X) that we are trying to estimate? So, for instance in this case, I have drawn, plotted the first floor square footage on the X axis, first floor square footage on the X axis, and the Y axis is basically the price of the house, the sale price of the house. So, what we have is this very nice scatter plot. And, so it seems that, there seems to be like a linear correlation, almost, between the first floor square footage and the price. 547 So, as the first floor square footage increases, the sales price also seem to increase. Now, if we draw a line through this like this. Now, this kind of captures the trend in this plot. So, for a particular value of the square footage, you can kind of read off the value of the price of the house for this thing, you can read off right here. So, this red line here, this is what I call f(X). So, in this case, I am considering only one variable, and X in this case is also a scalar, and output Y is also one scalar. And it passes through the origin, so Y= mX, where ‘m’ is some constant, would be the answer to this, would be the solution for this line. So, so the idea is now, I hope it is clear, is to estimate this line, the equation of this line. This is what we refer to by f(X) here. Now, in 2D, it is like this, on the other hand, if you have multiple dimensions, you are just trying to estimate a hyper plane. In multiple dimensions, you are going to estimate a hyper plane. So, if you have, let us say, you are using, thinking of two features and you are trying to estimate the price, then you are just doing a plane. But if you have multiple features, here you have tens of features, we saw for housing price prediction, and, so you are trying to do this linear regression which basically estimates a hyper plane. So, if you just have one feature so you are just trying to estimate a straight line, you have multiple features, then you are trying to estimate a hyper plane. So, what is the simplest linear model that can relate Y to X, in the sense, how do we estimate Y as a function of X and you want to find that f(X). And we need to model that f(X). So, we need a model for f (X) because we currently make a best case, we do not have access to information that will tell you exactly what f(X) is. Only these correlated features and the target. So, the simplest model that can relate Y to X is given by a model which is linear in X, it 𝑇 is given by this expression Y = 𝑤 𝑋. So, you might be put off by this notation but you have seen some weird version of this in your school when you are trying to fit something to a straight line, so then you might have used some equations like y = ax + b. So, this a and b are nothing but components of w. So, this X is just one, this is in one variable, X is one variable, Y is some scalar output. 548 So, this is the equation of straight line. So, this is more general form wherein you have more than multiple dimensions, and then you just have a weight matrix. So, this w is, you can think of them as coefficients or generally you can call them weights of your model. In this case, it is a linear model, linear index, and you are trying to, and the linear regression problem is basically all about figuring out this w which will best fit this data. So, we will see what that means in the subsequent slides. (Refer Slide Time: 08:53) So, before we go any further, so we also want to look at what have, in the case of classification, what is this, what kind of regression can we do? So, let us just plot one. So, one of the features in this breast cancer data set, it would be the mean radius of the cell. It is not visible, might not be visible to you, it does not matter. So, mean radius of this, of the mass or the cells that we have measured under the microscope. So, I am just going to write that down here, going to call this mean radius. And then the Y axis is basically benign or malignant. So, we are going to say 0 if it is benign and it is 1 if it is malignant. So, once again so 0 if it is benign and 1 if it is malignant. And let us just do this simple rule that may be based on whatever little I see here, 13 millimeters, let us say here, is, these are different colors, maybe black so 13 millimeters is where anything about 13 millimeters is malignant, anything below, 549 anything, sorry, anything above 13 millimeters is malignant, anything below 13 millimeters is benign. So, you have to draw a bunch of points here just to illustrate this. Try to use some colors which you have not seen before. So, let us just use some yellow here. So, for instance these are a bunch of points which are all less than 13, which is benign, and then there is a bunch of points here, maybe, again, here, bunch of points here, which are malignant. So, this is M for malignant, this is B for benign because it is below a certain thing. And we are going to use only one variable because we can always type more but easier to illustrate. So, what is this, how would you do this classification? Ideally, the function we want has to look like this. Anything less than 13, it should always consistently give 0, I will put this as 1, and anything over 13, should immediately jump and give you 1 as this. But this is your Heavyside function, Heavyside step function. So, we are looking for a step function. So, this is your f(X). This is what we want. And this is a classification problem. So, for classification problems we want some functions like this, approximate functions like this. Here too, you can do similar analysis that I am going to describe for a linear regression, but then there could be some errors because of outliers. And that is where, that is why people generally do not use regression, the regression that I am going to describe for solving these kind of problems. These are called classification problems. And the f(X) that we want is this shape. Whenever your X, which hits 13 millimeter, you want its output to jump to 1, whenever X is less than 13 millimeter, you want its output to be at 0. So, this is the function that we want to have, f(X). So, that gives you the outcome that is the desired outcome. (Refer Slide Time: 12:16) 550 So, just to reiterate, the idea is that linear regression is a system that takes a vector x. So, I am going to say vector X because we usually have lots of features, not one feature, and it is usually outputs a scalar Y. So, most of the problems you will run into, there will be one, one variable you will be predicting, one scalar you will be predicting, that is your Y, that is your target as you call it, and your input can be a vector, it will have multiple features in it. Once again, the way to do that, so let us see, if you have, if you have a bunch of features, 𝑇 [ the model that you would I have is of this form, 𝑤 𝑋, where 𝑤 ϵ 𝑅 , basically it has n components and each for one feature value. So, the elements of X which I have been repeating is called features, and w is always referred to as the weight matrix or the weight vector. And what does it signify? The par of the weights are also known as the parameters of the model, and what they do is determine the relative importance of each of the features. So, now, you have an “explainable model”, quote unquote because if that, if a w for a corresponding X is very high, then you know for a fact that that particular feature is contributing a lot to a certain, the output. So, this is why I have shown here in this example features for both the regression problem, housing price as well as the breast cancer problem, where we have a bunch of, 551 we have a bunch of numbers, which we call them vectors. And for housing price, actually it is mixed because it has both categorical as well as or continuous values in your X. (Refer Slide Time: 14:02) So, linear regression proceeds as follows. All you have to do is predict y given X as input 𝑇 by calculating this 𝑤 𝑋. Now, we can do this provided we know w. But actually, we will not. That is what we want to estimate. We want to estimate the weights. So, we need what is known as the training data or the design matrix. So, these are the examples x vectors x in the form of a matrix. Just like that table, if you look at this in excel table, it look like a matrix. So, every row of that matrix corresponds to a data point, every column of that matrix would correspond to a particular feature. So, and with this, if you have a bunch of such examples, using that, we can actually estimate this w because once we have w, then a 𝑇 new X comes in, we always calculate 𝑤 𝑋 and predict y. So, then how do we estimate w? So, that is the question, that is the core focus of the linear regression algorithm. 552 (Refer Slide Time: 15:11) So, the way we go about estimating the weights is that we estimate the weight so that the prediction of the model is close to the ground truth. So, what I mean by the ground truth is the training data. So, training data is usually provided in pairs. So, you have a bunch of, if you have a X, I am going to use subscripts, but this is just to indicate, each subscript indicates a data point, not the component of the vector X. So, ( e𝑖 , f𝑖) you will have this pairs, let us say you have m such pairs. In all of these formulas, x is implicit and you do not mentioned it. So, we want to estimate, how do we estimate w, we want to, we take w so that the output of the model so 𝑇 f is the model output, which is, you calculate as 𝑤 𝑋, is close to your ground truth or what is, what comes with your training data as the target y. So, this particular notation is basically is called the mean square error. So, you subtract for every case in your training data or examples, you predict your f , subtract it from the real, or you subtract the real y from it, square it and you do that for every case in your or every data point in your training data and you add them and then basically you estimate the average because 1 over m will give you the average error, that is a mean square error. 553 So, once you estimate this error, you can use this somehow to adjust w, that we will see later in the form of gradient design. But the idea is how do we, when we estimate w, what is the criterion that we have to consider. w has to be estimated so that this error is minimized or very low. So, this is just using the training data that is provided to us. (Refer Slide Time: 17:10) 2 So, how do we estimate w. So, we had the expression ( f − f ) , the sum of squared, mean square. So, we want to minimize w. Now, like I said, implicit but I did not really 2 indicate is, if you just look at the expression, it is actually ( f( e, 𝑤) − f ) is what we are looking at. So, let me write this and use a different so for instance we are looking at 1 2 𝑚 Σ( f( e, 𝑤) − f ) , the summation over all the m examples 1 over m. So, the f is implicit dependence on x and w. Now, we are trying to estimate w, and how do we estimate w? So, we know that we have to minimize this expression, minimize this error, minimize this error for a choice of w. So, the easiest way to do that is to take the derivative of this loss function with respect to w and set it to 0 to obtain the minimum. When you do that, you get into a closed form expression like your analytical expression for estimate w by estimating w which you can use. 554 So, this is one way of solving the linear regression problem. We also see in some cases that we cannot, this is probably not the most efficient and then we will then start using the so-called gradient decision algorithm, but this is the general idea. So, we want to 𝑇 minimize the errors between what we predict using the w’s with the, our model 𝑤 𝑋 and the correct output, y. So, we do the sum of squares, average sum of squares or mean square error, we want to reduce that. And we want to reduce that with respect to, by twiddling with the parameters w or the weights w. And the way to do that, of course, in mathematics, is to set the last, this is called the last function or the cost function, and we take the derivative of this with respect to w, set it to 0 or the gradient with respect to w and set it to 0, and then when we do that, we can actually obtain an analytical expression for w. (Refer Slide Time: 19:40) So, if we do the math then we end up with this particular expression for w and these are known as the normal equations. And of course, I remember that ( e, 𝑤) in all matrices or vectors, but there is also, in some cases, a biased term making it an affine function. So, 𝑇 f = 𝑤 𝑋 + 𝑏. In case you are wondering where that came from, so for instance, you, if you consider usual regression, you can, for instance think about a bunch of points, you fit a straight 555 line through it so you can fit a line which goes through the origin. Here, this, for this line, b=0, but you can also fit a line for instance, which is a more appropriate, maybe something like this, here b is some value b is number. So, this is the intercept. So, X equal to 0, that is what this is. So, this is the most general model. In one variable, it is easy to illustrate. So, using this form, you can actually estimate w given your training data set. But this is probably not the most efficient way when you have very large data sets or when you are trying to do real time estimation. (Refer Slide Time: 21:10) So, we are trying to minimize this error and what is this error, this is called the training error because this is done on, this is capital X that are typically using, this is done on the training data, the example data that was given to us, we used to do this. So, this is called the training error. So, when an unseen set of X applies, so X test comes in, where this is data you have not used for this process, it is given as input, you would still like the error to be low. So, the idea is the following, we need load training error and we also want the unseen data, that is, the testing error should be comparable to the training error. So, this is where two important things arise, one is called overfitting, other one is called underfitting. See 556 overfitting is when we have extremely low training error, and you will see at some point in next lecture, next two lectures where that comes from. So, extremely low training error and then, but then when the new data set comes in, your, the testing and training error are not close, they are so far apart, testing error is much, much higher than the training error. The underfitting is when the training error is larger. Anyways, large because your model is not good enough, maybe a linear model is not good enough to fit the data. So, then your training error would be large and that is called under fitting. So, typically in, if you go to the point where you talk about deeptool networks, overfitting will be more common. (Refer Slide Time: 22:55) Now, so, we were talking about training error and test error and how it would be great if they are very close to each other. So, typically, test error will always be slightly higher than the training error depending on the model because not all models quote unquote “generalized”. Generalization is when you have a new data comes in, your model performs as well as it did on the training data. Now, one of the steps that you can take to decrease the gap between your training and test error is to inject some prior knowledge about the model. You better take some prior 557 knowledge into the model. So, one of the, there are many ways of doing that. So, one, one way is the most naive way is you should use more complicated models. So, for instance, we have we are talking about linear regression where our model is linear, by itself is linear in the variables, in the features but we can have polynomial models, polynomial models, which are more complicated, in that case also, their fitting will be very good. So, your training error will be very small, but then overfitting can happen. However, we can provide some prior knowledge about what are our expectations. So, one of the things we can do is for instance, add this kind of a term to the loss function. Here 𝑇 2 𝑤 𝑤 is sometimes that you will see in this form, 𝑤. The sum of squared values of your parameters of your model or in this case, weights of your model. So, what this term does, since we are reducing the mean square error in general, we will try to minimize the sum. And but, but what this particular thing does is since you have w squared, it is, there is incentive for w to become small for a for, a lot of the w’s to become small. So, you want not only, do you want this linear model, but the forces the weights to low values, especially by tweaking the value of λ. So, for instance, you think that some weights values are too high and so it is not stable, then you just make λ large enough when you try to do the, try to solve for it, and in that way you will force the w’s to be very small. So, basically what you are trying to do is to select a particular type of solution. By regularization, you are enforcing, you are forcing the algorithm to select a particular type of solution. So, for instance, there are different sets of w’s that might satisfy this equation. Ideally, of course, because, you typically have lots of problems with noise etc. So, we will not go into that in much detail, but if you if you look at this particular expression, by increasing λ is, what you are trying to do is making sure that this term is weighed very high in the loss function or J(w) is the loss function or the error, or the error term. And when, and by doing that you are forcing w to become very small, in fact, sometimes you can drive it to 0 depending on the form of this regularization. 558 So, what it is forcing to do is to choose a solution for w wherein all the ws are very low values. So, lot of, for a lot of problems, this is typically used, especially when you actually do not have too much data. The number of weights will be very large compared to the number of data points. And at that time, you have a, the choice is to have this kind of regularizer, think of it as adding more, more equations when there is generally less equations than there are unknowns. So, this is one way of so-called regularization of the model which helps to decrease the gap between the training and the test error. Now once again, you have the 2 𝑇 𝐽(𝑤) = ( f − f ) + λ𝑤 𝑤, we can still take the derivative with, of the left hand of this expression with respect to w, set it to 0 and arrive at a closed form expression for x. So, analytical expression still possible, is still possible by even with adding this regularization term. There are different regularization terms but each of them enforces some expectation on your solution. (Refer Slide Time: 27:23) So, now we come to this so-called hyper parameters, which are also an essential part of all these models. So, we saw in the previous slides this parameter λ that we used by, to 𝑇 weight the 𝑤 𝑤 term. By tweaking λ, you can enforce either small, small values of w. So, 559 this λ is referred to as a hyper parameter. Almost all machine learning models have some hyper parameter that you have to tune. Now how do you tune this hyper parameter, how will you make it work? Now you will be tempted to say I will put different values of λ, train my network and then of course, test it on test data. So, that is typically not how it works. So, what you will do is take a separate partition of the data called validation data, and you would tune your hyper parameter using that validation data. So, after training with different λ, you will test it on some validation data, which is different from the test data in order to tune your hyper parameters λ. Once again, hyper parameters are there for pretty much a lot of the machine learning algorithms, even neural networks have this. And this is just an indication of how many of the concepts are common across these models. So, one, this hyper parameter in this case is only one parameter, λ, but in some, many networks, for instance, many machine learning methods, there could be more than one parameter. So, you need to have a systematic way of evaluating a correct hyper parameter and for that, you would need the so-called validation data set. So, you would not tune your hyper parameters on the test set, rather, you turn it on the validation data set. (Refer Slide Time: 29:06) 560 Now linear regression can also be solved with the normal equations. This can be done. However, when the data set is very large, millions of data points, or when the learning has to be done in real time, like you are getting one point at a time and you want to estimate, keep estimating your weight matrix or weight vector, then this, our learning algorithm, so I would say that this, solving this normal equations amounts to some learning algorithm. But, so, then we have to change a different, to a different algorithm when we are trying to do only a few bunch of points at the time or even one point at a time. This is where we use gradient descent. So, we are going to take a brief look at the gradient descent algorithm and how it can be used for learning w, the weights of your model, in this case, especially for your linear model. We will see that in the next video. Thank you. 561 Medical Image Analysis Professor Ganapathy Krishnamurthi Department of Engineering Design Indian Institute of Technology, Madras Lecture 39 Gradient Descent Formulation (Refer Slide Time: 00:15) Welcome back. So, in this video, we are going to look at the gradient based optimization, specifically the steepest descent or the gradient descent optimization algorithm. So, now we saw that in the context of linear regression, we were looking at this error function or the mean squared error, which we write down and use different color, is some summation 1 2 over all the data points, 𝑚 Σ( 𝑦(𝑥, 𝑤) − 𝑦 ) , this is your ground truth or the training example y, again, y and 𝑦 are scalars, squared error. So, this, we want to find the, minimize this function, I am going to minimize this with respect to w, w or the weights or the parameters of the model. So, all machine learning algorithms will involve some optimization like this. This is called an optimization problem. And the idea is now we have to keep adjusting w. So, adjust w, till we reach minimum of MSE. So, there are many systematic ways of doing this and they fall under this category called optimization. 562 And there is, of course, there are some, there is a specific topic of optimism called context optimization, but in general we will not talk to talk about that, we will just talk about how one would go about estimating w using this gradient descent formulation, especially in the context of linear regression. (Refer Slide Time: 02:15) So, if we let us say, so let us, let us, let us consider some function f(x). So, this is a single variable, f(x) is our function, and we are trying to minimize or maximize this function, and we want to find out the x, the x that minimizes or maximizes this function. We will just say minimizes, at this point, not to get confused, x that minimizes f(x). This is what we want to find out, this is what we want to estimate. So, given this, how do you go about doing that? So, that is the question. So, for every function, like say, we are we are going to write this as 𝑦 = 𝑓(𝑥), for every function where, in this case x and y are real numbers, there is a derivative of x. So, f(x), let us say, smooth function, it does not discontinue this, we assume that it is derivative x. So, the 𝑑𝑦 derivative of this function is denoted as 𝑓'(𝑥), which is our 𝑑𝑥 , these are two ways of denoting this. Now, we know that for, in, if x is, we are looking at single variables and we know that the 𝑑𝑦 derivative 𝑑𝑥 or 𝑓'(𝑥) gives the slope, gives the slope of the function f(x) at x. That is 563 the slope of the function, that is, that is what we, and you are all familiar with this concept of the slope of the function. So, what is that, what is the slope of function tell you? It tells you how much f changes if you change x by a small amount. So, that is the idea. Or in other words, the model that you are trying to, that you are formulating is 𝑓(𝑥 + ϵ)~ 𝑓(𝑥) + ϵ𝑓'(𝑥), where the epsilon is a very small number. So, it is basically the rate of change with respect to x. So, locally, how fast it changes when you change x? So, for small changes in ϵ, now this is an approximation to this is how you calculate 𝑓(𝑥 + ϵ). So, which means that it is useful for minimizing the function because what this derivative tells you is that how to change x in order to change 𝑓(𝑥). (Refer Slide Time: 04:48) Now let us just look at a very in a simple function that we can draw and we can 2 𝑥 appreciate how this works. So, let us say we can draw, we can we can look at 𝑓(𝑥) = 2. And so, 𝑓'(𝑥) = 𝑥 so now if we draw this, one second, so just the best approximation, we can draw something of that sort. So, basically at x = 0, for this function, 𝑓'(𝑥) = 0. If you put 𝑓'(𝑥) = 𝑥 and it is at 0. So, when the, when the gradient or the derivative for this function is set to 0, it is, it is at a critical point or at an optimal point. In this case, it is 564 the global minimum. So, for negative values of x, this is negative values of x and this is positive values of x, this is x and this is 𝑓(𝑥) of course. For negative values of x, we can calculate 𝑓'(𝑥) and it will turn out that 𝑓'(𝑥) < 0 , for x negative, and 𝑓'(𝑥) > 0 for x positive. So, we can just, because you just have to plug in value here. So, 𝑓'(𝑥) = 𝑥. So, for negative values of x, our negative gradient, values, gradient value is negative, for positive value of x, gradient is positive. So, what that tells you is that even 𝑓'(𝑥) < 0, which means that when on this side when 𝑓'(𝑥) < 0, on this side 𝑓'(𝑥) > 0, means that we can move, we can, by moving to the right, we can move, by moving to the right, we can decrease the value of 𝑓(𝑥). And since on the hand side, when x is positive 𝑓'(𝑥) > 0 , we can decrease 𝑓(𝑥) by moving leftward. In a sense, we move in the direction opposite to that indicated by the gradient. So, because the gradient is, in this case, is negative. So, as we go in this, further in this direction, gradient gets more and more negative. So, we move in the opposite direction, which to, in order to minimize 𝑓(𝑥). So, this is the gradient descent technique. So, basically, the way to reduce the value of the function with respect to or optimize the function with respect to its, say, variables, independent variable, x, is to move in a direction opposite, which is opposite in sign of the directive, of the derivative. So, and we can take very small steps so that we do not overshoot the optimum. So, this is the principle behind the gradient descent algorithm. Now we are just going to see how it works for functions which are, which have multiple variables as input. So, in our case our loss function, we saw, was, was basically a function of weights, and there are multiple weights, because we have lots of features, each feature is multiplied by weight in linear regression and there is also a bias term, so the number of features plus 1, basically, and in that scenario how do we formulate this problem. So, in the one-dimensional case, all I have to do is move in, find out the sign of the gradient and move in the direction which is opposite to the, move in sense, make small steps in x with the opposite sign of the derivative. So, that is how we reduce 𝑓(𝑥) or 565 minimize 𝑓(𝑥). So, this is for one dimension. So, what if we have multiple variables, which have multiple inputs. (Refer Slide Time: 09:00) So, for instance our 𝐽(𝑤) which we had, 𝐽(𝑤) is our loss function. So, we have 1 2 𝐽(𝑤) = 𝑚 Σ( 𝑦(𝑥, 𝑤) − 𝑦 ). So, this is our 𝐽(𝑤), that is the function, this is our last function. So, the inputs this is like if you have n features of (𝑛 + 1) weights, multiple weights. So, in this case you have to be familiar with the concept of a partial derivative. So, for instance, if we will use x but it would be confusing so I will just move to w’s. So, for instance, if you calculate in this case 𝐽(𝑤), instead of 𝑓(𝑥), we just use 𝐽(𝑤). So, δ𝐽 a δ𝑤𝑖 where i is the subscript for the each of the weights, basically this tells you how J changes only with respect to this particular variable, 𝑤𝑖. So, then we can, we can do this δ𝐽 δ𝐽 δ𝐽 for every w. So, δ𝑤 , δ𝑤2 , all the way to δ𝑤𝑛+1 , which you have a bias term. 1 So, this forms a vector, so this vector. So, this is a gradient vector that we have. So, for 𝑛 functions with multiple inputs, so, basically our function 𝐽(𝑤): 𝑅 ~𝑅. So, it takes a bunch of features as input and produces a scalar output, that is what linear regression does. So, that, in that case what we want to do is take the gradient rather than the derivative, the 566 gradient is of course the vector of the partial derivatives with rest, of the, with respect to each of the input variables or the variables involved in the function. So, in this scenario, how do we, how does gradient descend look like. So, then in what direction should we go? So, for instance in one dimension, it is very simple. You find out the sign of the derivative and move in the, take steps in the direction opposite the, opposite side. That is, that is not 1D in 1D, but this is in multiple directions. So, because there now there are infinite possibilities, how, in what direction do you take the step? When I say take the step, in how much, for instance, this x is fixed, so in this case we are trying to estimate w, so we have to change all these w’s. We have (𝑛 + 1) w’s, so how do we change w’s? So, we want to estimate this ∆𝑤 so that we can do ∆𝑤 + 𝑤 so that J is minimized. So, how do we estimate that, delta w, and what direction do we take the step. (Refer Slide Time: 12:00) So, for functions like the one, like our last function or the error function, we had talked about the MSE, mean square error where 𝐽(𝑤), this has multiple inputs, w’s, basically. Remember that x is fixed, so even though I have written it as part of the function whenever I write 𝐽(𝑤), x is fixed, the only thing that the variable is w. So, then, the idea is now, how do we know the direction of the gradient. 567 𝑑𝑦 𝑑𝑦 So, for instance in one dimensional, we just did that 𝑑𝑥 , so when we 𝑑𝑥 , what we did in which is we called it as frame of x, we just defined that to be this way is 𝑓(𝑥 + ϵ)~ 𝑓(𝑥) + ϵ𝑓'(𝑥). This is the way to do it. So, now, if we, if you are going to do it this way, so for multiple inputs, so in this case, x, I am just switching back and forth between x and w, so please put up with me. In this case when I use the small x, it is just some variable in one dimensions. So, now how do we take, when we have multiple inputs, w is a vector, w is a vector. Then in this case, how do we take this derivative? So, the idea is now, we want to take something called the directional derivative. How do we take directional derivative, and we want to take it in a particular direction u. So, u is a vector, it is a unit vector, which means its magnitude is 1. u is a unit vector, its magnitude is 1, and we want to estimate the slope of the function J, here J, this function J along u. So, how would you do that? So, the mean vector notation, so the gradient of, let → → me erase and rewrite. So, (𝑢.∇𝑤𝐽(𝑤)) this is a mean square error. So, we want the gradient, we saw that this was a vector, this is a vector with so many components as the size of w. And we want to take, so we can, we will just put a vector arrow here, and we want to take a component of this vector along the unit vector u. So, you just have to do u dot, just a dot product. So, if you want a unit vector, along some unit vector if you want your estimated slope, especially in, if it has a multi-dimensional input, then this is the way to do it. 𝑇 And in matrix notation this reduces to 𝑢 ∇𝑤𝐽(𝑤). So, it is a dot product within two vectors and you can write it this way. So, now in order to minimize our J, we would like to find the direction in which J decreases the fastest. So, the idea is to find direction in which J decreases the fastest. 568 (Refer Slide Time: 15:50) So, how do we estimate this direction? So, let us see, we just, it is not too hard. So, we want to minimize this particular expression. So, what do we want to minimize? We want 𝑇 to minimize, 𝑢 ∇𝑤𝐽(𝑤) and we want to minimize this expression with respect to u, u is 𝑇 the direction, we want to estimate, of course, we know that 𝑢 𝑢 = 1 because its magnitude is 1. And the dot product, we can always write. So, if your dot product of two →→ →→ vectors, if you have two vectors 𝑎.𝑏, we know it is just 𝑎.𝑏 = |𝑎| |𝑏| 𝑐𝑜𝑠θ it is a scalar quantity. So, in this case also, we can write in this particular notation, we are using, so minimize 𝑇 with respect to u, 𝑢 𝑢 = 1. We can write this as ||𝑢||2||∇𝑤𝐽(𝑤)||2𝑐𝑜𝑠θ. So, when is this maximum? Because we want to find the direction in which f would decrease the fastest. So, in this case θ is the angle between u and the ∇𝑤𝐽(𝑤). And then of course, we know this is a unit vector, so we can say this to be 1. And since this does not depend on u, it is independent of u, this particular expression here, I am just going to, yeah, this expression is independent of u. So, now we just have to find out, independent of u. So, now we just are left with pretty much 𝑐𝑜𝑠θ. So, which means that, so we are pretty much writing something like minimum with respect to u 569 𝑐𝑜𝑠θ. This is minimized when the angle between u and ∇𝑤𝐽(𝑤) is 180 degrees. So, that is basically when u points in the direction opposite to that of the gradient. So, so in which case, so which is, which is meaning that the gradient points in the direction in which the function is increasing and the negative of the gradient points directly in a direction where the function is decreasing. So, then we can, so which means, so it is minimizing, it is when 𝑐𝑜𝑠θ =− 1 this side, and θ is 180 degrees, which tells you that u should be the direction opposite to the gradient, which means that we can decrease J by moving in the direction of the negative gradient. This is known as the method of steepest descent. So, however, this is, which implies θ or which make 𝑐𝑜𝑠θ =− 1 which means that that u and gradient of J with respect to w are pointing in opposite directions. And then, this gradient of J, and then for the way to decrease the J with respect to w because, by stepping in a direction of the negative gradient, so decrease J by stepping in the direction of negative gradient. (Refer Slide Time: 19:52) So, this is called the method of steepest descent or gradient descent, and the formula is the following. So, 𝑤' = 𝑤 − ϵ∇𝑤𝐽(𝑤) where this ϵ is called the learning rate. Again, 570 this notation might be different, some people might use just, we have already used lambda, so I am refraining from using λ. So, we can use β or some other alphabet. So, this basically a positive scalar. Epsilon is a positive scalar, this one is a positive scalar, some number, which basically constrains your step size. It constrains how big a step we will take. But this is also a hyper parameter. So, we talked about hyper parameters. This is a hyper parameter. And it is one of those things that you would optimize in most machine learning algorithms. So, we calculate this gradient numerically, and in many cases and then we keep moving by, and then of course, remember that say 𝑤' = 𝑤 − ϵ∇𝑤𝐽(𝑤). The second step, this would be the new w and then you would continue doing this iteratively, and then you stop once, yes you do not change or w does not change, appreciated, but our w does not change in a in a very, by a large amount. So, you have to fix that threshold also. So, therefore, many tricks associated with it, we will not go into detail because it is a very well established literature, a lot of variations of this algorithm is available in many software packages and you just have to know how to use them, but this is the underlying principle. Generally, you calculate the gradient of the function and then you would point, you, it would it, would point uphill basically in the way function is increasing, whereas if you want to decrease the function, you just move in the opposite direction. Now, this might look slightly difficult to comprehend why this happens, but if you, one 2 way of understanding this is using the 1D example. So, if just plot some 𝑥 , calculate the value of, or some function in 1D, and then you calculate the values of the derivative and see in which direction and if you see, look at its sign and based on that you can understand which way, in which way you have to move in order to arrive at the opt or the minimum of that function Now, this is all assuming that of course, I am just making an assumption that this is all has a global minima. There also functions with multiple and optimum. Some of them could be a maxima. Even at the maximum gradient, goes to 0, but it is kind of unstable, but still you might go, get stuck there, you might, you might have to climb that, and then 571 there are, there, there are these, in the case of neural networks, etc. they might have multiple minima. So, usually, you have to start off with some w, with some guess of w, and from there you have to refine your w. And this again, is, you can treat this like a hyper parameter, you can start off from different values of w. But then there are all these algorithms which try to compensate for that is like you start off with any arbitrary w, will get you there to a local optimum. So, in many machine learning algorithms, especially neural networks, w generally converges to a global optimum. It is only for so-called convex functions, the one, there is only one true optima and that happens to be the global optimum or the global minimum. So, this is a gradient descent. Now, what we want to do for real-time learning is we want to do something called stochastic gradient descent. We will just briefly look at what stochastic gradient descent does, but then before we go any further, this is ϵ, when I, when we do this, this, this step, this gradient of ∇𝑤𝐽(𝑤), if you think about this, this is evaluated over all the training data. So, this is something, because your last function J(w), I am going to rewrite this, sorry, let me, let me just 2 rewrite this. So, 𝐽(𝑤) = Σ( 𝑦(𝑤) − 𝑦 ). So, now what we are trying to, so what we are, what you will get here is some average gradient over all these data points. That is what we will get when you do this expression. In fact, in this case, you can actually directly, directly calculate the gradient, if you, if you did, if you take the derivative with respect to w, you will be able to, be able to calculate it. We are not going to get into that here, but the idea is for gradient descent, you will consider all the data at one time. And then, we are actually using the average gradient here, kind of, to take the step. So, if you have a very large data set, you have to use all those points to calculate this one step every time. So, that is why, we, even this form of gradient descent is probably not suitable when you have real-time data or when you have very large data sets. So, we will look at a variation of this, called stochastic gradient descent, and I think a mini batch 572 gradient descent. Now, and sometimes I think this method is also referred to as batch gradient descent. So, we will be looking at this mini batch gradient descent and stochastic gradient descent. (Refer Slide Time: 26:00) So, we were talking about the previous slide, we said if you look at the gradient 1 𝑖 𝑖 ∇𝑤𝐽(𝑤) = 𝑚 Σ ∇𝑤𝐽(𝑥 , 𝑦 , 𝑤) it is actually the average, you just compute this as an average, 1 to m. Now, this m becomes very large, you have to do this for every data point. So, as a trading set becomes big, in many cases, this trading set can run into hundreds of millions or maybe billions of examples, then this becomes computationally much harder to do. So, the stochastic gradient descent is the preferred, is the go-to algorithm here. Instead of doing this for the entire data set, what is done is to actually consider a so-called mini batch of examples. So, this is like, the understanding this is the following, so here I said, now you are kind of doing an average or ideally, I mean you should call this the expectation, but we will just work with averages. So, you are calculating empirically, you are you are approximating the expectation with an average for this particular case. Now, we have very large number of samples. 573 So, what this so-called mini batch stochastic gradient does is, so for every calculating step size, we do not have to estimate this average on the entire data, but rather a sample of the data. So, we do an average of the sample and use that to update the weights of your linear model. So, the mini batch size, which is m’ as we call it, which is a subset of m, is chosen to be a generally small number. We can be anywhere, we want usually powers of 2 sometimes, 1, 32, 64, 100, something like that but you will have tens of thousands of examples, but you might just use a mini batch of 50 to 100 to, for every step. So, that, this is, you, you choose these 100 randomly. And you can use that, and it, and in fact, the cost to compute, do a step given that you fix that m’, so the mini batch size is m’, which is a very small number. It remains the same regardless of how large your data set is, so then, but you can continue this step, take steps at the same cost, no matter how large your data set. So, then the ~ 1 𝑖 𝑖 gradient estimate, which we call, which is basically ∇𝑤𝐽(𝑤) = 𝑚' ∇𝑤Σ 𝐽(𝑥 , 𝑦 , 𝑤) this is your last function, J of, in, for individual data points. So, once you have this gradient estimate, mini batch gradient estimate then you can use your usual w goes to, and I am going to just, I know, I do not know, put a giant tilde here ~ to, this, and 𝑤 → 𝑤 − ϵ∇𝑤𝐽(𝑤) your mini batch gradient estimate. So, that is your stochastic gradient descent. So, surprisingly works very well for a wide variety of problems, optimization problems. In fact, a lot of the deep learning techniques that we see in the next, over the next couple of weeks, they all use this algorithm for estimating the weights of their models or the parameters of the model. And not only for linear regression but for a wide variety of algorithms. And it is a, it is a workhorse of modern machine learning techniques. So, again, we are not going to look at, consider how, whether it will converge, what are they, what are the samples, theoretical analysis, etc. Just to understand the, just the basis on which we, we use these algorithms, this is good enough. 574 Medical Image Analysis Professor Ganapathy Krishnamurthi Department of Engineering Design Indian Institute of Technology, Madras Lecture 40 Linear Regression Demo (Refer Slide Time: 00:15) 575 576 577 578 Hi, everyone, I am Krishna Khalori, PA of the Ganapathy Krishnamurty Sir. In all artificial intelligence related problems, in all artificial intelligence related problems, be it either ML problem or a DL problem, like machine learning problem or deep learning problems, to solve any kind of problem, fundamental thing that is necessary is data. Once we have data, using this particular data, we are going to train a particular model, train a model, using this particular data we are going to train a model. Once the, that model is trained, we will you use some new data to feed as input to the trained model. So, it will predict some output. The main goal is we should train this 579 particular model in such a way that whenever new data comes, it should predict in a desired way. This is the whole idea of this entire training model. What sort of data that is available in digital domain? First thing is, images you can expect, second thing is videos, you can expect, third thing is like CSV or Excel files, mostly the numerical values or tabular datas will be stored in CSV or Excel files, videos, similarly, we can have text data. Let us assume that these are the four sorts of data. One more thing is more or less like speech kind of data, speech data or we can say one dimensional data. These are the several ways in which datas are available in digital domain. So, whenever you want to feed the machine learning model or deep learning model data, data should be sent in a particular format to both machine learning and deep learning problems or deep learning models irrespective of the data, irrespective of the data, like images, video, speech data, one-dimensional text data or CSV, Excel files, all these data should be converted into NumPy Array, NumPy Array. This is the fundamental thing one should be noted while dealing with machine learning and deep learning problems. Summary is, irrespective of the data, we should convert that kind of a data into NumPy Array format. Suppose if you take as ML problem, machine learning problem, NumPy Array can be directly fed into the machine learning models, whereas into the deep learning models, we should convert this again, convert this NumPy Arrays into tensors before feeding the data into deep learning models. This is the usual way. So, one should be familiar with NumPy Array, NumPy Array like, there is in Python. Now, coming to second thing, in order that, to load files like CSV files or Excel files, we need pandas. Pandas library, where, in this pandas library we are going to feed the CSV files as data frames, data frames. Once you convert this CSV files or Excel files into data frames, further it will be easy for us to handle the data to do our data analysis, analysis. So, to work, just I am explaining whatever are libraries that are very much important to start off working with either machine learning problems or deep learning problems. Fundamental thing that is necessary is NumPy Array, second thing is pandas library, third thing is Matplotlib, Matplotlib. While dealing with the data, you always try to visualize 580 the data, like suppose if I have Excel sheet data I just want to see a column of data as a graph. I can use that particular column of data and I can use this Matplotlib for the visualization of the particular column however that data has been distributed. To get some visual picture, we go for Matplotlib library. One more library that is also, mostly people use, that is like Seaborn, s-e-a-b-o-r-n. So, one should be familiar with both, Matplotlib and Seaborn is an added advantage. So, if you know Matplotlib alone, that is also fine. Next thing is machine learning models, machine learning models. To get familiarized with the machine learning models, or how to use this particular machine learning models, one should be familiar with scikit-learn library, one should be familiar with scikit-learn library. Irrespective of the data, whatever you have data, you should convert that particular data into NumPy Arrays. After converting into NumPy Array, we should use the data directly into the scikit-learn library. Now, coming back to deep learning models, and after that to train your, to train the models of deep learning models, we just have to be familiarized with the either PyTorch library provided by Facebook or TensorFlow library, TensorFlow library. So, make sure that get familiarized with either PyTorch library or TensorFlow library. Before feeding data into this particular PyTorch or TensorFlow libraries, our PyTorch or TensorFlow models, first, we convert the data into NumPy Arrays, NumPy Arrays. After converting into NumPy Array, we should convert that into basic tensor format. These deep learning models expect data in terms of tensor format. This is the whole idea. The summary is to start off your work in machine learning problems or with the deep learning problems you should be familiar with these kind of libraries. First and foremost, first and foremost is for NumPy library, second thing is pandas library, third thing is Matplotlib library, fourth thing is machine learning models related, that is scikit-learn level library, next thing is deep learning models related models that is PyTorch library or TensorFlow library. TensorFlow library is provided by Google, PyTorch library provided by Facebook, that is it. 581 So, to get your hands on using Python to solve any kind of AI related problems, you should be aware of this stuff. Once we obtained the data, the usual procedure to solve any ML problem is to please split this data into training data. Second thing is validation data, third thing is test data. Suppose if you have some 1,000 data points, you can divide either it like 70 percent is for train, 20 percent is for validation, and third one is test data. But why, that is the question? Suppose, I have used this, instead of using this validation data in between, what I can do is directly, I can use 70 percent is for training and the whole 30 percent is for testing. Mostly, ML problems suffer from overfitting problems, problems, that is, the matrix, let us assume that the matrix is loss value, or error value. That particular loss value is less for trained data. After training, we are going to get very less error or loss value for the trained data, whereas if you test the, that, if you use the test data on this trained model, you will get loss value as high, loss value as high for the test data. This is nothing but overfitting problem. So, our goal is to always try to maintain the balance of both trained data and test data, loss values of both trained data and tested data should be more or less same. That is how we should design the models or train the models. So, what happens, suppose this, on this particular test data, model is not giving less value just like for the training, just as training error. That means, again, you have to train the model. So, suppose second time also I have done and I have used the test data, same thing happened. Again, third time again, I have trained. Now, I used my test data, again happened that we are not reaching the error value or loss value more or less same to trained data. So, while training the model itself, instead of after training and doing up, instead of doing the validation after the entire compression of training, we, we came up with validation part. We came up with validation part while the model is training. So, as the, instead of, for, from those 30 percentage, we are converting into 20 percentage into 10 percentage. This 20 indicates validation, this one indicates test. So, when the model is training, validation loss is not improving or I mean validation loss is not decreasing. Again, the model will train in such a way that validation loss also 582 should be low. So, the advantage is here. While the, while training the model itself, we are just testing whether that particular model is performing well or not. So, once this entire training is done, only once we can use the test data whether this model is performing well or not. That is the basic idea behind using validation data set. The summary is, validation data will help you to better, to train the model in a better way while that, while doing the training itself. So, the usual procedure is either you can split the entire data into a 70 percentage, 20 percentage or 10 percentage or 80, 10 percentage or 10 percentage. This is how you should use this one. Next thing is cross validation, one should be familiar with. Cross validation is nothing but, since I am using validation, assume that I am training my model like this. Suppose this is my trained data, enter trained data. One iteration is done. In the second iteration, I will use this part for validation and remaining part for training. Here, I am using this part for, suppose this entire data is like, whole data which we are going to train. While training I am assuming that in the first iteration, all these points, I am taking as trained, and these points for validation. Let us assume that I am doing this cross validation like 5 times. In the second iteration, what I do, instead of doing, taking these same points as validation I am taking some other points and remaining as again, they act as training points only. Similarly, in the third iteration, we can choose these points as validation part, remaining as training points. Likewise, I will do for all the five iterations. This is nothing but cross validation concept. To do all these tasks, to automate all these tasks, whatever we have, something like cross validation, and all the implementing of this model, everything, we should be familiar with the libraries. That is it. There is nothing else. 583 (Refer Slide Time: 15:10) 584 585 586 587 588 589 590 Now coming to ML problem settings. Most of the times, you can see that either classification problems are regression problems mostly. Suppose if you observe this breast cancer data set which Sir has discussed in the video also. Here, this is how the data set will be given. Like we have some x1, x2, x3, like that some features we will have, and this one is called targets. These are called features, these are nothing but targets. This is how the data has been provided. Suppose, in terms of classification, the output values, like target variables are given as depending on the number of classes. Suppose I have some n number of classes. So, I will give the digits from 0 to (n-1), that is the usual criteria. Even you can give some distinct numbers also, that is up to you, but mostly this is how the people will give in problem setting with respect to classification. Now, coming to the regression problems, we will given, we will be given same like, that CSV file where we have x features. These are nothing but the features and these are nothing but the targets. Regression means continuous. Here, if you observe the outputs here, you can see they are all continuous values, like 2.5, 3.5 or so many other kind of any real number. So, in summary, in, see, most of the times we are dealing with the regression problems, we will be given data in terms of either CSV files or Excel sheets where you will be 591 given the, suppose we have n number of data points, for all the n number of data points, we have some m features, and correspondingly, we will have target variable also. This is how the data has been given. Just here I have discussed like this, where features and targets are given. Our task is to, using this particular data like suppose you have some m data points, m data points you have, I have to prepare a particular model or train a model by using this kind of data set. That is my task. Here, especially the linear regression problems, where we will be given data like this, and we have to generate a function like this. ^ Like f = β0 + β1𝑥 where suppose, assume that for all the data points, we have only two dimensionals, only, the data is having only one dimension, like, suppose I have only x, well x has only one column. My target anyway, it has only one column. Suppose, you have only one feature, one feature or in terms of, you can say unique variable or unique variable, unique variable you have. At the point of time, we will go for univariate linear regression, where you will try to ^ generate this function f = β0 + β1𝑥 using this particular loss function, we are going to estimate the β0 and β1. This will be taken care by the library, what I have mentioned as a scikit-learn library, it will take care of to estimating this stuff. After fitting the model, we, in theoretical part, we say that we will estimate β0 and β1. Once we know the β0 and β1 values, for any new kind of data point comes, suppose x ^ value, new value comes, like XNEW, now, I will substitute here. Lout = β0 + β1𝑋NEW I will give, β0, I already know from the model, β1, I know already from the model, and I will substitute this new value and I will predict this particular Yout. This is how the criteria goes. In summary, suppose you have only one variable in terms of independent variables or in terms of features, we can say, where y is nothing but the target, at the time, we call it as univariate linear regression. There, we are assuming just this particular function, β0 + β1𝑥. It will take care of. So, we are saying that the training is going on. But while the training is going on, we should be in a position to evaluate, while the training itself is going on. 592 Like, we need some metric, while the training is going on, we have to see some metric. So, if you are able to find that, that metric is good enough, that means, the training is going well. Like that, in linear regression problems, we have three matrix. Most of the times people will use these three matrix only. Like R2 value, it is given by here, this kind of a formula, where yi is nothing but the original target, targets or the data that is given. ^ Like do not put data when it comes to fi. This is nothing but the predicted value. This is nothing but the original target value, this is nothing but the predicted value. Next, f‾i is nothing but mean of all the target values, mean of all target values. This is one sort of a matrix. We expect this value to be 0 to 1, 0 to 1 in general, and I need a value of 1. Suppose, after fitting the model, if you are getting 1 value, that means we have developed a very good number, that is very good. Next thing is mean square value. This is the matrix, similarly, you can always, you can also say that it is mean absolute error. These are the three metrics, we can see, in designing a linear regression problem. Till now, I have discussed only matrix related to linear regression, at the same time I have discussed only with respect to univariate linear regression only. That means independent variables are features, we have only one, targets we have. Now, coming to multivariate regression. It is just another extension of this univariate regression. Instead of only one β0 and β1, we are extending these coefficients. Now, suppose in case I have x1, x2, x3, x4, likewise I have xn, like n features I have, for every data point I have n features, n data points, n features, we just have to build this particular line. Usually, since it is not a two dimensional plane, that is why we call this one as a hyper plane, not a line, hyper plane we call. Here we, instead of estimating just β0 and β1, we are extending our task and we are estimating remaining things also. I have already told you that is the basic difference between multivariate linear regression and univariate regression. That is, univariate regressi

Use Quizgecko on...
Browser
Browser