Machine Learning Lecture Notes PDF

Polynomial Regression is a form of linear regression in which the relationship between the independent variable x and dependent variable y is modeled as an nth-degree polynomial. Polynomial regression fits a nonlinear relationship between the value of x and the corresponding conditional mean of y, denoted E(y | x). How does a Polynomial Regression work? If we observe closely then we will realize that to evolve from linear regression to polynomial regression. We are just supposed to add the higher-order terms of the dependent features in the feature space. This is sometimes also known as feature engineering but not exactly. Application of Polynomial Regression The reason behind the vast use cases of the polynomial regression is that approximately all of the real-world data is non-linear in nature and hence when we fit a non-linear model on the data or a curvilinear regression line then the results that we obtain are far better than what we can achieve with the standard linear regression. Some of the use cases of the Polynomial regression are as stated below:  The growth rate of tissues.  Progression of disease epidemics  Distribution of carbon isotopes in lake sediments Linear regression with mul ple features In previous lecture we talk about linear regression with one variable New version of linear regression with mul ple features  Mul ple variables = mul ple features  In original version we had o X = house size, use this to predict o y = house price  If in a new scheme we have more variables (such as number of bedrooms, number ﬂoors, age of the home) o x1, x2, x3, x4 are the four features  x1 - size (feet squared)  x2 - Number of bedrooms  x3 - Number of ﬂoors  x4 - Age of home (years) o y is the output variable (price)  More nota on o n  number of features (n = 4) o m  number of examples (i.e. number of rows in a table) i o x  vector of the input for an example (so a vector of the four parameters for the ith input example)  i is an index into the training set  So  x is an n-dimensional feature vector  x3 is, for example, the 3rd house, and contains the four features associated with that house o x ji The value of feature j in the ith training example   So  x23 is, for example, the number of bedrooms in the third house  Now we have mul ple features  o What is the form of our hypothesis? o Previously our hypothesis took the form;  hθ(x) = θ0 + θ1x  Here we have two parameters (theta 1 and theta 2) determined by our cost func on  One variable x o Now we have mul ple features  hθ(x) = θ0 + θ1x1 + θ2x2 + θ3x3 + θ4x4 o For example hθ(x) = 80 + 0.1x1 + 0.01x2 + 3x3 - 2x4   An example of a hypothesis which is trying to predict the price of a house  Parameters are s ll determined through a cost func on o For convenience of nota on, x0 = 1  For every example i you have an addi onal 0th feature for each example  So now your feature vector is n + 1 dimensional feature vector indexed from 0  This is a column vector called x  Each example has a column vector associated with it  So let's say we have a new example called "X"  Parameters are also in a 0 indexed n+1 dimensional vector  This is also a column vector called θ  This vector is the same for each example o Considering this, hypothesis can be wri en  hθ(x) = θ0x0 + θ1x1 + θ2x2 + θ3x3 + θ4x4 o If we do  hθ(x) =θT X  θT is an [1 x n+1] matrix  In other words, because θ is a column vector, the transposi on opera on transforms it into a row vector  So before  θ was a matrix [n + 1 x 1]  Now  θT is a matrix [1 x n+1]  Which means the inner dimensions of θT and X match, so they can be mul plied together as  [1 x n+1] * [n+1 x 1]  = hθ(x)  So, in other words, the transpose of our parameter vector * an input example X gives you a predicted hypothesis which is [1 x 1] dimensions (i.e. a single value)  This x0 = 1 lets us write this like this o This is an example of mul variate linear regression o Gradient descent for mul ple variables  Fi ng parameters for the hypothesis with gradient descent o Parameters are θ0 to θn o Instead of thinking about this as n separate values, think about the parameters as a single vector (θ)  Where θ is n+1 dimensional  Our cost func on is  Similarly, instead of thinking of J as a func on of the n+1 numbers, J() is just a func on of the parameter vector o J(θ)  Gradient descent  Once again, this is o θj = θj - learning rate (α) mes the par al deriva ve of J(θ) with respect to θJ(...) o We do this through a simultaneous update of every θj value  Implemen ng this algorithm o When n = 1  Above, we have slightly diﬀerent update rules for θ0 and θ1 o Actually they're the same, except the end has a previously undeﬁned x0(i) as 1, so wasn't shown  We now have an almost iden cal rule for mul variate gradient descent  What's going on here? o We're doing this for each j (0 un l n) as a simultaneous update (like when n = 1) o So, we re-set θj to  θj minus the learning rate (α) mes the par al deriva ve of of the θ vector with respect to θj  In non-calculus words, this means that we do  Learning rate  Times 1/m (makes the maths easier)  Times the sum of  The hypothesis taking in the variable vector, minus the actual value, mes the j-th value in that variable vector for EACH example o It's important to remember that  These algorithm are highly similar Gradient Decent in prac ce: 1 Feature Scaling  Having covered the theory, we now move on to learn about some of the prac cal tricks  Feature scaling  o If you have a problem with mul ple features o You should make sure those features have a similar scale  Means gradient descent will converge more quickly o e.g.  x1 = size (0 - 2000 feet)  x2 = number of bedrooms (1-5)  Means the contours generated if we plot θ1 vs. θ2 give a very tall and thin shape due to the huge range diﬀerence o Running gradient descent on this kind of cost func on can take a long me to ﬁnd the global minimum  Pathological input to gradient descent o So we need to rescale this input so it's more eﬀec ve o So, if you deﬁne each value from x1 and x2 by dividing by the max for each feature o Contours become more like circles (as scaled between 0 and 1)  May want to get everything into -1 to +1 range (approximately) o Want to avoid large ranges, small ranges or very diﬀerent ranges from one another o Rule a thumb regarding acceptable ranges  -3 to +3 is generally ﬁne - any bigger bad  -1/3 to +1/3 is ok - any smaller bad  Can do mean normaliza on  o Take a feature xi o  Replace it by (xi - mean)/max  So your values all have an average of about 0  Instead of max can also use standard devia on Learning Rate α  Focus on the learning rate (α)  Topics o Update rule o Debugging o How to chose α Make sure gradient descent is working  Plot min J(θ) vs. no of itera ons o (i.e. plo ng J(θ) over the course of gradient descent  If gradient descent is working then J(θ) should decrease a er every itera on  Can also show if you're not making huge gains a er a certain number o Can apply heuris cs to reduce number of itera ons if need be o If, for example, a er 1000 itera ons you reduce the parameters by nearly nothing you could chose to only run 1000 itera ons in the future o Make sure you don't accidentally hard-code thresholds like this in and then forget about why they're their though!  o Number of itera ons varies a lot  30 itera ons  3000 itera ons  3000 000 itera ons  Very hard to tel in advance how many itera ons will be needed  Can o en make a guess based a plot like this a er the ﬁrst 100 or so itera ons o Automa c convergence tests  Check if J(θ) changes by a small threshold or less  Choosing this threshold is hard  So o en easier to check for a straight line  Why? - Because we're seeing the straightness in the context of the whole algorithm  Could you design an automa c checker which calculates a threshold based on the systems preceding progress? o Checking its working  If you plot J(θ) vs itera ons and see the value is increasing - means you probably need a smaller α  Cause is because your minimizing a func on which looks like this o But you overshoot, so reduce learning rate so you actually reach the minimum (green line) o So, use a smaller α  Another problem might be if J(θ) looks like a series of waves o Here again, you need a smaller α  However o If α is small enough, J(θ) will decrease on every itera on o BUT, if α is too small then rate is too slow  A less steep incline is indica ve of a slow convergence, because we're decreasing by less on each itera on than a steeper slope  Typically  o Try a range of alpha values o Plot J(θ) vs number of itera ons for each version of alpha o Go for roughly threefold increases o  0.001, 0.003, 0.01, 0.03. 0.1, 0.3 Features and polynomial regression  Choice of features and how you can get diﬀerent learning algorithms by choosing appropriate features  Polynomial regression for non-linear func on  Example o House price predic on  Two features  Frontage - width of the plot of land along road (x 1)  Depth - depth away from road (x2) o You don't have to use just two features  Can create new features o Might decide that an important feature is the land area  So, create a new feature = frontage * depth (x3)  h(x) = θ0 + θ1x3  Area is a be er indicator o O en, by deﬁning new features you may get a be er model  Polynomial regression  o May ﬁt the data be er o θ0 + θ1x + θ2x2 e.g. here we have a quadra c func on o For housing data could use a quadra c func on o  But may not ﬁt the data so well - inﬂec on point means housing prices decrease when size gets really big  So instead must use a cubic func on  How do we ﬁt the model to this data o To map our old linear hypothesis and cost func ons to these polynomial descrip ons the easy thing to do is set  x1 = x  x2 = x2  x3 = x3 o By selec ng the features like this and applying the linear regression algorithms you can do polynomial linear regression o Remember, feature scaling becomes even more important here  Instead of a conven onal polynomial you could do variable ^(1/something) - i.e. square root, cubed root etc  Lots of features - later look at developing an algorithm to chose the best features Normal equa on  For some linear regression problems the normal equa on provides a be er solu on  So far we've been using gradient descent o Itera ve algorithm which takes steps to converse  Normal equa on solves θ analy cally o Solve for the op mum value of theta  Has some advantages and disadvantages How does it work?  Simpliﬁed cost func on o J(θ) = aθ2 + bθ + c  θ is just a real number, not a vector o Cost func on is a quadra c func on o How do you minimize this?  Do  Take deriva ve of J(θ) with respect to θ  Set that deriva ve equal to 0  Allows you to solve for the value of θ which minimizes J(θ)  In our more complex problems;  o Here θ is an n+1 dimensional vector of real numbers o Cost func on is a func on of the vector value  How do we minimize this func on  Take the par al deriva ve of J(θ) with respect θj and set to 0 for every j  Do that and solve for θ0 to θn  This would give the values of θ which minimize J(θ) o If you work through the calculus and the solu on, the deriva on is pre y complex o  Not going to go through here  Instead, what do you need to know to implement this process Example of normal equa on  Here  o m=4 o n=4  To implement the normal equa on o Take examples o Add an extra column (x0 feature) o Construct a matrix (X - the design matrix) which contains all the training data features in an [m x n+1] matrix o Do something similar for y  Construct a column vector y vector [m x 1] matrix o Using the following equa on (X transpose * X) inverse mes X transpose y  If you compute this, you get the value of theta which minimize the cost func on General case  Have m training examples and n features  o The design matrix (X) o  Each training example is a n+1 dimensional feature column vector  X is constructed by taking each training example, determining its transpose (i.e. column -> row) and using it for a row in the design A  This creates an [m x (n+1)] matrix o Vector y o  Used by taking all the y values into a column vector  What is this equa on?!  o (XT * X)-1 o What is this --> the inverse of the matrix (XT * X)  i.e. A = XT X  A-1 = (XT X)-1  In octave and MATLAB you could do; pinv(X'*x)*x'*y o  X' is the nota on for X transpose  pinv is a func on for the inverse of a matrix  In a previous lecture discussed feature scaling o If you're using the normal equa on then no need for feature scaling When should you use gradient descent and when should you use feature scaling?  o Gradient descent  Need to chose learning rate  Needs many itera ons - could make it slower  Works well even when n is massive (millions)  Be er suited to big data  What is a big n though  100 or even a 1000 is s ll (rela vity) small  If n is 10 000 then look at using gradient descent o Normal equa on o  No need to chose a learning rate  No need to iterate, check for convergence etc.  Normal equa on needs to compute (XT X)-1  This is the inverse of an n x n matrix  With most implementa ons compu ng a matrix inverse grows by O(n3 )  So not great  Slow of n is large   Can be much slower Normal equa on and non-inver bility  Advanced concept o O en asked about, but quite advanced, perhaps op onal material o Phenomenon worth understanding, but not probably necessary  When compu ng (XT X)-1 * XT * y) o What if (XT X) is non-inver ble (singular/degenerate)  Only some matrices are inver ble  This should be quite a rare problem  Octave can invert matrices using  pinv (pseudo inverse)  This gets the right value even if (XT X) is non-inver ble  inv (inverse) o What does it mean for (XT X) to be non-inver ble  Normally two common causes  Redundant features in learning model  e.g.  x1 = size in feet  x2 = size in meters squared  Too many features  e.g. m

Machine Learning Lecture Notes PDF

Document Details

Tags

Related

Summary

Full Transcript