Machine Learning Lecture Notes PDF
Document Details
Uploaded by GlimmeringSard2672
Helwan University
Tags
Summary
These lecture notes cover polynomial regression and linear regression with multiple features. The notes include definitions, applications, and examples of these machine learning concepts, along with equations related to their implementations.
Full Transcript
Polynomial Regression is a form of linear regression in which the relationship between the independent variable x and dependent variable y is modeled as an nth-degree polynomial. Polynomial regression fits a nonlinear relationship between the value of x and the corresponding conditional mean of y, d...
Polynomial Regression is a form of linear regression in which the relationship between the independent variable x and dependent variable y is modeled as an nth-degree polynomial. Polynomial regression fits a nonlinear relationship between the value of x and the corresponding conditional mean of y, denoted E(y | x). How does a Polynomial Regression work? If we observe closely then we will realize that to evolve from linear regression to polynomial regression. We are just supposed to add the higher-order terms of the dependent features in the feature space. This is sometimes also known as feature engineering but not exactly. Application of Polynomial Regression The reason behind the vast use cases of the polynomial regression is that approximately all of the real-world data is non-linear in nature and hence when we fit a non-linear model on the data or a curvilinear regression line then the results that we obtain are far better than what we can achieve with the standard linear regression. Some of the use cases of the Polynomial regression are as stated below: The growth rate of tissues. Progression of disease epidemics Distribution of carbon isotopes in lake sediments Linear regression with mul ple features In previous lecture we talk about linear regression with one variable New version of linear regression with mul ple features Mul ple variables = mul ple features In original version we had o X = house size, use this to predict o y = house price If in a new scheme we have more variables (such as number of bedrooms, number floors, age of the home) o x1, x2, x3, x4 are the four features x1 - size (feet squared) x2 - Number of bedrooms x3 - Number of floors x4 - Age of home (years) o y is the output variable (price) More nota on o n number of features (n = 4) o m number of examples (i.e. number of rows in a table) i o x vector of the input for an example (so a vector of the four parameters for the ith input example) i is an index into the training set So x is an n-dimensional feature vector x3 is, for example, the 3rd house, and contains the four features associated with that house o x ji The value of feature j in the ith training example So x23 is, for example, the number of bedrooms in the third house Now we have mul ple features o What is the form of our hypothesis? o Previously our hypothesis took the form; hθ(x) = θ0 + θ1x Here we have two parameters (theta 1 and theta 2) determined by our cost func on One variable x o Now we have mul ple features hθ(x) = θ0 + θ1x1 + θ2x2 + θ3x3 + θ4x4 o For example hθ(x) = 80 + 0.1x1 + 0.01x2 + 3x3 - 2x4 An example of a hypothesis which is trying to predict the price of a house Parameters are s ll determined through a cost func on o For convenience of nota on, x0 = 1 For every example i you have an addi onal 0th feature for each example So now your feature vector is n + 1 dimensional feature vector indexed from 0 This is a column vector called x Each example has a column vector associated with it So let's say we have a new example called "X" Parameters are also in a 0 indexed n+1 dimensional vector This is also a column vector called θ This vector is the same for each example o Considering this, hypothesis can be wri en hθ(x) = θ0x0 + θ1x1 + θ2x2 + θ3x3 + θ4x4 o If we do hθ(x) =θT X θT is an [1 x n+1] matrix In other words, because θ is a column vector, the transposi on opera on transforms it into a row vector So before θ was a matrix [n + 1 x 1] Now θT is a matrix [1 x n+1] Which means the inner dimensions of θT and X match, so they can be mul plied together as [1 x n+1] * [n+1 x 1] = hθ(x) So, in other words, the transpose of our parameter vector * an input example X gives you a predicted hypothesis which is [1 x 1] dimensions (i.e. a single value) This x0 = 1 lets us write this like this o This is an example of mul variate linear regression o Gradient descent for mul ple variables Fi ng parameters for the hypothesis with gradient descent o Parameters are θ0 to θn o Instead of thinking about this as n separate values, think about the parameters as a single vector (θ) Where θ is n+1 dimensional Our cost func on is Similarly, instead of thinking of J as a func on of the n+1 numbers, J() is just a func on of the parameter vector o J(θ) Gradient descent Once again, this is o θj = θj - learning rate (α) mes the par al deriva ve of J(θ) with respect to θJ(...) o We do this through a simultaneous update of every θj value Implemen ng this algorithm o When n = 1 Above, we have slightly different update rules for θ0 and θ1 o Actually they're the same, except the end has a previously undefined x0(i) as 1, so wasn't shown We now have an almost iden cal rule for mul variate gradient descent What's going on here? o We're doing this for each j (0 un l n) as a simultaneous update (like when n = 1) o So, we re-set θj to θj minus the learning rate (α) mes the par al deriva ve of of the θ vector with respect to θj In non-calculus words, this means that we do Learning rate Times 1/m (makes the maths easier) Times the sum of The hypothesis taking in the variable vector, minus the actual value, mes the j-th value in that variable vector for EACH example o It's important to remember that These algorithm are highly similar Gradient Decent in prac ce: 1 Feature Scaling Having covered the theory, we now move on to learn about some of the prac cal tricks Feature scaling o If you have a problem with mul ple features o You should make sure those features have a similar scale Means gradient descent will converge more quickly o e.g. x1 = size (0 - 2000 feet) x2 = number of bedrooms (1-5) Means the contours generated if we plot θ1 vs. θ2 give a very tall and thin shape due to the huge range difference o Running gradient descent on this kind of cost func on can take a long me to find the global minimum Pathological input to gradient descent o So we need to rescale this input so it's more effec ve o So, if you define each value from x1 and x2 by dividing by the max for each feature o Contours become more like circles (as scaled between 0 and 1) May want to get everything into -1 to +1 range (approximately) o Want to avoid large ranges, small ranges or very different ranges from one another o Rule a thumb regarding acceptable ranges -3 to +3 is generally fine - any bigger bad -1/3 to +1/3 is ok - any smaller bad Can do mean normaliza on o Take a feature xi o Replace it by (xi - mean)/max So your values all have an average of about 0 Instead of max can also use standard devia on Learning Rate α Focus on the learning rate (α) Topics o Update rule o Debugging o How to chose α Make sure gradient descent is working Plot min J(θ) vs. no of itera ons o (i.e. plo ng J(θ) over the course of gradient descent If gradient descent is working then J(θ) should decrease a er every itera on Can also show if you're not making huge gains a er a certain number o Can apply heuris cs to reduce number of itera ons if need be o If, for example, a er 1000 itera ons you reduce the parameters by nearly nothing you could chose to only run 1000 itera ons in the future o Make sure you don't accidentally hard-code thresholds like this in and then forget about why they're their though! o Number of itera ons varies a lot 30 itera ons 3000 itera ons 3000 000 itera ons Very hard to tel in advance how many itera ons will be needed Can o en make a guess based a plot like this a er the first 100 or so itera ons o Automa c convergence tests Check if J(θ) changes by a small threshold or less Choosing this threshold is hard So o en easier to check for a straight line Why? - Because we're seeing the straightness in the context of the whole algorithm Could you design an automa c checker which calculates a threshold based on the systems preceding progress? o Checking its working If you plot J(θ) vs itera ons and see the value is increasing - means you probably need a smaller α Cause is because your minimizing a func on which looks like this o But you overshoot, so reduce learning rate so you actually reach the minimum (green line) o So, use a smaller α Another problem might be if J(θ) looks like a series of waves o Here again, you need a smaller α However o If α is small enough, J(θ) will decrease on every itera on o BUT, if α is too small then rate is too slow A less steep incline is indica ve of a slow convergence, because we're decreasing by less on each itera on than a steeper slope Typically o Try a range of alpha values o Plot J(θ) vs number of itera ons for each version of alpha o Go for roughly threefold increases o 0.001, 0.003, 0.01, 0.03. 0.1, 0.3 Features and polynomial regression Choice of features and how you can get different learning algorithms by choosing appropriate features Polynomial regression for non-linear func on Example o House price predic on Two features Frontage - width of the plot of land along road (x 1) Depth - depth away from road (x2) o You don't have to use just two features Can create new features o Might decide that an important feature is the land area So, create a new feature = frontage * depth (x3) h(x) = θ0 + θ1x3 Area is a be er indicator o O en, by defining new features you may get a be er model Polynomial regression o May fit the data be er o θ0 + θ1x + θ2x2 e.g. here we have a quadra c func on o For housing data could use a quadra c func on o But may not fit the data so well - inflec on point means housing prices decrease when size gets really big So instead must use a cubic func on How do we fit the model to this data o To map our old linear hypothesis and cost func ons to these polynomial descrip ons the easy thing to do is set x1 = x x2 = x2 x3 = x3 o By selec ng the features like this and applying the linear regression algorithms you can do polynomial linear regression o Remember, feature scaling becomes even more important here Instead of a conven onal polynomial you could do variable ^(1/something) - i.e. square root, cubed root etc Lots of features - later look at developing an algorithm to chose the best features Normal equa on For some linear regression problems the normal equa on provides a be er solu on So far we've been using gradient descent o Itera ve algorithm which takes steps to converse Normal equa on solves θ analy cally o Solve for the op mum value of theta Has some advantages and disadvantages How does it work? Simplified cost func on o J(θ) = aθ2 + bθ + c θ is just a real number, not a vector o Cost func on is a quadra c func on o How do you minimize this? Do Take deriva ve of J(θ) with respect to θ Set that deriva ve equal to 0 Allows you to solve for the value of θ which minimizes J(θ) In our more complex problems; o Here θ is an n+1 dimensional vector of real numbers o Cost func on is a func on of the vector value How do we minimize this func on Take the par al deriva ve of J(θ) with respect θj and set to 0 for every j Do that and solve for θ0 to θn This would give the values of θ which minimize J(θ) o If you work through the calculus and the solu on, the deriva on is pre y complex o Not going to go through here Instead, what do you need to know to implement this process Example of normal equa on Here o m=4 o n=4 To implement the normal equa on o Take examples o Add an extra column (x0 feature) o Construct a matrix (X - the design matrix) which contains all the training data features in an [m x n+1] matrix o Do something similar for y Construct a column vector y vector [m x 1] matrix o Using the following equa on (X transpose * X) inverse mes X transpose y If you compute this, you get the value of theta which minimize the cost func on General case Have m training examples and n features o The design matrix (X) o Each training example is a n+1 dimensional feature column vector X is constructed by taking each training example, determining its transpose (i.e. column -> row) and using it for a row in the design A This creates an [m x (n+1)] matrix o Vector y o Used by taking all the y values into a column vector What is this equa on?! o (XT * X)-1 o What is this --> the inverse of the matrix (XT * X) i.e. A = XT X A-1 = (XT X)-1 In octave and MATLAB you could do; pinv(X'*x)*x'*y o X' is the nota on for X transpose pinv is a func on for the inverse of a matrix In a previous lecture discussed feature scaling o If you're using the normal equa on then no need for feature scaling When should you use gradient descent and when should you use feature scaling? o Gradient descent Need to chose learning rate Needs many itera ons - could make it slower Works well even when n is massive (millions) Be er suited to big data What is a big n though 100 or even a 1000 is s ll (rela vity) small If n is 10 000 then look at using gradient descent o Normal equa on o No need to chose a learning rate No need to iterate, check for convergence etc. Normal equa on needs to compute (XT X)-1 This is the inverse of an n x n matrix With most implementa ons compu ng a matrix inverse grows by O(n3 ) So not great Slow of n is large Can be much slower Normal equa on and non-inver bility Advanced concept o O en asked about, but quite advanced, perhaps op onal material o Phenomenon worth understanding, but not probably necessary When compu ng (XT X)-1 * XT * y) o What if (XT X) is non-inver ble (singular/degenerate) Only some matrices are inver ble This should be quite a rare problem Octave can invert matrices using pinv (pseudo inverse) This gets the right value even if (XT X) is non-inver ble inv (inverse) o What does it mean for (XT X) to be non-inver ble Normally two common causes Redundant features in learning model e.g. x1 = size in feet x2 = size in meters squared Too many features e.g. m