Foundations of Machine Learning PDF
Document Details
Uploaded by Deleted User
IIT Kharagpur
Sudeshna Sarkar
Tags
Summary
This document is a lecture or presentation on Foundations of Machine Learning, focusing on the module of Linear Regression. It covers basic concepts, examples, and types of linear regression models, including simple and multiple linear regression. The document also touches upon the concept of fitting a polynomial and evaluating different fits.
Full Transcript
Foundations of Machine Learning Module 2: Linear Regression and Decision Tree Part A: Linear Regression Sudeshna Sarkar IIT Kharagpur Regression In regression the output is continuous Many models could be used – Simplest is linear...
Foundations of Machine Learning Module 2: Linear Regression and Decision Tree Part A: Linear Regression Sudeshna Sarkar IIT Kharagpur Regression In regression the output is continuous Many models could be used – Simplest is linear regression – Fit data with the best hyper-plane which "goes through" the points y dependent variable (output) x – independent variable (input) A Simple Example: Fitting a Polynomial The green curve is the true function (which is not a polynomial) We may use a loss function that measures the squared error in the prediction of y(x) from x. from Bishop’s book on Machine Learning 3 Some fits to the data: which is best? from Bishop 4 Types of Regression Models Regression 1 feature Models 2+ features Simple Multiple Non- Non- Linear Linear Linear Linear Linear regression Given an input x compute an output y For example: Y - Predict height from age - Predict house price from house area - Predict distance from wall from sensors X Simple Linear Regression Equation E(y) Regression line Intercept Slope β1 b0 x Linear Regression Model Relationship Between Variables Is a Linear Function Population Population Random Y-Intercept Slope Error Y=𝛽0 + 𝛽1 𝑥1 + 𝜖 House Number Y: Actual Selling X: House Size (100s Price ft2) 1 89.5 20.0 2 79.9 14.8 3 83.1 20.5 Sample 15 4 56.9 12.5 houses 5 66.6 18.0 from the 6 82.5 14.3 region. 7 126.3 27.5 8 79.3 16.5 9 119.9 24.3 10 87.6 20.2 11 112.6 22.0 12 120.8.019 13 78.5 12.3 14 74.3 14.0 15 74.8 16.7 Averages 88.84 18.17 House price vs size Linear Regression – Multiple Variables Yi = b0 + b1X1 + b2 X2 + + bp Xp +e b0 is the intercept (i.e. the average value for Y if all the X’s are zero), bj is the slope for the jth variable Xj 11 Regression Model Our model assumes that E(Y | X = x) = b0 + b1x (the “population line”) Population Yi = b0 + b1X1 + b2 X2 + + bp Xp +e line Yˆi = b̂0 + b̂1 X1 + b̂2 X2 + Least Squares line + b̂ p X p We use 𝛽 0 through 𝛽 𝑝 as guesses for b0 through bp 𝑖 as a guess for Yi. The guesses will not be perfect. and 𝑌 Assumption The data may not form a perfect line. When we actually take a measurement (i.e., observe the data), we observe: Yi = b0 + b1Xi + i, where i is the random error associated with the ith observation. Assumptions about the Error E(i ) = 0 for i = 1, 2,…,n. (i ) = where is unknown. The errors are independent. The i are normally distributed (with mean 0 and standard deviation ). The regression line The least-squares regression line is the unique line such that the sum of the squared vertical (y) distances between the data points and the line is the smallest possible. Criterion for choosing what line to draw: method of least squares 0 The method of least squares chooses the line ( 𝛽 and 𝛽1 ) that makes the sum of squares of the residuals σ ℇ𝑖 2 as small as possible Minimizes n i 0 1i [ y i 1 (b b x )]2 for the given observations ( xi , yi ) How do we "learn" parameters For the 2-d problem Y = b0 + b1X To find the values for the coefficients which minimize the objective function we take the partial derivates of the objective function (SSE) with respect to the coefficients. Set these to 0, and solve. n å xy - å x å y å y - b åx b1 = b0 = 1 nå x 2 - (å x ) 2 n 17 Multiple Linear Regression Y b 0 b1 X 1 b 2 X 2 b n X n 𝑛 ℎ 𝑥 = 𝛽𝑖 𝑥𝑖 𝑖=0 There is a closed form which requires matrix inversion, etc. There are iterative techniques to find weights – delta rule (also called LMS method) which will update towards the objective of minimizing the SSE. 18 Linear Regression 𝑛 ℎ 𝑥 = 𝛽𝑖 𝑥𝑖 𝑖=0 To learn the parameters θ (𝛽𝑖 ) ? Make h(x) close to y, for the available training examples. Define a cost function 𝐽 𝜃 1 𝑚 (𝑖) (𝑖) 2 J(θ) = σ (ℎ 𝑥 − 𝑦 ) 2 𝑖=1 Find 𝜃 that minimizes J(𝜃). LMS Algorithm Start a search algorithm (e.g. gradient descent algorithm,) with initial guess of 𝜃. Repeatedly update 𝜃 to make J(𝜃) smaller, until it converges to minima. 𝜕 βj = βj − 𝛼 𝐽 𝜃 𝜕βj J is a convex quadratic function, so has a single global minima. gradient descent eventually converges at the global minima. At each iteration this algorithm takes a step in the direction of steepest descent(-ve direction of gradient). LMS Update Rule If you have only one training example (𝑥, 𝑦) 𝜕 𝜕 1 J(𝜃) = (ℎ 𝑥 − 𝑦)2 𝜕𝜃 𝜕𝜃𝑗 2 1 𝜕 = 2. (ℎ 𝑥 − 𝑦) (ℎ 𝑥 − 𝑦) 2 𝜕𝜃𝑗 𝑛 𝜕 = ℎ 𝑥 −𝑦. ( 𝜃𝑖 𝑥𝑖 − 𝑦) 𝜕𝜃𝑗 𝑖=0 = (ℎ 𝑥 − 𝑦)𝑥𝑗 For a single training example, this gives the update rule: 𝛽𝑗 = 𝛽𝑗 + 𝛼(𝑦 𝑖 − ℎ 𝑥 𝑖 )𝑥𝑗 (𝑖) m training examples Repeat until convergence { 𝜃𝑗 ≔ 𝜃𝑗 + 𝛼 σ𝑚 𝑖=1(𝑦 𝑖 −ℎ 𝑥 𝑖 ) 𝑥𝑗 (𝑖) } Batch Gradient Descent: looks at every example on each step. Stochastic gradient descent Repeatedly run through the training set. Whenever a training point is encountered, update the parameters according to the gradient of the error with respect to that training example only. Repeat { for I = 1 to m do 𝜃𝑗 ≔ 𝜃𝑗 + 𝛼(𝑦 𝑖 −ℎ 𝑥 𝑖 )𝑥𝑗 (𝑖) (for every j) end for } until convergence Thank You Delta Rule for Classification 1 z 0 x 1 z 0 x What would happen in this adjusted case for perceptron and delta rule and where would the decision point (i.e..5 crossing) be? CS 478 - Regression 25 Delta Rule for Classification 1 z 0 x 1 z 0 x Leads to misclassifications even though the data is linearly separable For Delta rule the objective function is to minimize the regression line SSE, not maximize classification CS 478 - Regression 26 Delta Rule for Classification 1 z 0 x 1 z 0 x 1 z 0 x What would happen if we were doing a regression fit with a sigmoid/logistic curve rather than a line? CS 478 - Regression 27 Delta Rule for Classification 1 z 0 x 1 z 0 x 1 z 0 x Sigmoid fits many decision cases quite well! This is basically what logistic regression does. 28 Foundations of Machine Learning Module 2: Linear Regression and Decision Tree Part B: Introduction to Decision Tree Sudeshna Sarkar IIT Kharagpur Definition A decision tree is a classifier in the form of a tree structure with two types of nodes: – Decision node: Specifies a choice or test of some attribute, with one branch for each outcome – Leaf node: Indicates classification of an example Decision Tree Example 1 Whether to approve a loan Employed? No Yes Credit Income? Score? High Low High Low Approve Reject Approve Reject Decision Tree Example 3 Issues Given some training examples, what decision tree should be generated? One proposal: prefer the smallest tree that is consistent with the data (Bias) – the tree with the least depth? – the tree with the fewest nodes? Possible method: – search the space of decision trees for the smallest decision tree that fits the data Example Data Training Examples: Action Author Thread Length Where e1 skips known new long Home e2 reads unknown new short Work e3 skips unknown old long Work e4 skips known old long home e5 reads known new short home e6 skips known old long work New Examples: e7 ??? known new short work e8 ??? unknown new short work Possible splits skips 9 length reads 9 long short skips 9 thread reads 9 Skips 2 Skips 7 Reads 9 Reads 0 new old Skips 6 Skips 3 Reads 2 Reads 7 Two Example DTs Decision Tree for PlayTennis Attributes and their values: – Outlook: Sunny, Overcast, Rain – Humidity: High, Normal – Wind: Strong, Weak – Temperature: Hot, Mild, Cool Target concept - Play Tennis: Yes, No Decision Tree for PlayTennis Outlook Sunny Overcast Rain Humidity Yes Wind High Normal Strong Weak No Yes No Yes Decision Tree for PlayTennis Outlook Sunny Overcast Rain Humidity Each internal node tests an attribute High Normal Each branch corresponds to an attribute value node No Yes Each leaf node assigns a classification Decision Tree for PlayTennis Outlook Temperature Humidity Wind PlayTennis Sunny Hot High Weak ? No Outlook Sunny Overcast Rain Humidity Yes Wind High Normal Strong Weak No Yes No Yes Decision Tree decision trees represent disjunctions of conjunctions Outlook Sunny Overcast Rain Humidity Yes Wind High Normal Strong Weak No Yes No Yes (Outlook=Sunny Humidity=Normal) (Outlook=Overcast) (Outlook=Rain Wind=Weak) Searching for a good tree How should you go about building a decision tree? The space of decision trees is too big for systematic search. Stop and – return the a value for the target feature or – a distribution over target feature values Choose a test (e.g. an input feature) to split on. – For each value of the test, build a subtree for those examples with this value for the test. Top-Down Induction of Decision Trees ID3 1. Which node to proceed with? 1. A the “best” decision attribute for next node 2. Assign A as decision attribute for node 3. For each value of A create new descendant 4. Sort training examples to leaf node according to the attribute value of the branch 5. If all training examples are perfectly classified (same value of target attribute) stop, else iterate over new leaf nodes. 2. When to stop? Choices When to stop – no more input features – all examples are classified the same – too few examples to make an informative split Which test to split on – split gives smallest error. – With multi-valued features split on all values or split values into half. Foundations of Machine Learning Module 2: Linear Regression and Decision Tree Part C: Learning Decision Tree Sudeshna Sarkar IIT Kharagpur Top-Down Induction of Decision Trees ID3 1. Which node to proceed with? 1. A the “best” decision attribute for next node 2. Assign A as decision attribute for node 3. For each value of A create new descendant 4. Sort training examples to leaf node according to the attribute value of the branch 5. If all training examples are perfectly classified (same value of target attribute) stop, else iterate over new leaf nodes. 2. When to stop? Choices When to stop – no more input features – all examples are classified the same – too few examples to make an informative split Which test to split on – split gives smallest error. – With multi-valued features split on all values or split values into half. Which Attribute is ”best”? [29+,35-] A1=? A2=? [29+,35-] True False True False [21+, 5-] [8+, 30-] [18+, 33-] [11+, 2-] ICS320 4 Principled Criterion Selection of an attribute to test at each node - choosing the most useful attribute for classifying examples. information gain – measures how well a given attribute separates the training examples according to their target classification – This measure is used to select among the candidate attributes at each step while growing the tree – Gain is measure of how much we can reduce uncertainty (Value lies between 0,1) Entropy A measure for – uncertainty – purity – information content Information theory: optimal length code assigns (- log2p) bits to message having probability p S is a sample of training examples – p+ is the proportion of positive examples in S – p- is the proportion of negative examples in S Entropy of S: average optimal number of bits to encode information about certainty/uncertainty about S Entropy(S) = p+(-log2p+) + p-(-log2p-) = -p+log2p+- p-log2p- Entropy The entropy is 0 if the outcome is ``certain”. The entropy is maximum if we have no knowledge of the system (or any outcome is equally possible). S is a sample of training examples p+ is the proportion of positive examples p- is the proportion of negative examples Entropy measures the impurity of S Entropy(S) = -p+log2p+- p-log2 p- Information Gain Gain(S,A): expected reduction in entropy due to partitioning S on attribute A Gain(S,A)=Entropy(S) − vvalues(A) |Sv|/|S| Entropy(Sv) Entropy([29+,35-]) = -29/64 log2 29/64 – 35/64 log2 35/64 = 0.99 [29+,35-] A1=? A2=? [29+,35-] True False True False [21+, 5-] [8+, 30-] [18+, 33-] [11+, 2-] Information Gain Entropy([21+,5-]) = 0.71 Entropy([18+,33-]) = 0.94 Entropy([8+,30-]) = 0.74 Entropy([8+,30-]) = 0.62 Gain(S,A1)=Entropy(S) Gain(S,A2)=Entropy(S) -26/64*Entropy([21+,5-]) -51/64*Entropy([18+,33-]) -38/64*Entropy([8+,30-]) -13/64*Entropy([11+,2-]) =0.27 =0.12 [29+,35-] A1=? A2=? [29+,35-] True False True False [21+, 5-] [8+, 30-] [18+, 33-] [11+, 2-] ICS320 9 Training Examples Day Outlook Temp Humidity Wind Tennis? D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No Selecting the Next Attribute S=[9+,5-] S=[9+,5-] E=0.940 E=0.940 Humidity Wind High Normal Weak Strong [3+, 4-] [6+, 1-] [6+, 2-] [3+, 3-] E=0.985 E=0.592 E=0.811 E=1.0 Gain(S,Humidity) Gain(S,Wind) =0.940-(7/14)*0.985 =0.940-(8/14)*0.811 – (7/14)*0.592 – (6/14)*1.0 =0.151 =0.048 Humidity provides greater info. gain than Wind, w.r.t target classification. Selecting the Next Attribute S=[9+,5-] E=0.940 Outlook Sunny Overcast Rain [2+, 3-] [4+, 0] [3+, 2-] E=0.971 E=0.0 E=0.971 Gain(S,Outlook) =0.940-(5/14)*0.971 -(4/14)*0.0 – (5/14)*0.0971 =0.247 Selecting the Next Attribute The information gain values for the 4 attributes are: Gain(S,Outlook) =0.247 Gain(S,Humidity) =0.151 Gain(S,Wind) =0.048 Gain(S,Temperature) =0.029 where S denotes the collection of training examples Note: 0Log20 =0 ID3 Algorithm [D1,D2,…,D14] Outlook [9+,5-] Sunny Overcast Rain Ssunny=[D1,D2,D8,D9,D11] [D3,D7,D12,D13] [D4,D5,D6,D10,D14] [2+,3-] [4+,0-] [3+,2-] ? Yes ? Test for this node Gain(Ssunny , Humidity)=0.970-(3/5)0.0 – 2/5(0.0) = 0.970 Gain(Ssunny , Temp.)=0.970-(2/5)0.0 –2/5(1.0)-(1/5)0.0 = 0.570 Gain(Ssunny , Wind)=0.970= -(2/5)1.0 – 3/5(0.918) = 0.019 ID3 Algorithm Outlook Sunny Overcast Rain Humidity Yes Wind [D3,D7,D12,D13] High Normal Strong Weak No Yes No Yes [D1,D2] [D8,D9,D11] [D6,D14] [D4,D5,D10] Splitting Rule: GINI Index GINI Index – Measure of node impurity GINInode(Node) = 1- [ p(c)] 2 c classes Sv GINIsplit (A) = S GINI(N v ) v Value s(A) Splitting Based on Continuous Attributes Taxable Taxable Income Income? > 80K? < 10K > 80K Yes No [10K,25K) [25K,50K) [50K,80K) (i) Binary split (ii) Multi-way split Continuous Attribute – Binary Split For continuous attribute – Partition the continuous value of attribute A into a discrete set of intervals – Create a new boolean attribute Ac , looking for a threshold c, true if Ac c Ac = false otherwise How to choose c ? consider all possible splits and finds the best cut Practical Issues of Classification Underfitting and Overfitting Missing Values Costs of Classification Hypothesis Space Search in Decision Trees Conduct a search of the space of decision trees which can represent all possible discrete functions. Goal: to find the best decision tree Finding a minimal decision tree consistent with a set of data is NP-hard. Perform a greedy heuristic search: hill climbing without backtracking Statistics-based decisions using all data 20 Bias and Occam’s Razor Prefer short hypotheses. Argument in favor: – Fewer short hypotheses than long hypotheses – A short hypothesis that fits the data is unlikely to be a coincidence – A long hypothesis that fits the data might be a coincidence ICS320 21 Foundations of Machine Learning Module 2: Linear Regression and Decision Tree Part D: Overfitting Sudeshna Sarkar IIT Kharagpur Overfitting Learning a tree that classifies the training data perfectly may not lead to the tree with the best generalization performance. – There may be noise in the training data – May be based on insufficient data A hypothesis h is said to overfit the training data if there is another hypothesis, h’, such that h has smaller error than h’ on the training data but h has larger error on the test data than h’. Overfitting Learning a tree that classifies the training data perfectly may not lead to the tree with the best generalization performance. – There may be noise in the training data – May be based on insufficient data A hypothesis h is said to overfit the training data if there is another hypothesis, h’, such that h has smaller error than h’ on the training data but h has larger error on the test data than h’. On training accuracy On testing Complexity of tree Underfitting and Overfitting (Example) 500 circular and 500 triangular data points. Circular points: 0.5 sqrt(x12+x22) 1 Triangular points: sqrt(x12+x22) > 0.5 or sqrt(x12+x22) < 1 Underfitting and Overfitting Overfitting Underfitting: when model is too simple, both training and test errors are large Overfitting due to Noise Decision boundary is distorted by noise point Overfitting due to Insufficient Examples Lack of data points makes it difficult to predict correctly the class labels of that region Notes on Overfitting Overfitting results in decision trees that are more complex than necessary Training error no longer provides a good estimate of how well the tree will perform on previously unseen records Avoid Overfitting How can we avoid overfitting a decision tree? – Prepruning: Stop growing when data split not statistically significant – Postpruning: Grow full tree then remove nodes Methods for evaluating subtrees to prune: – Minimum description length (MDL): Minimize: size(tree) + size(misclassifications(tree)) – Cross-validation ICS320 10 Pre-Pruning (Early Stopping) Evaluate splits before installing them: – Don’t install splits that don’t look worthwhile – when no worthwhile splits to install, done Pre-Pruning (Early Stopping) Typical stopping conditions for a node: – Stop if all instances belong to the same class – Stop if all the attribute values are the same More restrictive conditions: – Stop if number of instances is less than some user-specified threshold – Stop if class distribution of instances are independent of the available features (e.g., using 2 test) – Stop if expanding the current node does not improve impurity measures (e.g., Gini or information gain). Reduced-error Pruning A post-pruning, cross validation approach - Partition training data into “grow” set and “validation” set. - Build a complete tree for the “grow” data - Until accuracy on validation set decreases, do: For each non-leaf node in the tree Temporarily prune the tree below; replace it by majority vote Test the accuracy of the hypothesis on the validation set Permanently prune the node with the greatest increase in accuracy on the validation test. Problem: Uses less data to construct the tree Sometimes done at the rules level General Strategy: Overfit and Simplify Reduced Error Pruning Model Selection & Generalization Learning is an ill-posed problem; data is not sufficient to find a unique solution The need for inductive bias, assumptions about H Generalization: How well a model performs on new data Overfitting: H more complex than C or f Underfitting: H less complex than C or f 15 Triple Trade-Off There is a trade-off between three factors: – Complexity of H, c (H), – Training set size, N, – Generalization error, E on new data overfitting As N increases, E decreases As c (H) increases, first E decreases and then E increases As c (H) increases, the training error decreases for some time and then stays constant (frequently at 0) 16 Notes on Overfitting overfitting happens when a model is capturing idiosyncrasies of the data rather than generalities. – Often caused by too many parameters relative to the amount of training data. – E.g. an order-N polynomial can intersect any N+1 data points Dealing with Overfitting Use more data Use a tuning set Regularization Be a Bayesian 18 Regularization In a linear regression model overfitting is characterized by large weights. 19 Penalize large weights in Linear Regression Introduce a penalty term in the loss function. Regularized Regression 1. (L2-Regularization or Ridge Regression) 1. L1-Regularization 20