Machine Learning Lecture 2 PDF
Document Details
Uploaded by JoyousWillow
Andrew Ng
Tags
Summary
This document is a lecture on machine learning, focusing on different types of machine learning, linear regression, and the gradient descent algorithm. The key ideas presented in this lecture include discussions on supervised learning, unsupervised learning, semi-supervised learning, and reinforcement learning. It also explores linear regression and gradient descent as methods used for function approximation.
Full Transcript
24CSAI03H MACHİNE LEARNİNG Lecture 2 Last Time: Machine Learning “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience...
24CSAI03H MACHİNE LEARNİNG Lecture 2 Last Time: Machine Learning “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.” 2 LAST TIME Supervised (Predictive) Learning Given: training data + desired outputs (labels) Unsupervised (Descriptive) Learning Given: training data (without desired outputs) Semi-supervised learning Given: training data + a few desired outputs Reinforcement Learning Rewards from sequence of actions. 3 Supervised Learning 4 5 Abu_Mostafa CLASSIFIERS: LINEAR Find a linear function to separate the classes: f(x) = sgn(w x + b) Slide credit: L. Lazebnik LINEAR CLASSIFIERS Let’s simplify life by assuming: Every instance is a vector of real numbers, x=(x1,…,xn). (Notation: boldface x is a vector.) There are only two classes, y=(+1) and y=(-1) A linear classifier is vector w of the same dimension as x that is used to make this prediction: yˆ = sign( w1 x1 + w2 x2 +... + wn xn ) = sign (w x) + 1 if x 0 sign ( x) = − 1 if 0 REGRESSION For classification the output(s) is discrete In regression the output is continuous Function Approximation Many models could be used – Simplest is linear regression Fit data with the best hyper-plane which "goes through" the points y dependent variable (output) x – independent variable (input) 9 LINEAR REGRESSION For now, assume just one (input) independent variable x, and one (output) dependent variable y Multiple linear regression assumes an input vector x We "fit" the points with a line (i.e. hyper-plane) Which line should we use? Choose an objective function For simple linear regression we choose sum squared error (SSE) S (predictedi – actuali)2 = S (residuei)2 Thus, find the line which minimizes the sum of the squared residues (e.g. least squares) 10 11 For simple linear regression we choose sum squared error (SSE) S (predictedi – actuali)2 = S (residuei)2 y x LINEAR REGRESSION lets say we decide to approximate y as a linear function of x: 12 Andrew N g LINEAR REGRESSION the θi’s are the parameters (also called weights) parameterizing the space of linear functions mapping from X to Y. X0 To simplify our notation, we also introduce the convention of letting x0 = 1 (this is the intercept term), so that right-hand side above we are viewing θ and x both as vectors, How do we learn, the parameters θ? 13 Andrew NG LINEAR REGRESSION lets say we decide to approximate y as a linear function of x: 14 Andrew N g LINEAR REGRESSION 15 LINEAR REGRESSION We will define a function that measures, for each value of the θ’s, how close the h(x(i))’s are to the corresponding y(i)’s. 17 Andrew N g 18 We define the cost function: Andrew Ng 19 Θ1=1 Θ1=0.5 Θ1=0 Andrew Ng 20 GRADIENT DESCENT ALGORITHM Andrew Ng 21 GRADIENT DESCENT ALGORITHM Andrew Ng 22 GRADIENT DESCENT ALGORITHM Andrew Ng 23 GRADIENT DESCENT ALGORITHM Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) GRADIENT DESCENT ALGORITHM 24 Andrew Ng 25 GRADIENT DESCENT ALGORITHM J(θ) (θ) θ 26 GRADIENT DESCENT ALGORITHM 27 LEARNING DECISION TREES Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) LEARNING DECISION TREES A Decision Tree is a tree-structured with a set of attributes to test; in order to predict the output. To decide which attribute should be, simply find the one which decreases data impurities (tested first with the highest information gain). Then recurse… The goal is to have the resulting decision tree as small as possible (Occam’s Razor) 30 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) ENTROPY S is a sample of training examples p+ is the proportion of positive examples p- is the proportion of negative examples Entropy measures the impurity of S Entropy(S) = -p+ log2 p+ - p- log2 p- 31 TREE USES NODES, AND LEAVES 14 examples 14 C1=9, 7 C2=5 7 5 2 32 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) TREE USES NODES, AND LEAVES 14 examples 14 C1=9, 7 C2=5 7 5 2 33 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) TREE USES NODES, AND LEAVES 14 examples 7 7 5 2 34 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) TREE USES NODES, AND LEAVES 14 examples 7 7 5 2 35 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) WHEN TO CONSIDER DECISION TREES If a comprehensible explanation of the classification is needed. Target function is discrete valued Possibly noisy training data Missing attribute values Examples: Medical diagnosis Credit risk analysis 36 CONTINUOUS VALUED ATTRIBUTES Create a discrete attribute to test continuous Temperature = 24.50C (Temperature > 20.00C) = {true, false} Where to set the threshold? Temperature 150C 180C 190C 220C 240C 270C PlayTennis No No Yes Yes Yes No 37 SPLITTING CONTINUOUS ATTRIBUTES Different ways of handling Discretization to form an ordinal categorical attribute Binary Decision: (A < v) or (A v) consider all possible splits and finds the best cut v can be more compute intensive OCCAM’S RAZOR ”If two theories explain the facts equally well, then the simpler theory is to be preferred” Arguments in favor: Favor short hypotheses than long hypotheses A short hypothesis that fits the data is unlikely to be a coincidence A long hypothesis that fits the data might be a coincidence 39 OVERFITTING One of the biggest problems with decision trees is Overfitting 40 AVOID OVERFITTING Stop growing when split not statistically significant grow full tree, then post-prune Select “best” tree: measure performance over training data measure performance over separate validation data set 41 STOPPING CRITERIA FOR TREE INDUCTION 1. Grow entire tree Stop expanding a node when all the records belong to the same class Stop expanding a node when all the records have the same attribute values 2. Pre-pruning (do not grow complete tree) 1. Stop when only x examples are left (pre-pruning) 2. … other pre-pruning strategies HOW TO ADDRESS OVERFITTING IN DECISION TREES The most popular approach: Post-pruning Grow decision tree to its entirety Trim the nodes of the decision tree in a bottom-up fashion If generalization error improves after trimming, replace sub-tree by a leaf node. Class label of leaf node is determined from majority class of instances in the sub-tree 44 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 45 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 46 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) EFFECT OF REDUCED ERROR PRUNING 47 ADVANTAGES DECISION TREE BASED CLASSIFICATION Inexpensive to construct Extremely fast at classifying unknown records Easy to interpret for small-sized trees Can handle both continuous and symbolic attributes Accuracy is comparable to other classification techniques for many simple data sets Kind of a standard—if you want to show that your “new” classification technique really “improves the world” → compare its performance against decision trees (e.g. C 5.0) using 10-fold cross-validation More recently, forests (ensembles of decision trees) have gained some popularity. DISADVANTAGES DECISION TREE BASED CLASSIFICATION Relies on rectangular approximation that might not be good for some dataset Selecting good learning algorithm parameters (e.g. degree of pruning) is non-trivial 50 QUESTIONS?