9 Learning from examples.docx
Document Details

Uploaded by BrainiestLithium
London Metropolitan University
Full Transcript
Introduction to Machine Learning (ML) Machine Learning is about creating new facts without being able to infer them logically from the exiting knowledge and experience. Machine learning can be useful in a number of tasks which require knowledge Detection: discovering implicitly present interference...
Introduction to Machine Learning (ML) Machine Learning is about creating new facts without being able to infer them logically from the exiting knowledge and experience. Machine learning can be useful in a number of tasks which require knowledge Detection: discovering implicitly present interference from the outside world Classification: grouping the items into categories, groups or classes according to certain discriminating characteristics Recognition: Establishing the class of an item based on common attributes Identification: Unambiguously recognizing an item based on unique attributes Prediction: Predicting the appearance of a particular object, class or pattern There are three types of feedback that can accompany the inputs, and that determine the three main types of learning: Supervised Learning agent observes input-output pairs learns a function that maps from input to output *** output prediction capability *** Unsupervised Learning agent processes data input learns patterns in the input without any explicit feedback *** input classification capability **** Utility-based Learning agent learns from a series of reinforcements: rewards & punishments *** improvement over the time *** Supervised Learning Training set of N examples of input-output mapping (x1, y1), (x2, y2),... (xN, yN) , y = f (x) Model h is hypothesis about the world, approximates the true function f drawn from a hypothesis space H of possible functions h Model of the data, drawn from a model class H Consistency hypothesis: h must be such, that for each xi in the training set h (xi) = yi. In reality, we learn only the best-fit function which h (xi) ≈ yi The true measure of the quality of a hypothesis depends on how well it handles inputs it has not yet seen during the training. We evaluate it by applying the function to the input from a test set (xj , yj) and compare the predicted output h (xj) to the actual yj Supervised Learning Quality Bias: the tendency of a predictive hypothesis to deviate from the expected value when averaged over different training set Underfitting: fails to find a pattern in the data, possibly due to insufficient training. Overfitting: when it performs poorly on test data due to too much training Variance: the amount of change in the hypothesis due to fluctuation in the training data. Bias–variance tradeoff: a choice between more complex, low-bias hypotheses that fit the training data well and simpler, low-variance hypotheses that may generalize better. Supervised Learning - Decision Trees A decision tree is a representation of a function that maps a vector of attribute values to a single output value—a “decision.” reaches its decision by performing a sequence of tests, starting at the root and following the appropriate branch until a leaf is reached. each internal node in the tree corresponds to a test of the value of one of the input attributes the branches from the node are labeled with the possible values of the attribute, the leaf nodes specify what value is to be returned by the function. Aim: find a small tree consistent with the training examples Idea: (recursively) choose the “most significant” in the sense of closest to the decision attribute as root of every subtree Decision Trees – Applicability Decision trees can be made more widely useful by handling the following complications: Missing data Continuous and multivalued input attributes Continuous-valued output attribute Decision trees are also unstable in that adding just one new example can change the test at the root, which changes the entire tree. Model Selection and Optimization Task of finding a good hypothesis can be split into two subtasks: Model selection: model selection chooses a good hypothesis space Optimization (training) finds the best hypothesis within that space. The set of data can be split into a training set to create the hypothesis, and a test set to evaluate it. Error rate: the proportion of times that h (x) ≠ y for a sample (x, y) When considering different models: three data sets are needed: A training set to train candidate models. A validation set, also known as a development set or dev set, to evaluate the candidate models and choose the best one. A test set to do a final unbiased evaluation of the best model. When insufficient amount of data to create three sets: k-fold cross-validation split the data into k equal subsets; popular values for k are 5 & 10 perform k rounds of learning on each round 1/k of the data are held out as a validation set and the remaining examples are used as the training set. Criterion for selection is to minimize the loss function rather than maximize the utility function. The loss function L(x, y, yˆ) is defined as the amount of utility lost by predicting that h(x) = yˆ when the correct answer is f (x) = y: A simplified version of the loss function which is independent of x: L(y, yˆ) The learning agent maximizes its expected utility by choosing the hypothesis that minimizes expected loss over all input–output pairs it will see. Parametric Models Parametric model: learning model that summarizes data with a set of parameters of fixed size (independent of the number of training examples) Nonparametric Models Nonparametric model: model that cannot be characterized by a bounded set of parameters Example: Simplest nonparametric learning method: Table lookup take all the training examples, put them in a lookup table, and then when asked for h(x), see if x is in the table; if it is, return the corresponding y. Support vector machines (SVM) SVMs retain three attractive properties over deep learning networks and random forests, which are much more complex methods: SVMs construct a maximum margin separator - a decision boundary has the largest possible distance to example points SVMs create a linear separating hyperplane SVMs are nonparametric - the separating hyperplane is defined by a set of examples Instead of minimizing expected empirical loss on the training data, SVMs attempt to minimize expected generalization loss. Other non-parametric methods Locality-sensitive hashing (LSH) Nonparametric regression The kernel trick Developing Machine Learning Systems Problem formulation Define the problem, the input, output and loss function Chose metrics that should be tracked Data collection, assessment, and management When data are limited, data augmentation can help For unbalanced class, undersample the majority, over-sample the minority Feature engineering, Exploratory data analysis (EDA) Model selection and training Receiver operating characteristic (ROC) curve Confusion matrix Trust, interpretability, and explainability Source control Testing Review, Monitoring, Accountability, Inspect the actual model and understand why it got a particular answer for input Operation, monitoring, and maintenance Monitor your performance on live data Nonstationarity—the world changes over time