Introduction to Machine Learning PDF

Introduction to Machine Learning Module 1: Introduction Part A: Introduction Sudeshna Sarkar IIT Kharagpur Overview of Course 1. Introduction 2. Linear Regression and Decision Trees 3. Instance based learning Feature Selection 4. Probability and Bayes Learning 5. Support Vector Machines 6. Neural Network 7. Introduction to Computational Learning Theory 8. Clustering Module 1 1. Introduction a) Introduction b) Different types of learning c) Hypothesis space, Inductive Bias d) Evaluation, Training and test set, cross-validation 2. Linear Regression and Decision Trees 3. Instance based learning Feature Selection 4. Probability and Bayes Learning 5. Support Vector Machines 6. Neural Network 7. Introduction to Computational Learning Theory 8. Clustering Machine Learning History 1950s: – Samuel's checker-playing program 1960s: – Neural network: Rosenblatt's perceptron – Minsky & Papert prove limitations of Perceptron 1970s: – Symbolic concept induction – Expert systems and knowledge acquisition bottleneck – Qui la ’s ID3 – Natural language processing (symbolic) Machine Learning History 1980s: – Advanced decision tree and rule learning – Learning and planning and problem solving – Resurgence of neural network – Valia t’s PAC learning theory – Focus on experimental methodology 90's ML and Statistics – Data Mining – Adaptive agents and web applications – Text learning 1994: Self-driving car – Reinforcement learning road test – Ensembles 1997: Deep Blue – Bayes Net learning beats Gary Kasparov Machine Learning History Popularity of this field in recent time and the reasons 2009: Google builds self driving car behind that 2011: Watson wins – New software/ algorithms Jeopardy Neural networks 2014: Human vision Deep learning surpassed by ML systems – New hardware GPU’s – Cloud Enabled – Availability of Big Data Programs vs learning algorithms Algorithmic solution Machine learning solution Data Program Data Output Computer Computer Output Program Machine Learning : Definition Learning is the ability to improve one's behaviour based on experience. Build computer systems that automatically improve with experience What are the fundamental laws that govern all learning processes? Machine Learning explores algorithms that can – learn from data / build a model from data – use the model for prediction, decision making or solving some tasks Machine Learning : Definition A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. [Mitchell] Components of a learning problem Task: The behaviour or task being improved. – For example: classification, acting in an environment Data: The experiences that are being used to improve performance in the task. Measure of improvement : – For example: increasing accuracy in prediction, acquiring new, improved speed and efficiency Black-box Learner Experiences Problem/ Data Task Background knowledge/ Answer/ Bias Performance Learner Experiences Problem/ Data Task Models Learner Reasoner Background knowledge/ Answer/ Bias Performance Many domains and applications Medicine: Diagnose a disease – Input: symptoms, lab measurements, test results, DNA tests, – Output: o e of set of possi le diseases, or o e of the a o e Data: historical medical records Learn: which future patients will respond best to which treatments Many domains and applications Vision: say what objects appear in an image convert hand-written digits to characters 0..9 detect where objects appear in an image Many domains and applications Robot control: Design autonomous mobile robots that learn from experience to – Play soccer – Navigate from their own experience Many domains and applications NLP: detect where entities are mentioned in NL detect what facts are expressed in NL detect if a product/movie review is positive, negative, or neutral Speech recognition Machine translation Many domains and applications Financial: predict if a stock will rise or fall predict if a user will click on an ad or not Application in Business Intelligence Forecasting product sales quantities taking seasonality and trend into account. Identifying cross selling promotional opportunities for consumer goods. … Some other applications Fraud detection : Credit card Providers determine whether or not someone will default on a home mortgage. Understand consumer sentiment based off of unstructured text data. Fore asti g o e ’s o i tio rates ased off external macroeconomic factors. Learner Experiences Problem/ Data Task Models Learner Reasoner Background knowledge/ Answer/ Bias Performance Design a Learner Experiences Problem/ 1. Choose the training Data Task experience Models 2. Choose the target function (that is to be Learn Reaso learned) er ner 3. Choose how to represent the target function Background 4. Choose a learning Answer/ knowledge/ Performance algorithm to infer the Bias target function Choosing a Model Representation The richer the representation, the more useful it is for subsequent problem solving. The richer the representation, the more difficult it is to learn. Components of Representation – Features – Function class / hypothesis language Foundations of Machine Learning Module 1: Introduction Part B: Different types of learning Sudeshna Sarkar IIT Kharagpur Module 1 1. Introduction a) Introduction b) Different types of learning c) Hypothesis space, Inductive Bias d) Evaluation, Training and test set, cross-validation 2. Linear Regression and Decision Trees 3. Instance based learning Feature Selection 4. Probability and Bayes Learning 5. Neural Network 6. Support Vector Machines 7. Introduction to Computational Learning Theory 8. Clustering Broad types of machine learning Supervised Learning – X,y (pre-classified training examples) – Given an observation x, what is the best label for y? Unsupervised learning –X – Give a set of ’s, cluster or su arize the Semi-supervised Learning Reinforcement Learning – Determine what to do based on rewards and punishments. Supervised Learning X y Input1 Output1 New Input x Input2 Output2 Input3 Output3 Learning Algorithm Model Input-n Output-n Output y Unsupervised Learning X Clusters Input1 Input2 Input3 Learning Algorithm Input-n Semi-supervised learning Reinforcement Learning Action at State st St+1 Agent Environment Reward rt rt+1 Reinforcement Learning Action at State st St+1 RLearner Environment Reward rt rt+1 Q-values update State, Policy Action at action state Best State st St+1 User Environment Reward rt rt+1 Supervised Learning Given: – a set of input features 1, … , 𝑛 – A target feature – a set of training examples where the values for the input features and the target features are given for each example – a new example, where only the values for the input features are given Predict the values for the target features for the new example. – classification when Y is discrete – regression when Y is continuous Classification Example: Credit scoring Differentiating between low-risk and high-risk customers from their income and savings Regression Example: Price of a used car x : car attributes y = wx+w0 y : price y: price y = g (x, θ ) g ( ) model, 𝜃 parameters x: mileage Features Often, the individual observations are analyzed into a set of quantifiable properties which are called features. May be – categorical (e.g. "A", "B", "AB" or "O", for blood type) – ordinal (e.g. "large", "medium" or "small") – integer-valued (e.g. the number of words in a text) – real-valued (e.g. height) Example Data Training Examples: Action Author Thread Length Where e1 skips known new long Home e2 reads unknown new short Work e3 skips unknown old long Work e4 skips known old long home e5 reads known new short home e6 skips known old long work New Examples: e7 ??? known new short work e8 ??? unknown new short work Training Set Training Learning Algorithm Testing Hypothesis Predicted X y Training phase Label machine learning feature algorithm extractor features Input Testing Phase feature classifier Label extractor model features Input Classification learning Task T: – input: – output: Performance metric P: Experience E: Classification learning Task T: – input: a set of instances d1,…,dn an instance has a set of features we can represent an instance as a vector d= – output: a set of predictions ŷ1,..., ŷn one of a fixed set of constant values: – {+1,-1} or {cancer, healthy}, or {rose, hi is us, jas i e, …}, or … Performance metric P: Experience E: Classification Learning Task Instance Labels medical patient record: {-1,+1} = low, high risk diagnosis blood pressure diastolic, blood of heart disease pressure systolic, age, sex (0 or 1), BMI, cholesterol finding entity a word in context: capitalized {first,later,outside} = names in text (0,1), word-after-this-equals- first word in name, Inc, bigram-before-this-equals- second or later word acquired-by, … in name, not in a name image image: {0,1} = no house, recognition 1920*1080 pixels, each with a house code for color Classification learning we care about performance on the Task T: distribution, not the training data – input: a set of instances d1,…,dn – output: a set of predictions ŷ1,..., ŷn Performance metric P: – Prob (wrong prediction) on examples from D Experience E: – a set of labeled examples (x,y) where y is the true label for x – ideally, examples should be sampled from some fixed distribution D Classification Learning Task Instance Labels Getting data medical patient record: risk of heart wait and look diagnosis lab readings disease for heart disease finding entity a word in context: {first,later,outside} text with names in text capitalized, manually nearby words,... annotated entities image image: no house, house hand-labeled recognition pixels images Representations Weekend 1. Decision Tree Yes EatOut No Late Yes No EatOut Home 2. Linear function Representations 3. Multivariate linear function 4. Single layer perceptron Representations 5. Multi-layer neural network Hypothesis Space One way to think about a supervised learning machine is as a device that explores a h pothesis space. – Each setting of the parameters in the machine is a different hypothesis about the function that maps input vectors to output vectors. Terminology Features: The number of features or distinct traits that can be used to describe each item in a quantitative manner. Feature vector: n-dimensional vector of numerical features that represent some object Instance Space X: Set of all possible objects describable by features. Example (x,y): Instance x with label y=f(x). Terminology Concept c: Subset of objects from X (c is unknown). Target Function f: Maps each instance x ∈ X to target label y ∈ Y Example (x,y): Instance x with label y=f(x). Training Data S: Collection of examples observed by learning algorithm. Used to discover potentially predictive relationships Foundations of Machine Learning Module 1: Introduction Part c: Hypothesis Space and Inductive Bias Sudeshna Sarkar IIT Kharagpur Inductive Learning or Prediction Given examples of a function (X, F(X)) – Predict function F(X) for new examples X Classification F(X) = Discrete Regression F(X) = Continuous Probability estimation F(X) = Probability(X): Features Features: Properties that describe each instance in a quantitative manner. Feature vector: n-dimensional vector of features that represent some object Feature Space Example: + 3.0 + + + - + + - - - + - 2.0 - + - + + - - - - 1.0 + + + - - 0.0 0.0 1.0 2.0 3.0 4.0 5.0 6.0 Slide by Jesse Davis: University of Washington Terminology Hypothesis: Function for labeling examples Label: + + Label: - 3.0 + ? + + - + + - - - + - 2.0 - ? ? + - + + - - - - 1.0 + + + - ? - 0.0 0.0 1.0 2.0 3.0 4.0 5.0 6.0 Slide by Jesse Davis: University of Washington Terminology Hypothesis Space: Set of legal hypotheses + 3.0 + + + - + + - - - + - 2.0 - + - + + - - - - 1.0 + + + - - 0.0 0.0 1.0 2.0 3.0 4.0 5.0 6.0 Slide by Jesse Davis: University of Washington Representations Weekend 1. Decision Tree Yes EatOut No Late Yes No EatOut Home 2. Linear function Representations 3. Multivariate linear function 4. Single layer perceptron 5. Multi-layer neural networks Hypothesis Space The space of all hypotheses that can, in principle, be output by a learning algorithm. We can think about a supervised learning machine as a device that explores a h pothesis space. – Each setting of the parameters in the machine is a different hypothesis about the function that maps input vectors to output vectors. Terminology Example (x,y): Instance x with label y. Training Data S: Collection of examples observed by learning algorithm. Instance Space X: Set of all possible objects describable by features. Concept c: Subset of objects from X (c is unknown). Target Function f: Maps each instance x ∈ X to target label y ∈ Y Classifier Hypothesis h: Function that approximates f. Hypothesis Space ℋ : Set of functions we allow for approximating f. The set of hypotheses that can be produced, can be restricted further by specifying a language bias. Input: Training set 𝒮 ⊆ Output: A hypothesis ℎ ∈ ℋ Hypothesis Spaces If there are 4 (N) input features, there are 6 𝑁 2 2 possible Boolean functions. We cannot figure out which one is correct unless we see every possible input-output pair 24 (2𝑁 ) Example Hypothesis language 1. may contain representations of all polynomial functions from X to Y if X = ℛ 𝑛 and Y = ℛ, 2. may be able to represent all conjunctive concepts over X when = 𝐵𝑛 and = 𝐵 (with B the set of booleans). Hypothesis language reflects an inductive bias that the learner has Inductive Bias Need to make assumptions – E perie ce alo e does ’t allow us to ake conclusions about unseen data instances Two types of bias: – Restriction: Limit the hypothesis space (e.g., look at rules) – Preference: Impose ordering on hypothesis space (e.g., more general, consistent with data) Inductive learning Inductive learning: Inducing a general function from training examples – Construct hypothesis h to agree with c on the training examples. – A hypothesis is consistent if it agrees with all training examples. – A hypothesis said to generalize well if it correctly predicts the value of y for novel example. Inductive Learning is an Ill Posed Problem: Unless we see all possible examples the data is not sufficient for an inductive learning algorithm to find a unique solution. Inductive Learning Hypothesis Any hypothesis h found to approximate the target function c well over a sufficiently large set of training examples D will also approximate the target function well over other unobserved examples. Learning as Refining the Hypothesis Space Concept learning is a task of searching an hypotheses space of possible representations looking for the representation(s) that best fits the data, given the bias. The tendency to prefer one hypothesis over another is called a bias. Given a representation, data, and a bias, the problem of learning can be reduced to one of search. Occam's Razor ⁻ A classical example of Inductive Bias the simplest consistent hypothesis about the target function is actually the best Some more Types of Inductive Bias Minimum description length: when forming a hypothesis, attempt to minimize the length of the description of the hypothesis. Maximum margin: when drawing a boundary between two classes, attempt to maximize the width of the boundary (SVM) Important issues in Machine Learning What are good hypothesis spaces? Algorithms that work with the hypothesis spaces How to optimize accuracy over future data points (overfitting) How can we have confidence in the result? (How much training data – statistical qs) Are some learning problems computationally intractable? Generalization Components of generalization error – Bias: how much the average model over all training sets differ from the true model? Error due to inaccurate assumptions/simplifications made by the model – Variance: how much models estimated from different training sets differ from each other Underfitting and Overfitting Underfitting: odel is too si ple to represe t all the relevant class characteristics – High bias and low variance – High training error and high test error Overfitting: odel is too co ple a d fits irrelevant characteristics (noise) in the data – Low bias and high variance – Low training error and high test error Foundations of Machine Learning Module 1: Introduction Part D: Evaluation and Cross validation Sudeshna Sarkar IIT Kharagpur Experimental Evaluation of Learning Algorithms Evaluating the performance of learning systems is important because: – Learning systems are usually designed to predict the lass of future unla eled data points. Typical choices for Performance Evaluation: – Error – Accuracy – Precision/Recall Typical choices for Sampling Methods: – Train/Test Sets – K-Fold Cross-validation Evaluating predictions Suppose we want to make a prediction of a value for a target feature on example x: – y is the observed value of target feature on example x. – is the predicted value of target feature on example x. – How is the error measured? Measures of error Absolute error: |𝑓 − | 𝑛 𝑛 Sum of squares error: 𝑖= 𝑓 − 𝑛 𝑛 Number of misclassifications: 𝑖= 𝛿 𝑓 , 𝑛 𝛿 𝑓 , is 1 if f(x)  y, and 0, otherwise. Confusion Matrix True class  Hypothesized Pos Neg Accuracy = class (TP+TN)/(P+N) Yes TP FP Precision = No FN TN TP/(TP+FP) P=TP+FN N=FP+TN Recall/TP rate = TP/P FP Rate = FP/N Sample Error and True Error The sample error of hypothesis f with respect to target function c and data sample S is: errors(f)= 1/n xS(f(x),c(x)) The true error (denoted errorD(f)) of hypothesis f with respect to target function c and distribution D, is the probability that h will misclassify an instance drawn at random according to D. errorD(f)= PrxD[f(x)  c(x)] Why Errors Errors in learning are caused by: – Limited representation (representation bias) – Limited search (search bias) – Limited data (variance) – Limited features (noise) Difficulties in evaluating hypotheses with limited data Bias in the estimate: The sample error is a poor estimator of true error – ==> test the hypothesis on an independent test set We divide the examples into: – Training examples that are used to train the learner – Test examples that are used to evaluate the learner Variance in the estimate: The smaller the test set, the greater the expected variance. Validation set Validation fails to use all the available data k-fold cross-validation 1. Split the data into k equal subsets 2. Perform k rounds of learning; on each round – 1/k of the data is held out as a test set and – the remaining examples are used as training data. 3. Compute the average test set score of the k rounds K-fold cross validation Trade-off In machine learning, there is always a trade- off between – complex hypotheses that fit the training data well – simpler hypotheses that may generalise better. As the amount of training data increases, the generalization error decreases.

Introduction to Machine Learning PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue