Week 3 Review of Machine Learning PDF

Week 3 Review of Machine Learning Dr. Samer Arafat Week 3 Summary Recall and finish probability from last time Full Machine Learning System Real Life Historic Data Set Collection example Significance of Feature Extraction Start the Review of Machine Learning Probability and Bayes Rule Probability Basics Prior, conditional and joint probability – Prior probability: P(X) – Conditional probability: P(X1 |X2 ), P(X2 |X1 ) – Joint probability: X = (X1 , X2 ), P(X) = P(X1 ,X2 ) – Relationship: P(X1 ,X2 ) = P(X2 |X1 )P(X1 ) = P(X1 |X2 )P(X2 ) – Independence: P(X2 |X1 ) = P(X2 ), P(X1 |X2 ) = P(X1 ), P(X1 ,X2 ) = P(X1 )P(X2 ) Bayesian Rule P( X|C )P(C ) P(C |X) = P( X) Prior Probability Prior or Unconditional Probabilities of propositions e.g., P(Infection = true) = 0.2 and P(Weather = sunny) = 0.72 correspond to belief prior to arrival of any (new) evidence Probability Distribution gives values for all possible assignments: P(Weather) = (normalized, i.e., sums up to 1) Joint Probability Distribution for a set of random variables gives the probability of every atomic event on those random variables: P(Weather, Infection) = a 2 × 4 matrix of values: Weather = sunny rainy cloudy snow Infection = true 0.25 0.05 0.05 0.00 Infection = false 0.55 0.1 0.05 0.05 Every question about a domain can be answered by the joint probability distribution Conditional probability Conditional or Posterior probabilities e.g., P(Infection | fever) = 0.8 i.e., given that fever evidence is all I know (Notation for conditional distributions: P(Infection | fever) = 2-element vector of 2-element vectors) If we know more, e.g., cavity is also given, then we have P(Infection | fever, Infection) = 1 New evidence (observation) may be irrelevant, allowing simplification, e.g., P(Infection | fever , sunny) = P(Infection | fever) = 0.8 This kind of inference, sanctioned by domain knowledge, is crucial Conditional probability Definition of conditional probability: P(a | b) = P(a , b) , if P(b) > 0 P(b) Product rule gives an alternative formulation: P(a , b) = p(a, b) = P(a | b) P(b) = P(b | a) P(a) A general version holds for whole distributions, e.g., P(Weather, Infection) = P(Weather | Infection) P(Infection) Chain rule is derived by successive application of product rule: P(X1, …,Xn) = P(X1,...,Xn-1) P(Xn | X1,...,Xn-1) = P(X1,...,Xn-2) P(Xn-1 | X1,...,Xn-2) P(Xn | X1,...,Xn-1) =… = πi= 1^n P(Xi | X1, … ,Xi-1) Inference by enumeration Start with the joint probability distribution: fever fever Blood test  Blood Blood test  Blood test test infection.108.012.072.008  infection.015.054.144.576 Inference by enumeration Start with the joint probability distribution: fever fever Blood test  Blood Blood test  Blood test test infection.108.012.072.008  infection.016.064.144.576 For any proposition φ, sum up the atomic events where they are true: P(φ) = Σω:ω╞φ P(ω) Example: P(fever) = 0.108 + 0.012 + 0.016 + 0.064 = 0.2 Inference by enumeration Start with the joint probability distribution: fever fever Blood test  Blood test Blood test  Blood test infection.108.012.072.008  infection.016.064.144.576 Can also compute conditional probabilities: P( infection | fever) = P( infection , fever) P(fever) 0.016+0.064 = 0.108 + 0.012 + 0.016 + 0.064 = 0.4 Independence A and B are independent iff P(A|B) = P(A) or P(B|A) = P(B) or P(A, B) = P(A) P(B) P(Fever, Blood test, Infection, Weather) = P(Fever, Blood test, Infection) P(Weather) Bayes' Rule Product rule P(a, b) = P(a | b) P(b) = P(b | a) P(a)  Bayes' rule: P(a | b) = P(b | a) P(a) P(b) or in distribution form P(Y|X) = P(X|Y) P(Y) P(X) Useful for assessing diagnostic probability from causal probability: P(Cause|Effect) = P(Effect|Cause) P(Cause) P(Effect) E.g., let H be headache, S be stiff neck: P(S|H) = P(H|S) P(S) = 0.25 × 0.1 = 0.4 P(H) 0.05 Machine Learning Review A Machine Learning System Feature Testing Raw Clean Vectors Examples Data Data Data Feature Machine Preprocessing Classifier Results Collection Extraction Training Learning Examples Data Collection with Manual Feature Extraction: Historic Example Iris Data Set Iris Data Collection with Manual Features The Iris flower data set or Fisher's Iris data set is a multivariate data set used and made famous by the British statistician and biologist Ronald Fisher in his 1936 paper. The use of multiple measurements in taxonomic problems is an example of linear discriminant analysis. 150 flowers total Iris Data Class (Species) and Feature Names The classes are: 3 Classes (flower species), 50 samples, each The feature names are: Sepal length, sepal width, petal length, and petal length Evaluation We evaluate good vs. not so good features by plotting a pair-wise classification of feature data (patterns) Good features do not show much overlapping of classes Not so good features show a lot of overlap A Machine Learning System: Next Step Feature Testing Raw Clean Vectors Examples Data Data Data Feature Machine Preprocessing Classifier Results Collection Extraction Training Learning Examples Feature Extraction Brief Introduction How do we Extract Features from Raw Data? Many types of features, such as: Entropy-based Statistical Wavelet transform Fourier transform Convolutions (like in CNN) Texture, like for images Shape, color, orientation (using gradients), … etc Example of Good vs. Bad Features A Machine Learning System: Next Step Feature Testing Raw Clean Vectors Examples Data Data Data Feature Machine Preprocessing Classifier Results Collection Extraction Training Learning Examples Machine Learning A Review ML Algorithms Review Supervised vs. Unsupervised Learning kNN (instance-based learning) - skipped Linear Regression and Regularization Logistic Regression BP SVM Kmeans PCA XGBoost Other Learning types: Self-Supervised and Semi-Supervised Reinforcement Learning (if time allows) Supervised vs. Unsupervised Learning Supervised learning Training set: Supervised learning Training set: Unsupervised learning Training set: Use a Clustering Algorithm to find “natural” groupings Unsupervised learning Training set: Some Applications of Clustering Social network analysis Market segmentation Image credit: NASA/JPL-Caltech/E. Churchwell (Univ. of Wisconsin, Madison) Organization of computing clusters Astronomical data analysis Applications of clustering Find coherent groups of people Social network analysis Market segmentation Find out which computers work together so that you Eg, organize your SkyCat resources, lay out the Project network, design Image credit: NASA/JPL-Caltech/E. Churchwell (Univ. of Wisconsin, Madison) data & comm. centers Organization of computing clusters Astronomical data analysis Some Supervised Learning Applications Automatic classification and indexing of info Science and astronomy Search engines Service Robots Industry Military Medical Robots Diagnosis … and more Linear Regression with One Variable Housing Prices Data Set Size in feet2 (x) Price ($) in 1000's (y) Training set of 2104 460 housing prices 1416 232 (Portland, OR) 1534 315 852 178 … … 500 Housing Prices 400 (Portland, OR) 300 Price 200 (in 1000s 100 of dollars) 0 0 500 1000 1500 2000 2500 3000 Size (feet2) Given the “right answer” for Predict real-valued output each example in the data … 500 Housing Prices 400 (Portland, OR) 300 Price 200 (in 1000s of 100 dollars) 0 0 500 1000 1500 2000 2500 3000 Size (feet2) Supervised Learning Regression Problem Given the “right answer” for Predict real-valued output each example in the data. Training set of Size in feet2 (x) Price ($) in 1000's (y) housing prices 2104 460 (Portland, OR) 1416 232 1534 315 852 178 … … Notation: m = Number of training examples x’s = “input” variable / features y’s = “output” variable / “target” variable Training set of Size in feet2 (x) Price ($) in 1000's (y) housing prices 2104 460 (Portland, OR) 1416 232 1534 315 852 178 … … Notation: m = Number of training examples x’s = “input” variable / features y’s = “output” variable / “target” variable Training Set How do we represent the hypothesis, h ? Learning Algorithm Size of h Estimate a house price Training Set Hypothesis for linear regressor Learning Algorithm Size of h Estimated house price Linear regression with one variable. The hypothesis is the predictor line Given Size in feet2 (x) Price ($) in 1000's (y) 2104 460 A Training Set 1416 232 1534 315 852 178 … … Use the Hypothesis: ‘s are called: Parameters Q: How do we pick the best predictor (regression line) for our data? To get a feel of regression lines … First, let’s try a few demonstrating examples 3 3 3 2 2 2 1 1 1 0 0 0 0 1 2 3 0 1 2 3 0 1 2 3 How to choose ‘s ? y Idea: Choose so that x is close to for our training examples Strategy: pick the parameter values that would minimize a carefully selected objective (cost) function Cost Function (Objective Function) y x Idea: Choose so that is close to for our training examples Summary - A squared error function works well for regression problems, mainly because of taking the derivative which requires a smooth and differentiable function, like a quadratic - The constant 2 is added to the denominator in order to help compute the derivative (later on) - To summarize these ideas, so far, we get: Summary Hypothesis: Parameters: Cost Function: Goal: Some Intuition on the Cost and Hypothesis Functions Hypothesis: Simplified Hypotheis Parameters: Parameters: Cost Function: Goal: We will compute and compare the cost function to the hypothesis function Using different parameter values Using the following simple 3-point data set Plot both the hypothesis and corresponding cost values (for fixed , this is a function of x) (function of the parameter ) 3 3 2 2 y 1 1 0 0 0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5 x Try for (for fixed , this is a function of x) (function of the parameter ) 3 3 2 2 y 1 1 0 0 0 1 2 3 x -0.5 0 0.5 1 1.5 2 2.5 (for fixed , this is a function of x) (function of the parameter ) 3 3 2 2 y 1 1 0 0 -0.5 0 0.5 1 1.5 2 2.5 0 1 x 2 3 (for fixed , this is a function of x) (function of the parameter ) 3 3 2 2 y 1 1 0 0 0 1 2 3 x -0.5 0 0.5 1 1.5 2 2.5 Now, Try (for fixed , this is a function of x) (function of the parameter ) 3 3 2 2 y 1 1 0 0 0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5 x (for fixed , this is a function of x) (function of the parameter ) 3 3 2 2 y 1 1 0 0 0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5 x (for fixed , this is a function of x) (function of the parameter ) 3 3 2 2 y 1 1 0 0 0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5 x Next, what if: Next, Try We get this new cost point: (for fixed , this is a function of x) (function of the parameter ) 3 3 2 2 y 1 1 0 0 0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5 x (for fixed , this is a function of x) (function of the parameter ) 3 3 2 2 y 1 1 0 0 0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5 x Back to the original formulation with 2 parameters Hypothesis: Parameters: Cost Function: Goal: (for fixed , this is a function of x) (function of the parameters ) 500 400 Price ($) 300 in 1000’s 200 100 0 0 500 1000 1500 2000 2500 3000 Size in feet2 (x) (function of the parameters ) Pick a few specific points on the cost function’s contour line Then, find corresponding parameters’ values, and then, construct and plot corresponding hypotheses Observe the effect of changing parameter values over the location and quality of hypothesis lines (for fixed , this is a function of x) (function of the parameters ) (for fixed , this is a function of x) (function of the parameters ) (for fixed , this is a function of x) (function of the parameters ) So, we need a learning method to automatically adapt and adjust the parameter values so that the error arrives at its minimum Gradient Descent Learning Have some function Want Outline: Start with some Keep changing to reduce until we hopefully end up at a minimum Sensitivity to Starting Points Animation J(0,1) 1 0 J(0,1) 1 0 at local optima Current value of Notice that the global error (optima) is NOT zero in this plot! Gradient descent algorithm Gradient Descent Intuition The Gradient Descent Algorithm If α is too small, gradient descent can be slow. If α is too large, gradient descent can overshoot the minimum. It may fail to converge, or even diverge. Let’s Plot the Gradient’s Behavior with 1 Recall what we learned in school and the freshman year in Calculus 1 That the gradient at one specific point is geometrically the slope at that point Slope = (y2 – y1) / (x2 – x1) Example: 45 degrees line has positive slope Example: 135 degrees line has negative slope Try 2 scenarios How will GD behave in each case? Gradient descent can converge to a local minimum, even with the learning rate α fixed. As we approach a local minimum, gradient descent will automatically take smaller steps. So, no need to decrease α over time, unless when you start with a large value for α. Acknowledgement These slides were prepared by Dr. Samer Arafat using adaptation and modifications applied to the following sources: Andrew Ng’s notes Samer Arafat’s notes Other internet sources and slide images

Week 3 Review of Machine Learning PDF

Document Details

Tags

Related

Summary

Full Transcript