Elements of Machine Learning & Data Science Lecture Notes 2024/25 PDF
Document Details
RWTH Aachen University
2024
Bastian Leibe
Tags
Related
- Mathematics and Statistical Foundations for Machine Learning (FIC 504) - PDF
- Mathematics and Statistical Foundations for Machine Learning (FIC 504-507) PDF
- Mathematics and Statistical Foundations for Machine Learning PDF
- Machine Learning: Basics PDF
- Supervised Learning - Naive Bayes & Support Vector Machines PDF
- Bayesian Machine Learning Lecture PDF
Summary
These are lecture notes from a course on Elements of Machine Learning & Data Science, specifically covering Bayes Decision Theory. The notes are from the Winter semester 2024/25 at RWTH Aachen University.
Full Transcript
Elements of Machine Learning & Data Science Winter semester 2024/25 Lecture 3 – Bayes Decision Theory 17.10.2024 Prof. Bastian Leibe Announcement: Small-Group Exercises Monday Tuesday Wednesday Thursday Friday 4x 10:30-12:0...
Elements of Machine Learning & Data Science Winter semester 2024/25 Lecture 3 – Bayes Decision Theory 17.10.2024 Prof. Bastian Leibe Announcement: Small-Group Exercises Monday Tuesday Wednesday Thursday Friday 4x 10:30-12:00h 4x 12:30-14:00h 4x 12:30-14:00h 8x 14:30-16:00h Weekly small-group exercises We have set up a poll to collect your preferences for the exercise slots Please enter your choices until Thu, 17.10. evening! Based on the poll results, we will assign you to exercise slots Please sign up for your time slot preferences by this evening (Thu)... 2 Machine Learning Topics 1. Introduction to ML 2. Probability Density Estimation 3. Linear Discriminants Machine Learning Forms of Machine Learning Concepts 4. Linear Regression 5. Logistic Regression 6. Support Vector Machines 7. AdaBoost Bayes Decision Theory Bayes Optimal 8. Neural Network Basics Classification 3 Introduction 1. Motivation 2. Forms of learning 3. Terms, Concepts, and Notation 4. Bayes Decision Theory 4 Bayes Decision Theory Reminder Goal: predict an output class from measurements , Example: handwritten character recognition by minimizing the probability of misclassification. How can we make such decisions optimally? Bayes Decision Theory gives us the tools for this Based on Bayes’ Theorem: : e.g., pixel values In the following, we will introduce its basic concepts… 5 Bayes Decision Theory Core Concept: Posterior Reminder What is the probability for class if we made a measurement ? This a-posteriori probability can be computed via Bayes’ Theorem after we observed : This is usually what we’re interested in! Interpretation 8 Bayes Decision Theory Making Optimal Decisions Reminder Our goal is to minimize the probability of a misclassification. The optimal decision rule is: decide for iff Or for multiple classes: decide for iff decide for decide for Once we can estimate posterior probabilities, we can use this rule to build classifiers. 11 Next Lectures… Ways how to estimate the probability densities Parametric methods − Gaussian distribution − Mixtures of Gaussians Non-parametric methods − Histograms − k-Nearest Neighbor − Kernel Density Estimation Ways to directly model the posteriors Linear discriminants Logistic regression, SVMs, Neural Networks, … 13 Machine Learning Topics 1. Introduction to ML 2. Probability Density Estimation 3. Linear Discriminants Parametric Methods Nonparametric Methods & ML-Algorithm 4. Linear Regression 5. Logistic Regression 6. Support Vector Machines 7. AdaBoost 8. Neural Network Basics Mixtures of Gaussians Bayes Classifiers & EM-Algorithm 15 Probability Density Estimation 1. Probability Distributions 2. Parametric Methods 3. Nonparametric Methods 4. Mixture Models 5. Bayes Classifier 6. K-NN Classifier 16 Probability Distributions Up to now: Bayes optimal classification based on and. How can we estimate (= learn) those probability densities? Supervised training case: data and class labels are known. Estimate the probability density for each class separately. given (For simplicity of notation, we will drop the class label in the following ). First, we look at the Gaussian distribution in more detail… 17 Probability Densities The Gaussian (or Normal) Distribution One-dimensional (univariate) case: Mean Variance Multi-dimensional (multivariate) case: Mean Covariance vector matrix 18 Probability Distributions Gaussian Distribution: Shape Full covariance matrix: Diagonal covariance matrix: Uniform variance: General ellipsoid shape Axis-aligned ellipsoid Hypersphere 19 Probability Distributions Gaussian Distribution: Motivation Central Limit Theorem The distribution of a sum of i.i.d. random variables i.i.d. = independent and becomes increasingly Gaussian as grows. identically distributed In practice, the convergence to a Gaussian can be very rapid. This makes the Gaussian interesting for many applications. Example: Sum over uniform [0,1] random variables. 20 Probability Density Estimation 1. Probability Distributions 2. Parametric Methods 3. Nonparametric Methods 4. Mixture Models 5. Bayes Classifier 6. K-NN Classifier 21 Parametric Methods In parametric methods, we assume that we know the Example: parametric form of the underlying data distribution. I.e., the equation of the pdf with parameters. 22 Parametric Methods In parametric methods, we assume that we know the Example: parametric form of the underlying data distribution. I.e., the equation of the pdf with parameters. Goal: Estimate from training data. Likelihood of : Probability that the data was indeed generated by a distribution with parameters. 23 Parametric Methods Maximum Likelihood Approach Idea: Find optimal parameters by maximizing. Computation of the likelihood: Single data point (e.g., for Gaussian): Assumption: all data points are independent Negative Log-Likelihood (“Energy”): Maximizing the likelihood minimizing the negative log-likelihood. 24 Parametric Methods | Maximum Likelihood Approach Minimizing the negative log-likelihood: Take the derivative and set it to zero. Log-likelihood for Normal distribution (1D case): 25 Parametric Methods | Maximum Likelihood Approach By minimizing the negative log-likelihood, we found: Similarly, we can derive: sample mean sample variance is the Maximum Likelihood estimate for the parameters of a Gaussian distribution. This is a very important result. Unfortunately, it is wrong… 26 Parametric Methods | Maximum Likelihood Approach To be precise, the result is not wrong, but biased. Assume the samples come from a true Gaussian distribution with mean and variance It can be shown that the expected estimates are then The ML estimate will underestimate the true variance! We can correct for this bias: corrected, unbiased estimate 27 Parametric Methods | Maximum Likelihood Approach Maximum Likelihood has several significant limitations. It systematically underestimates the variance of the distribution! E.g., consider the estimate for a single sample: We say ML overfits to the observed data. We will still often use Maximum Likelihood, but it is important to know about this effect. 28 Probability Density Estimation 1. Probability Distributions 2. Parametric Methods 3. Nonparametric Methods a) Histograms b) Kernel Methods & k-Nearest Neighbors 4. Mixture Models 5. Bayes Classifier 6. K-NN Classifier 29 Nonparametric Methods Histograms Partition the data space into distinct bins with widths and count the number of observations in each bin. Then,. Often the same width is used for all bins. This can be done, in principle, for any dimensionality. …but the required number of bins grows exponentially with ! 30 Nonparametric Methods | Histograms The bin width acts as a smoothing factor. Not smooth enough About ok Too smooth 31 Nonparametric Methods | Histograms Advantages Limitations Very general method. In the limit , Rather brute-force. every probability density can be Discontinuities at bin edges. represented. Choosing right bin size is hard. No need to store the data points once histogram is computed. Unsuitable for high-dimensional feature spaces. 32 Probability Density Estimation 1. Probability Distributions 2. Parametric Methods 3. Nonparametric Methods a) Histograms b) Kernel Methods & k-Nearest Neighbors 4. Mixture Models 5. Bayes Classifier 6. K-NN Classifier 33 Nonparametric Methods Kernel Methods and k-Nearest Neighbors Data point comes from pdf. Probability that falls into small region : Estimate from samples Let be the number of samples that fall into. For sufficiently small , If the number of samples is sufficiently large, is roughly constant. we can estimate as: : volume of. 34 Nonparametric Methods | Kernel Methods and k-Nearest Neighbors fixed fixed determine determine Kernel Methods k-Nearest Neighbors Example: Determine the number of data points inside a fixed hypercube 35 Nonparametric Methods Kernel Methods Hypercube of dimension with edge length : Probability density estimate: This method is known as Parzen Window estimation. 36 Nonparametric Methods | Kernel Methods In general, we can use any kernel such that Then, we get the probability density estimate E.g., a Gaussian kernel for smoother boundaries. This is known as Kernel Density Estimation. 37 Nonparametric Methods | Kernel Methods and k-Nearest Neighbor fixed fixed determine determine Kernel Methods k-Nearest Neighbors Increase the volume until the next data points are found. 38 Nonparametric Methods k-Nearest Neighbors Fix , estimate from the data. Consider a hypersphere centered on and let it grow to a volume that includes of the given data points. Then Side note: Strictly speaking, the model produced by k-NN is not a true density model, because the integral over all space diverges. E.g. consider and a sample exactly on a data point. 39 Nonparametric Methods | Kernel Methods and k-Nearest Neighbors Advantages Limitations Very general. In the limit , every Requires storing and computing with the probability density can be represented. entire dataset. No computation during training phase Computational costs linear in the number of data points. Just need to store training set Can be improved through efficient storage structures (at the cost of some computation during training). Choosing the kernel size/ is a hyperparameter optimization problem. 40 Nonparametric Methods Bias-Variance Tradeoff Not smooth enough Too much variance About ok Too smooth Too much bias Histograms: KDE: k-NN: Bin width. Kernel size. # of neighbors 41 References and Further Reading More information in Bishop’s book Gaussian distribution and ML: Ch. 1.2.4 and 2.3.1-2.3.4. Nonparametric methods: Ch. 2.5. Christopher M. Bishop Pattern Recognition and Machine Learning Springer, 2006 42