Bayesian Neural Networks & Uncertainty PDF Lecture Notes
Document Details
Uploaded by HappierIris
Tags
Summary
This document provides a comprehensive overview of Bayesian Neural Networks and uncertainty, including its application in active learning. Topics covered include the significance of uncertainty in machine learning models, Bayesian inference, and different uncertainty types. It also discusses important concepts including information theory and the connection between Bayesian methods and deep learning.
Full Transcript
12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Bayesian Neural Networks & Uncertainty Comprehensive Overview https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#...
12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Bayesian Neural Networks & Uncertainty Comprehensive Overview https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 1/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Why Uncertainty? The need for uncertainty in ML models stems from several critical factors. https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 2/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Death of Elaine Herzberg https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 3/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Song et al. (2021) https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 4/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Kendall and Gal (2017) https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 5/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Uncertainty for Active Learning Uncertainty is a natural measure of informativeness in active learning: 1. High uncertainty → model is unsure → learning opportunity 2. Low uncertainty → model is confident → less to learn Key Insight Uncertainty helps us find the decision boundary - where the model is most uncertain is where we need labels most. https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 6/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Consider a binary classification problem: Points far from decision boundary: Model is confident → low uncertainty Points near decision boundary: Model is uncertain → high uncertainty Points in unexplored regions: Model should be uncertain → high uncertainty This motivates uncertainty sampling and its variants as acquisition functions. https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 7/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Uncertainty about Uncertainty Consider multiple valid hypotheses that explain our training data: https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 8/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Key Insight Even when a model expresses uncertainty through probabilities, there can be uncertainty about those probabilities, too. Different valid hypotheses (decision boundaries) that explain our training data can assign very different confidence values to the same point. This highlights why we need: 1. Any single decision boundary can be wrong and will be likely overconfident on some points. 2. Methods to quantify uncertainty about uncertainty 3. Ensembles or Bayesian averaging to capture hypothesis uncertainty https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 9/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty From Vision to Goals What do we want from our deep learning models? 1. Reliable uncertainty estimates (unknown unknowns) 2. Calibrated confidence (known unknowns) 3. Detection of out-of-distribution inputs (anomaly detection) 4. Principled handling of uncertainty (theory) 5. Detection of informative points (active learning) https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 10/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Steps 1. Foundations: Bayesian Statistics & Models 2. Aleatoric vs Epistemic Uncertainty 3. Information-theoretic Uncertainty 4. Density-based Uncertainty 5. Bayesian Deep Learning 6. Approximation Methods 7. Shallow Dive: Variational Inference https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 11/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Last Years Methods for BNNs: Mean Field VI (Blundell et al. 2015) MC Dropout (Gal and Ghahramani 2016) Deep Ensembles (Lakshminarayanan, Pritzel, and Blundell 2016) Laplace Approximation (Immer, Korzepa, and Bauer 2021) https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 12/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty What You Learn Foundations Connections between methods Trade-offs How to apply uncertainty methods https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 13/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Bayesian Statistics Three key principles that distinguish the Bayesian approach: Principle Realization Probability as degrees of belief Everything is a R.V. Parameters as random variables Parameter distribution: p(θ) Training data as evidence Data likelihood: p(D | θ) Learning is Bayesian inference Update beliefs: p(θ | D) Prediction via marginalization Average over parameters: E p(θ) [p(y | x, θ)] https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 14/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty The Bayesian Model Both the model parameters θ and the outputs y for a given input x are random variables. The joint distribution is defined as: p(y, θ | x) = p(y | x, θ) p(θ) Parameter distribution: p(θ): Captures our beliefs about the parameters. Data likelihood: p(y | x, θ): Relates the inputs x to outputs y given parameters θ. https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 15/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Probability as Degrees of Belief Probabilities represent subjective beliefs Can assign probabilities to non-repeatable events Updates beliefs as new evidence arrives Prior knowledge can be formally incorporated What it’s NOT: Frequentist Probability as long-run frequency of events Requires repeatable experiments No formal way to include prior knowledge Only considers sampling distributions https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 16/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Learning as Bayesian Inference Bayesian inference ≙ computing posterior p(θ | D) Combines prior knowledge with data Full distribution over parameters Automatic Occam’s razor effect What it’s NOT Point estimation (MLE/MAP): Single “best” parameter value No uncertainty quantification Can overfit without regularization https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 17/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Bayesian Inference likelihood prior p(D | θ) p(θ) p(θ | D) = p(D) posterior marginal likelihood ("evidence") https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 18/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Prediction through Marginalization Average over all possible parameters p(y | x, D) = E p(θ | D) [p(y | x, θ)] = ∫ p(y | x, θ) p(θ | D)dθ Naturally handles uncertainty Model averaging reduces overfitting What it’s NOT Plug-in prediction: Using single parameter estimate Ignores parameter uncertainty Can be overconfident https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 19/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty What Bayesian Stats is NOT Frequentist Statistics Probability = long-run frequency Fixed parameters, random data Confidence intervals p-values and hypothesis tests Maximum likelihood estimation Just Using Bayes’ Rule Bayes’ rule is just one tool Full Bayesian inference is more Not just about conditional probability Merely Adding Priors Not just regularization Full uncertainty quantification Posterior predictive checks Model comparison Only for Simple Models Scales to deep learning Practical approximations exist Active research area Modern computational tools https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 20/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Summary 1. Parameter uncertainty: instead of θ , use p(θ) ∗ 2. Start with prior beliefs p(θ) 3. Instead of optimizing the likelihood, update beliefs using data likelihood p(D | θ): ∗ θ = arg max p(D | θ) vs. p(θ | D) θ 4. Instead of point predictions, predict via marginalization: p(y | x, D) = E p(θ | D) [p(y | x, θ)] https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 21/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Informativeness 1. Informativeness measures how much information a given sample contains about the model parameters 2. Active learning selects the most informative samples for labeling How can we measure informativeness when we have a parameter distribution that captures our beliefs? https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 22/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Reduction in Uncertainty ≡ mutual information between model parameters and data: I[Θ; Y | x] = H[Θ] − H[Θ | Y , x]. This is known as the Expected Information Gain (EIG). https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 23/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty I-Diagram for the EIG https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 24/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Expected Information Gain Definition 1 The expected information gain is the mutual information between the model parameters Θ and the prediction Y given a new data point x and already labeled data D: I[Θ; Y | x, D] = H[Θ | D] − E p(y | x,D) [H[Θ | D, y, x]]. Information Diagram The overlapping region shows the mutual information between model parameters Θ and predictions Y given input x and dataset D. This represents how much uncertainty about the predictions would be reduced if we knew the true parameters. https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 25/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Historical Context: Expected Information Gain The Expected Information Gain (EIG) has deep roots in experimental design: Lindley (1956): Introduced for optimal experimental design Box and Hill (1967): Applied to chemical kinetics experiments MacKay (1992): Connected to neural networks and active learning https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 26/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Details Historical Impact These works established that selecting experiments to maximize information gain provides a principled approach to scientific discovery and machine learning. 1. Lindley (1956): “On a Measure of Information Provided by an Experiment” First formal treatment of information gain in Bayesian experiments Showed that Shannon entropy could quantify experimental value Introduced I[Θ; Y | x] as design criterion 2. Box and Hill (1967): “Discrimination Between Mechanistic Models” Applied EIG to discriminate between competing chemical models Introduced sequential design of experiments Showed practical value in scientific discovery 3. MacKay (1992) : “Information-Based Objective Functions for Active Learning” Connected EIG to neural network active learning Showed equivalence to maximizing expected cross-entropy Laid groundwork for modern Bayesian active learning https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 27/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty BALD: A Practical Alternative Bayesian Active Learning by Disagreement (BALD) provides a tractable approximation: I[Θ; Y | x, D] = I[Y ; Θ | x, D] = H[Y | x, D] − E p(θ | D) [H[Y | x, θ]]. Key Advantage BALD requires only expectations over the model’s predictive distribution rather than explicit parameter uncertainty, making it more computationally feasible. https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 28/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Aleatoric vs Epistemic Uncertainty Two fundamental types of uncertainty: 1. Aleatoric Uncertainty Inherent randomness in data Irreducible noise Present even with infinite data 2. Epistemic Uncertainty Model’s lack of knowledge Reducible with more data High in regions without training data https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 29/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Aleatoric Uncertainty Inherent randomness in data Irreducible noise Present even with infinite data https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 30/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Epistemic Uncertainty Model’s lack of knowledge Reducible with more data High in regions without training data https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 31/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Classification Example Binary classification with overlapping regions and missing data Key Regions in the Plots Unambiguous Data Region: Confident predictions No aleatoric uncertainty No epistemic uncertainty No Data Region (±2): Unknown predictions High epistemic uncertainty No aleatoric uncertainty Ambiguous Data Regions $[3.5, 4.5]: Uncertain predictions High aleatoric uncertainty No epistemic uncertainty https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 32/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Uncertainty vs Generalization Epistemic uncertainty and generalization are both shaped by our model’s inductive biases. Key Insights Both manifest outside training data Inductive biases guide both uncertainty growth and generalization We can be correct while being uncertain (or not) https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 33/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Uncertainty vs Generalization Epistemic uncertainty and generalization are both shaped by our model’s inductive biases. Key Insights Both manifest outside training data Inductive biases guide both uncertainty growth and generalization We can be correct while being uncertain (or not) https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 34/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Key Differences Aleatoric Uncertainty Data noise/randomness Cannot be reduced Often homoscedastic or heteroscedastic Present even with infinite data Example: Measurement noise in sensors Epistemic Uncertainty Model’s knowledge gaps Reducible with data High in sparse data regions Approaches zero with infinite data Example: Predictions far from training data Key Insight Epistemic uncertainty decreases as we add more training data, while the true aleatoric uncertainty is independent of the model. https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 35/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Information-Theoretic Uncertainty https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 36/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Total Uncertainty Decomposition Total uncertainty can be decomposed using information theory: H[Y | x] = I[Y ; Θ | x] + H[Y | x, Θ] total uncertainty epistemic aleatoric Key Insight This decomposition separates uncertainty into: 1. What we could know (epistemic) - reducible with more data 2. What we can’t know (aleatoric) - irreducible noise https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 37/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Aleatoric Uncertainty Estimate https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 38/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty BALD: Bayesian Active Learning by Disagreement The mutual information term I[Y ; Θ | x] measures: 1. Agreement between models (ensemble disagreement) 2. Reduction in predictive uncertainty if we knew true parameters 3. Uncertainty that could be reduced with more data https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 39/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Classification Example Key Regions Unambiguous points (-6–-4): No uncertainty Confident predictions No data region (-2–2): High predictive entropy (Total Uncertainty) High mutual information (Epistemic Uncertainty) Low softmax entropy (Aleatoric Uncertainty) Ambiguous points points (3.5–4.5): High softmax entropy (Aleatoric Uncertainty) High predictive entropy (Total Uncertainty) Low mutual information (Epistemic Uncertainty) https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 40/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Approximation Effect https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 41/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Summary H[Y | x] = I[Y ; Θ | x] + E p(θ | D) [H[Y | x, θ]] total epistemic aleatoric Total Uncertainty Epistemic Uncertainty Aleatoric Uncertainty H[Y | x] : Predictive entropy of averaged predictions What we observe in practice I[Y ; Θ | x] : Mutual information between predictions and parameters Reducible through data collection High in regions without training data E p(θ | D) [H[Y | x, θ]] : Expected entropy under parameter distribution Irreducible noise in the data Present even with infinite data https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 42/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Why Bayesian Model Average? TL;DR: It’s the best we can do. https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 43/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Details Assume: 1. The model is well-specified: the true model is in our model class, so there are θ that generated the data. true 2. Assume our beliefs p(θ) are rational and reflect the best we can do: otherwise, we would pick a different parameter distribution. Then: E p(θ) [D KL (p(Y | x, θ) ∥ q(Y | x))], captures how much worse any q(y | x) is than the true (but unknown) model ) in expectation according to our beliefs p(θ). true p(y | x, θ We have: E p(θ) [D KL (p(Y | x, θ) ∥ q(Y | x))] = E p(θ) [H(p(Y | x, θ) ∥ q(Y | x))] − E p(θ) [H(p(Y | x, θ))] = H(E p(θ) [p(Y | x, θ)] ∥ q(Y | x)) − H(p(Y | x, Θ)) = H(p(Y | x) ∥ q(Y | x)) − H(p(Y | x, Θ)) ≥ H(p(Y | x)) − H(p(Y | x, Θ)) = I[Y ; Θ | x]. This shows that the Bayesian model average is the best we can do: it minimizes the expected prediction error according to our beliefs. We can’t do better than this. We also see: the expected information gain/epistemic uncertainty is a lower bound on the expected loss divergence between the true model and our predictions. https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 44/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty BMA: Summary Best we can do: minimizes expected prediction error Lower bound on expected loss divergence Epistemic uncertainty is the (expected) gap to the (expected) truth https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 45/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Probability Simplex …or how to visualize predictions as points in the probability simplex. https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 46/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Point Predictions vs Probability Vectors A neural network’s softmax output may look like a probability distribution: 1 # Prediction for a single image 2 logits = model(x) 3 probs = softmax(logits) 4 >>> [0.05, 0.92, 0.03] # Sums to 1 But this is still just a point estimate - a single point in the probability simplex: n n−1 n Δ = {p ∈ R : ∑ p i = 1, p i ≥ 0}. i=1 https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 47/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Understanding the Probability Simplex https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 48/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty MC Samples in the Simplex https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 49/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Visualizing Epistemic Uncertainty https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 50/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Visualizing Entropy Samples https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 51/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Interactive Plot Entropy Surface over Probability Simplex Entropy (Bits) 1.4 1.2 1 0.8 0.6 0.4 0.2 https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 52/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Asides A few visualiations to understand the probability simplex, the concavity of the entropy, and the margin scores better… https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 53/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Aside: Entropy Visualization https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 54/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Interactive Plot Entropy Surface over Probability Simplex Entropy (bits) 1.4 1.2 1 0.8 0.6 0.4 0.2 0 https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 55/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Aside: Margin Visualization https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 56/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Interactive Plot Margin Surface over Probability Simplex Margin 1 0.8 0.6 0.4 0.2 0 https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 57/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Comparison: Margin vs Entropy https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 58/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Interactive Plot Margin > Entropy Preference Surface over Probability Simplex Preference (%) 80 60 40 20 0 −20 https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 59/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Density-based Uncertainty Epistemic uncertainty ≙ reducible with more data ? ⟹ use sample density p(x) to estimate uncertainty But what about generalization to new inputs? f (⋅;θ) linear+sof tmax − −X → Z → Y encoder classif ier inputs latents outputs ? ⟹ use feature-space density p(f (x)) to estimate uncertainty? https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 60/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Key Idea Density-based uncertainty leverages feature space density to estimate uncertainty: High density → Likely in-distribution Low density → Likely out-of-distribution No parameter distribution or sampling required! https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 61/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Feature Space Density https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 62/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Advantages Benefits: Single forward pass No sampling required Computationally efficient Natural OOD detection Limitations: Requires density estimation Curse of dimensionality May miss class-specific uncertainty https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 63/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Issues 1. How do we estimate the density? https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 64/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty 2. How do take the decision boundaries for the classifier into account? https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 65/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Deep Deterministic Uncertainty (DDU) Key components: 1. Feature extractor with spectral normalization 2. Softmax classifier ⟹ Predictions + Aleatoric Uncertainty 3. Class-conditional Gaussian density estimation ⟹ Epistemic Uncertainty https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 66/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty DDU Pseudocode 1 class DDU(nn.Module): 2 def __init__(self): 3 self.model = SpectralNormNet(classifier = SoftmaxClassifi 4 self.density_estimator = ClassConditionalGMM() 5 self.density_ecdf = None 6 7 def fit(self, X, y): 8 self.model.fit(X, y) 9 latents = self.model.encode(X) 10 densities = self.density_estimator.fit(latents, y).score_ 11 self.density_ecdf = scipy.stats.ecdf(densities).cdf.evalu 12 13 def predict(self, X): 14 latents = self.model.encode(X) 15 density = self.density_estimator.score_samples(latents) 16 density_quantile = self.density_ecdf(density) 17 return self.model.classify(latents), density_quantile https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 67/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty ClassConditionalGMM 1 import numpy as np 2 import scipy.special 3 4 from dataclasses import dataclass, field 5 from sklearn.mixture import GaussianMixture 6 7 8 @dataclass 9 class ClassConditionalGMM: 10 gmms: list[GaussianMixture] = field(default_factory=list) 11 class_priors: list[float] = field(default_factory=list) 12 classes: list[int] = field(default_factory=list) 13 counts: list[int] = field(default_factory=list) 14 15 def fit(self, X, y, seed=42): 16 """Fit the class-conditional GMMs to the data.""" 17 self.classes, self.counts = np.unique(y, return_counts=Tr 18 total_samples = len(y) 9 https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 68/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty ClassConditionalGMM: Example https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 69/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty SimpleDDU (no features) 1 import matplotlib.pyplot as plt 2 from sklearn.metrics import roc_curve, auc 3 from sklearn.linear_model import LogisticRegression 4 5 class SimpleDDU: 6 def __init__(self): 7 self.classifier = LogisticRegression() 8 self.density_estimator = ClassConditionalGMM() 9 self.density_ecdf = None 10 11 def fit(self, X, y): 12 self.classifier.fit(X, y) 13 densities = self.density_estimator.fit(X, y).score_samples(X) 14 self.density_ecdf = scipy.stats.ecdf(densities).cdf.evaluate 15 return self 16 17 def predict(self, X): 18 class_probs = self.classifier.predict_proba(X) 19 density = self.density_estimator.score_samples(X) 20 density_quantile = self.density_ecdf(density) 21 return class_probs, density_quantile https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 70/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty SimpleDDU: Example https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 71/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty SimpleDDU: Results https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 72/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty DDU Paper Results https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 73/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Dirty-MNIST https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 74/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Density ≙ Epistemic Uncertainty https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 75/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Density for Active Learning https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 76/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Summary Density-based uncertainty: 1. Uses feature space density 2. Single forward pass 3. Natural OOD detection 4. Computationally efficient Key Insight Combining density estimation with deep learning provides efficient uncertainty estimation without sampling! From Kirsch (2024). https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 77/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty OoD Detection vs AL https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 78/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Key Distinctions OOD detection and active learning approach uncertainty from complementary perspectives: OOD Detection prioritizes identifying samples that deviate from the training signal p(x) Primary goal: Flag anomalous inputs Less concerned with model’s epistemic uncertainty Focuses on binary decision: in or out of distribution? Active Learning prioritizes epistemic uncertainty Primary goal: Find informative samples to improve model Seeks regions where model is uncertain but learnable Often most valuable samples lie at distribution boundaries Key insight: While OOD detection asks “Is this sample valid?”, active learning asks “What can we learn from this sample?” Important DDU works for AL because DDU’s experimental setup ensures that OoD detection and epistemic uncertainty are well-aligned (e.g., no far OoD data!). https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 79/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Bayesian Deep Learning https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 80/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty The Ideal The Bayesian ideal is to maintain a full posterior distribution over model parameters: p(θ | D) This would allow us to: 1. Make predictions with uncertainty quantification 2. Automatically handle overfitting through marginalization 3. Get principled uncertainty estimates https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 81/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty BUT… Exact Bayesian inference is intractable for modern neural networks: High dimensionality (millions of parameters) Non-conjugate likelihoods Computational cost https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 82/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Details 1. We cannot compute a density easily: p(D | θ) p(θ) p(θ | D) = , p(D) requires p(D) = E p(θ) [p(D | θ)] , which is intractable. 2. We cannot easily draw samples from the posterior either: θ ∼ p(θ | D). Doing so requires MCMC or similar methods, which are computationally expensive and slow. This leads us to approximations… https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 83/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty The Approximations 1. Variational Inference 2. Mean Field Approximation 3. Monte Carlo Dropout 4. Laplace Approximation https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 84/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Variational Inference “The one to rule them all.” —Andreas Kirsch, 2024/12/04 https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 85/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Goal 1. We have a complex posterior distribution p(θ | D) 2. We want to approximate it with a variational distribution q(θ) How can we do that? Optimization Problem min D KL (q(Θ) ∥ p(Θ | D)) ↘ 0. q(θ) https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 86/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Variational Bayes Variational Inference (VI) is a method for approximating the posterior distribution p(θ | D) with a variational distribution q(θ). When we use VI for the Bayesian posterior, we call it Variational Bayes (VB). q ∗ (θ) ∈ arg min D KL (q(Θ) ∥ p(Θ | D)). q(θ) https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 87/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty How does this help us? 1. We can rewrite the KL divergence in closed form and make it tractable by dropping p(D). 2. We can optimize the KL divergence using gradient descent to find a good q(θ). 3. We usually parameterize q(θ) with parameters ψ, that is, we have q ψ (θ) and optimize over ψ. https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 88/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Discovering the Evidence Bound Also known as the Evidence Lower Bound (ELBO) when using probabilities. D KL (q(Θ) ∥ p(Θ | D)) = H(q(Θ) ∥ p(Θ | D)) − H(q(Θ)) Tip We have: p(D | Θ) p(Θ) H(q(Θ) ∥ p(Θ | D)) = H(q(Θ) ∥ ) p(D) So we get: So we get: p(D | Θ) p(Θ) = H(q(Θ) ∥ ) − H(q(Θ)) p(D) = H(q(Θ) ∥ p(D | Θ)) − H(q(Θ) ∥ p(D)) + H(q(Θ) ∥ p(Θ)) − H(q(Θ)) Tip We have: H(q(Θ) ∥ p(D | Θ)) = E q(θ) [H[D | θ]] H(q(Θ) ∥ p(D)) = H(p(D)) = − ln p(D) H(q(Θ) ∥ p(θ)) − H(q(Θ)) = D KL (q(Θ) ∥ p(Θ)). = E q(θ) [H[D | θ]] + D KL (q(Θ) ∥ p(Θ)) − H(p(D)) ≥ 0. So, we have: E q(θ) [H[D | θ]] − D KL (q(Θ) ∥ p(Θ)) ≥ H(p(D)). This is the Evidence Bound! Note H(p(D)) is independent of q(θ), so we can equivalently optimize: ∥ https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 89/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty min D KL (q(Θ) ∥ p(Θ | D)) ≥ 0 q(θ) ⟺ min E q(θ) [H[D | θ]] − D KL (q(Θ) ∥ p(Θ)) ≥ H(p(D)). q(θ) https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 90/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Evidence Bound Theorem 1 (Evidence Bound) For any distributions q(θ) and p(θ | D), we have: D KL (q(Θ) ∥ p(Θ | D)) ≥ 0 ⟺ E q(θ) [H[D | θ]] + D KL (q(Θ) ∥ p(Θ)) ≥ H(p(D)), where H(p(D)) is independent of q(θ) with equality when q(θ) = p(θ | D). Crucial: We can minimize the Evidence Bound without having to compute the intractable evidence. https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 91/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Evidence Lower Bound Equivalently but more confusing by Kingma (2013): Corollary 1 (Evidence Lower Bound) Using probabilities instead of information (i.e., − log instead of H(⋅)), we get: ln p(D) = E q(θ) [ln p(D | θ)] − D KL (q(Θ) ∥ p(Θ)) + D KL (q(Θ) ∥ p(Θ | D)) ≥ E q(θ) [ln p(D | θ)] − D KL (q(Θ) ∥ p(Θ)). We can thus equally maximize: max E q(θ) [ln p(D | θ)] − D KL (q(Θ) ∥ p(Θ)). q(θ) https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 92/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Proof of Equivalence Proof. Starting with the Evidence Bound: E q(θ) [H[D | θ]] + D KL (q(Θ) ∥ p(Θ)) − H(p(D)) ≥ 0 1. Replace information with negative log probabilities: H[D | θ] = − ln p(D | θ) H(p(D)) = − ln p(D). 2. Substitute: −E q(θ) [ln p(D | θ)] + D KL (q(Θ) ∥ p(Θ)) + ln p(D) ≥ 0. 3. Multiply by -1 (flips inequality): E q(θ) [ln p(D | θ)] − D KL (q(Θ) ∥ p(Θ)) ≤ ln p(D). This gives us the ELBO formulation that we maximize instead of minimize. □ https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 93/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Evidence Bound vs ELBO The Evidence Bound and ELBO are equivalent objectives: one phrased in terms of information theory (which we minimize) and one phrased in terms of probabilities (which we maximize). https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 94/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Summary Let’s quickly recap the Evidence Bound derivation: D KL (q(Θ) ∥ p(Θ | D)) = H(q(Θ) ∥ p(Θ | D)) − H(q(Θ)) p(D | Θ) p(Θ) = H(q(Θ) ∥ ) − H(q(Θ)) p(D) = E q(θ) [H[D | θ]] − H(p(D)) + D KL (q(Θ) ∥ p(Θ)) ≥ 0. Since KL divergence is non-negative: E q(θ) [H[D | θ]] + D KL (q(Θ) ∥ p(Θ)) ≥ H(p(D)). https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 95/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Mean Field VI Mean Field Variational Inference (MFVI): 1. approximates the posterior distribution p(θ | D) 2. with a factorized variational distribution q(θ) = ∏ q(θ i ) i , 3. where each Θ i ∼ N (μ i , σ i ). https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 96/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty The Core Idea Mean Field VI approximates p(θ | D) with a factorized Gaussian: 2 q(θ) = ∏ N (θ i ; μ i , σ ). i i https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 97/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Evidence Bound The objective decomposes nicely under MFVI: L = E q(θ) [H(p(D | θ))] + D KL (q(Θ) ∥ p(Θ)), where the KL term has a closed form for Gaussians. For p(θ ) = N (θ ; μ , σ ): 2 i i 0 0 D KL (q(Θ) ∥ p(Θ)) 2 2 2 1 σ (μ i − μ 0 ) σ i i = ∑ ( + − 1 − log ). 2 2 2 2 σ σ σ i 0 0 0 https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 98/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty The Reparameterization Trick To get unbiased gradients, we reparameterize (Kingma 2013; Blundell et al. 2015): θi = μi + σi ⋅ ϵi , ϵ i ∼ N (0, 1). We can check that then still θ ∼ N (μ , σ ). i i 2 i This lets us backprop through the sampling process in the Evidence Bound. https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 99/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Practical Tips Initialize log σ around -3 to -4 i Use a softplus transform to keep σ positive i The KL term often needs annealing during training: During mini-batch updates, we can reweight the KL term, such that overall, its weight is 1, but at the beginning of training, it is higher, and at the end, it is lower. (Blundell et al. 2015, 5) https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 100/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Monte Carlo Dropout Based on Gal and Ghahramani (2016). https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 101/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Dropout as Variational Inference Dropout can be reinterpreted as performing VI with a specific variational distribution (with a single sample per training example): μi Θ i ∼ Bernoulli(p) ⋅. p where p is the keep probability and μ is the parameter we learn. i https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 102/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty The Magic of Dropout Key insight: Dropout at test time approximates model averaging! https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 103/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Regression T 1 ^ ^ y ^ ≈ ∑ f (x; θ t ), θ t ∼ q(θ) T t=1 https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 104/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Classification T 1 ^ ^ p(y | x) ≈ ∑ sof tmax(f (x; θ t )) θ t ∼ q(θ). T t=1 ^ p(y | x,θ t ) https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 105/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Monte Carlo Dropout Hence, the name Monte Carlo Dropout: we compute a Monte Carlo approximation of the model averaging expectation: T 1 ^ ^ p(y | x, D) ≈ E q(θ) [p(y | x, θ)] ≈ ∑ sof tmax(f (x; θ t )), θ t ∼ q(θ). T t=1 https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 106/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Uncertainty Estimates Get predictive uncertainty for regression without noise by: 1. Keep dropout active at test time 2. Make T forward passes with different dropout masks 3. Compute statistics of predictions: T Mean: E [Y ] ≈ 1 T ∑ t=1 y ^t T Variance: Var [Y ] ≈ 1 T ∑ t=1 ^t − E [Y ]) (y 2 https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 107/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Classification Get predictive uncertainty for classification by: 1. Keep dropout active at test time 2. Make T forward passes with different dropout masks 3. Compute statistics of predictions: Predictive distribution: T 1 ^ ^ p(y | x, D) ≈ ∑ sof tmax(f (x; θ t )), θ t ∼ q(θ) T t=1 ^ p(y | x,θ t ) Predictive entropy: T 1 ^ ^ H[Y | x, D] ≈ H( ∑ p(y | x, θ t )), θ t ∼ q(θ) T t=1 Aleatoric uncertainty: T 1 ^ ^ H[Y | x, Θ, D] ≈ ∑ H(p(y | x, θ t )), θ t ∼ q(θ) T t=1 Epistemic uncertainty: I[Y , Θ | x, D] = H[Y | x, D] − H[Y | x, Θ, D] T 1 ^ ≈ H( ∑ p(y | x, θ t )) T t=1 T 1 ^ ^ − ∑ H(p(y | x, θ t )), θ t ∼ q(θ) T t=1 https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 108/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Advantages Almost free! Just reuse existing dropout layers No additional parameters to learn Works with any architecture that uses dropout Computationally efficient compared to full VI https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 109/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Limitations Fixed form of uncertainty (Bernoulli) Dropout rate affects uncertainty estimates May underestimate uncertainty Not all architectures benefit from (or work with!) dropout https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 110/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Best Practices Start with a dropout rate around 0.5 for uncertainty and then lower Increase T for more precise estimates Consider model architecture carefully Validate uncertainty estimates empirically https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 111/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Code Example 1 import torch 2 import torch.nn.functional as F 3 from typing import Tuple 4 5 6 @torch.inference_mode() 7 def sample_predictions(model, x: torch.Tensor, n_samples: int = 30) -> torch.Tensor: 8 """Get samples from the predictive distribution. 9 10 Args: 11 model: Neural network with dropout/sampling capability 12 x: Input tensor of shape (batch_size,...) 13 n_samples: Number of Monte Carlo samples 14 15 Returns: 16 Tensor of shape (n_samples, batch_size, n_classes) containing log_softmax outputs 17 """ 18 model.train() # Enable dropout 19 samples = torch.stack([ 20 model(x) # Assuming model outputs log_softmax 21 for _ in range(n_samples) 22 ]) 23 model.eval() 24 return samples 25 26 https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 112/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Probability Simplex Case Predicted Confidence Total Aleatoric Epistemic Class Uncertainty Uncertainty Uncertainty Confident 0 1.000 0.001 0.001 0.000 Aleatoric 1 0.335 1.099 1.082 0.016 Epistemic 1 0.345 1.098 0.003 1.095 https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 113/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Regression Example 1 import torch 2 import torch.nn as nn 3 import torch.optim as optim 4 import numpy as np 5 import matplotlib.pyplot as plt 6 7 # Define the Residual Block 8 class ResidualBlock(nn.Module): 9 def __init__(self, in_features): 10 super(ResidualBlock, self).__init__() 11 self.linear1 = nn.Linear(in_features, in_features) 12 self.relu = nn.ReLU() 13 self.linear2 = nn.Linear(in_features, in_features) 14 self.dropout = nn.Dropout(p=0.5) 15 16 def forward(self, x): 17 residual = x 18 out = self.linear1(x) 9 lf l ( ) https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 114/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Laplace Approximation https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 115/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty The Core Idea Approximate the posterior p(θ | D) with a Gaussian distribution Uses a second-order Taylor expansion around information content of MAP estimate θ ∗ Results in q(θ) ∗ = N (θ , H [θ ] ′′ ∗ −1 ) What is H ′′ ∗ [θ ] though? https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 116/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Notation We write H [⋅] and H ′ ′′ [⋅] for the Jacobian and Hessian of the negative log probability of ⋅. ′ H [⋅] := ∇ θ [H[⋅]] ′′ 2 H [⋅] := ∇ [H[⋅]]. θ https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 117/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Notation We write H [⋅] and H ′ ′′ [⋅] for the Jacobian and Hessian of the negative log probability of ⋅. ′ H [⋅] := ∇ θ [H[⋅]] = −∇ θ log p(⋅) ′′ 2 2 H [⋅] := ∇ [H[⋅]] = −∇ log p(⋅). θ θ https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 118/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Second-Order Taylor Expansion - Take 1 Around the information content of θ : ∗ − log p(θ | D) ∗ ≈ − log p(θ | D) ∗ T ∗ + (θ − θ ) ∇ θ [− log p(θ | D)] 1 ∗ T 2 ∗ ∗ + (θ − θ ) ∇ [− log p(θ | D)](θ − θ ) θ 2 ∗ 3 + O(∥θ − θ ∥ ). https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty 119/150 12/12/24, 11:57 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Bayesian Neural Networks & Uncertainty Completing the Square For a quadratic form, we can complete the square: 1 1 T T −1 T −1 x Ax + b x + c = (x + A b) A(x + A b) 2 2 1 T −1 − b A b + c 2 1 T = (x − μ) A(x − μ) + const, 2 with μ := −A −1. b https://blackhc.github.io/balitu/lectures/lecture_3_bnns.html#/reduction-in-uncertainty