Submodularity & Batch Acquisition (PDF)
Document Details
Uploaded by HappierIris
Tags
Summary
Lecture notes providing an overview of submodularity and batch acquisition methods, particularly BatchBALD and stochastic batch acquisition in active learning. It highlights the shortcomings of naive top-k selection and emphasizes the importance of considering redundancy and information theory in efficient batch selection.
Full Transcript
12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Submodularity & Batch Acquisition (BatchBALD & Stochastic Batch Acquisition) https://blackhc.github.io/balitu/lectures/lecture...
12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Submodularity & Batch Acquisition (BatchBALD & Stochastic Batch Acquisition) https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 1/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Why Batch Acquisition? Real-World Impact Labeling data is expensive ($200+/hour for medical experts) Need to request labels in batches for efficiency Naive batch selection often wastes resources https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 2/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition What You’ll Learn Today After this lecture, you’ll be able to: Understand why naive batch acquisition fails Apply information theory to batch selection Use submodularity for efficient algorithms Implement BatchBALD in practice https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 3/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Common Misconceptions Warning Batch acquisition is NOT: Simply selecting top-k points independently Only about computational efficiency Limited to specific acquisition functions https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 4/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Batch Acquisition https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 5/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Top-K Algorithm Score All Points Sort by Score Take Top K Repeat Label All Train Model https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 6/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Pseudocode 1 def naive_batch_acquisition(pool_set, batch_size, model): 2 # Score all points independently 3 scores = [] 4 for x in pool_set: 5 score = acquisition_function(x, model) 6 scores.append((x, score)) 7 8 # Sort by score and take top-k 9 scores.sort(key=lambda x: x, reverse=True) 10 selected = [x for x,_ in scores[:batch_size]] 11 12 return selected https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 7/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition The Problem with Top-K Selects points independently Ignores redundancy between points Can waste labeling budget Gets worse with larger batch sizes https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 8/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Redundancy Issue BALD BatchBALD https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 9/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Redundancy Issue Consider MNIST digits: Note Top-K selects many “8”s All provide similar information overall vs other digits Wastes labeling budget Better to select diverse examples https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 10/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Simple Baseline 1. Bucket by predicted class 2. Select examples by highest acquisition function score in each bucket round- robin. https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 11/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Batch Size Effect Small Batch Size Easier to contain only similar points More vulnerable to local redundancy May need more acquisition rounds Large Batch Size Unlikely to contain only similar points Natural diversity through size More expensive labelling https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 12/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Promise of Good Batch Acquisition https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 13/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Saving Time and Labelling Cost https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 14/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Information-Theoretic Perspective https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 15/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition BALD (≙ EIG) For random variables Y (labels) and Θ (model parameters): I[Y ; Θ | x, D] = H[Y | x, D] − E p(θ | D) [H[Y | x, θ, D]] Note This measures how much knowing the model parameters reduces our uncertainty about labels https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 16/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Top-K Sum Equivalence Let A be a batch of k points. k Taking the top-k points by score is equivalent to: arg max ∑ I[Y ; Θ | x, D] A k ⊆P,|A k |=k x∈A k Key Insight This shows why Top-K fails: it assumes information from each point adds independently! https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 17/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Top-K Intuition ∑ I[Y ; Θ | x, D] x∈A k https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 18/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Batch Intuition How can we avoid double-counting information? We want: * μ [Θ ∩ (∪ x∈A Y | x)] k https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 19/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition I-Diagram ⇒ Information Quantity * μ [Θ ∩ (∪ x∈A Y | x)] k ⟹ ≡ I[Y 1 ,.. , Y k ; Θ | x 1 ,.. , x k , D]. https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 20/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Joint Mutual Information Definition 1 (BatchBALD) BatchBALD is the mutual information between the model parameters and the joint predictions of a batch of points: I[Y 1 ,.. , Y k ; Θ | x 1 ,.. , x k , D] https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 21/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Comparison BALD Scores independently Double counts ∑ I[Y i ; Θ | x i , D] i BatchBALD Scores jointly Accounts for redundancy I[Y 1 ,.. , Y k ; Θ | x 1 ,.. , x k , D] https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 22/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Acquisition Function arg max I[Y 1 ,.. , Y k ; Θ | x 1 ,.. , x k , D] A k ⊆P,|A k |=k Problem: combinatorial search over all subsets! 🙀😱 What’s a cheap way to find a batch? https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 23/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Greedy Optimization 1 def greedy_batch_acquisition(pool_set, k, acquisition_fn): 2 batch = [] 3 remaining = set(pool_set) 4 5 for _ in range(k): 6 # Find point that maximizes marginal gain 7 best_x = max(remaining, 8 key=lambda x: acquisition_fn(batch + [x])) 9 10 batch.append(best_x) 11 remaining.remove(best_x) 12 13 return batch Key Properties Simple to implement Works well in practice https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 24/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Guarantees It actually works well in practice and achieves approximately a 1 1 − ≈ 0.63 e approximation ratio of the optimal batch! https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 25/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Submodularity Greedy optimization of submodular functions yields a 1 1 − ≈ 0.63 e approximation of the optimum! https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 26/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition What is Submodularity? Submodularity captures “diminishing returns”: Adding elements to larger sets gives less benefit https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 27/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Submodularity Visualization Left: Total value increases sublinearly Right: Marginal gain decreases with set size https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 28/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Formal Definition Nemhauser, Wolsey, and Fisher (1978): Definition 2 A set function f is submodular when ∀A, B: f (A ∪ B) + f (A ∩ B) ≤ f (A) + f (B). https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 29/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition More Intuitive Definition? Intuitively: Adding elements to larger sets gives fewer benefits For X and Y 1 ⊆ Y2 , such that X and Y are disjoint (X ∩ Y 2 2 = ∅ ): “f (X | Y 2) ≤ f (X | Y 1 ).” Formal: f (X ∪ Y 2 ) − f (Y 2 ) ≤ f (X ∪ Y 1 ) − f (Y 1 ). https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 30/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Formal More Intuitive Definition Definition 3 For X and Y 1 ⊆ Y2 , such that X and Y are disjoint (X ∩ Y 2 2 = ∅ ), f is submodular iff: f (X ∪ Y 2 ) − f (Y 2 ) ≤ f (X ∪ Y 1 ) − f (Y 1 ). If we define the marginal gain of adding X to Y : Δ(X | Y ) := f (X ∪ Y ) − f (Y ), then this is equivalent to: Δ(X | Y 1 ) ≤ Δ(X | Y 2 ). https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 31/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Proof of Equivalence Proof (Formal Definition ⟹ Intuitive Definition). 1. Assume f is submodular as per the formal definition: f (A ∪ B) + f (A ∩ B) ≤ f (A) + f (B). 2. For X and Y 1 ⊆ Y2 , such that X and Y are disjoint (X ∩ Y 2 2 = ∅ ), let: A := X ∪ Y 1 B := Y 2 3. Given that Y 1 ⊆ Y2 and X ∩ Y 2 = ∅ , we have: A ∪ B = X ∪ Y1 ∪ Y2 = X ∪ Y2 A ∩ B = (X ∪ Y 1 ) ∩ Y 2 = (X ∩ Y 2 ) ∪ (Y 1 ∩ Y 2 ) = Y 1 4. Starting from the formal submodular inequality: f (A ∪ B) + f (A ∩ B) ≤ f (A) + f (B) ⟺ f (X ∪ Y 2 ) + f (Y 1 ) ≤ f (X ∪ Y 1 ) + f (Y 2 ) ⟺ f (X ∪ Y 2 ) − f (Y 2 ) ≤ f (X ∪ Y 1 ) − f (Y 1 ). This matches the intuitive definition. □ https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 32/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Proof (Intuitive Definition ⟹ Formal Definition). 1. Assume f satisfies the intuitive definition: f (X ∪ Y 2 ) − f (Y 2 ) ≤ f (X ∪ Y 1 ) − f (Y 1 ), for all X and Y 1 ⊆ Y2 , such that X and Y are disjoint (X ∩ Y 2 2 = ∅ ). 2. Let’s consider arbitrary sets A and B. Define: X := A ∖ B Y 1 := A ∩ B Y 2 := B 3. Now compute: X ∪ Y 1 = (A ∖ B) ∪ (A ∩ B) = A X ∪ Y 2 = (A ∖ B) ∪ B = A ∪ B 4. Observe that: X ∩ Y 2 = (A ∖ B) ∩ B = ∅ Y1 ⊆ Y2 5. Thus, the intuitive definition implies: f (X ∪ Y 2 ) − f (Y 2 ) ≤ f (X ∪ Y 1 ) − f (Y 1 ) ⟺ f (A ∪ B) − f (B) ≤ f (A) − f (A ∩ B) ⟺ f (A ∪ B) + f (A ∩ B) ≤ f (A) + f (B). This matches the formal definition. □ https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 33/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Simpler Definition But wait, there is more—we can simplify this further: Theorem 1 For any Y ⊆ Y and e ∉ Y , f is submodular iff: 1 2 2 f (X ∪ {e}) − f (X) ≤ f (Y ∪ {e}) − f (Y ). That is: Δ(e | X) ≤ Δ(e | Y ). https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 34/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Proof 1. ⟹ follows immediately from the previous definition. 2. ⟸ : 1. We assume that for any Y 1 ⊆ Y2 and e ∉ Y2 : Δ(e | Y 2 ) ≤ Δ(e | Y 1 ). 2. For X and Y 1 ⊆ Y2 , such that X and Y are disjoint (X ∩ Y 2 2 = ∅ ), let: S := Y 2 ∖ Y 1 , S := {s 1 , … , s n }, S k := {s 1 , … , s k } ⊆ S. 3. For all k, we have: Δ(s k | X ∪ Y 1 ∪ S k−1 ) ≤ Δ(s k | Y 1 ∪ S k−1 ), as Y 1 ∪ S k−1 ⊆ X ∪ Y 1 ∪ S k−1. Telescoping Telescoping is a technique where terms in a sequence cancel out except for the first and last terms. The name comes from how a telescope’s segments collapse into each other. 4. Using telescoping, we get: https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 35/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition n n ∑ Δ(s k | X ∪ Y 1 ∪ S k−1 ) ≤ ∑ Δ(s k | Y 1 ∪ S k−1 ) k=1 k=1 n ⟺ ∑ f (X ∪ Y 1 ∪ S k−1 ∪ {s k }) k=1 n −f (X ∪ Y 1 ∪ S k−1 ) ≤ ∑ f (Y 1 ∪ S k−1 ∪ {s k }) k=1 − f (Y 1 ∪ S k−1 ) n ⟺ ∑ f (X ∪ Y 1 ∪ S k ) k=1 n −f (X ∪ Y 1 ∪ S k−1 ) ≤ ∑ f (Y 1 ∪ S k ) k=1 − f (Y 1 ∪ S k−1 ) ⟺ f (X ∪ Y 1 ∪ S) −f (X ∪ Y 1 ) ≤ f (Y 1 ∪ S) − f (Y 1 ) ⟺ f (X ∪ Y 2 ) −f (X ∪ Y 1 ) ≤ f (Y 2 ) − f (Y 1 ), where we have used that Y 2 = Y1 ∪ S ↔ S = Y2 ∖ Y1. 5. Rearranging, we get: f (X ∪ Y 2 ) − f (Y 2 ) ≤ f (X ∪ Y 1 ) − f (Y 1 ). □ https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 36/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Summary This characterizes submodularity in terms of marginal gains: f is submodular ⟺ f (A ∪ B) + f (A ∩ B) ≤ f (A) + f (B) ⟺ Δ(X | Y 2 ) ≤ Δ(X | Y 1 ) ⟺ Δ(e | Y 2 ) ≤ Δ(e | Y 1 ). for Y 1 ⊆ Y2 and e ∉ Y2 and X ∩ Y 2. = ∅ https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 37/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition 1 1 − e Optimality Theorem 2 For a monotone submodular function f with f (∅) = 0 , the greedy algorithm achieves at least: greedy 1 ∗ f (A ) ≥ (1 − )f (A k ) k e where A is the optimal set of size k. ∗ k https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 38/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Key Insights Greedy selection achieves ≈63% of optimal value Holds even for very large search spaces https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 39/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Monotone Submodular Functions Definition 4 A set function f : 2 V → R is monotone if for all A ⊆ B ⊆ V : f (A) ≤ f (B) Intuition Adding elements never hurts ↔ non-negative gains! Function value can only increase Different from diminishing returns! https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 40/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Common Confusion Submodularity: diminishing returns Monotonicity: always increasing Both needed for greedy guarantees! https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 41/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Questions 1. What does a non-monotone submodular function look like? 2. What does a non-monotone non-submodular function look like? https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 42/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Submodular Non-/Monotone https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 43/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Non-Submodular Non-Monotone https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 44/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Optimality Proof https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 45/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Statement Let f : 2 → R be a monotone (i.e., f (A) ≤ f (B) whenever A ⊆ B), and V ≥0 submodular function with f (∅) = 0 defined over a finite ground set V. Our goal is to select a subset S ⊆ V of size k that maximizes f (S): ∗ S = arg max f (S). |S|≤k https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 46/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Greedy Algorithm The greedy algorithm constructs a set S greedy by starting with S 0 = ∅ and iteratively adding elements: S i = S i−1 ∪ {e i }, where e is chosen to maximize the marginal gain: i e i := arg max [f (S i−1 ∪ {e}) − f (S i−1 )] e∈V∖S i−1 = arg max Δ(e ∣ S i−1 ). e∈V∖S i−1 https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 47/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Goal We will show that: 1 ∗ f (S greedy ) ≥ (1 − )f (S ). e https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 48/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Key Concepts Marginal Gain: The benefit of adding an element e to a set A: Δ(e ∣ A) := f (A ∪ {e}) − f (A). Submodularity: Diminishing returns property, i.e., for A ⊆ B , we have: Δ(e ∣ A) ≥ Δ(e ∣ B). https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 49/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Proof Sketch 1. Bounding the Marginal Gain At each iteration i: 1 ∗ Δ(e i ∣ S i−1 ) ≥ [f (S ) − f (S i−1 )]. k 2. Recursive Inequality Update the function value: 1 ∗ f (S i ) ≥ f (S i−1 ) + [f (S ) − f (S i−1 )]. k 3. Unfolding the Recursion After k iterations: k 1 ∗ f (S greedy ) ≥ (1 − (1 − ) )f (S ). k 4. Limit Evaluation As k increases: k 1 1 (1 − ) ≤ , k e so: 1 ∗ f (S greedy ) ≥ (1 − )f (S ). e https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 50/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition 1 Proof of (1 − e ) -Optimality https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 51/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Step 1: Bounding the Marginal Gain 1. At iteration i, consider the optimal set S and the current greedy set S ∗ i−1. 2. Define R := S ∗ ∖ S i−1. 3. Since |S ∗ | = k , we have |R| ≤ k. 4. By submodularity and the fact that f is monotone: ∗ f (S ) − f (S i−1 ) ≤ f (S i−1 ∪ R) − f (S i−1 ) (1) ≤ ∑ Δ(e ∣ S i−1 ). e∈R Explanation: First inequality from monotonicity. Second inequality from submodularity after telescoping: f (S i−1 ∪ R) − f (S i−1 ) |R| = ∑ f (S i−1 ∪ {r 1 , … , r i }) i=1 − f (S i−1 ∪ {r 1 , … , r i−1 }) |R| = ∑ Δ(r i ∣ S i−1 ∪ {r 1 , … , r i−1 }) i=1 |R| ≤ ∑ Δ(r i ∣ S i−1 ) i=1 = ∑ Δ(e ∣ S i−1 ). e∈R 5. Since the greedy choice e maximizes Δ(e i ∣ S i−1 ) : 1 Δ(e i ∣ S i−1 ) ≥ ∑ Δ(e ∣ S i−1 ) |R| e∈R 1 ∗ ≥ [f (S ) − f (S i−1 )]. k Explanation: First inequality from maximality: Δ(e i ∣ S i−1 ) ≥ Δ(e ∣ S i−1 ), and summing over all e ∈ R : https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 52/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition |R|Δ(e i ∣ S i−1 ) ≥ ∑ Δ(e ∣ S i−1 ). e∈R Second inequality follows from |R| ≤ k and Equation 1 above. https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 53/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Step 2: Recursive Inequality From the above bound: f (S i ) = f (S i−1 ) + Δ(e i ∣ S i−1 ) 1 ∗ ≥ f (S i−1 ) + [f (S ) − f (S i−1 )] k 1 1 ∗ = (1 − )f (S i−1 ) + f (S ). k k https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 54/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Step 3: Unfolding the Recursion 1. Define γ i := f (S ∗ ) − f (S i ). 2. Then: 1 γ i ≤ (1 − )γ i−1. k Note Explanation: 1. We subtract the previous inequality 1 1 ∗ f (S i ) ≥ (1 − )f (S i−1 ) + f (S ) k k from f (S ∗ ) 2. which yields: ∗ f (S ) − f (S i ) 1 1 ∗ ∗ ≤ f (S ) − [(1 − )f (S i−1 ) + f (S )] k k 1 ∗ = (1 − ) [f (S ) − f (S i−1 )]. k 3. which is just the definition of γ on the LHS and γ i i−1 on the RHS: 1 γ i ≤ (1 − )γ i−1. k 1. By iterating this inequality: k 1 ∗ γ k ≤ (1 − ) f (S ). k Explanation: We start with ∗ ∗ γ 0 = f (S ) − f (S 0 ) = f (S ), as f (S 0) = f (∅) = 0. https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 55/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition 1 Then we apply the inequality γ i ≤ (1 − k )γ i−1 iteratively to obtain: 1 1 1 γ k ≤ (1 − ) (1 − ) ⋯ (1 − )γ 0 k k k k 1 = (1 − ) γ0. k 2. Thus, the greedy solution satisfies: f (S greedy ) = f (S k ) ∗ = f (S ) − γk k 1 ∗ ≥ f (S ) (1 − (1 − ) ). k Explanation: We start with γ k = f (S ∗ ) − f (S k ). Then we apply the itereated inequality to obtain: k 1 ∗ ∗ ∗ f (S ) − γ k ≥ f (S ) − (1 − ) f (S ) k k 1 ∗ = f (S ) (1 − (1 − ) ). k https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 56/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Step 4: Limit Evaluation 1. Remembering that: k 1 1 lim (1 − ) = , k→∞ k e 2. and that: k 1 1 (1 − ) < , k e 3. we conclude: k 1 ∗ f (S greedy ) ≥ (1 − (1 − ) )f (S ) k 1 ∗ ≥ (1 − )f (S ). e □ https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 57/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Summary 1 The greedy algorithm provides a (1 − )-approximation to the optimal e solution when maximizing a non-negative, monotone, submodular function under a cardinality constraint. https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 58/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Why? This result is significant because it guarantees that the greedy approach to batch acquisition in settings like active learning will achieve near-optimal utility, ensuring efficient use of resources. https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 59/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Submodular BatchBALD Set f (x 1 , … , x k ) := I[Y 1 , … , Y k ; Θ | x 1 , … , x k ]. Show: 1. f (∅) = 0 2. Monotone 3. Submodular https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 60/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Marginal Gain For this definition of f , the marginal gain is: Δ(x e ∣ x 1 ,.. , x k ) = f ({x 1 ,.. , x k , x e }) − f ({x 1 ,.. , x k }) = I[Y e , Y 1 ,.. , Y k ; Θ | x e , x 1 ,.. , x k ] − I[Y 1 ,.. , Y k ; Θ | x 1 ,.. , x k ] = I[Y e ; Θ | x e , Y 1 , x 1 ,.. , Y k , x k ] https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 61/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Proof (Last Step) 1. Convince yourself that: I[A, B; Θ | C] = I[A; Θ | C] + I[B; Θ | A, C]. 2. And then: I[B; Θ | A, C] = I[A, B; Θ | C] − I[A; Θ | C]. □ https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 62/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Proof: f (∅) = 0 f (∅) = I[∅; Θ] = 0. □ https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 63/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Proof: Monotone Show that for A ⊆ B : f (A) ≤ f (B). Proof. 1. For x 1,.. , xn ∈ P and 1 ≤ k ≤ n, k ∈ N : f ({x 1 ,.. , x k }) ≤ f ({x 1 ,.. , x n }) ⟺ I[Y 1 ,.. , Y k ; Θ | x 1 ,.. , x k ] ≤ I[Y 1 ,.. , Y n ; Θ | x 1 ,.. , x n ] ⟺ H[Θ] − H[Θ | Y 1 ,.. , Y k , x 1 ,.. , x k ] ≤ H[Θ] − H[Θ | Y 1 ,.. , Y n , x 1 ,.. , x n ] ⟺ H[Θ | Y 1 ,.. , Y k , x 1 ,.. , x k ] ≥ H[Θ | Y 1 ,.. , Y n , x 1 ,.. , x n ]. 2. Conditioning reduces entropy. 3. Thus, f is monotone. □ https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 64/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Alternative: We notice that f ({x 1 ,.. , x k }) ≤ f ({x 1 ,.. , x n }) ⟺ 0 ≤ Δ(x k+1 ,.. , x n ∣ x 1 ,.. , x k ) ⟺ 0 ≤ I[Y k+1 ,.. , Y n ; Θ | x k+1 ,.. , x n , Y 1 ,.. , Y k , x 1 ,.. , x k ], which is true as pairwise mutual information is non-negative. □ https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 65/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Proof: Submodular 1. For X and Y 1 ⊆ Y2 with X ∩ Y 2 = ∅ , we need to show: Δ(X | Y 2 ) ≤ Δ(X | Y 1 ). 2. Notation: We rename these to A and B ⊆ B to avoid confusing with the 1 2 random variables X and Y. We drop the conditioning on x and only write the Y to save space. 3. We need to show: I[Y A ; Θ | Y B 2 ] ≤ I[Y A ; Θ | Y B 1 ]. 4. Let: S := B 2 ∖ B 1. 5. We can rewrite this as: I[Y A ; Θ | Y B ] − I[Y A ; Θ | Y B ] ≥ 0 1 2 ⟺ I[Y A ; Θ | Y B ] − I[Y A ; Θ | Y B , Y S ] ≥ 0 1 1 ⟺ I[Y A ; Θ; Y S | Y B 1 ] ≥ 0. 6. Using the symmetry of the mutual information: 0 ≤ I[Y A ; Θ; Y S | Y B ] 1 = I[Y A ; Y S ; Θ | Y B ] 1 = I[Y A ; Y S | Θ, Y B 1 ] − I[Y A ; Y S | Θ, Y B 1 ] = I[Y A ; Y S | Θ, Y B ] ≥ 0, 1 because I[Y A; Y S | Θ, Y B ] = 0 1 and the pairwise mutual information is non- negative. 7. I[Y A; Y S | Θ, Y B ] = 0 1 because: 1. Recall that our probabilistic model is n p(y 1 ,.. , y n , θ | x 1 ,.. , x n ) = p(θ) ∏ p(y i | θ, x i ). i=1 2. Thus, for all i ≠ j : Yi ⊥ ⊥ Y j | Θ, x i , x j. https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 66/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition 8. Thus, f is submodular. □ https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 67/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Summary f is submodular and monotone, that is f BatchBALD (x 1 , … , x k ) := I[Y 1 , … , Y k ; Θ | x 1 , … , x k ]. Note BatchBALD is submodular, allowing efficient computation despite the combinatorial nature of batch selection https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 68/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Revisiting the Greedy Algorithm 1. Given x 1, … , x i−1 , at each step we find the point x that maximizes: i f BatchBALD (x 1 , … , x i−1 , x i ). 2. The first x 1, … , x i−1 are already selected and thus fixed. 3. We can just as well maximize the marginal gain: Δ(x | x 1 , … , x i−1 ) = I[Y ; Θ | x, Y 1 , x 1 , … , Y i−1 , x i−1 ]. 4. At each step, we thus maximize the EIG conditioned on the other selected points in expectation (we don’t know the actual labels so we use the predictive distribution and take expectations wrt. the labels!). https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 69/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition BatchBALD in Practice https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 70/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Key Components 1. Joint mutual information computation 2. Efficient caching strategies 3. MC dropout for uncertainty https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 71/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Results Performance Gains More diverse batch selection Better data efficiency State-of-the-art results on standard benchmarks https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 72/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Computing BatchBALD Given some samples θ , …, θ , we need to compute: 1 k I[Y 1 , … , Y n ; Θ | x 1 , … , x n ]. https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 73/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Split Because predictions are independent given the weights, we have: I[Y 1 ,.. , Y n ; Θ | x 1 ,.. , x n ] = H[Y 1 ,.. , Y n ] − H[Y 1 ,.. , Y n | Θ, x 1 ,.. , x n ] k = H[Y 1 ,.. , Y n | x 1 ,.. , x n ] − ∑ H[Y i | x i , Θ]. i=1 We can precompute H[Y | x , Θ] for samples in the pool set and reuse these i i values throughout during the greedy selection process. https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 74/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Computing the Joint H[Y 1 ,.. , Y k | x 1 ,.. , x n ] is more challenging: p(y 1 ,.. , y n | x 1 ,.. , x n ) = E p(θ) [p(y 1 ,.. , y n | x 1 ,.. , x n , θ)] k 1 ≈ ∑ p(y 1 ,.. , y n | x 1 ,.. , x n , θ i ) k i=1 k n 1 = ∑ ∏ p(y j | x j , θ i ). k i=1 j=1 And then: H[Y 1 ,.. , Y k | x 1 ,.. , x n ] = ∑ − p(y 1 ,.. , y n | x 1 ,.. , x n ) ln p(y 1 ,.. , y n | x 1 ,.. y 1 ,..,y n What is the challenge with this? https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 75/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition MC Sampling of y 1,.. , yn To reduce the complexity of the outer sum, we can sample ∼ p(y ,.. , y | x ,.. , x ) and to compute an estimate. 😢 y ,..,y 1 n 1 n 1 n But this does not work well in practice. Thus, n ≤ 7. is a common choice for BatchBALD. ( ⟹ stochastic sampling instead of deterministic!) https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 76/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Inconsistent MC Dropout How do we sample p(y 1,.. , yn | x1 ,.. , xn , θi ) ? 1. Dropout masks change on every call to the model ⚡️ 2. We cannot compute the joint distributions with different masks (parameter samples) for different points: 3. Inconsistent parameter samples cannot capture correlations between points! https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 77/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Easy Consistent Samples per Batch? Cheap & Easy? could sample in larger test batches? would loose correlations between batches but we will capture correlations within a batch of points at least. ⏭️ BUT: ⏭️ https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 78/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition This shows the BatchBALD scores for 1000 randomly selected points from the pool set while selecting the 10th point in a batch for an MNIST model that has already reached 90% accuracy. The scores for a single set of 100 model parameters (randomly chosen) are shown in blue. The BatchBALD estimates show strong banding with the score differences between different sets of sampled parameters being larger than the differences between different data points within a specific set of model parameters. Without consistent sampling, the arg max would essentially be randomly sampled and not be informative. https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 79/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Consistent MC Dropout Implement custom MC dropout that samples fixed dropout masks once when we enter model.eval() and then reuses them for all points. We always sample multiple masks in batch to speed up sampling: TensorType[n_samples, n_inputs, n_classes] https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 80/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Pseudo-Code 1 class ConsistentMCDropout(nn.Module): 2 """Consistent MC Dropout layer that reuses masks during evaluation.""" 3 def __init__(self, p: float = 0.5): 4 super().__init__() 5 self.p = p 6 self.mask = None 7 self.training = True 8 9 def forward(self, x: torch.Tensor) -> torch.Tensor: 10 if self.training: 11 # Standard training behavior 12 return F.dropout(x, self.p, self.training) 13 14 if self.mask is None: 15 # Generate mask for all MC samples at once 16 # Shape: [n_samples, *x.shape[1:]] 17 self.mask = (torch.rand(x.shape, device=x.device) > self.p).float() / (1 - self.p) 18 19 return x * self.mask 20 21 class ConsistentMCModel(nn.Module): 22 """Wrapper for models to enable consistent MC dropout sampling.""" 23 def __init__(self, base_model: nn.Module): 24 super().__init__() 25 self.base_model = base_model 26 self.n samples = 1 https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 81/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Efficient(?) Joint Entropy 1 def compute_batch_joint_entropy_matmul( 2 fixed_log_probs: torch.Tensor, # [n_samples, B, n_classes] 3 pool_log_probs: torch.Tensor, # [n_samples, pool_size, n_classes] 4 n_classes: int = 10 5 ) -> torch.Tensor: 6 """Compute joint entropy using matrix multiplication. 7 8 Args: 9 fixed_log_probs: Predictions for fixed batch [n_samples, B, n_classes] 10 pool_log_probs: Predictions for pool points [n_samples, pool_size, n_classes] 11 n_classes: Number of classes 12 13 Returns: 14 Joint entropies [pool_size] 15 """ 16 n_samples = fixed_log_probs.shape 17 batch_size = fixed_log_probs.shape 18 pool_size = pool_log_probs.shape 19 20 assert fixed_log_probs.shape == pool_log_probs.shape 21 assert fixed_log_probs.shape == pool_log_probs.shape 22 assert fixed_log_probs.device == pool_log_probs.device 23 24 # Convert pool logits to probabilities 25 pool_probs = pool_log_probs.exp() # [n_samples, pool_size, n_classes] 26 https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 82/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Acquisition Size Ablation BALD BatchBALD Figure 1: Performance on MNIST for increasing acquisition sizes. BALD’s performance drops drastically as the acquisition size increases. BatchBALD maintains strong performance even with increasing acquisition https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 83/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition size. https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 84/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition EMNIST EMNIST is a dataset of 28x28 grayscale images of handwritten digits and letters. https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 85/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition EMNIST Results BatchBALD consistently outperforms both random acquisition and BALD while BALD is unable to beat random acquisition. https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 86/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Acquired Label “Entropy” BatchBALD steadily acquires a more diverse set of data points. https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 87/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Summary 1. Naive batch acquisition can select redundant points 2. BatchBALD uses joint mutual information 3. Submodularity enables efficient subset selection 4. Practical implementation considerations matter https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 88/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Stochastic Batch Acquisition https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 89/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition The Problem Traditional batch acquisition methods can be computationally expensive (in seconds) K Top-K BADGE BatchBALD 10 0.2 ± 0.0 9.2 ± 0.3 566.0 ± 17.4 100 0.2 ± 0.0 82.1 ± 2.5 5,363.6 ± 95.4 500 0.2 ± 0.0 409.3 ± 3.7 29,984.1 ± 598.7 BatchBALD scales poorly with batch size (limited to 5-10 points) Top-k selection ignores interactions between points Need a more efficient yet simple approach that maintains diversity https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 90/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Key Insight Acquisition scores change as new points are added to training set 1. For BatchBALD, the difference is: I[Y ; Θ | x] − I[Y ; Θ | x, Y train , x train ] = I[Y ; Y train ; Θ | x, x train ] = I[Y ; Y train | x, x train ] − I[Y ; Y train | Θ, x, x train ] = I[Y ; Y train | x, x train ] ≥ 0. 2. But if we don’t want to compute this? Single-point scores act as noisy proxies for future acquisition value Instead of deterministic top-k selection, use stochastic sampling Sample according to score-based probability distribution https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 91/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Noise Addition vs Sampling We can add noise to the scores and take the top-K. OR: we can sample from the score-based distribution directly. This “duality” is similar to the Gumbel-Softmax trick. https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 92/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Gumbel-Top-k Theorem Theorem 3 For scores s , i ∈ {1, … , n}, batch size k ≤ n, temperature i parameter β > 0, and independent Gumbel noise ϵ ∼ Gumbel(0, β ): i −1 arg top k {s i + ϵ i } i is equivalent to sampling k items without replacement: exp(β s i ) Categorical ( , i ∈ {1, … , n}) ∑ exp(β s j ) j See also Kool, Hoof, and Welling (2019);Maddison, Tarlow, and Minka (2014); Gumbel (1954). https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 93/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Key Implications 1. Provides theoretical justification for stochastic batch acquisition 2. Temperature β controls exploration-exploitation trade-off: β → ∞ : deterministic top-k β → 0 : uniform random sampling Tip Origin: The Gumbel-Softmax trick allows us to backpropagate through sampling operations, making it useful for both inference and training. https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 94/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition The Gumbel Distribution The Gumbel distribution models the maximum of many random variables: x − μ F (x; μ, β) = exp (− exp (− )) β μ is the location parameter β is the scale parameter https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 95/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Why Add Noise to Batch Selection? https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 96/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition The Problem with Top-K Selection 1. Assumption: Top-K assumes point scores are independent 2. Reality: Adding one point changes scores of all others 3. Key Insight: Most informative points cause biggest changes https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 97/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Think About It… What We Do: Score all points Pick top K at once What Really Happens: Score points Add one point Model updates Scores change Repeat… https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 98/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Score Correlation Over Time Note Scores become less correlated as we add points Effect strongest for highest-scoring points Larger batches = more uncertainty Early acquisition scores are only a loose proxy for later scores. Specifically, the Spearman rank-correlation between acquisition scores on the first and n’th time-step falls with n. While top-acquisition incorrectly implicitly assumes the rank-correlation remains 1, stochastic acquisitions do not. BNN trained on MNIST at initial 20 points and 73% initial accuracy, score ranks over test set. https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 99/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Details Score rank correlation Early Score rank correlation Late https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 100/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Predetermined Acquisitions Figure 2: Rank correlations for BALD scores on MNIST between the initial scores and future scores of the top- or bottom-scoring 1%, 10% and 100% of test points (smoothed with a size-10 Parzen window). The rank order decorrelates faster for the most informative samples and in the early stages of training. The top-1% scorers’ ranks after roughly 40 (100) acquisitions unlike the bottom-1%. Later in training, the acquisition scores stay more strongly correlated. This suggests the acquisition size could be increased later in training. Top-acquisition hurts less later in training (BALD on MNIST). At t ∈ {20, 100} (), we keep acquiring samples using the BALD scores from those two steps. At t = 20 ( orange ), the model performs well for ≈ 20 acquisitions; at t = 120 ( green ), for ≈ 50. https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 101/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Solution: Model the Uncertainty 1. Add Noise: Acknowledge score uncertainty 2. Sample Stochastically: Instead of deterministic top-K 3. Benefits: Natural diversity Matches reality better Simple & efficient https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 102/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition How It Works 1 import numpy as np 2 from scipy.special import softmax 3 4 def stochastic_batch_acquisition(pool_set, batch_size, model, temperature=1.0): 5 # Score all points 6 scores = [] 7 for x in pool_set: 8 score = acquisition_function(x, model) 9 scores.append(score) 10 11 # Convert to probabilities 12 probs = softmax(np.log(scores) / temperature) 13 14 # Sample without replacement 15 indices = np.random.choice(len(pool_set), size=batch_size, replace=False, p=probs) 16 17 return indices https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 103/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Benefits Simple to implement (O(|P |(1 + log K)) complexity) for Gumbel noise w/ top-K Naturally introduces diversity through stochastic sampling Performs as well as or better than sophisticated methods No hyperparameters to tune beyond temperature Compatible with any acquisition scoring function https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 104/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Results Matches or outperforms BatchBALD and BADGE: https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 105/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Repeated MNIST (a) Performance (b) Temperature Ablation Figure 3: (a) Performance on Repeated-MNIST with 4 repetitions (5 trials). Up and to the right is better (↗). PowerBALD outperforms (top-k) BALD and BADGE and is on par with BatchBALD. This is despite being orders of magnitude faster. Acquisition sizes: BatchBALD–5, BADGE–20, others–10. (b) Repeated-MNISTx4 (5 trials): PowerBALD outperforms BatchBALD for β = 8. https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 106/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition EMNIST & MIO-TCD EMNIST-Balanced MIO-TCD https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 107/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Orders of magnitude faster computation K Top-K Stochastic BADGE BatchBALD 10 0.2 ± 0.0 0.2 ± 0.0 9.2 ± 0.3 566.0 ± 17.4 100 0.2 ± 0.0 0.2 ± 0.0 82.1 ± 2.5 5,363.6 ± 95.4 500 0.2 ± 0.0 0.2 ± 0.0 409.3 ± 3.7 29,984.1 ± 598.7 Maintains performance with larger batch sizes Avoids redundant selections common in top-k https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 108/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Research Questions? The strong performance of simple stochastic approaches raises serious questions about current complex batch acquisition methods: 1. Do they truly model point interactions effectively? 2. Are the underlying acquisition scores themselves reliable? 3. Is the additional computational complexity justified given similar performance? ⟹ Future work must develop more efficient methods that better capture batch dynamics while being tractable. https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1 109/111 12/11/24, 10:34 AM (Bayesian) Active Learning, Information Theory, and Uncertainty – Submodularity & Batch Acquisition Appendix https://blackhc.github.io/balitu/lectures/lecture_4_batchbald.html#/key-insight-1