Lecture 2: Information Theory PDF
Document Details
Uploaded by HappierIris
Andreas Kirsch
Tags
Summary
This lecture provides an introduction to information theory, emphasizing its applications in machine learning and data compression. Key concepts like entropy, coding techniques, and the relationship between probability and information are examined.
Full Transcript
12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Information Theory Introduction Andreas Kirsch https://blackhc....
12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Information Theory Introduction Andreas Kirsch https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 1/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Why Information Theory? Real-World Impact Information theory powers modern compression algorithms like ZIP files, image compression (JPEG), and is fundamental to machine learning. Active learning is a prime example of how information theory can be applied to improve learning systems. https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 2/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory What You’ll Learn Today After this lecture, you’ll be able to: Design optimal codes for data compression Calculate the theoretical limits of compression Understand why certain codes are better than others Apply information theory concepts to real-world problems Based on Cover and Thomas (2006), Yeung (1991), and Shannon (1948), following my thesis (Kirsch 2024). https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 3/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Lecture Roadmap 1. Motivation Real-world compression examples Why we need information theory 2. Core Concepts Information content Entropy basics Simple coding examples 3. Coding Techniques Fixed vs variable length Huffman coding Practical examples 4. Theory & Proofs Kraft’s inequality Source coding theorems 5. Information Quantities Joint Entropy Mutual information Conditional entropy Chain rule 6. Information Diagrams Practice Theory https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 4/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Introduction Information theory helps us: Quantify information content of random variables and events Provide valuable insights and intuitions Offer a principled framework for reasoning about uncertainty https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 5/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Information theory is widely used in machine learning for/with: Learning via compression Information bottlenecks in supervised/unsupervised learning Bayesian experimental design Active learning Submodularity https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 6/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Common Misconceptions Warning Information theory is NOT: Just about compression Limited to computer science Only about binary codes The same as classical probability theory https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 7/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Common Challenges Common challenges when learning information theory: 1. Ambiguity in complex expressions 2. Deviation between fields (statistics, CS, information engineering) 3. Confusion between cross-entropy and joint entropy 4. Unclear expectations over random variables Example of Ambiguity H (X, Y ) could mean: Joint entropy of X and Y Cross-entropy between X and Y https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 8/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Probability & Information Theory Information theory builds on probability theory to quantify uncertainty and information Probability distributions determine how much information we gain from observing outcomes Less probable events carry more information (seeing a rare event tells us more than a common one) The entropy of a distribution measures its inherent uncertainty Mutual information quantifies how much uncertainty in one variable is reduced by knowing another Key Connection Information theory provides a mathematical framework for measuring the “surprise value” of probabilistic events. The more unlikely an event, the more informative it is when it occurs. https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 9/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Review: Probability Information theory builds on probability theory to quantify uncertainty and information. Important Understanding these basics is crucial for information theory! https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 10/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Probability Distributions Let’s review some basic probability distributions and concepts. 1. For discrete random variable X: p(X = x) is the probability of outcome x. 2. For continuous variables, p(X = x) is the probability density function, and e.g. p(X ∈ [a, b]) is the probability mass of the interval [a, b]. Notation We write p(x) as shorthand for p(X = x). https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 11/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Key Concepts For any random variables A and B: https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 12/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Conditional Probability p(a, b) p(b | a) = p(a) Independence p(a, b) = p(a) p(b) ∀a, b We write: A ⊥ ⊥ B if p(a, b) = p(a) p(b). https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 13/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Law of Total Probability p(b) = ∑ p(b | a) p(a). a Bayes’ Theorem p(B | A) p(A) p(A | B) = p(B) https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 14/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Example: Email Spam Filter For a spam email (S) containing word “money” (M): Term Notation Meaning p(S) 0.3 30% of all emails are spam p(M | S) 0.6 60% of spam emails contain “money” p(M | ¬S) 0.05 5% of non-spam emails contain “money” p(S | M ) ??? What fraction of emails with “money” are spam? We can find p(M ) using the law of total probability: p(M ) = p(M | S) p(S) + p(M | ¬S) p(¬S) = 0.6 ⋅ 0.3 + 0.05 ⋅ 0.7 = 0.215 Now we can apply Bayes’ theorem: p(M | S) p(S) 0.6 ⋅ 0.3 p(S | M ) = = ≈ 0.84 p(M ) 0.215 https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 15/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Bayesian Inference Bayes’ theorem also shows how to update probabilities when new evidence arrives. It connects: p(B | A) p(A) p(A | B) = p(B) Prior belief: p(A) Likelihood of evidence: p(B | A) Posterior belief: p(A | B) https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 16/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Example: Medical Testing For a disease (D) and positive test (T): Term Notation Meaning Prior p(D) Disease prevalence Likelihood p(T | D) Test sensitivity Posterior p(D | T ) Diagnosis confidence https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 17/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Expected Value & Variance E [X] = ∑ x p(x) or ∫ x p(x)dx x 2 2 2 Var [X] = E [(X − E [X]) ] = E [X ] − E [X] https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 18/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Notation for Expectations 1. E [X] denotes the expectation of a random variable X over the “corresponding” distribution p(x) (implicitly). 2. E p(x) denotes the expectation of a random variable X over the distribution [x] p(x) explicitly. We use a lower-case x to indicate that the x is bound to x ∼ p(x). It is essentially a short-hand for ∫ or ∑. 3. Same for Var [X] (and Var p(x) [x] —but I don’t think we’ll see that). https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 19/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Discrete Probability Distributions Bernoulli Binomial Models single trial success/failure of probability p Property Formula Probability mass function p(X = 1) = p , p(X = 0) = 1 − p Mean E [X] = p Variance Var [X] = p(1 − p) 1 from scipy.stats import bernoulli 2 # Probability mass function 3 bernoulli_probs = bernoulli.pmf([0, 1], p) 4 # Drawing samples 5 bernoulli_samples = bernoulli.rvs(p, size) Counts successes in n independent Bernoulli trials of probability p Property Formula Probability mass function p(X = k) = ( ) p (1 − p) n k n−k k Mean E [X] = np Variance Var [X] = np(1 − p) 1 from scipy.stats import binom 2 3 # Probability mass function 4 binomial_probs = binom.pmf(np.arange(0, n+1), n, p) 5 # Drawing samples 6 binomial_samples = binom.rvs(n, p, size) https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 20/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 21/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Continuous Probability Distributions Uniform Gaussian Models uniform probability over interval [a, b] Property Formula Probability density function p(X = x) = 1 for x ∈ [a, b] b−a Mean E [X] = a+b 2 Variance Var [X] = (b−a) 2 12 1 from scipy.stats import uniform 2 3 # PDF evaluation 4 x = np.linspace(a-1, b+1, 100) 5 uniform_pdf = uniform.pdf(x, loc=a, scale=b-a) 6 # Generate samples 7 uniform_samples = uniform.rvs(loc=a, scale=b-a, size=1000) Bell-shaped curve characterized by mean μ and variance σ 2 Property Formula Probability density function 1 − (x−μ) 2 2 p(X = x) = e 2σ √ 2πσ 2 Mean E [X] = μ Variance Var [X] = σ 2 1 from scipy.stats import norm 2 3 # Generate samples 4 gaussian_samples = norm.rvs(loc=mu, scale=sigma, size=1000) 5 # PDF evaluation 6 x = np.linspace(mu-4*sigma, mu+4*sigma, 100) 7 gaussian_pdf = norm.pdf(x, loc=mu, scale=sigma) https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 22/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 23/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Interactive Demo Link to Interactive Normal Distribution https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 24/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Information Content https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 25/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Information Content How do we measure the information content of an event? Shannon’s information content for an event x: H(p(x)) = − ln(p(x)). (Or simpler: h(ρ) = − ln(ρ). ) Wait, but why? https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 26/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Desirable Properties of h(p(x)) The information content of an event is always relative to a set of events (e.g. outcomes of a discrete R.V.). We want the information content to be: 1. a continuous function of the probability of the event, 2. montonously decreasing—the less probable an event, the more information it carries: p(x 1 ) < p(x 2 ) ⟹ h(p(x 1 )) > h(p(x 2 )), 3. additive for independent events: x ⊥ ⊥ y ⟹ h(p(x, y)) = h(p(x)) + h(p(y)). https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 27/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Let’s solve this (assuming diff’able) If x ⊥ ⊥ y , then p(x, y) = p(x) p(y) , so we need to have: h(p(x) p(y)) = h(p(x)) + h(p(y)). W.l.o.g., let a := p(x) and b := p(y). Then we need: h(a b) = h(a) + h(b) d d d ⟹ h(a b) = h(a) + h(b) db db db ′ h (b) ′ ′ ′ ⟹ ah (ab) = h (b) ⟹ h (ab) =. a https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 28/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Let’s fix b := 1 and define h ′ (1) := c. Then we have: c ′ h (a) =. a a c ⟹ h(a) − h(1) = ∫ dx = c ln a. 1 x So we have: h(a) = c ln a + h(1). https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 29/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory What is h(1)? h(a b) = h(a) + h(b) = c ln a + c ln b + 2 h(1), but also: h(a b) = c ln(ab) + h(1) = c ln a + c ln b + h(1). ⟹ h(1) = 0. Thus, we have: h(a) = c ln a. https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 30/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory How do we choose c? To be monotonically decreasing, we need c < 0 because ln(a) is monotonically increasing. https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 31/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Final Functional Form We can choose c := −1 to get the simplest expression: Important h(ρ) = − ln ρ. https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 32/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Nats vs Bits The information content is dimensionless, but we will use a pseudo-unit for it called nats. If we picked h(ρ) = − log ρ, we would use bits as unit. 2 The difference is a factor of ln 2: 1 bit = ln 2 nat ≈ 0.6931 nat. https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 33/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Why Bits? Given a uniform distribution, what’s the information content of a specific 4-bit string? 4-bit strings: 2 = 16 possible outcomes. 4 1 1 ⟹ p(x) = =. 4 2 16 1 ⟹ h(p(x)) = − ln ( ) 4 2 = 4 ln 2 nats = 4 bits. https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 34/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Why Bits? Given a uniform distribution, what’s the information content of a specific 4-bit string? Important A 4-bit string contains exactly 4 bits of information! https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 35/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Why Bits? Note A fair coin-flip contains ? bit of information. https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 36/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Why Bits? Note A fair coin-flip contains 1 bit of information. https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 37/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Shannon’s Information Content For an event with probability ρ, the information content is: h(ρ) = − ln ρ. Key properties: 1. Decreases monotonically with probability 2. Approaches infinity as ρ → 0 3. Equals zero when ρ = 1 4. Additive for independent events: h(ρ 1 ρ 2 ) = h(ρ 1 ) + h(ρ 2 ) https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 38/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory (Towards) Compression We have seen that for a uniform distribution, the information content can be related to the number of bits needed to encode a specific outcome. Is this the best we can do? And what if the distribution is not uniform? https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 39/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Information Content as a Lower Bound Important On average, the information content of a specific outcome is a lower bound on the number of bits needed to encode it. Note We could cheat and try to encode one outcome with fewer bits, but then we’d pay for it with other outcomes, and on average we’d be worse off. https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 40/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Codes A code maps source symbols to sequences of symbols from a code alphabet. The goal is typically to: 1. Represent information efficiently (minimize average sequence length) 2. Enable reliable transmission (protect against errors) https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 41/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Key Terminology Symbol: A basic unit of information (e.g., a letter, digit, or voltage level) Alphabet: A finite set of symbols Example 1: Binary alphabet {0,1} Example 2: English alphabet {A,B,…,Z} String: A finite sequence of symbols from an alphabet Example 1: “101001” (binary string) Example 2: “HELLO” (text string) Code: A system that converts symbols from one alphabet into strings of symbols from another alphabet Source Symbols Binary Strings A 00 B Code 01 C 10...... https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 42/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Example 1: ASCII Code Source symbols: Text characters Code alphabet: Binary digits (0 and 1) Mapping: Each character is converted to an 8-digit binary string Example: ‘A’ is converted to ‘01000001’ https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 43/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Example 2: Morse Code Source symbols: Letters and numbers Code alphabet: {dot (.), dash (-), space} Mapping: Each character is converted to a sequence of dots and dashes Examples: ‘E’ → ‘.’ (most common English letter: shortest code) ‘A’ → ‘.-’ ‘B’ → ‘-…’ ‘SOS’ → ‘… — …’ (famous distress signal) (What is ’ ’ →?) https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 44/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Formal Definitions Let’s formalize these concepts: Alphabet: A finite set of symbols, denoted A Strings: A finite sequence of symbols from A Set of all finite strings over A denoted by A ∗ Length of string x denoted by |x| Empty string denoted by ϵ Example: If A = {0, 1} , then A ∗ = {ϵ, 0, 1, 00, 01, 10, 11, 000, …} Code: A mapping C : X → D ∗ from source alphabet X to strings over code alphabet D Code length ℓ C (x) := |C(x)| Non-singular if the mapping is one-to-one (different inputs ↔︎ different outputs) Source Alphabet (𝒳) Code Strings (𝒟*) x₁ 𝒞(x₁) x₂ Code 𝒞 𝒞(x₂) x₃ 𝒞(x₃)...... Code Extension: The extension C maps sequences of source symbols to ∗ sequences of code strings: For input sequence x 1 x2 … xn : ∗ C (x 1 x 2 … x n ) = C(x 1 )C(x 2 ) … C(x n ). A code is uniquely decodable if its extension is non-singular https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 45/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Source Sequences (𝒳*) Code Sequences (𝒟*) ₁₂ ₙ x x...x Code Extension 𝒞* ₁ ₂ 𝒞(x )𝒞(x )...𝒞(x ) ₙ https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 46/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Example 1: ASCII Code (Formal) Source alphabet: X = {ASCII characters} Code alphabet: D = {0, 1} Code: C : X → D 8 (fixed-length mapping to 8-bit strings) Example: C(’A’) = 01000001 Extension: C ∗ (’CAT’) = C(’C’)C(’A’)C(’T’) https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 47/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Example 2: Morse Code (Formal) Source alphabet: X = {A, B, … , Z, 0, … , 9} Code alphabet: D = {., -, space} Code: C : X → D ∗ (variable-length mapping) Examples: C(’E’) =. C(’A’) =.- Extension: C ∗ (’SOS’) = C(’S’) C(’O’) C(’S’) =... ---... https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 48/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Types of Codes Codes can be broadly classified based on their properties: Fixed-length codes Variable-length codes Every codeword has the same length. Simple, but not always efficient. Example: ASCII characters. We have 256 ASCII characters. We use 8 bits to encode each one. Codewords can have different lengths. Can be more efficient but require careful design. Example: Huffman Coding Optimal for a given distribution. Used in ZIP files for example. https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 49/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Prefix Codes A prefix code is a type of variable-length code where no codeword is a prefix of any other codeword. This property allows prefix codes to be instantaneously decodable. Example: Symbol Codeword A 0 B 10 C 110 D 111 Decoding 110010111 is unambiguous: 110|0|10|111 CABD https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 50/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Prefix Codes as Trees A prefix code can be represented as a tree, where each codeword corresponds to a path from the root to a leaf (terminal node). The code alphabet determines the degree of the tree (e.g., binary code → binary tree). 0 A 0 B 1 0 C 1 1 D https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 51/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Uniquely Decodable Codes A uniquely decodable code is a variable-length code where any encoded string has only one valid decoding. Prefix codes are a subset of uniquely decodable codes. Not all uniquely decodable codes are prefix codes. Example (Uniquely Decodable, Not Prefix): Symbol Codeword A 0 B 01 C 011 D 111 Decoding 0111111 could be: C D or B D or A B D. Ambiguity. Uniquely decodable codes are not necessarily instantaneously decodable. https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 52/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Uniquely Decodable Codes as Trees Uniquely decodable codes can also be represented as trees, but symbols might be assigned to internal nodes as well as leaves. This can lead to ambiguity in instantaneous decoding, as an internal node might be part of multiple codewords. B C 0 A 1 1 1 D 1 1 https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 53/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Constructing Codes Let’s take a look at two different coding algorithms that are almost optimal: 1. Shannon-Fano coding 2. Huffman coding Later we will see that the Source Coding Theorem (Shannon 1948) establishes fundamental limits on data compression. https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 54/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Overview A concrete example inspired by Cover and Thomas (2006): Consider transmitting English text where letters have the following probabilities. Letter Probability Information ASCII Near (bits) (bits) Optimal (bits) e 0.127 2.98 8 3 t 0.091 3.46 8 3 a 0.082 3.61 8 4 space 0.192 2.38 8 2 … … … … … https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 55/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Key observations 1. ASCII uses fixed 8 bits regardless of probability 2. Optimal code lengths ≈ − log 2 (probability) 3. Near optimal coding approaches these optimal lengths 4. Average bits per symbol for English text: ASCII: 8.00 bits Huffman: ≈ 4.21 bits Theoretical minimum (entropy): 4.17 bits Letter Probability h(⋅) ASCII Near (bits) (bits) optimal (bits) e 0.127 2.98 8 3 t 0.091 3.46 8 3 a 0.082 3.61 8 4 space 0.192 2.38 8 2 … … … … … https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 56/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Shannon-Fano Coding Shannon-Fano coding (1948) was one of the first systematic methods for constructing prefix codes based on probabilities: Algorithm: 1. Sort symbols by probability (descending) 2. Split list into two parts with roughly equal total probability 3. Assign 0 to first group, 1 to second 4. Recurse on each group until single symbols remain https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 57/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Example Consider source {A, B, C, D, E} with probabilities: Symbol Prob Shannon- Fano A 0.35 00 B 0.30 01 C 0.15 10 D 0.10 110 E 0.10 111 ABCDE (1.0) 0 1 AB CDE (0.65) (0.35) 0 1 0 1 A B C DE (0.35) (0.30) (0.15) (0.20) 0 1 D E (0.10) (0.10) https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 58/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Properties Always produces a prefix code Near-optimal but not always optimal Average length: E p(x) [ℓ(x)] = 2.20 bits Entropy: H(p(X)) = 2.13 bits Inefficiency: 0.7 bits/symbol above entropy https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 59/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Huffman Coding Huffman coding (Huffman 1952) improves on Shannon-Fano by building the code tree from bottom-up: Algorithm: 1. Create leaf nodes for each symbol with their probabilities 2. Repeatedly combine two lowest probability nodes 3. Assign 0 and 1 to each branch 4. Continue until single root remains https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 60/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Example Same source as before: Symbol Prob Huffman A 0.35 0 B 0.30 10 C 0.15 110 D 0.10 1110 E 0.10 1111 1.0 0.35 0.65 A 0.30 0.35 B 0.15 0.20 C 0.10 0.10 D E https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 61/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Properties Produces optimal prefix codes Average length: E p(x) [ℓ(x)] = 2.20 bits Entropy: H(p(X)) = 2.13 bits Inefficiency: 0.07 bits/symbol above entropy Maximum inefficiency < 1 bit/symbol https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 62/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Comparing Shannon-Fano and Huffman Without proof: Huffman’s bottom-up approach ensures optimality by always merging the lowest probability nodes, while Shannon-Fano’s top-down splits can sometimes be suboptimal. https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 63/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Additional Resources 1. https://190n.github.io/huffman-visualization/ 2. https://youtu.be/17gUjSTMekI?si=DHZPKpNWEdO8MkYN 3. https://en.wikipedia.org/wiki/Huffman_coding https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 64/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Kraft’s Inequality Kraft’s inequality gives a necessary and sufficient condition for the existence of a prefix code (or a uniquely decodable code) for a given set of codeword lengths. https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 65/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Kraft’s Inequality for Prefix Codes Theorem: For a prefix code over an alphabet of size D with codeword lengths ℓ , ℓ , … , ℓ , the following inequality holds: 1 2 m m −ℓ i ∑D ≤ 1 i=1 Conversely, if the inequality holds, a prefix code with these lengths exists. https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 66/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Proof for ⟹ If we have a prefix code, the inequality must hold: 1. Consider a full D-ary tree of depth L = max i ℓ i. 2. Each codeword corresponds to a node in this tree. 3. A codeword of length ℓ has D i L−ℓ i descendants at depth L. 4. Since the code is prefix-free, these descendants are disjoint for different codewords. m 5. The total number of descendants for all codewords is ∑ i=1 D L−ℓ i. 6. Since there are D possible nodes at depth L, we must have: L m m L−ℓ i L −ℓ i ∑D ≤ D ⇒ ∑D ≤ 1 i=1 i=1 □ https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 67/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Proof for ⟸ Conversely, if the inequality holds, we can construct a prefix code: 1. Sort the lengths in increasing order. 2. Assign a node at level ℓ. 1 3. Remove its descendants. 4. Since D −ℓ 1 ≤ 1 , we can assign a node at level ℓ among those that remain. 2 5. Repeat, removing descendants after each assignment. 6. This process can always complete because the inequality ensures there will always be a node available at each required depth. □ https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 68/172 12/11/24, 5:16 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Kraft’s Inequality for Uniquely Decodable Codes Kraft’s inequality also holds for uniquely decodable codes (although it is no longer a sufficient condition for their existence). Theorem: For a uniquely decodable code over an alphabet of size D with codeword lengths ℓ , ℓ , … , ℓ , the following inequality holds: 1 2 m m −ℓ i ∑D ≤ 1. i=1 https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 69/172 12/11/24, 5:17 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Proof m 1. Let S = ∑ i=1 D −ℓ i. 2. If a code is uniquely decodable, then its extension is non-singular. 3. Consider the nth power of S : n m m m n −ℓ i −(ℓ i +⋯+ℓ i n ) S = (∑ D ) = ∑⋯∑D 1 i=1 i 1 =1 i n =1 4. Each term in this sum corresponds to a sequence of n codewords with total length ℓ + ⋯ + ℓ. i1 in 5. Let T be the number of such combinations with total length j. Then, we can j write: nL n −j S = ∑ Tj D j=1 6. The code extension C is itself a code from A to D , and T is the number of n n n j codes that map a source string to a code string of length j. 7. Since C is non-singular, T n j ≤ D j. Hence: nL nL n −j j −j S ≤ ∑ Tj D ≤ ∑D D = nL j=1 j=1 1 8. Then, we have: S ≤ (nL) n. 1 9. Since lim n→∞ (nL) n = 1 , we obtain S ≤ 1. □ https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 70/172 12/11/24, 5:17 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Review: Convexity & Jensen’s Inequality https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 71/172 12/11/24, 5:17 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Convexity Convexity & Concavity A function f is convex if for all x, y and λ ∈ [0, 1] : λf (x) + (1 − λ)f (y) ≥ f (λx + (1 − λ)y). Or, when f is twice differentiable: ′′ f (x) ≥ 0 ∀x. Strictly convex if we have strict inequality: > instead of ≥. f is (strictly) concave if −f is (strictly) convex. https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 72/172 12/11/24, 5:17 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Jensen’s Inequality Jensen’s Inequality If f is convex, then for any random variable X: f (E [X]) ≤ E [f (X)], with equality if and only if one of the following holds: 1. X is almost surely a constant. 2. f is affine on the convex hull of the support of X. If f is strictly convex, then the inequality is strict unless X is (almost surely) a constant. If f is concave, then the inequality is reversed. https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 73/172 12/11/24, 5:17 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Proof of Jensen’s Inequality We start by showing: Important For f convex and λ 1, λ2 , … , λn ≥ 0 with ∑ i λi = 1 : f (∑ λ i x i ) ≤ ∑ λ i f (x i ). i i 1. We start with the definition of convexity: f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y). 2. Notice that this is equal to having λ 1, λ2 ≥ 0 with λ 1 + λ2 = 1 , such that: f (λ 1 x 1 + λ 2 x 2 ) ≤ λ 1 f (x 1 ) + λ 2 f (x 2 ). 3. We extend this to λ 1, … , λn ≥ 0 with ∑ i λi = 1 by iterating the convexity inequality: f (∑ λ i x i ) i n−1 λi ≤ f ((1 − λ n ) (∑ xi ) + λn xn ) (1 − λ n ) i=1 n−1 λi ≤ (1 − λ n )f (∑ x i ) + λ n f (x n ) (1 − λ n ) i=1 n−1 ′ = (1 − λ n )f (∑ λ i x i ) + λ n f (x n ) i=1 ≤ (1 − λ n ) (⋯) + λ n f (x n ) n ≤ ∑ λ i f (x i ). i=1 λi n−1 with λ ′ i := 1−λ n ≥ 0 for i = 1, … , n − 1 and ∑ i=1 λ ′ i. = 1 4. Note that we have to drop any λ i = 0 to avoid division by zero, but we can obviously do that first w.l.o.g.. 5. To make this formal, we can use induction to show, starting from n = 2 (already shown): https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 74/172 12/11/24, 5:17 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory 1. n = 2 : f (λ 1 x 1 + λ 2 x 2 ) ≤ λ 1 f (x 1 ) + λ 2 f (x 2 ). 2. n → n + 1 : n+1 n+1 f (∑ λ i x i ) ≤ ∑ λ i f (x i ) i=1 i=1 ⋯ [Left as exercise.] □ https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 75/172 12/11/24, 5:17 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Proof Sketch of Jensen’s Inequality Now, we show the actual theorem: 1. Once we have this, we note that for any x i ∼ p(x) , we have: 1 n→∞ ∑ f (x i ) ⟶ E p(x) [f (x)], n i by the law of large numbers. n→∞ 1 2. Same for n ∑ x i ⟶ E p(x) [x]. i 1 3. Then, by continuity of f (is it?), we have for λ i = n ,x i ∼ p(x) : n n f (∑ λ i x i ) ≤ ∑ λ i f (x i ) i=1 i=1 n→∞ ⟶ f (E p(x) [x]) ≤ E p(x) [f (x)]. □ https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 76/172 12/11/24, 5:17 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory f convex ⟹ f continuous … See https://math.stackexchange.com/questions/258511/prove-that-every- convex-function-is-continuous https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 77/172 12/11/24, 5:17 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory More on Convexity More details can be found in the beautiful book by Boyd and Vandenberghe: https://web.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 78/172 12/11/24, 5:17 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Convexity of h(⋅) 2 d 1 ∀ρ > 0 : dρ 2 h(ρ) = ρ 2 > 0 ⟹ h(ρ) strictly convex. https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 79/172 12/11/24, 5:17 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Entropy and Optimal Codes https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 80/172 12/11/24, 5:17 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Entropy The entropy of a distribution is the average information content of a random variable: H(p(X)) := E p(x) [h(p(x))] = E p(x) [− ln p(x)]. https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 81/172 12/11/24, 5:17 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Example: Fair Coin For a fair coin X, we have p(”Head”) = p(”Tail”) = 0.5 and: H(p(X)) = −0.5 ln 0.5 − 0.5 ln 0.5 = ln 2 nats = 1 bit. https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 82/172 12/11/24, 5:17 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Example: Fully Biased Coin For a fully biased coin X with p(”Head”) = 1 and p(”Tail”) = 0 , we have: H(p(X)) = −1 ln 1 − 0 ln 0 = 0. https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 83/172 12/11/24, 5:17 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Deterministic Variable For any deterministic variable X with p(x) = 1 for some x and 0 otherwise, we have: H(p(X)) = − ln 1 = 0. https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 84/172 12/11/24, 5:17 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Key Properties 1. Conceptual Properties Measures uncertainty/randomness Higher entropy = greater uncertainty 2. Mathematical Properties 1. Non-negative: H(p(X)) ≥ 0 2. Upper bounded: H(p(X)) ≤ ln |X | for discrete X 3. Concavity: in p(x) 4. Additivity: for independent variables: H(p(X, Y )) = H(p(X)) + H(p(Y )) if X ⊥ ⊥ Y https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 85/172 12/11/24, 5:17 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory New Notation for h(⋅) As notation of the information content, we will use: H(p(x)) := h(p(x)). When p(⋅) is clear from context, we will simply write H[⋅]. So, H(p(X)) = E [H(p(x))] and H[X] = E p(x) [H[x]]. p(x) Caution The difference of H[X] vs H[x] is whether we take an expectation or not. https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 86/172 12/11/24, 5:17 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Non-Negativity of Discrete Entropy The entropy of a discrete distribution is always non-negative: Proof. 1. Individual Term Analysis: For each outcome x, p(x) ∈ (0, 1]. 2. Therefore, − ln(p(x)) ≥ 0 because the natural logarithm of a probability ln(p(x)) ≤ 0. 3. Summing Non-Negative Terms: Since each p(x)(− ln p(x)) ≥ 0 , their sum is also non-negative: H(p(X)) = ∑ p(x)(− ln p(x)) ≥ 0 x 4. Equality Condition: Entropy H(p(X)) = 0 if and only if p(x) = 1 for some x (i.e., the distribution is deterministic). □ https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 87/172 12/11/24, 5:17 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Source Coding Theorem For any uniquely decodable code C using an alphabet of size D with codeword lengths ℓ , ℓ , … , ℓ for x , x , … , x : 1 2 m 1 2 m m H(p(X)) ≤ ∑ p(x i ) ℓ i D-ary digits, i=1 where ln D nats are 1D-it (D-ary digit) (like ln 2 nats = 1 bit ). https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 88/172 12/11/24, 5:17 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Proof 1. Define a pseudo-probability q(x i) for each codeword: exp(−ℓ i ln D) q(x i ) := m ∑ exp(−ℓ i ln D) i=1. −ℓ i D =. m −ℓ i ∑ D i=1 Note Also known as the softmax function with temperature 1/ ln D (or more correctly, the softargmax function). 2. Then, we look at the following fraction on average and use Jensen’s inequality: q(x) q(x) E p(x) [h( )] = E p(x) [− ln ] p(x) p(x) q(x) ≥ − ln E p(x) [ ] p(x) m q(x i ) = − ln ∑ p(x i ) p(x i ) i=1 m = − ln ∑ q(x i ) i=1 = − ln 1 = 0. 3. On the other hand, we have: q(x) E p(x) [h( )] p(x) q(x) = E p(x) [− ln ] p(x) = E p(x) [− ln q(x)] − E p(x) [− ln p(x)] = E p(x) [− ln q(x)] − H(p(X)). 4. Thus, overall: H(p(X)) ≤ E p(x) [− ln q(x)]. 5. Now, we substitute q(x) back: https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 89/172 12/11/24, 5:17 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory −ℓ(x) D H(p(X)) ≤ E p(x) [− ln ] m −ℓ i ∑ D i=1 m −ℓ i = E p(x) [ℓ(x) ln D] + ln ∑ D. i=1 m 6. From Kraft’s inequality, ∑ i=1 D −ℓ i , so the second term is non-positive ≤ 1 because for x ≤ 1, ln x ≤ 0. 7. Thus, overall: H(p(X)) ≤ ln D E p(x) [ℓ(x)]. Note □ For D = 2 with 1 bit = ln 2 nats , we have: H(p(X)) ≤ E p(x) [ℓ(x)] bits. Tip This shows that the entropy of a distribution is a lower bound on the expected codeword length of any prefix-free or uniquely decodable code. https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 90/172 12/11/24, 5:17 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Near-Optimal Code Given p(x), we can construct a prefix-free code C such that: H(p(X)) ≤ ln D E p(x) [ℓ(x)] ≤ H(p(X)) + ln D. Note Remember that ln D nats are 1 D-it (D-ary digit) (like ln 2 nats = 1 bit ). Thus: H(p(X)) ≤ E p(x) [ℓ(x)] D-its ≤ H(p(X)) + 1 D-it. Given that we cannot send fractional messages easily—more on that later (maybe?)—this is as good as it gets essentially. https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 91/172 12/11/24, 5:17 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Proof 1. Recall that by Kraft’s inequality, if we have a sequence of code lengths {ℓ m i } i=1 satisfying: m −ℓ i ∑D ≤ 1, i=1 then there exists a prefix-free code with these lengths. 2. For a given distribution p(x), let’s choose code lengths: ℓ(x) := ⌈− log D p(x)⌉. (⌈y⌉ is the smallest integer greater than or equal to y, i.e. np.ceil). 3. These lengths satisfy Kraft’s inequality: −ℓ(x) −(− log D p(x)) ∑D ≤ ∑D = ∑ p(x) = 1, x x x where we used that ⌈y⌉ ≥ y and thus D −⌈y⌉ ≤ D −y. 4. For the upper bound on the expected code length: E p(x) [ℓ(x)] = E p(x) [⌈− log p(x)⌉] D ≤ E p(x) [− log D p(x) + 1] 1 = − E p(x) [ln p(x)] + 1 ln D H(p(X)) = + 1. ln D 5. Multiplying both sides by ln D gives us: ln D E p(x) [ℓ(x)] ≤ H(p(X)) + ln D. □ https://blackhc.github.io/balitu/lectures/lecture_2_info_theory.html#/title-slide 92/172 12/11/24, 5:17 PM (Bayesian) Active Learning, Information Theory, and Uncertainty – Information Theory Information Content & Optimal Code Length The information content H[x] = − ln p(x) has a deep operational meaning: it specifies the optimal code length (in nats) for encoding an event x given its probability p(x). Let’s make this connection precise. 1. Our proof showed that the near-optimal code length is: ℓ(x) = ⌈− log D p(x)⌉ 2. Expressing this in natural logarithms: ln p(x) H[x] ℓ(x) ≈ −