Foundations: Information Theory PDF - Lecture Slides
Document Details

Uploaded by AdaptiveOpArt68
Technische Universität Darmstadt
Prof. Jan Peters
Tags
Summary
This document presents lecture slides on the foundations of information theory. Topics discussed include entropy, channel capacity, and the application of information theory to various fields within computer science and more generally. This includes machine learning, and data compression.
Full Transcript
Join at slido.com #2878010 Click Present with Slido or install our Chrome extension to display joining ⓘ instructions for participants while presenting. Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt Foundations: Information...
Join at slido.com #2878010 Click Present with Slido or install our Chrome extension to display joining ⓘ instructions for participants while presenting. Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt Foundations: Information Theory Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt Basic Cycle of Probabilistic Modeling and Analysis Prior Assumptions Modeling probability distribution (Lecture 3) Observed Data from Model structure Experiment Estimating uncertain quantities (Lecture 4) Model parameters Lecture 6+7 Lecture 1+2 Experimental design in computer science and evaluation (Lecture 4) Model quality evaluation Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt 4 Motivation So far… … we have learned about Probabilities, Events, Distributions … learned about Bayes Rule … learned about Expectation, Variance What is missing? Quantifying and analyzing information for i.e. observing an event Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt 5 Outline 1. What is Information? 2. Properties of Information Channel 3. Application Examples of Information Theory 4. Divergences 5. Conclusion Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt 6 Outline 1. What is Information? 2. Properties of Information Channel 3. Application Examples of Information Theory 4. Divergences 5. Conclusion Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt 7 Historical Context of Information Theory Foundation for data compression, cryptography, error detection and correction, and telecommunications. Growing interest in communication and data transmission since the early 20th century Claude Shannon 1948: "A Mathematical Theory of Communication" Introduced the concept of bits as a unit of information. Claude Shannon fundamental principles of data compression and error correction (1916-2001) influences Cybersecurity, Wireless Communication, Machine Learning, Psychology, Linguistics… basically everything! Interesting book: The Information: A History, a Theory, a Flood Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt 8 Learning Objectives Define Information Quantify Uncertainty Compare probability distributions Claude Shannon (1916-2001) … and see what this is useful for Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt 9 Information Channel Information flows from a source to a receiver Source Receiver The source encodes a message During transmission on the channel noise is added Encoder Decoder The receiver decodes the noisy encoded message Noisy Channel The receiver then gains information based on the data! Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt 10 Information Intuitively, which statement has more information? 1. The program terminated due to an error. 2. The program terminated with Error Code 3. The first sentence only tells if a program The second sentence tells the finished successfully possible source of error yes no 0 1 2 3 4 The more diverse the possible outcomes, the greater the amount of information conveyed by the message! 11 Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt Defining Information less likely events are more surprising the more surprise the more information Shannon Information is the most useful measure Data does not equal Information! Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt 12 Information vs Data A bit (0 or 1) is the most basic unit of data. Also a bit is the basic unit of information in log2. → Why? X bit of uniformly distributed data carry X bit of information To encode X bit of information we need at least X bit of data. 0.5 bit of information makes sense, 0.5 bit of data does not. Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt 13 In Morse code, some letters (like "E" and "T") use shorter codes, while others (like "Q" and "Z") use longer codes. Why is this design efficient from an information theory perspective? Click Present with Slido or install our Chrome extension to activate this ⓘ poll while presenting. Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt Information Source Assume that we have a set of messages We want to measure the amount of Information that a message contains. Axiom: only depends on the probability with which the source sent message Definition: We define a source Let be a message with probability then Stationary source (of information): A finite, non-empty set of messages AND each in has and Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt 15 Quantifying Uncertainty How to measure uncertainty? Suggestion: Average information obtained! This measure of uncertainty is called Entropy Entropy’s unit is also a bit is the entropy of source Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt 16 What is Information? All letters in the alphabet of each language have a different probability of occurring bits can encode characters. The alphabet can be encoded in bits The information of a single character is. Events with a low probability correspond to high information content The average information in a character in an english text is Abbas, Rasha Hassan, and Firas Abdul Elah Abdul Kareem. "Text Language Identification Using Letters (Frequency, Self-information, and Entropy) Analysis for English, French, and German Languages." Journal of Southwest Jiaotong University 54.4 (2019) Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt 17 Which of the three languages has the lowest average information (entropy) in a single character? Click Present with Slido or install our Chrome extension to activate this ⓘ poll while presenting. Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt Letter Probabilities given Language Abbas, Rasha Hassan, and Firas Abdul Elah Abdul Kareem. "Text Language Identification Using Letters (Frequency, Self-information, and Entropy) Analysis for English, French, and German Languages." Journal of Southwest Jiaotong University 54.4 (2019) Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt 19 Entropy of Continuous Random Variables Differential Entropy (DE) extends the concept of entropy to continuous random variables However, the naive extension does not share the same properties as discrete entropy! The probability density function can be larger than 1 and the DE can be lower than zero! Possible remedy is Kullback Leibler Divergence (We get to it later) Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt 20 In information theory, Shannon entropy is a measure of Click Present with Slido or install our Chrome extension to activate this ⓘ poll while presenting. Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt Channel A channel transmits messages and can be subject to errors. An imperfect channel is called a noisy channel We define a channel as with and are non-empty sets is a function with named the channel transition function that tells us the probability to transition from symbol to symbol. Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt 22 Outline 1. What is Information? 2. Properties of Information Channel 3. Application Examples of Information Theory 4. Divergences 5. Conclusion Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt 23 Channel Types Binary symmetric channels (BSC): A BSC is a communication channel where each 0 0 transmitted bit has a certain probability of being flipped (i.e., changed from 0 to 1 or 1 to 0) due to noise. 1 1 Binary erasure channels (BEC): 0 0 In a BEC, each transmitted bit is either received correctly or erased, meaning the receiver knows * a bit is missing but doesn’t know its original value. 1 1 Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt 24 Channel Transinformation: The transinformation quantifies how much information is shared between the input and the output of a channel. Equal to the Mutual Information of both variables! Channel Capacity: Channel capacity is the maximum rate at which information can be transmitted through a noisy channel with arbitrarily small error probability. Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt 25 You are at a crowded place and you try to upload a video, but the upload is unreliable due to many users. Which technique is commonly used to cope with a noisy channel? Click Present with Slido or install our Chrome extension to activate this ⓘ poll while presenting. Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt Joint Entropy How do we compute this channel capacity? Measure of uncertainty in a pair of random variables. The Joint Entropy tells us how many units of information are needed to describe the joint outcome of two random variables Includes dependencies between random variables. E.g. covariance. is the transmitted message and is the received message Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt 27 Marginal Distributions Given the joint distribution we may need to compute p x=0 x=1 Categorical: Continuous: y=0 0.3 0.4 y=1 0.2 0.1 y=1 or 2 0.5 0.5 Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt 28 Mutual Information The difference between marginal and joint entropies measures the shared information between two variables. Compute from entropies or distributions: Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt 29 Channel Capacity Using the Mutual Information we can now compute the channel capacity of BSC and BEC! Let’s assume a BSC with bit-flip probability of 0.1 and both 0,1 are equally likely to be transmitted Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt 30 Channel Capacity - Binary Symmetric Channel We got and need to compute the conditional entropy as well now with the bit-flip chance of The max distribution for the BSC is the uniform distribution, thus Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt 31 Suppose you are tasked with improving the data transmission rate of a wireless network that suffers from significant background noise. Which of the following actions would directly increase the channel capacity? Click Present with Slido or install our Chrome extension to activate this ⓘ poll while presenting. Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt Block Coding Definition: Block coding is a method of encoding data into fixed-length blocks of bits to improve error detection and correction. Purpose: Enhance reliability by encoding information with redundancy, enabling error correction in noisy communication channels. Encoding: Each block of information bits is transformed into a block of bits, known as a codeword, where Code Rate: Defined as. Higher redundancy (lower ) improves error correction Error Correction Capability: Can detect and correct errors by comparing received blocks to the original encoded blocks. Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt 33 Convolutional Coding Definition: Convolutional coding is a type of error-correcting code where input data bits are encoded based on current and previous input bits. Purpose: Provides continuous protection against errors, well-suited for real-time data transmission Encoding: Generates code bits by passing input bits through a shift register, with each output bit dependent on a combination of input bits and previous state bits. Constraint Length: Determines the length of the encoder memory that impacts error correction capability. Code Rate: Similar to block codes, the code rate defines redundancy. Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt 34 Outline 1. What is Information? 2. Properties of Information Channel 3. Application Examples of Information Theory 4. Divergences 5. Conclusion Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt 35 Information Theory Beyond Message Transmission Information theory can not only be applied to transmitting messages but is relevant in many different applications and research fields! 1. Data Compression (Images, Audio, Video) Reduce data size using entropy for lossless (ZIP) and lossy (JPEG) compression. Applications: Storage, streaming 2. Machine Learning & Classification Information gain and mutual information aid in feature selection and decision tree building. KL-Divergence used in optimizing models Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt 36 Information Theory Beyond Message Transmission 3. AI Learning & Inference Entropy guides exploration-exploitation in reinforcement learning; variational inference approximates distributions Applications: Autonomous systems, robotics 4. Neuroscience & Cognitive Science Neural coding and perception modeled using entropy and information theory to study brain functions Applications: Brain-computer interfaces 5. Cryptography & Security Shannon's entropy ensures secure encryption; entropy also measures password strength Applications: Data privacy, encryption algorithms Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt 37 Mutual Information - Application Example The JPEG algorithm uses mutual information to compress images It leverages mutual information between neighbouring pixels Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt 38 Mutual information between two random variables X and Y is a measure of? Click Present with Slido or install our Chrome extension to activate this ⓘ poll while presenting. Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt Conditional Entropy Choose two pixels and with mutual information Observing one reduces our uncertainty over the other! Conditional Entropy measures remaining uncertainty over one given the other Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt 40 Comparing Distributions Quantifying the similarity or difference between distributions is not trivial But many algorithms depend on it Let’s have a look at some options! Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt 41 Cross Entropy Average surprise over distribution assuming distribution Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt 42 Cross Entropy - Application Example Average surprise over distribution assuming distribution → Very common in machine learning as loss function for classification and https://www.cs.toronto.edu/~kriz/cifar.html → Also Language Transformers (ChatGPT), GANs, VAEs, … Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt 43 Outline 1. What is Information? 2. Properties of Information Channel 3. Application Examples of Information Theory 4. Divergences 5. Conclusion Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt 44 Divergences They are the most natural way to compare two distributions! Measure of how one probability distribution diverges from a second probability distribution, but not a metric How efficient one distribution encodes another Applications in Probabilistic Inference, optimization, and learning Choice of divergence matters! Properties of divergences: ○ Asymmetry: Unlike distances, divergences are not necessarily symmetric ○ Non-Negativity: Divergences are always greater than or equal to zero ○ Zero if distributions are identical Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt 45 Divergences Source: https://franknielsen.github.io/Divergence/index.html Further Reading: Nielsen, Frank. "An elementary introduction to information geometry." Entropy 22.10 (2020). Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt 46 Why are divergences, such as the Kullback-Leibler (KL) divergence, useful in information theory and machine learning? Click Present with Slido or install our Chrome extension to activate this ⓘ poll while presenting. Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt Kullback–Leibler (KL) Divergence Measures the information loss when is used to approximate Extends the concept of differential entropy as relative entropy Not symmetric Is a special case for several generalized divergences! Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt 49 KL Divergence - Application Example Reinforcement Learning: → REPS, TRPO, PPO, SAC, MPO, … Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt 50 Jensen-Shannon Divergence A symmetrized and smoothed version of KL divergence Bounded between 0 and 1 Can be transformed to a distance by taking the square root! Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt 51 Outlook Advanced Topics: Trust Region Update Methods Maximizing Information Maximum Entropy Methods Applications: Machine learning optimization, statistical mechanics, etc. Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt 52 Outline 1. What is Information? 2. Properties of Information Channel 3. Application Examples of Information Theory 4. Divergences 5. Conclusion Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt 53 Recap Information in Computer Science: Intuition of information, definition of information and introduction to entropy Transmission of information: Introduction of channel, their capacity, types and coding schemes Application Examples: Information theoretic measures applied to LLMs, Image Compression or Robot Learning Divergences: Measures to compare distributions, showed connections between divergences and discussed properties of KL and JS divergence Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt 54 Self-Test Questions What is Information? What is the difference between information and data? What are the properties of an information source? What is entropy in information theory? What is a channel? How do the two discussed channel types differ? What is the channel capacity? Why are we using coding schemes? What are practical applications of information theory? What is a divergence? What are the properties of the KL? Prof. Jan Peters | Probabilistic Methods for Computer Science | WiSe 2024/25 | CS, TU Darmstadt 55