Information Theory PDF

IT2312 Information Theory Introduction to Information Theory (Stone, 2019) Claude Elwood Shannon, a mathematician and computer scientist, constructed the groundwork for today's electronic communications networks based on his theories on information. One of his published papers is A Mathematical Theory of Communication (1948). This paper changed people’s understanding of information as it was perceived as a “poorly defined smelly fluid” at the time. With his paper, it became evident that information is a well-defined and measurable quantity. With almost 150,000 citations, this paper single-handedly accelerated the rate of scientific progress. His work was then published as a book in 1949 titled ‘The Mathematical Theory of Communication.’ Information Theory It details the definite and unbreachable limits on exactly how much information can be communicated between any two (2) components of any communication system, whether man-made (TV/radio broadcasting, telephone) or natural (observable in nature). The main goal of information theory is for one person or device (transmitter) to convey a message (over a channel) to another person or device (receiver). Shannon analyzed the ability to convey information through a communication channel. Figure 1. A communication channel. Retrieved from https://arxiv.org/pdf/1802.05968.pdf Figure 1 shows a communication channel wherein a message or data is encoded before being used as input to a communication channel, which adds noise. A receiver then decodes the output to recover the message. According to Shannon, a simple concept in information theory is that information can be treated like a physical quantity, such as mass or energy. It is why it is essential to remember the following: 1. Finding a Route, Bit by Bit Information can be measured in bits, the smallest increment of data on a computer, wherein one bit of information allows the possibility of two (2) or more equally probable (equiprobable) alternatives. Figure 2. Fork in the road. Retrieved from https://arxiv.org/pdf/1802.05968.pdf 05 Handout 1 *Property of STI  [email protected] Page 1 of 7 IT2312 To put into perspective, imagine a person standing at the fork in the road at point A in Figure 2, who must get to point D. Assume that this person is unfamiliar with the direction, and each fork requires one bit of information to make a correct decision. In Figure 2, binary digits 0 and 1 are used: 0 for the left fork and 1 for the right. These summarize the instructions needed to arrive at each destination. If this person asks a local, the local might suggest going to the left. This instruction is considered one bit of information and is represented by 0. Now, at Point B, the person must face another fork. Another set of binary digits leads to C, the correct road. The correct binary is 1 (right). Note that Point C is one of the four possible destinations the person could have reached after making two decisions. For clarity, the four possible destinations include options for each point (2 left and right sets). Each decision is made on points (Points A and B). After two forks, the two binary digits that allowed the person to make the correct decisions provided two bits of information, allowing options from four equiprobable alternatives; 4 equals 2 𝑥 2 = 22. From Point C, another binary digit is needed to reach Point D. In this case, the left direction (1) will serve as the third binary digit. There are now eight possible roads that the person could have chosen since Point A, and three binary digits, with three bits of information, allowed the person to choose from eight equiprobable alternatives. It equals 2 𝑥 2 𝑥 2 = 23 = 8. It can be restated in more general terms if 𝑛 is used to represent the number of forks and 𝑚 to represent the number of final destinations. If the person has come to 𝑛 forks, then they have effectively chosen from 𝑚 = 2𝑛 , which is the formula to be used for the final destination. As the decision at each fork needs one bit of information, 𝑛 forks require 𝑛 bits of information. Going back to the fork problem, from Point A to Point D, there were a total of 3 number of forks. So 𝑚 = 23 = 8 possible destinations. Getting the base 2 logarithm of 8 or 𝑙𝑜𝑔2 (8) results in 3, so it can be assumed that 𝑛 = 𝑙𝑜𝑔2 𝑚. To prove this, using the four equiprobable alternatives at Point A to C as 𝑚: 𝑛 = 𝑙𝑜𝑔2 𝑚 𝑛 = 𝑙𝑜𝑔2 (4) 𝑛=2 The same number of forks can be seen. 2. Bits Are Not Binary Digits The word bit comes from binary digit, but they are fundamentally different types of quantities. A binary digit is the value of a binary variable, while a bit is an amount of information. To confuse a binary digit for a bit is a category error. Entropy (Stone, 2019) Entropy is a scientific concept concerning the measure of the disorder of a system. In a communication channel such as in Figure 1, entropy is the expected number of bits of information in each message, taking over all possibilities for the transmitted message. For instance, assume the transmitter wants to inform the receiver of the result of a 4-person tournament, with some players being better than others. The entropy of this message would be a weighted average of the amount of information each of the possible messages (“Player 1 Won”, “Player 2 Won”, etc.) provided. 05 Handout 1 *Property of STI  [email protected] Page 2 of 7 IT2312 Entropy is Average Shannon Information Another event where entropy can be understood is during a coin flip. The outcome of a coin flip can be represented as the 𝑟𝑎𝑛𝑑𝑜𝑚 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒 𝑥. The head outcome is 𝑥 = 𝑥ℎ and the tail outcome is 𝑥 = 𝑥𝑡. The surprise outcome of a particular value of a random variable is often irrelevant, but knowing how much surprise, on average, is associated with the entire set of possible values is interesting. The average surprise of a variable 𝑥 is called the entropy of 𝑝𝑥 defined by its probability distribution 𝑝(𝑥). It is represented as 𝐻(𝑥). 90 Assume that a coin lands heads up 90% of the time: 𝑝(𝑥ℎ ) = 0.9 𝑜𝑟 (100). When this coin is flipped, it is expected to land heads up, which would be less surprising than if it is to land tails up. The more improbable an outcome is, the more surprising it becomes (see Figure 3). Figure 1. Surprisal. Retrieved from https://arxiv.org/pdf/1802.05968.pdf If the logarithm to the base of 2 is used, then the Shannon information or surprisal of each outcome is measured in bits: 1 Shannon Information = 𝑙𝑜𝑔2 𝑝(𝑥 ) bits ℎ It is also often expressed as: Shannon information = −𝑙𝑜𝑔2 𝑝(𝑥ℎ ) bits Most calculators do not have a specific button for 𝑙𝑜𝑔2. To solve 𝑙𝑜𝑔2 of this formula, use: 𝑙𝑜𝑔(𝑥) = log (2) 1 1 Where 𝑥 is the result of or. Thus, a 𝑝(𝑥ℎ ) of 0.9 will result to: 𝑝(𝑥ℎ ) 𝑝(𝑥𝑡 ) 1 Shannon Information = 𝑙𝑜𝑔2 𝑝(𝑥 ) ℎ 1 = 𝑙𝑜𝑔2 0.9 = 𝑙𝑜𝑔2 (1.1111) 𝑙𝑜𝑔(1.111) = log (2) 0.0457 = 0.3010 = 0.1518 𝑜𝑟. 𝟏𝟓 𝒃𝒊𝒕𝒔 05 Handout 1 *Property of STI  [email protected] Page 3 of 7 IT2312 Entropy of a Fair Coin The average possible surprise outcomes of a flip coin are as follows. If a coin is unbiased with a 50% possibility for heads and tails outcomes, then 𝑝(𝑥ℎ ) = 𝑝(𝑥𝑡 ) = 0.5. The Shannon information gained during a head or a 1 tail outcome is 𝑙𝑜𝑔 2 0.5 = 1 bit. This same bit is also the average Shannon information gained after each coin flip. As entropy is the average Shannon information, the entropy of a fair coin is 𝐻(𝑥) = 1 bit. Entropy of an Unfair Coin If a coin is biased, such as the example above, that it lands on heads 90% of the time, then it is easy to predict the result of each coin flip. For a head outcome: 1 𝑆ℎ𝑎𝑛𝑛𝑜𝑛 𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 = log 2 𝑝(𝑥ℎ ) 1 = log 2 0.9 = log 2( 1.1111) = 𝟎. 𝟏𝟓 bits (returns the same answer as the indirect computation for 𝑙𝑜𝑔2 ) But if the outcome is tail, then: 1 𝑆ℎ𝑎𝑛𝑛𝑜𝑛 𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 = log 2 𝑝(𝑥 ) 𝑡 1 = log 2 0.1 = log 2 ( 10) = 𝟑. 𝟑𝟐 bits Notice that more information (bits) is required with a more surprising outcome. As the proportion of coin flips that result in a head is 𝑝(𝑥ℎ ) and 𝑝(𝑥𝑡 ) for tail, where 𝑝(𝑥ℎ ) + 𝑝(𝑥𝑡 ) = 1, the average surprise formula is: 1 1 𝐻(𝑥) = 𝑝(𝑥ℎ )𝑙𝑜𝑔2 + 𝑝(𝑥𝑡 )𝑙𝑜𝑔2 𝑝(𝑥ℎ ) 𝑝(𝑥𝑡 ) Using both Shannon's information above, the average surprise is: 1 1 𝐻(𝑥) = 𝑝(𝑥ℎ )𝑙𝑜𝑔2 + 𝑝(𝑥𝑡 )𝑙𝑜𝑔2 𝑝(𝑥ℎ ) 𝑝(𝑥𝑡 ) 1 1 𝐻(𝑥) = (0.9)𝑙𝑜𝑔2 + (.1)𝑙𝑜𝑔2 0.9 0.1 𝐻(𝑥) = (0.9)𝑙𝑜𝑔2 (1.1111) + (.1)𝑙𝑜𝑔2 (10) 𝐻(𝑥) = (0.9)(0.152) + (.1)(3.322) 𝐻(𝑥) = 0.1368 + 0.3322 𝑯(𝒙) = 𝟎. 𝟒𝟔𝟗 bits Based on the average surprise formula, it can be observed that there are parts that are already solved, such 1 1 as 𝑙𝑜𝑔2 ) and 𝑙𝑜𝑔2 ). It is advisable not just to substitute the individual answers as it may return a less 𝑝(𝑥ℎ 𝑝(𝑥𝑡 precise answer. 05 Handout 1 *Property of STI  [email protected] Page 4 of 7 IT2312 Basic Laws of Information (Stone, 2019) Along with Shannon’s definition of information and its ability to send information through a channel, he also discovered that a channel has a certain maximum transmission rate that cannot be exceeded. That is called the bandwidth of the channel today. Such theorems of information theory can be regarded as the laws of information, given how important they are. For any communication channel, such as Figure 1, the basic laws of information can be summarized as: 1. Channel Capacity: There is a definite upper limit to the amount of available information that can be transmitted through a channel; 2. Noise Reduces Channel Capacity: The limit shrinks as the amount of noise increases, and 3. Shannon’s Source Coding Theorem: The limit can nearly be reached by reasonably encoding or packaging data. Channel Capacity (C) It is the maximum amount of information that a channel can communicate at its output about the input. It is expressed in bits per usage, such as bits per output or bits per second (bits/s). 1 𝑆 𝐶 = 𝑙𝑜𝑔2 (1 + ) 2 𝑁 S/N is the signal-to-noise ratio. To recall: 𝑆 𝑉𝑠 𝑆 𝑃𝑠 = 𝑜𝑟 = 𝑁 𝑉𝑛 𝑁 𝑃𝑛 Wherein: 𝑉𝑠 = signal voltage 𝑃𝑠 = signal power 𝑉𝑛 = noise voltage 𝑃𝑛 = noise power Assume that a signal voltage is 4.2 𝑚𝑉 and the noise voltage is 0.4 𝑚𝑉. The channel capacity is: Solving for S/N: 𝑆 𝑉𝑠 = 𝑁 𝑉𝑛 𝑆 4.2 𝑚𝑉 = 𝑁 0.4 𝑚𝑉 𝑆 = 10.5 𝑁 Computing for channel capacity: 1 𝑆 𝐶 = 𝑙𝑜𝑔2 (1 + ) 2 𝑁 1 𝐶 = 𝑙𝑜𝑔2 (1 + 10.5) 2 1 𝐶 = 𝑙𝑜𝑔2 (11.5) 2 1 𝐶 = (3.5236) 2 𝐶 = (0.5)(3.5236) 𝑪 = 𝟏. 𝟕𝟔 𝒃𝒊𝒕𝒔 𝒑𝒆𝒓 𝒕𝒓𝒂𝒏𝒔𝒎𝒊𝒔𝒔𝒊𝒐𝒏 05 Handout 1 *Property of STI  [email protected] Page 5 of 7 IT2312 The rate of information transmission through the channel depends on the entropies of three variables in the channel: The entropy 𝐻(𝑥) of the input The entropy 𝐻(𝑦) of the output The entropy 𝐻(𝜂) of the noise (η is pronounced as eta, a Greek letter) An important and convenient channel is the additive channel. Noise (𝜂) is added as encoded values pass through an additive channel, so the channel output is a noisy version 𝑦 of the channel input 𝑥. Such as: 𝑦 =𝑥+𝜂 With entropy, 𝐻(𝑦) = 𝐻(𝑥) + 𝐻(𝜂) If the output entropy is high, it provides a large potential for information transmission. This potential is only realized depending on the input entropy and noise level. If the noise is low, the output entropy can be close to the channel capacity, although the capacity gets progressively smaller as the noise increases. Noise Reduces Channel Capacity This law examines how noise effectively reduces the maximum amount of information a channel can communicate. If the number of equiprobable input states is 𝑚𝑥 then the input entropy is: 𝐻(𝑥) = 𝑙𝑜𝑔2 𝑚𝑥 𝑏𝑖𝑡𝑠 For instance, assume there are 𝑚𝑥 = 3 equiprobable input states such as 𝑥1 = 200, 𝑥2 = 300, and 𝑥3 = 400, the input entropy is: 𝐻(𝑥) = 𝑙𝑜𝑔2 𝑚𝑥 𝐻(𝑥) = 𝑙𝑜𝑔2 (3) 𝑯(𝒙) = 𝟏. 𝟓𝟖 𝒃𝒊𝒕𝒔 If there are 𝑚𝜂 = 2 equiprobable values for the channel noise, such as 𝜂1 = 20 and 𝜂2 = 40, then the noise entropy is: 𝐻(𝜂) = 𝑙𝑜𝑔2 𝑚𝜂 𝐻(𝜂) = 𝑙𝑜𝑔2 (2) 𝑯(𝜼) = 𝟏 𝒃𝒊𝒕 Using 𝑦 = 𝑥 + 𝜂, if the input is 𝑥1 = 200, the outputs can be one of two equiprobable states. 𝑦1 = 𝑥1 + 𝜂1 𝑦1 = 200 + 20 𝒚𝟏 = 𝟐𝟐𝟎 or 𝑦2 = 𝑥1 + 𝜂2 𝑦2 = 200 + 40 𝒚𝟐 = 𝟐𝟒𝟎 The number of outputs depends on the available inputs. For the outputs of 𝑥2 , it can also be 𝑦3 = 300 + 20 = 320 or 𝑦4 = 300 + 40 = 340. The same idea is applied to the outputs of 𝑥3 (𝑦5 = 420, 𝑦6 = 440). So, given three equiprobable input states and two equiprobable noise values, 𝑚𝑦 = 6 equiprobable output states. The output entropy is: 𝐻(𝑦) = 𝑙𝑜𝑔2 𝑚𝑦 𝐻(𝑦) = 𝑙𝑜𝑔2 (6) 𝑯(𝒚) = 𝟐. 𝟓𝟖 𝒃𝒊𝒕𝒔, which is also true if 𝐻(𝑦) = 𝐻(𝑥) + 𝐻(𝜂) is used. 07 Handout 1 *Property of STI  [email protected] Page 6 of 7 IT2312 Shannon’s Source Coding Theorem This theorem applies only to noiseless channels. It is about re-packaging or encoding data before it is transmitted so every datum conveys as much information as possible when it is transmitted. This theorem is significant to biological information processing as it defines definite limits to how sensory data can be re-packaged efficiently. First Fundamental Coding Theorem Shannon’s source coding theorem is considered the first fundamental coding theorem and is also known as Shannon’s Fundamental Theorem for a Discrete Noiseless Channel. His full quote about this theorem in 1949 says: Let a source have entropy H (bits per symbol) and a channel capacity C (bits per second). Then, it is possible to encode the output of the source in such a way as to transmit at the average rate of C/H − ε symbols per second over the channel where ε is arbitrarily small. It is not possible to transmit at an average rate greater than capacity C/entropy H. Reference: Stone, V. (2019). Information theory: A tutorial introduction. [PDF]. Retrieved on October 20, 2023 from https://arxiv.org/pdf/1802.05968.pdf 07 Handout 1 *Property of STI  [email protected] Page 7 of 7

Information Theory PDF

Document Details

Tags

Related

Summary

Full Transcript