Introduction to Information Theory PDF

W. Henkel, Constructor University Bremen 1 Introduction to Information Theory Prof. Dr.-Ing. Werner Henkel http://trsys.jacobs-university.de [email protected] W. Henkel,...

W. Henkel, Constructor University Bremen 1 Introduction to Information Theory Prof. Dr.-Ing. Werner Henkel http://trsys.jacobs-university.de [email protected] W. Henkel, Constructor University Bremen 2 The IT framework... data source channel modulator source encoder encoder noise channel data source channel demodulator sink decoder decoder Line Coding Source Coding Channel Coding Cryptology W. Henkel, Constructor University Bremen 3 The topics... Introduction Sources and channels, probabilities and densities, information, entropy Entropy and Mutual Information Chain rule, Jensen’s inequality, Fano’s inequality Source coding principles Kraft’s inequality, optimum codes, Huffman coding, Shannon-Fano codes Practical lossless source coding algorithms Lempel-Ziv, arithmetic coding The discrete channel Channel capacity and channel coding theorem The Gaussian and bandlimited channel The single and multiple Gaussian channel and its capacities, water filling, capacities for modulation alphabets W. Henkel, Constructor University Bremen 4 Random coding bound Error exponent, cutoff rate Rate-distortion theory Multiple access and MIMO Capacity region, multiple access, broadcast, relay, capacity of MIMO channels Channel coding Linear block codes, Reed-Solomon codes, convolutional codes, turbo codes Cryptology Public-key encryption W. Henkel, Constructor University Bremen 5 1 Basics of stochastics - the discrete case First let us consider a random variable that can take on values from a certain set X. In an experiment, we may obtain a certain event, i.e., an element x ∈ X. Let the probability of such an event be pX (x). This can be estimated by doing many, say n statical experiments. Let x be obtained nx times. Then, nx pX (x) ≈ n The properties of a probability 1. ∀Q ∈ Q(X ) : p(Q) ≥ 0, where Q(X ) is the set of subsets of X , i.e., the event Q is to belong to a certain subset of X. P 2. P (X ) = 1, i.e., xi ∈X P (xi ) = 1 3. ∀Q, R ∈ Q(X ); Q ∩ R = {} : p(Q ∪ R) = p(Q) + p(R) W. Henkel, Constructor University Bremen 6 Having a random variable, it is suitable to use expectations, e.g., the expected value of a function g(x) would be expressed as X E{g(X)} = g(xi ) · pX (xi ) xi ∈X We use capital ”X” to denote the random variable. Instead of P (X = xi ), we abbreviate pX (xi ). In some books, X is written as x(η). Some special expected values The linear mean or average n n X 1X 1X mX = E{X} = xi · pX (xi ) ≈ xi · ni = ui n i=1 n i=1 xi ∈X The second moment or quadratic mean (average energy) n (2) X 1X 2 mX = E{(X)2 } = 2 xi · pX (xi ) ≈ xi ni n i=1 xi ∈X W. Henkel, Constructor University Bremen 7 The kth moment n (k) X 1X k mX = E{(X)k } = k xi · pX (xi ) ≈ xi ni n i=1 xi ∈X The kth central moment (k) X k µX = E{(X − mX ) } = (xi − mX )k · pX (xi ) xi ∈X especially the variance (2) X 2 2 σX = µX = E{(X − mX ) } = (xi − mX )2 · pX (xi ) , xi ∈X which is the square of the standard deviation σX. Note the relation 2 (2) σX = mX − m2X W. Henkel, Constructor University Bremen 8 Combined experiments Let us do two experiments with possible events xi ∈ X and yi ∈ Y. We just can view this as a single experiment with a pair (xi , yi ) as outcome. Correspondingly, we can define joint moments X X mXY = E{XY } = xi · yi · pXY (xi , yi ) xi ∈X yi ∈Y µXY = C{XY } = E{(X − mX )(Y − mY )} = X X = (xi − mX ) · (yi − mY ) · pXY (xi , yi ) xi ∈X yi ∈Y “C” is the covariance. The random variables are said to be statistically independent, if pXY (xi , yi ) = pX (xi ) · pY (yi ) W. Henkel, Constructor University Bremen 9 Bayes’ theorem pXY (xi , yi ) = pX|Y (xi |yi ) · pY (yi ) = pY |X (yi |xi ) · pX (xi ) with the conditional probabilities pX|Y (xi |yi ) and pY |X (yi |xi ) Total-probability theorem X pX (xi ) = pXY (xi , yi ) yi ∈Y X = pX|Y (xi |yi ) · pY (yi ) yi ∈Y = EY {pX|Y (xi |yi )} X pY (yi ) = pXY (xi , yi ) xi ∈X X = pY |X (yi |xi ) · pX (xi ) xi ∈X = EX {pY |X (yi |xi )} W. Henkel, Constructor University Bremen 10 2 Information and Entropy Requirements for a measure of information I(xi ): I(xi ) should grow with decreasing probability, P (xi ) < P (xj ) =⇒ I(xi ) > I(xj ) especially P (xi ) = 1 =⇒ I(xi ) = 0 I(xi ) = f (P (xi )), with some continuous function f In case of statistical independence, i.e., P (xi , yi ) = P (xi ) · P (yi ), I(xi , yi ) = I(xi ) + I(yi ) The information content (R. Hartley) is defined as Ir (xi ) = logr (1/P (xi )) = − logr P (xi ) W. Henkel, Constructor University Bremen 11 especially, I2 (xi ) = log2 (1/P (xi )) [bit] −⋆− Ie (xi ) = ln (1/P (xi )) [nat] I10 (xi ) = log10 (1/P (xi )) [Hartley] The Entropy is the expectation of the information, i.e., the average information content: P P H(X) = xi ∈X P (xi ) · I(xi ) = xi ∈X P (xi ) · log (1/P (xi )) W. Henkel, Constructor University Bremen 12 Example Let     a with probability 1/2    b with probability 1/4 X =    c with probability 1/8    d with probability 1/8 The entropy is H(X) = −1/2 · log 1/2 − 1/4 · log 1/4 − 1/8 · log 1/8 − 1/8 · log 1/8 = 7/4 bit = 1.75 bit How would you choose the sequence of questions to find out which element occurred and what would be the average number of steps needed? W. Henkel, Constructor University Bremen 13 Example Binary source, X = {0, 1}, P (0) = p, P (1) = 1 − p The entropy of the binary source is 1 H(p) = p · log p1 + (1 − p) log 1−p := h(p) (Note that we introduced the abbreviation h(p) for the binary entropy function.) 1 0.9 0.8 0.7 0.6 H(p) 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 p W. Henkel, Constructor University Bremen 14 Some properties of the Entropy H(X) = 0 ⇐⇒ ∃xi ∈ X : P (xi ) = 1 ∧ ∀j 6= i : P (xj ) = 0; Pn Pn Let pi > 0 and qi > 0 for i = 0,... , n and i=1 pi = i=1 qi = 1. Then, Xn Xn − pi log pi ≤ − pi log qi i=1 i=1 Proof: For all positive real numbers x, we have log(x) ≤ (x − 1) log e , i.e., ln(x) ≤ x − 1 3 x-1 2 1 ln(x) 0 -1 -2 -3 -4 0 0.5 1 1.5 2 2.5 3 3.5 x W. Henkel, Constructor University Bremen 15 This leads to n n n n X X X X qi pi log qi − pi log pi = pi log(qi /pi ) ≤ pi − 1 log e = i=1 i=1 i=1 i=1 pi n n n n ! X X X X = (qi − pi ) log e = log e (qi − pi ) = log e qi − pi =0 i=1 i=1 i=1 i=1 For a fixed number n = |X | of possible events, the entropy achieves its maximum for equal probabilities P (xi ), i = 1,... , n 1 1 H(p1 ,... , pn ) ≤ H ,..., n n Proof: Letting qi = 1/n in the previous property, we obtain n X n X n X H(p1 ,... , pn ) = − pi log pi ≤ − pi log(1/n) = − log(1/n) pi = log n i=1 i=1 i=1 W. Henkel, Constructor University Bremen 16 1 1 1 1 1 1 1 1 1 H , ,..., =H , ,..., +H , ,..., n·l n·l n·l n n n l l l due to 1 1 1 H , ,..., = log(n · l) = log n + log l n·l n·l n·l W. Henkel, Constructor University Bremen 17 Entropy of repeated independent trials – a memoryless source (n-fold source extension) n Y P (x) = P (xk 1 , xk 2 ,... , xk n ) = P (xk i ) i=1 W. Henkel, Constructor University Bremen 18 n X 1 H(X ) = P (x) · log n P (x) x∈X M M X X 1 1 = ··· P (xk 1 ) · · · P (xk n ) · log + · · · + log P (xk 1 ) P (xk n ) k1 =1 kn =1 M M M X 1 X X = P (xk 1 ) · log · P (xk 2 ) · · · P (xk n ) + · · · P (xk 1 ) k1 =1 k2 =1 kn =1 | {z } =1 | {z } =1 M M X 1 X 1 = P (xk 1 ) · log + ··· + P (xk n ) · log = n · H(X) P (xk 1 ) P (xk n ) k1 =1 kn =1 W. Henkel, Constructor University Bremen 19 H(X n ) = n · H(X) Example: Source: X = {−, 0, +}; P (−) = 1/2; P (+) = P (0) = 1/4 X2 −− −0 −+ 0− 00 0+ +− +0 ++ 1 1 1 1 1 1 1 1 1 P (xk1 , xk2 ) 4 8 8 8 16 16 8 16 16 1 1 1 3 H(X) = log2 2 + log2 4 + log2 4 = 2 4 4 2 1 1 1 2 12 16 H(X 2 ) = log2 4 + 4 · log2 8 + 4 · log2 16 = + + = 3 = 2 · H(X) 4 8 16 4 8 16 W. Henkel, Constructor University Bremen 20 3 Joint and conditional entropies The joint entropy we already know, without having used the name. It is defined as XX H(X, Y ) = − P (x, y) log P (x, y) = −E{log P (X, Y )} x∈X y∈Y The conditional entropy is defined as X H(Y |X) = P (x) · H(Y |X = x) x∈X X X = − P (x) P (y|x) log P (y|x) x∈X y∈Y XX = − P (x, y) log P (y|x) x∈X y∈Y = −EP (x,y) {log P (Y |X)} W. Henkel, Constructor University Bremen 21 The chain rule H(X, Y ) = H(X) + H(Y |X) Proof: P P H(X, Y ) = − x∈X y∈Y P (x, y) log P (x, y)... Use Bayes’ theorem to do the proof ! Corollary: H(X, Y |Z) = H(X|Z) + H(Y |X, Z) W. Henkel, Constructor University Bremen 22 Chain rule for a collection of random variables n X H(X1 , X2 ,... , Xn ) = H(Xi |Xi−1 ,... , X1 ) i=1 Proof: H(X1 , X2 ) = H(X1 ) + H(X2 |X1 ) H(X1 , X2 , X3 ) = H(X1 ) + H(X2 , X3 |X1 ) = H(X1 ) + H(X2 |X1 ) + H(X3 |X2 , X1 )... H(X1 , X2 ,... , Xn ) = H(X1 ) + H(X2 |X1 ) + · · · + H(Xn |Xn−1 ,... , X2 , X1 ) Xn = H(Xi |Xi−1 ,... , X1 ) i=1 W. Henkel, Constructor University Bremen 23 Independence bound on the entropy n X H(X1 , X2 ,... , Xn ) ≤ H(Xi ) , i=1 with equality if and only if the Xi are statistically independent. Proof: By the chain rule for entropies, n X n X H(X1 , X2 ,... , Xn ) = H(Xi |Xi−1 ,... , X1 ) ≤ H(Xi ) i=1 i=1 This follows from H(X) ≥ H(X|Y ) W. Henkel, Constructor University Bremen 24 4 Markov sources q1 q2 qm random combination network pure recursive structure Markov source A Markov source is a source with memory m. Probability depends on states s(i) = (x(i − 1), x(i − 2),... , x(i − m)) P (x(i)|s(i)) = P (x(i)|x(i − 1), x(i − 2),... , x(i − m)) If we have q discrete elements, there will be q m states. The state transitions are s(i) = (x(i − 1), x(i − 2),... , x(i − m)) −→ s(i+1)) = (x(i), x(i − 1),... , x(i − m + 1)) W. Henkel, Constructor University Bremen 25 Graphical representation by, e.g., the state diagram Example: q = 2, m = 2, X = {0, 1} Transition probabilities P (1|11) = P (0|00) = 0.8 P (0|11) = P (1|00) = 0.2 P (1|10) = P (0|10) = = P (1|01) = P (0|01) = 0.5 W. Henkel, Constructor University Bremen 26 0.8/0 The state transitions are 00 0.2/1 0.5/0 P (11) := P (1|11) · P (11) + P (1|10) · P (10).. 0.5/0. 10 01 0.5/1 P (00) := P (0|00) · P (00) +P (0|01) · P (01) | {z } | {z } | {z } 0.5/1 0.2/0 i+1 i i 11 and P (00)+P (01)+P (10)+P (11) = 1 0.8/1 W. Henkel, Constructor University Bremen 27 The state transitions will usually be described by a probabilistic matrix, often called the Markov matrix Π ps (i) = ps (i − 1)Π = ps (0)Πi with the property m q X πi,j = 1 j=1 Since this was expressed with row vectors, the row sum should be equal to one. When using the transpose, i.e., column vectors, of course, the column sum of the Markov matrix has to add up to one. When a stationary state has been reached, we can write ps = ps Π Eigenvector with eigenvalue 1 W. Henkel, Constructor University Bremen 28 Additionally, the stationary state probabilities have to fulfill m p X pi = 1. i=1 In matrix form ps U = 1   1 1 ··· 1   with 1 = (1, 1,... , 1) and U= ·  ·   1 1 ··· 1 We solve for the steady-state probabilities (if they exist) by adding ps Π = ps and ps U = 1 leading to ps (U + Π) = ps + 1 = ps I + 1 ps (U + Π − I) = 1 ps = 1(U + Π − I)−1 W. Henkel, Constructor University Bremen 29   Example: 1 1 3  Π= 4 2 2 The steady-state probabilities will be ps = 1(U + Π − I)−1      −1 1 1 1  1 3   1 0  = (1 1)    + − 1 1 4 2 2 0 1    −1 0 1 1  1 3  = (1 1)    + 1 0 4 2 2   −1   1  1 7  1  −2 7  1 = (1 1)  = (1 1) · = (2 3) 4 6 2 10 6 −1 5 The average number of steps mii necessary to return to a state i equals the reciprocal of the ps i , i.e., m11 = 5/2, m22 = 5/3 W. Henkel, Constructor University Bremen 30 Special cases without a stationary state States are said to communicate, if P (Sn = j|S0 = i) > 0, i.e., that state j is reachable from state i. Each Markov chain has a unique decomposition of the state space S into a S sequence of disjoint subsets C1 , C2 ,..., S = i Ci A Markov chain for which there is only one communication class is called an irreducable Markov chain. A Markov chain is said to be periodic with period k for k = gcd{n > 0 : P (Sn = j|S0 = i) > 0} This means, when recurrence is occuring after 6,8,10,12 steps, k would be 2. An absorbing state is given by Pii = 1 and Pij = 0 for i 6= j. For determining a stationary state, a Markov chain needs to be irreducable, do not have absorbing states, and not be periodic. W. Henkel, Constructor University Bremen 31 Example: We consider a queuing problem where a customer is served at discrete times kT and the maximum queue length allowed is 2. Let a be the probability that a customer is arriving in between iT and (i + 1)T , b the probability that a customer is served and leaves. In each interval [iT, (i + 1)T ) only one customer can arrive or depart. In case all queue spaces are allocated, further customers are not accepted any more. We model the problem with a Markov chain having 3 states, each state for a number of waiting customers (0,1,2). Let a = 1/4, b = 1/2 W. Henkel, Constructor University Bremen 32 The transition matrix (Markov matrix) is then given by     (1 − a) a 0 6 2 0   1  Π=  b(1 − a) ab + (1 − a)(1 − b) a(1 − b)  = 8  3 4 1     0 b (1 − b) 0 4 4 The stationary probabilities will be 1 ps = (6 4 1) 11 ab + (1 − a)(1 − b) a a(1 − b) 0 1 2 1−a b(1 − a) b 1−b W. Henkel, Constructor University Bremen 33 Some terminology... In general, the transition matrices could be time-variant. We assumed this not to be the case, i.e., we assumed homogeneous and stationary Markov chains. Homogeneous means that transition probabilities only depend on the ‘time’ distance. Stationary means that the transition probabilities are constants. If a stationary steady state is reached after an infinite number of steps, the Markov chain is called regular and if the steady state probability is independent of the initial state it is named ergodic. Let the probability for a transition from state j to state k be ∞ X pjk = pjk (n) , n=1 where n is the duration and pjk (n) the probability for a certain transition j → k of duration n. W. Henkel, Constructor University Bremen 34 A state with pjj < 1 is transient. A state with pjj = 1 is recurrent. A null-recurrent state requires an infinite average number of steps to return. A positive recurrent state requires a finite average number of steps to return. Positive recurrent states can be periodic or non-periodic. Periodic ones allow only to return at i · k steps, i, k ∈ N, k > 1. Can a state k be reached from state j and vice versa, they are said to communicate. If all state pairs communicate, the Markov chain is said to be irreducible. In such a Markov chain, all states are transient, null-recurrent, positive recurrent, periodic, or non-periodic. W. Henkel, Constructor University Bremen 35 Entropy calculations for the Markov chain... Information content for a state sj : 1 I(xk |sj ) = log , k = 1,... , q P (xk |sj ) Average information content for a state sj : H(X|sj ) = Exk {I(xk |sj )} Xq = P (xk |sj ) · I(xk |sj ) k=1 q X 1 = P (xk |sj ) · log , j = 1,... , q m P (xk |sj ) k=1 W. Henkel, Constructor University Bremen 36 Average information content (entropy) of the Markov chain over all states sj : H(X) = Esj {H(X|sj )} m q X q X = P (sj ) · P (xk |sj ) · j=1 k=1 ·I(xk |sj ) m q X q X H(X) = P (xk , sj ) · I(xk |sj ) j=1 k=1 X X X ··· P (x(i), x(i − 1),... , x(i − m)) · x(i) x(i−1) x(i−m) 1 · log P (x(i)|x(i − 1),... , x(i − m)) W. Henkel, Constructor University Bremen 37 Modeling of English text by Markov chains of different memories m=0: AI NGAE ITF NNR ASAEV OIE BAINTHA BYROO POER SETRYGAIETRWCO EHDUARU EU C FT NSREM DIY EESE F O SRIS R UNNASHOR... m=1: URTESHETHING AD E AT FOULE ITHALIORT WACT D STE MINTSAN OLNS TWID OULY TE THIGHE CO YS TH HR UPAVIDE PAD CTAVED... m=2: IANKS CAN OU ANG RLER THATTED OF SHAR OF TO HAVEMEM A I MAND AND BUT WHISSITABLY THERVEREER EIGHTS TAKILLIS TA... W. Henkel, Constructor University Bremen 38 Examples and derivations have partly been taken from the following references: Wolfgang Sauer-Greff, Einführung in die Informations- und Codierungstheorie, lecture notes, Uni Kaiserslautern, May 2001. Hans Tzschach, Informationstheoretische Codierungstheorie, TU Darmstadt Thomas M. Cover, Joy A. Thomas, Elements of Information Theory, Wiley 1990 Eberhard Hänsler, Grundlagen der Theorie statistischer Signale, Springer 1983. Athanasios Papoulis, Probability, Random Variables, and Stochastic Processes, McGraw-Hill 1985. W. Henkel, Constructor University Bremen 39 5 Some more lemmas on entropy... Fano’s lemma Let X and Y be random variables with values from the set X = {x1 , x2 ,... , xν }, with ν = |X |. Furthermore, let Pe = P (X 6= Y ). We then obtain H(X|Y ) ≤ h(Pe ) + Pe log2 (ν − 1) Proof: We make use of an error indicator   0, if X = Y Z=  1, if X 6= Y From the chain rule, we know H(XZ|Y ) = H(X|Y ) + H(Z|XY ) = H(X|Y ) , W. Henkel, Constructor University Bremen 40 where the final equality is due to the fact that X and Y uniquely determine Z. Furthermore, H(X|Y ) = H(XZ|Y ) = H(Z|Y ) + H(X|Y Z) (chain rule) ≤ H(Z) + H(X|Y Z) First, we note that H(X|Y, Z = 0) = 0, since X is uniquely defined. There are ν − 1 cases with Z = 1, leading to H(X|Y, Z = 1) ≤ log2 (ν − 1) Thus, H(X|Y Z) ≤ P (Z = 1) log2 (ν − 1) = Pe log2 (ν − 1) Finally, we insert H(Z) = h(Pe ) = −Pe log2 Pe − (1 − Pe ) log2 (1 − Pe ) W. Henkel, Constructor University Bremen 41 The special case: no knowledge of Y Without loss of generality, let us assume that the probabilities of the elements of X are ordered as p1 ≥ p2 ≥ · · · ≥ pν. Since the first element has highest probability, the best guess would be X̂ = 1 and the probability of error would then be Pe = 1 − p1. Writing Fano’s inequality for this case, leads to H(Pe ) + Pe · log2 (ν − 1) ≥ H(X). Equality would be achieved with the probability mass function Pe Pe (p1 , p2 ,... , pν ) = 1 − Pe , ,..., ν−1 ν−1 W. Henkel, Constructor University Bremen 42 What is the intuition behind Fano’s lemma? H(X|Y ) ≤ h(Pe ) + Pe log2 (ν − 1) Let H(X|Y ) be the information that is needed to determine X when Y is known. One way to do this, is to check if X = Y. If this is the case, we are done. If however X 6= Y , ν − 1 possibilities are left for X. The first step, i.e., determining if X = Y , requires h(Pe ) bit. If X 6= Y , which occurs with the probability Pe , at most log2 (ν − 1) further bits will be needed. W. Henkel, Constructor University Bremen 43 The sum h(Pe ) + Pe log2 (ν − 1) is positive for all Pe with 0 < Pe ≤ 1. h(Pe ) + Pe · log2 (ν − 1) log2 ν log2 (ν − 1) ν−1 0 ν 1 Pe If H(X|Y ) > 0, a lower limit for the error probability results. Sometimes, even an upper and lower bound are obtained. W. Henkel, Constructor University Bremen 44 Examples: Let ν = 2 and assume H(X|Y ) = 1/2 =⇒ h(Pe ) ≥ 1/2 and consequently, 0.11 ≤ Pe ≤ 0.89 Let ν = 3, instead, and assume H(X|Y ) = 1/2 1 =⇒ h(Pe )+Pe ≥ 1/2 0.8 Pe should be 0.6 H(p) greater or equal the 0.4 intersection. 0.2 0 0 0.2 0.4 0.6 0.8 1 p W. Henkel, Constructor University Bremen 45 Exercises: E1: We toss dice and see the outcome as possible events of the set {ω1 , ω2 ,... , ω6 }. We distinguish the following random variables: 1. X, where X is the obtained count. 2. Y , where Y = even, if an even number has been obtained, otherwise Y = odd. 3. Z, where Z = S, if a value smaller or equal to three has been obtained, otherwise Z = G. Determine the following entropies: 1. H(X) 5. H(X, Z) 2. H(Y ) 6. H(Y, Z) 3. H(Z) 7. H(X, Y, Z) 4. H(X, Y ) W. Henkel, Constructor University Bremen 46 E2: Let (X1 , X2 , X3 ) be a vector-valued random variable that can only assume the values (000), (011), (101), and (110) with the same probability of 1/4. Determine the entropies 1. H(X1 ) 4. H(X1 , X2 , X3 ) 2. H(X1 , X2 ) 5. H(X3 |X1 , X2 ) 3. H(X2 |X1 ) 6. H(X3 ) W. Henkel, Constructor University Bremen 47 E3: Let the quality of a weather forecast be given by the joint (simultaneous) probabilities of the weather forecast (WF) and the actual weather (AW). The following table shows these two-dimensional probabilities WF \ AW rain sunsh. rain 1/4 1/2 sunsh. 0 1/4 We realize that the forecast is only right in 50% of the time, whereas we would be right in 75% of the time by just assuming that it’s always sunny. Why does such a forecast still make sense? Give an information-theoretic reason. W. Henkel, Constructor University Bremen 48 Solution E3: We abbreviate the actual weather as Y and the weather forecast as X, furthermore rain to be ”0” and sunshine to be ”1”. We rewrite the joint probabilities (y, x) P (y, x) (0, 0) 1/4 (0, 1) 0 (1, 0) 1/2 (1, 1) 1/4 The last two probabilities add up to the 75 % and the inner two add up to 50 %. We thus obtain 1 3 P (y = 0) = 1/4 P (x = 0) = 3/4 H(Y ) = 4 log2 (4) + 4 log2 ( 34 ) =⇒ 3 P (y = 1) = 3/4 P (x = 1) = 1/4 = log2 (4) − 4 log2 (3) = 0.81 W. Henkel, Constructor University Bremen 49 Let us compute the entropy P H(Y |X) = − (yj ,xi ) PY,X (yj , xi ) · log2 PY |X (yj |xi ) Using Bayes’ rule P (y|x) = P (y, x)/P (x), we obtain the table (y, x) P (y|x) (0, 0) 1/4 · 4/3 = 1/3 (0, 1) 0 · 4/1 = 0 (1, 0) 1/2 · 4/3 = 2/3 (1, 1) 1/4 · 4/1 = 1 and thus the conditional entropy, i.e., the uncertainty of the weather after the forecast is 1 1 1 2 1 H(Y |X) = − · log2 + 0 + · log2 + · log2 1 = 0.689 4 3 2 3 4 W. Henkel, Constructor University Bremen 50 If the weather forecast would have been useless, H(Y |X) should be equal to H(Y ), i.e., the weather forecast does not carry addition information that would change the entropy. H(Y ) = 0.81 > H(Y |X) = 0.689 The entropy (uncertainty) is reduced by the weather forecast! W. Henkel, Constructor University Bremen 51... Jensen’s inequality Definition: A function f (x) is said to be convex over an interval (a, b) if for every x1 , x2 ∈ (a, b) and 0 ≤ λ ≤ 1, f (λx1 + (1 − λ)x2 ) ≤ λf (x1 ) + (1 − λ)f (x2 ) A function f is said to be strictly convex if equality holds only if λ = 0 or λ = 1. A function f is concave if −f is convex This means that a function is convex if it always lies below a straight line between the points (x1 , f (x1 )) and (x2 , f (x2 )). Some convex functions are: x2 , |x|, ex , x log x (for x ≥ 0),..., √ some concave functions are: log x, x for x ≥ 0),.... W. Henkel, Constructor University Bremen 52 Theorem: If the function f has a second derivate which is non-negative (positive), then the function is convex (strictly convex). Proof: Taylor expansion around x0 ′′ ′ f (x0 ) f (x) = f (x0 ) + f (x0 )(x − x0 ) + (x − x0 )2 +... 2 Let us assume that f ′′ (x0 ) ≥ 0. Let x0 = λx1 + (1 − λ)x2 and x = x1 to obtain f (x1 ) ≥ f (x0 ) + f ′ (x0 ) [(1 − λ)(x1 − x2 )] | ·λ For x = x2 , we obtain f (x2 ) ≥ f (x0 ) + f ′ (x0 ) [λ(x2 − x1 )] | · (1 − λ) Adding both with the factors λ and (1 − λ), respectively, yields the W. Henkel, Constructor University Bremen 53 theorem. The case of strict convexity can be shown in a similar fashion. More often used! W. Henkel, Constructor University Bremen 54 Jensen’s inequality If f is a convex function and X is a random variable, then E{f (X)} ≥ f (E{X}) Moreover, if f is strictly convex, then equality implies that X = E{X} with probability 1, i.e., X is a constant. Proof: The proof starts from a two mass point distribution p1 f (x1 ) + p2 f (x2 ) ≥ f (p1 x1 + p2 x2 ) This directly follows from convexity. Why? W. Henkel, Constructor University Bremen 55 We now take the induction hypothesis that it was true for k − 1 mass points. Then writing p′i = pi /(1 − pk ) for i = 1, 2,... , k − 1, we have k X k−1 X pi f (xi ) = pk f (xk ) + (1 − pk ) · p′i f (xi ) i=1 i=1 k−1 ! X ≥ pk f (xk ) + (1 − pk ) · f p′i xi i=1 k−1 ! X ≥ f pk xk + (1 − pk ) · p′i xi i=1 k ! X = f pi xi i=1 The transition to continuous distributions is straight forward. W. Henkel, Constructor University Bremen 56... Some more Information and Entropy definitions Fisher information: The Fisher information J(θ) is the variance of the so-called score, ( 2 ) ∂ J(θ) = Eθ ln f (x; θ) ∂θ with an indexed set of densities f (x; θ). Rényi entropy: m 1 X Hm (P1 ,..., Pm ) = log Piα 1−α i=1 with some Rényi order α > 0, α 6= 1 W. Henkel, Constructor University Bremen 57 Gibbs entropy (thermodynamics): X S = −kB Pi ln Pi i with the probabilities of microstates Pi and the Boltzmann constant kB = 1.38065 · 10−23 J/K Boltzmann entropy (thermodynamics): S = kB ln W with the number of microstates W. W. Henkel, Constructor University Bremen 58 Examples and derivations have partly been taken from the following references: Rolf Johannesson, Informationstheorie, Grundlage der Telekommunikation, Addison-Wesley 1992. Thomas M. Cover, Joy A. Thomas, Elements of Information Theory, Wiley 1990. Imre Csiszár, János Körner , Information Theory, Akadémiai Kiadó, Budapest 1981. W. Henkel, Constructor University Bremen 59 6 Something on typical sequences... A coin flipping experiment p: probability for head, q = 1 − p: probability for tail If we do n tossing experiments, we can say with a probability near 1, what the contents of the sequence will be. Let rn = ♯heads ♯f lips = k n. Chebyshev’s inequality tells us σ2 pq P (|rn − µr | > ǫ) = P (|rn − p| > ǫ) ≤ 2 = 2 , ǫ nǫ σ 2 and µr = p being the variance and mean of rn , respectively, and ǫ > 0. This is called the asymptotic equipartition property (AEP) and is valid for long sequences of random variables. We will study only the case of “iid” (independent and identically distributed) random variables, although AEP is also valid for other classes of sequences, especially ergodic sequences. W. Henkel, Constructor University Bremen 60 The idea: AEP splits the set of sequences in typical and untypical ones. Although the typical ones will be much less than the whole set, its probability will sum up to almost one. This effect will be more pronounced with increasing sequence length. Thus, reducing ǫ and increasing n would make the probability diminish that an untypical sequence would be used. Hence, only typical sequences would need to be encoded. The set will be much smaller and also the corresponding code sequence can be much shorter. W. Henkel, Constructor University Bremen 61 Proof of the Tschebyscheff ’s inequality Discrete case Continuous case P∞ R∞ σ2 = −∞ (x − µ) 2 PX (x) σ2 = −∞ (x − µ) 2 fX (x)dx P 2 R 2 > |x−µ|≥ǫ (x − µ) PX (x) > |x−µ|≥ǫ (x − µ) fX (x)dx P 2 R 2 > |x−µ|≥ǫ ǫ PX (x) > |x−µ|≥ǫ ǫ fX (x)dx 2 P 2 R = ǫ · |x−µ|≥ǫ PX (x) = ǫ · |x−µ|≥ǫ fX (x)dx = ǫ2 · P (|x − µ| ≥ ǫ) = ǫ2 · P (|x − µ| ≥ ǫ) =⇒ P (|x − µ| ≥ ǫ) < σ 2 /ǫ2 W. Henkel, Constructor University Bremen 62 pq We still need to show that σ 2 = n Let pn (k) be the probability of obtaining k heads amongst n trials. n k n−k n! pn (k) = p q = pk q n−k k (n − k)!k! We note that ( 2 ) n k 1 n o 1 X 2 σ2 = E −p = 2 · E (k − np) = 2 (k − np)2 pn (k) n n n k=0 n X n X n2 σ 2 = k 2 pn (k) − 2np kpn (k) + n2 p2 k=0 k=0 Xn = k 2 pn (k) − 2n2 p2 + n2 p2 k=0 Xn = k 2 pn (k) − n2 p2 k=0 W. Henkel, Constructor University Bremen 63 Pn 2 Pn n! k=0k pn (k) = k=1 k (n−k)!(k−1)! pk q n−k Pn n! k n−k Pn n! = k=1 (k − 1) · (n−k)!(k−1)! p q + k=1 (1) · (n−k)!(k−1)! pk q n−k Pn n! k n−k Pn n! = k=2 · (n−k)!(k−2)! p q + k=1 (n−k)!(k−1)! pk q n−k substituting k ′ = k − 2 in the first term, leads to Pn−2 n! k′ +2 n−k′ −2 Pn n! = k′ =0 (n−k′ +2)!k! p q + k=1 (n−k)!(k−1)! pk q n−k 2 n−2 Pn n! = n(n − 1)p (p + q) + k=1 (n−k)!(k−1)! pk q n−k 2 Pn n! = n(n − 1)p + k=1 k · (n−k)!(k)! pk q n−k = n(n − 1)p2 + np = n2 p2 − np2 + np = n2 p2 + np(1 − p) = n2 p2 + npq pq leads to n2 σ 2 = n2 p2 + npq − n2 p2 = npq =⇒ σ 2 = n W. Henkel, Constructor University Bremen 64 Note: The typical sequence set has in total the highest probability. Its element sequences need not necessarily have higher probability than sequences in the non-typical set. E.g., obtaining np head outputs from the coin experiment would be typical, but the probability of having n times a heads output would actually be bigger for an unfair coin with p > 1/2! Let X1 , X2 ,... , Xn be a sequence of iid random variables, e.g., n coin flipping experiments. The probability of a certain sequence will be n Y PX (x) = PX (xi ) i=1 We say that the event x = x1 , x2 ,... , xn is an ǫ-typical sequence, if the arithmetic mean of its information values n n 1 1X 1X − log PX (x) = − log PX (xi ) = − IX (xi ) n n i=1 n i=1 differs from the stochastic average of IX (x), the entropy, only by ǫ. W. Henkel, Constructor University Bremen 65 Definition: The set Aǫ of ǫ-typical sequences x = x1 , x2 ,... , xn is defined as 1 Aǫ (X) := x : − log PX (x) − H(X) ≤ ǫ n The equipartition property will be formulated as follows: Theorem: For each ǫ > 0 exists an n ∈ N, such that Aǫ (X) has the following properties: 1. P (Aǫ (X)) ≥ 1 − ǫ 2. x ∈ Aǫ (X) ⇒ − n1 log PX (x) − H(X) ≤ ǫ 3. (1 − ǫ)2n(H(X)−ǫ) ≤ |Aǫ (X)| ≤ 2n(H(X)+ǫ) W. Henkel, Constructor University Bremen 66 Proof: 1. Follows from the convergence of − n1 log PX (x) in the definition of Aǫ (X) 2. Follows from the definition of Aǫ (X) 3. Follows from X X 1≥ PX (x) ≥ 2−n(H(X)+ǫ) = |Aǫ (X)| · 2−n(H(X)+ǫ) Aǫ (X) Aǫ (X) and X X 1−ǫ≤ PX (x) ≤ 2−n(H(X)−ǫ) = |Aǫ (X)| · 2−n(H(X)−ǫ) Aǫ (X) Aǫ (X) We made use of the the second property, obtaining 2−n(H(X)+ǫ) ≤ PX (x) ≤ 2−n(H(X)−ǫ). W. Henkel, Constructor University Bremen 67 Examples: Assume we have a container with 2 white and one black ball. The experiment is to blindly select 5 times a ball, putting it back every time. There will be 32 possible combinations. The entropy will be H(X) = h(1/3) = 0.918. We choose ǫ = 0.138 (15 % of h(1/3)). We’ll then have 0.0257 ≤ PX (x) ≤ 0.067 According to property 3: 14.94 ≤ |Aǫ (X)| ≤ 38.89 We will actually see 15 such ǫ-typical sequences adding up to a probability of 0.658. W. Henkel, Constructor University Bremen 68 W. Henkel, Constructor University Bremen 69 We now select a smaller ǫ = 0.046 (roughly 5 % of H(X)) and stepwise increase the length n n (1 − ǫ)2n(H(X)−ǫ) |Aǫ (X)| 2n(H(X)+ǫ) P (Aǫ (X)) 100 287.2 292.6 296.4 0.660 500 2436.1 2474.9 2482.1 0.671 1000 2872.3 2953.4 2964.2 0.998 2000 21744.7 21910.3 21928.4 1.000 The number of ǫ-typical sequences of length n = 1000 is 2953.4 , which is only a fraction of 2953.4 /21000 = 2−46.6 ≈ 10−14. W. Henkel, Constructor University Bremen 70 Shannon’s Source Coding Theorem For the probability of a perfect reconstruction approaching one if the length is going to infinity, it is necessary and sufficient to use H(X) bit/source symbol for encoding. Proof: For an arbitrary ǫ > 0 there will be a sufficiently large n such that a block of n source symbols X1 , X2 ,... , Xn can be uniquely encoded with n(H(X) + ǫ) bits, despite of non-typical source sequences, whose probability will in total be less than ǫ. This follows from 1 and 3 of the previous theorem. From the left limit of 3 also follows that H(X) is not only sufficient, but also asymptotically necessary, if the probability of perfect reconstruction should approach 1. Comment: The encoding could practically be carried out by numbering all ntuples in Aǫ (X) and using, e.g., a binary representation of these numbers. W. Henkel, Constructor University Bremen 71 Example: H(X) = 0.3333, ǫ = 0.033 (10 %) n (1 − ǫ)2n(H(X)−ǫ) |Aǫ (X)| 2n(H(X)+ǫ) P (Aǫ (X)) 100 230.0 230.2 236.7 0.165 500 2150.0 2175.5 2183.4 0.544 1000 2300.0 2358.0 2366.4 0.708 2000 2600.0 2723.4 2733.4 0.876 5000 21500.1 21820.5 21833.5 0.987 10000 23000.3 23649.6 23667.1 1.000 W. Henkel, Constructor University Bremen 72 H(X) = 0.3333, ǫ = 0.0167 (5 %) n (1 − ǫ)2n(H(X)−ǫ) |Aǫ (X)| 2n(H(X)+ǫ) P (Aǫ (X)) 100 231.6 230.2 235.0 0.165 500 2158.3 2167.9 2175.0 0.291 1000 2316.7 2342.8 2350.0 0.401 2000 2633.4 2693.1 2700.1 0.571 5000 21583.4 21741.1 21750.2 0.784 10000 23167.0 23490.8 23500.4 0.923 20000 26334.0 26987.0 27000.8 0.987 Note: A smaller ǫ requires a higher n! W. Henkel, Constructor University Bremen 73 7 Source encoding u ∈ Um x ∈ Xn Encoder C Let U be a qary source and X an rary symbol set. For every ui ∈ U m , an encoding rule C selects a code sequence (or block) xj ∈ X W C : ui ∈ U m −→ xj ∈ X n W. Henkel, Constructor University Bremen 74 Example: C1 C2 C3 C4 C5 C6 x1 − −− − − − − x2 +− −+ +−− +− −+ +− x3 −− +− ++− ++− −++ ++− x4 −+ ++ ++ + + +− +++ +++ block code no yes no no no no uniquely decodable no yes no yes yes yes P|C| 1 li i=1 2 1 41 > 1 1 1 (!?) 15 16 ≤1 1 1 instantan. block code comma code decodable no yes no yes no yes Draw the code trees of codes 1, 2, and 6! W. Henkel, Constructor University Bremen 75 The efficiency of source encoding Definition: The average codeword length is given by k X l= li PU (ui ) i=0 with k being the number of possible source vectors ui and li the length of the encoded sequence. We require the following: No two codewords shall be identical, i.e., xi 6= xj for i 6= j. This means that H(U) = H(X) No codeword shall be prefix of another longer codeword. This means that the code sequence (its end) can be detected also when it is in continuous operation. The second property we call prefix-free or instantaneous. W. Henkel, Constructor University Bremen 76 Kraft’s (McMillan’s) inequality An rary prefix-free code with codeword lengths l1 , l2 ,... , lk (∈ N) exists if and only if Pk −li i=1 r ≤1 Proof: We use the fact that for an rary tree of depth n there will be r n−l end nodes diverging from a node at depth l, where l < n. Let us assume there would be an rary prefix-free code, whose codeword lengths are l1 , l2 ,... , lk and let n = maxi {li } + 1. Now construct the tree of the code, where we successively cut parts from the full rary tree of depth n. Each time we cut a sub-tree from the tree to obtain a codeword xi , we eliminate r n−li end nodes. This is due to the prefix-free condition meaning none of these nodes can have paths through xi. In total we have r n end nodes that can be eliminated, we get k X r n−l1 + r n−l2 + · · · + r n−lk ≤ r n =⇒ r −li ≤ 1 i=1 W. Henkel, Constructor University Bremen 77 Proof cont’d: This was only one direction of the proof. Let us now start from assuming Pk l1 , l2 ,... , lk positive integers such that i=1 r −li ≤ 1. Furthermore, let these integer lengths be ordered: l1 ≤ l2 ≤ · · · ≤ lk. Let again n = maxi {li } + 1. We again start from a full tree of depth n. Let us consider the following algorithm 1. i ← 1 2. Choose an arbitrary node xi at depth li that has not yet been processed. Cut the tree at xi. Stop as soon as no node is left. 3. In case i = k, stop 4. Let i ← i + 1 and goto Step 2 W. Henkel, Constructor University Bremen 78 Proof cont’d: If we would be able to freely choose all xi in Step 2, we would have constructed an rary prefix-free code with the given codeword lengths. Let us assume that indeed we were able to freely choose the x1 , x2 ,... , xi−1 in Step 2. The number of remaining nodes at depth n would then be n n−l1 n−l2 n−li−1 r − r +r + ···+ r =   i−1 X r n · 1 − r −lj  > 0 , i ≤ k , j=1 Pk−1 since j=1 r −lj < 1 If there are nodes at depth n, there should also be some at depth li < n. Since l1 ≤ l2 ≤ · · · ≤ li , no other already chosen codeword could emerge from such a node. Thus, the node can be chosen for xi. This concludes the proof. W. Henkel, Constructor University Bremen 79 A bound for the average codeword length P −1 l q l Let Qi := r1 · i 1 j=1 r j with q = |U | Then, by using an earlier theorem q q q Pq −lj X 1 X 1 X j=1 r Hr (U) = Pi · logr ≤ Pi · logr = Pi · logr i=1 P i i=1 Qi i=1 r −li q X q X q X q X q X ≤ Pi · logr r li + Pi · logr r −lj = li · Pi · logr r + logr r −lj | {z } i=1 i=1 j=1 i=1 j=1 | {z } =1 | {z } =l ≤0 =⇒ Hr (U) ≤ l We obtain equality, i.e., the optimal case when q li X −lj 1 1 r = 1 and Pi = Qi = ∀i =⇒ li = logr j=1 r Pi W. Henkel, Constructor University Bremen 80 Since it is only possible to have integer li ∈ N, one may not be able to fulfill the above (right) condition. Then, we would go for the next bigger integer with logr P1i ≤ li < 1 + logr P1i , where we can ensure that the minimum average length, a so-called compact code, is achieved. We obtain q q q q X 1 X X X 1 Pi · logr ≤ Pi · li < Pi + Pi · logr i=1 P i i=1 i=1 i=1 Pi Hr (U) ≤ l < 1 + Hr (U) li Pq li Pq Remark: If (1/r) ≤ Pi , Kraft’s inequality i=1 (1/r) ≤ i=1 Pi = 1 is always fulfilled. W. Henkel, Constructor University Bremen 81 Increasing the block size at the source, e.g., using an n-fold extension, we would obtain Hr (Xn ) ≤ ln < Hr (Xn ) + 1 n · Hr (X) ≤ ln < n · Hr (X) + 1 Hr (X) ≤ ln /n < Hr (X) + 1/n Note that when using the binary logarithm instead of logr , we obtain H(X) H(X) log2 (r) ≤ ln /n < log2 (r) + 1/n This means the entropy H(X) is the lower limit of the average length required for a source code. It thus makes sense to define the efficiency. Definition: The efficiency η of a source code is the ratio of the entropy over the average length, or in other words, the share of the source entropy per code symbol. η = Hr (U)/l , ρ = 1 − η = l − Hr (U) /l W. Henkel, Constructor University Bremen 82 For n → ∞ =⇒ η → 1, ρ → 0 W. Henkel, Constructor University Bremen 83 8 The Huffman code Example: U = {u1 , u2 ,... , u6 } with probabilities P = (0.4, 0.3, 0.1, 0.1, 0.06, 0.04), H(U ) = 2.14 bit P (0) x(0) P (1) x(1) P (2) x(2) P (3) x(3) P (4) x(4) u1 0.4 1 0.4 1 0.4 1 0.4 1 0.6 0 u2 0.3 00 0.3 00 0.3 00 0.3 00 0.4 1 u3 0.1 011 0.1 011 0.2 010 0.3 01 u4 0.1 0100 0.1 0100 0.1 011 u5 0.06 01010 0.1 0101 u6 0.04 01011 l = 2.2 , η = 0.97 Draw the corresponding tree! W. Henkel, Constructor University Bremen 84 Examples and derivations have partly been taken from the following references: Wolfgang Sauer-Greff, Einführung in die Informations- und Codierungstheorie, lecture notes, Uni Kaiserslautern, May 2001. Rolf Johannesson, Informationstheorie, Grundlage der Telekommunikation, Addison-Wesley 1992. W. Henkel, Constructor University Bremen 85 9 Lempel-Ziv 77 algorithm Huffman was fixed-to-variable length encoding. Lempel-Ziv realizes variable-to-fixed length encoding. Let us consider an example where encoding has already been performed until the marked letter ”b”. a b c b b a c d e b b a d e a a... LZ77 checks the already processed data for a biggest match. This leads to bba being encoded as (3,3,d), where the first number specifies that start of the match, the second is the length of the match, and then d specifies the next new character that is not identical to the earlier encoded data. W. Henkel, Constructor University Bremen 86 The practical realization uses a windowed approach: search buffer look-ahead b. z }| { z }| { a b c b b a c d e b b a d e a a... | {z } window LZ77 example Input string: 001010210210212021021200... Length of look-ahead buffer: Ls = 9 Window size: N = 18 Codeword length: Lc = logα (N − Ls ) + logα (Ls ) + 1 W. Henkel, Constructor University Bremen 87 Encoding: Codewords (in Radix 3) 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 2 1 0 2 1... 22|02|1 0 0 0 0 0 0 0 0 1 0 1 0 2 1 0 2 1 0 2 1... 21|10|2 0 0 0 0 1 0 1 0 2 1 0 2 1 0 2 1 2 0 2 1... 20|21|2 2 1 0 2 1 0 2 1 2 0 2 1 0 2 1 2 0 0... 02|22|0 Decoding: 22|02|1 0 0 0 0 0 0 0 0 0 0 0 1 21|10|2 0 0 0 0 0 0 0 0 1 0 1 0 2 20|21|2 0 0 0 0 1 0 1 0 2 1 0 2 1 0 2 1 2 02|22|0 2 1 0 2 1 0 2 1 2 0 2 1 0 2 1 2 0 0 W. Henkel, Constructor University Bremen 88 LZ77 algorithm while (lookAheadBuffer not empty) { get a reference (position, length) to longest match; if (length > 0) { output (position, length, next symbol); shift the window length+1 positions along; } else { output (0, 0, first symbol in the lookahead buffer); shift the window 1 character along; } } W. Henkel, Constructor University Bremen 89 Lempel-Ziv 78 algorithm builds up a dictionary which, without measures, will become infinite in size. w := NIL; while (there is input){ K := next symbol from input; if (wK exists in the dictionary) { w := wK; } else { output (index(w), K); add wK to the dictionary; w := NIL; } } (This is a variable-to-variable length encoder!) W. Henkel, Constructor University Bremen 90 Example: S = 001212121021012101221011 ♯ entry phrase output ternary 1 0 0 00 00 2 1+1 01 11 11 3 2 2 02 02 4 1 1 01 00 1 5 3+1 21 31 10 1 6 5+0 210 50 12 0 7 6+1 2101 61 20 1 8 7+2 21012 72 21 2 9 7+1 21011 71 21 1 W. Henkel, Constructor University Bremen 91 Lempel-Ziv-Welch 84 algorithm w := INPUT; while (there is input){ K := INPUT; if (wK exists in the dictionary) { w := wK; } else { output (index(w)); add wK to the dictionary; w := K; } output (index(w)) } (variable-to-fixed length encoding!) W. Henkel, Constructor University Bremen 92 Decompression of LZW INCODE := INPUT OLDCODE := INCODE OUTPUT := invcode(INCODE) repeat while codes to read: INCODE := INPUT if code INCODE exists in dictionary table C. W. Henkel, Constructor University Bremen 190 Further simplified (weakened) 1 + Pen log2 (2k ) ≥ k (1 − C/R) 1 + Pen · k ≥ k (1 − C/R) 1 Pen ≥ 1 − C/R − nR For n → ∞ this will result in Pen ≥ 1 − C/R For R > C, the error rate will be lower bounded and cannot be made arbitrarily small! W. Henkel, Constructor University Bremen 191 Again, a Markov-chain definition... Def.: Random variables X, Y, Z are said to form a Markov chain in that order (denoted by X → Y → Z) if the conditional distribution of Z depends only on Y and is conditionally independent of X. Specifically, X, Y , and Z form a Markov chain X → Y → Z if the joint probability mass function can be written as P (x, y, z) = P (x) · P (y|x) · P (z|y) Data processing inequality (lemma) X Y Z processor processor I II   I(X; Y ) If X → Y → Z, then I(X; Z) ≤  I(Y ; Z). W. Henkel, Constructor University Bremen 192 Proof: I(X; Z) = H(X) − H(X|Z) ≤ H(X) − H(X|Z, Y ) = H(X) − H(X|Y ) = I(X; Y ). I(X; Z) = H(Z) − H(Z|X) ≤ H(Z) − H(Z|X, Y ) = H(Z) − H(Z|Y ) = I(Y ; Z). Processing cannot increase the information! (Isn’t there a similar theorem in physics?) W. Henkel, Constructor University Bremen 193 Discrete input and analog output... Consider, e.g., additive white Gaussian noise (AWGN) r 1 2 2 N0 fy (y| ± 1) = √ e−(y∓1) /2σy σy = , 2πσy 2Eb where N0 is the one-sided noise-power density and Eb the energy per bit which equals ES , the energy per symbol, in this case. The BER pb of the corresponding BSC would be Z ∞ r ! 1 −(y+1)2 /2σy2 1 Eb pb = √ e = 1 − erf. 0 2πσy 2 N0 Comparison of the channel-capacity formulas Qj −1 Qi −1 X X P (yi |xj ) CDM C = max P (xj )P (yi |xj ) log P (xj ) j=0 i=0 P (yi ) Qj −1 Z +∞ X f (y|xj ) CAW GN = max P (xj )f (y|xj ) log dy P (xj ) j=0 −∞ f (y) W. Henkel, Constructor University Bremen 194 CDM C = 1 + pb log2 pb + (1 − pb ) log2 (1 − pb ) Z ∞ Z ∞ 1 f (y| + 1) 1 f (y| − 1) CAW GN = f (y| + 1) log2 dy + f (y| − 1) log2 dy 2 |{z} −∞ f (y) 2 |{z} −∞ f (y) P (−1) P (+1) 1 1 2 2 1 1 2 2 with f (y) = √ e−(y−1) /2σy + √ e−(y+1) /2σy 2 2πσy 2 2πσy W. Henkel, Constructor University Bremen 195 Capacities of one- and two-dimensional modulation alphabets, i.e., for PAM, PSK, and QAM C = max I(X; Y ). PX For a discrete input xi and a discrete output yj , the mutual information would be given by My Mx X X P (xi , yj ) I(X; Y ) = P (xi , yj ) log2 , i=1 j=1 P (xi ) · P (yj ) where Mx and My denote the numbers of transmit and receive points. Using Bayes’ theorem, we obtain Mx My X X P (xi ) · P (yj |xi ) I(X; Y ) = P (xi ) P (yj |xi ) log2 i=1 j=1 P (xi ) · P (yj ) W. Henkel, Constructor University Bremen 196 and finally with the total probability theorem Mx My ! X X P (yj |xi ) I(X; Y ) = P (xi ) P (yj |xi ) log2 PMx. i=1 j=1 i′ =1 P (xi′ )P (yj |xi′ ) For the continuous output, we replace the summation by an integration and obtain Mx Z +∞ X p(y|xi ) C = max I(X; Y ) = max P (xi ) p(y|xi ) log2 PMx dy. i′ =1 P (xi′ )p(y|xi′ ) P (xi ) P (xi ) −∞ i=1 We consider the case of an AWGN channel with a density  |y−x |2 − 2σ2i  √1 one-dimensional (PAM) 2πσ 2 p(y|xi ) = e n · n.  1 two-dimensional (PSK, QAM) 2 2πσn W. Henkel, Constructor University Bremen 197 Assuming equiprobable input signals, i.e., P (xi ) = 1/Mx , we obtain M Z " ! # x +∞ X p(y|xi ) C = 1/Mx · p(y|xi ) log2 PMx + log2 (Mx ) dy i=1 −∞ i′ =1 p(y|xi ) ′    |y−xi′ |2  Mx Z +∞  √ 1  − |y−xi |2 PMx − 2σ2 X 2πσn 2 2  i′ =1 e n  = log2 (Mx ) − 1/Mx · · ·e 2σn log2  |y−xi |2  dy.  1  − i=1 −∞ 2πσn 2 e 2σn2 Substituting w = y − xi yields    |w+xi −x ′ |2  1 − i Mx Z +∞  √  |w|2 PMx 2 X 2 2πσn − 2σ2  i′ =1 e 2σn  C = log2 (Mx )−1/Mx · ·e n ·log2  2  dw. −∞  1  − |w| 2σ 2 i=1 2πσn2 e n W. Henkel, Constructor University Bremen 198 |w|2 R − 2σ2 One may write...e n... dw as expectation E{... } for Monte-Carlo integration. Mx ( "M #) X Xx |w+xi −xi′ |2 −|w|2 − 2 C = log2 (Mx ) − 1/Mx · E log2 e 2σn i=1 i′ =1 For the following curves, note   E(|s |2 )/σ 2 one-dimensional (PAM) i n SNR = Es /N0 = E(|si |2 )/E(|w|2 ) =  E(|si |2 )/2σ 2 two-dimensional (PSK, QAM) n W. Henkel, Constructor University Bremen 199 The channel capacities for one-dimensional M -PAM dependent on the signal-to-noise ratio SNR=Es /N0 M = 16 4 M -PAM 1/2 · log2 (1 + Es /N0 ) 3.5 M =8 3 C [bit/symbol] 2.5 M =4 2 1.5 M =2 1 0.5 0 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 Es /N0 W. Henkel, Constructor University Bremen 200 The channel capacities for two-dimensional M -PSK dependent on the signal-to-noise ratio SNR=Es /N0 M = 16 4 M -PSK log2 (1 + Es /N0 ) 3.5 M =8 3 C [bit/symbol] 2.5 M =4 2 1.5 M =2 1 0.5 0 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 Es /N0 W. Henkel, Constructor University Bremen 201 The channel capacities for two-dimensional M -QAM dependent on the signal-to-noise ratio SNR=Es /N0 M = 16 4 log2 (1 + Es /N0 ) M -QAM 3.5 M =8 3 C [bit/symbol] 2.5 M =4 2 1.5 M =2 1 0.5 0 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 Es /N0 W. Henkel, Constructor University Bremen 202 Application of the Chain Rule of the Mutual Information to Multi-Level Coded Modulation Let us, e.g., assume Ungeböck’s mapping by set partitioning W. Henkel, Constructor University Bremen 203 Every level is protected with a separate code. The code rates are obtained as the differences between neighboring capacity curves. This follows from the Chain Rule. 4 I(X1 − X4 ; Y ) M = 16 M -PSK log2 (1 + Es /N0 ) 3.5 I(X1 ; Y ) M =8 I(X2 − X4 ; Y |X1 ) 3 C [bit/symbol] 2.5 I(X2 ; Y |X1 ) M =4 2 I(X3 , X4 ; Y |X1 , X2 ) 1.5 I(X3 ; Y |X1 − X2 ) M =2 1 I(X4 ; Y |X1 − X3 ) 0.5 I(X4 ; Y |X1 − X3 ) 0 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 Es /N0 To show the principle, let us consider the last two levels. The difference describes the mutual information of the corresponding level bits. I(X3 , X4 ; Y |X1 , X2 ) − I(X4 ; Y |X1 , X2 , X3 ) = I(X3 ; Y |X1 , X2 ) which follows from the Chain Rule: I(X3 , X4 ; Y |X1 , X2 ) = I(X3 ; Y |X1 , X2 ) + I(X4 ; Y |X1 , X2 , X3 ) W. Henkel, Constructor University Bremen 204 Examples and derivations have partly been taken from the following references: Wolfgang Sauer-Greff, Einführung in die Informations- und Codierungstheorie, lecture notes, Uni Kaiserslautern, May 2001. Rolf Johannesson, Informationstheorie, Grundlage der Telekommunikation, Addison-Wesley 1992. Thomas M. Cover, Joy A. Thomas, Elements of Information Theory, Wiley 1990. Ungerböck, G., “Channel Coding with Multilevel/Phase Signals,” IEEE Trans. on Information Theory, Vol. IT-28, No. 1, Januar 1982, S. 55 ff.. W. Henkel, Constructor University Bremen 205 20 Differential entropy and the Gaussian channel Def.: Z ∞ H(X) := − fX (x) log fX (x)dx −∞ is called the differential entropy of a continuous random variable X with a density fX (x) Note that this definition has no physical interpretation. Especially, as introduced here, it does not even make much sense, since fX (x) has a dimension. Thus, at least, it would be required to normalize in some way. We can still define the mutual information as Z ∞Z ∞ fX|Y (x|y) I(X; Y ) := fX,Y (x, y) log dxdy −∞ −∞ fX (x) W. Henkel, Constructor University Bremen 206 and with Z ∞ Z ∞ H(X|Y ) := − fX,Y (x, y) log fX|Y (x|y)dxdy −∞ −∞ we still have the relation I(X; Y ) = H(X) − H(X|Y ) ≥ 0 However, we cannot say anything about signs of the individual differential entropies. W. Henkel, Constructor University Bremen 207 Example: Let us assume a Gaussian-distributed random variable 1 − (x−µ) 2 fX (x) = √ e 2σ 2 2πσ 2 The differential entropy will then be Z ∞ 1 (x−µ) − 2σ2 2 1 (x−µ) 2 H(X) = − √ e log √ e− 2σ2 dx −∞ 2πσ 2 2πσ 2 √ Z ∞ 1 − (x−µ)2 = log 2πσ 2 √ e 2σ2 dx + −∞ 2πσ 2 Z ∞ log e 2 1 − (x−µ) 2 + 2 (x − µ) √ e 2σ2 dx 2σ −∞ 2πσ 2 1 2 log e 2 1 2 = log(2πσ ) + σ = log(2πeσ ) 2 2σ 2 2 W. Henkel, Constructor University Bremen 208 We will now show that the Gaussian distribution is the one with the highest differential entropy. We first need the following Lemma: Let fX (x) and gX (x) be two densities with the same mean and standard deviation (variance) and let 1 − (x−µ) 2 hX (x) = √ e 2σ 2 2πσ 2 When the integrals do exist, we obtain Z ∞ Z ∞ − fX (x) · log hX (x)dx = − gX (x) · log hX (x)dx −∞ −∞ Proof: For convenience, we replace log by ln which just relate by a factor. Z ∞ Z ∞ √ 2 (x − µ) − fX (x) · ln hX (x)dx = fX (x) · ln 2πσ 2 + 2 dx = −∞ −∞ 2σ W. Henkel, Constructor University Bremen 209 √ 1 ∞Z = ln 2πσ 2 + 2 (x − µf )2 fX (x)dx + 2σ −∞ Z ∞ 2 Z ∞ µf − µ (µf − µ) + 2 (x − µ )f f X (x)dx + 2 fX (x)dx σ −∞ 2σ −∞ √ σf2 + (µf − µ)2 = ln 2πσ 2 + 2σ 2 µf and σf are the mean and variance, respectively, for fX (x). Since they were said to be equal to µg and σg , respectively, we have already proven the lemma. W. Henkel, Constructor University Bremen 210 Theorem: The Gaussian distribution maximizes the differential entropy amongst all distributions with a given mean µ and variance σ 2. Proof: 1 H(X) − log(2πeσ 2 ) = 2Z Z ∞ ∞ = − fX (x) log fX (x)dx + hX (x) log hX (x)dx −∞ −∞ Z ∞ Z ∞ = − fX (x) log fX (x)dx + fX (x) log hX (x)dx −∞ −∞ Z ∞ hX (x) = fX (x) log dx −∞ fX (x) Z ∞ hX (x) ≤ fX (x) − 1 log(e) dx −∞ fX (x) Z ∞ Z ∞ = hX (x)dx − fX (x)dx · log e = 0 −∞ −∞ W. Henkel, Constructor University Bremen 211 Mutual information for Gaussian source and channel Let Y = X + Z I(X; Y ) = H(Y ) − H(Y |X) Z ∞ Z ∞ H(Y |X) = − fX,Y (x, y) log fY |X (y|x)dxdy −∞ −∞ Z ∞ Z∞ = − fX (x)fY |X (y|x) log fY |X (y|x)dxdy −∞ −∞ Z ∞ Z∞ = − fX (x)fZ (y −

Introduction to Information Theory PDF

Document Details

Tags

Related

Summary

Full Transcript