Huffman Encoding PDF
Document Details
Uploaded by EducatedSynergy560
Vanderbilt University
Tags
Related
Summary
This document, from Vanderbilt University's computer science program, covers lectures and notes on Huffman codes for data compression. It explains various concepts about algorithms and data structures, and discusses variable-length encoding schemes. This content appears to be part of undergraduate-level material.
Full Transcript
CS 3250 Algorithms Huffman Codes (con’t) Divide and Conquer Announcements HW4 – Due Tuesday, October 29th. Remember to give yourself enough time to write your Greedy proofs. Last day to withdraw is Friday, October 25th © 2024-25 Vanderbilt University This Photo by Unknown Author is lice...
CS 3250 Algorithms Huffman Codes (con’t) Divide and Conquer Announcements HW4 – Due Tuesday, October 29th. Remember to give yourself enough time to write your Greedy proofs. Last day to withdraw is Friday, October 25th © 2024-25 Vanderbilt University This Photo by Unknown Author is licensed under CC BY-NC-ND 2 Recap: Introduction to Data Compression Fixed-length codes are simple and straightforward. Their downside is they don’t take advantage of natural situations occurring regularly such as: Some letters are more likely than other letters. Letters are more likely than numbers. Numbers are more likely than special symbols. © 2024-25 Vanderbilt University This Photo by Unknown Author is licensed under CC BY 3 Recap: Introduction to Data Compression Variable-length codes try to take advantage of naturally occurring situations to develop a more efficient compression. Their downside is they’re more complicated to design. Consider an alphabet with just four characters in it {A, B, C, D}. A fixed-length approach would use exactly two bits for each letter represented: A = 00, B=01, C=10, D=11. © 2024-25 Vanderbilt University This Photo by Unknown Author is licensed under CC BY 4 Recap: Introduction to Data Compression Say we know the letter A shows up 50% of the time with D being the next more frequent character. A variable length approach says, “let’s try to minimize the number of bits on the letters occurring most frequently even if we allocate more bits for other letters.” Let’s try: A=0, B=001, C=01, D =1. Can you find a potential problem with this approach? © 2024-25 Vanderbilt University This Photo by Unknown Author is licensed under CC BY 5 Recap: Introduction to Data Compression Let’s try: A=0, B=001, C=01, D =1. Can you find a potential problem with this approach? If someone represents AAD, it’s going to look exactly like the representation for B. The code 010 could represent CA or it could represent ADA. That’s a problem. This variable-length attempt is ambiguous. © 2024-25 Vanderbilt University This Photo by Unknown Author is licensed under CC BY 6 Greedy Algorithms: Huffman Codes Idea: Maybe using a binary tree might help? We could use a binary path (0=left, 1=right) to lead us to the correct character. Question: Yes or no. Do we still have ambiguity in our encoding scheme? Start 0 1 A=0 A D B=001 0 1 0 C=01 C h 1 1 D=1 B E E=101 © 2024-25 Vanderbilt University This Photo by Unknown Author is licensed under CC BY 7 Greedy Algorithms: Huffman Codes Question: Yes or no. Do we still have ambiguity in our encoding scheme? YES, there is still ambiguity. There are multiple letters on the path from start to C and from start to E. Does 01 mean an “A” followed by a “D” or the letter “C”? Start 0 1 A=0 A D B=001 0 0 1 C=01 C h 1 D=1 1 E E=101 B © 2024-25 Vanderbilt University This Photo by Unknown Author is licensed under CC BY 8 Greedy Algorithms: Huffman Codes Enter the idea of prefix-free encoding. We can use a variable-length encoding scheme that eliminates ambiguity by enforcing a prefix-free constraint. In the example below, the code 010 can only mean AD. Start 0 1 A=0 A 0 B=110 1 D C=1110 0 1 D=10 B 0 © 2024-25 Vanderbilt University C 9 Greedy Algorithms: Huffman Codes Where should the letters with the highest frequency of occurrence be located? A. As close to the root as possible B. As a leaf in the tree C. On the same level as each other in the tree D. None of the above © 2024-25 Vanderbilt University This Photo by Unknown Author is licensed under CC BY 10 Greedy Algorithms: Huffman Codes Where should the letters with the highest frequency of occurrence be located? A. As close to the root as possible B. As a leaf in the tree C. On the same level as each other in the tree D. None of the above © 2024-25 Vanderbilt University This Photo by Unknown Author is licensed under CC BY 11 Greedy Algorithms: Huffman Codes The letters with the highest frequency of occurrence should be placed as close to the root as possible (why is this helpful?) Start 0 1 A=0 A 0 B=110 1 D C=1110 0 1 D=10 B 0 C © 2024-25 Vanderbilt University 12 Greedy Algorithms: Huffman Codes Did we gain anything in terms of compression? Let’s say A occurs 50% of the time, D occurs 26% of the time and B and C both occur 12% of the time. ABL = 1*(0.50) + 2*(0.26) + 3*(0.12) + 4*(0.12) Start ABL = Average Bits per Letter 0 1 ABL =.50(1) + 2(.26) + 3(.12) + A 4(.12) 0 1 D ABL = 1.86 0 1 That’s better than using 2 bits B per character via a fixed 0 encoding. Nice! C © 2024-25 Vanderbilt University 13 Greedy Algorithms: Huffman Codes Practice: For each tree calculate the ABL for both frequency tables vs. fixed-length (i.e., calculate four ABLs). a d b a b c d e e c letter Frequency letter Frequency a.4 a.35 b.2 b.2 c.2 c.12 d.1 d.2 e.1 e.13 © 2024-25 Vanderbilt University 14 Greedy Algorithms: Huffman Codes It seem like prefix-free variable encoding is a good idea for compression. That leads us to the obvious algorist question. What is the optimal way to encode these characters, so we achieve maximum compression? Oh, it’s a binary tree question about minimizing the depth of a tree based on the frequency of occurrence of the symbols. © 2024-25 Vanderbilt University This Photo by Unknown Author is licensed under CC BY 15 Greedy Algorithms: Huffman Codes Question: Is the following encoding scheme where fk indicates the frequency of the letter k optimal? Why or why not? 0 1 fl = 10% fm = 15% 0 1 1 0 fe = 40% fi=25% e i 1 0 1 fs=7% fp=3% l m 0 1 s p © 2024-25 Vanderbilt University This Photo by Unknown Author is licensed under CC BY 16 Greedy Algorithms: Huffman Codes Question: Is the following encoding scheme where fk indicates the frequency of the letter k optimal? Why or why not? 0 1 fl = 10% fm = 15% 0 1 No. Am I 1 0 fe = 40% useful? fi=25% e i 1 0 1 fs=7% fp=3% l m 0 1 s p © 2024-25 Vanderbilt University This Photo by Unknown Author is licensed under CC BY 17 Greedy Algorithms: Huffman Codes Our input to this problem will be the list of characters along with their relative frequencies. Our output is an optimal encoding that maximizes compression by minimizing tree depth. Goal: We want to minimize the number of bits needed over all frequencies f, the ABL. ABL = σ𝑛𝑖=1 𝑓 𝑖 ∗ 𝑑𝑒𝑝𝑡ℎ 𝑜𝑓 𝑖 𝑖𝑛 𝑡𝑟𝑒𝑒 𝑇 © 2024-25 Vanderbilt University This Photo by Unknown Author is licensed under CC BY-SA 18 Greedy Algorithms: Huffman Codes Goal: We want to minimize the number of bits needed over all frequencies f, the ABL. Eminent MIT researchers, Robert Fano and Claude Shannon developed the following approach better known as Fano-Shannon. 1. Divide-and-conquer. Top-down approach. 2. Repeatedly divide characters into groups of frequencies that total as close to 50% or more, as possible. They were aware their algorithm was pretty good but wasn’t always optimal. © 2024-25 Vanderbilt University 19 Greedy Algorithms: Huffman Codes David Huffman, age 25 was an EE student at MIT in Prof. Fano’s class. Prof. Fano, offered students a choice of either taking the final exam or writing a term paper in his class. Huffman did not want to take the final, so he started working on a term paper based on a topic supplied by Fano (Fano did not tell the students he was working on the exact same topic and struggling with it). Huffman was ready to give up and take the final exam when the solution came to him. His approach is optimal! Huffman said he would never have attempted the problem if he knew his own professor was struggling with it. © 2024-25 Vanderbilt University 20 Greedy Algorithms: Huffman Codes Huffman cleverly realized a better would be to work bottom-up in the tree. He devised a greedy approach that constructs the optimal tree. Pro Tip: Whenever you work with tree algorithms, it’s generally a good idea to consider both top- down and bottom-up approaches. We’ll see this same trick soon with heapsort. Stay tuned. David Huffman earned his PhD, joined MIT as faculty and later UC Santa Cruz. © 2024-25 Vanderbilt University This Photo by Unknown Author is licensed under CC BY-SA-NC 21 Analysis: Huffman Codes and Algorithm Huffman’s Approach to Data Encoding: 1. Find the two lowest frequencies in the list. 2. Combine symbols to make a meta-letter 3. Reinsert the new letter into the list. 4. Repeat until done. 5. Huffman’s approach is a greedy approach. © 2024-25 Vanderbilt University This Photo by Unknown Author is licensed under CC BY-SA 22 Greedy Algorithms: Huffman Codes Why is Huffman’s a greedy approach? We want to minimize the ABL using a variable- length encoding by restricting prefix-free codes. If we stop Huffman at any point, the subtrees built are the optimal codes for the included symbols (optimal substructure). We repeatedly join (aka merge) the two lowest frequencies as sibling pairs. We will never regret this decision (greedy choice property). © 2024-25 Vanderbilt University This Photo by Unknown Author is licensed under CC BY-SA 23 Greedy Algorithms: Huffman Codes Recall a full binary tree is one in where a root of any subtree has either 0 or 2 children. It does not have to look balanced. Huffman realized storing a code optimally requires a full binary tree. After all, why have a single child when we can have a sibling at the same depth? Huffman trees tend to not look very balanced. A full binary tree © 2024-25 Vanderbilt University 24 Greedy Algorithms: Huffman Codes Huffman’s Approach: Start with all the characters as leaves of the tree. Repeatedly and “smartly” perform merges. A B C D 0 1 C D 0 1 A 0 1 0 1 B B 0 1 0 1 C D C D © 2024-25 Vanderbilt University 25 Example: Huffman Codes Huffman uses a bottom-up approach with smart merges: Select the two symbols with smallest frequencies. Join them together to create a new “meta symbol” (“de”) whose root is the sum of the two frequencies. Example: letter Frequency letter Frequency a.32 a.32 b.25.23 b.25 c.20 c.20 d.18 d e de.23 e.05 © 2024-25 Vanderbilt University 26 Example: Huffman Codes Question: Given the table and diagram below, which two symbols would be combined next? letter Frequency letter Frequency a.32 a.32 b.25.23 b.25 c.20 c.20 d.18 d e de.23 e.05 © 2024-25 Vanderbilt University This Photo by Unknown Author is licensed under CC BY 27 Example: Huffman Codes Answer: We combine.20 and.23 next and merge them together into a new “meta symbol” (cde) whose root is the sum of the two frequencies. Example: letter Frequency letter Frequency a.32 a.32 b.25.43 b.25 c.20 cde.43 de.23.23 c d e © 2024-25 Vanderbilt University This Photo by Unknown Author is licensed under CC BY 28 Example: Huffman Codes Next, we combine.25 and.32 and merge those together into a new “meta symbol” (ab) whose root is the sum of the two frequencies. Example: letter Frequency letter Frequency a.32 ab.57 b.25.43.57 cde.43 cde.43.23 c a b d e © 2024-25 Vanderbilt University 29 Example: Huffman Codes Finally, we combine.43 and.57 and merge those together into a new “meta symbol” (abcde) whose root is the sum of the two frequencies. Example: letter Frequency 1.0 letter Frequency ab.57 abcde 1.0 cde.43.43.57.23 c a b d e © 2024-25 Vanderbilt University 30 Practice: Huffman Codes We use left branches to represent 0 and right branches to represent 1. Example: What is the ABL of this tree? letter Frequency a.32 1.0 b.25 0 1 c.20 d.18.43.57 e.05 0 1 0 1.23 c a b 0 1 d e © 2024-25 Vanderbilt University 31 Example: Huffman Codes What is the ABL (avg bits per length) of this tree? ABL= (.32)(2) + (.25)(2) + (.20)(2) + (0.18)(3) + (.05)(3) ABL = 0.64 + 0.50 + 0.40 + 0.54 + 0.15 = 2.23 1.0 letter Frequency a.32 0 1 b.25.43.57 c.20 0 1 0 1 d.18.23 c a b e.05 0 1 d e © 2024-25 Vanderbilt University 32 Greedy Algorithms: Huffman Codes Question: Are Huffman trees unique for a given frequency table? Is it possible to generate a multiple trees from the table of frequencies below that are optimal? letter Frequency 1.0 a.32 0 1 b.25 c.20.43.57 d.18 0 1 0 1 e.05.23 c a b 0 1 d e © 2024-25 Vanderbilt University This Photo by Unknown Author is licensed under CC BY 33 Greedy Algorithms: Huffman Codes Question: Yes/No. Are Huffman trees unique for a given frequency table? Answer: No. The left and right decisions are arbitrary. For instance, we can swap d and e below. 1.0 letter Frequency 0 1 a.32 b.25.43.57 c.20 0 1 0 1 d.18.23 c a b e.05 0 1 d e © 2024-25 Vanderbilt University This Photo by Unknown Author is licensed under CC BY 34 Greedy Algorithms: Huffman Codes Question: Consider this encoding tree for the following frequencies: A =.25; B=.08; C=.17; D=.09; E=.41. Is the following encoding tree optimal? © 2024-25 Vanderbilt University This Photo by Unknown Author is licensed under CC BY 35 Greedy Algorithms: Huffman Codes Question: Consider this encoding tree for the following frequencies: A =.25; B=.08; C=.17; D=.09; E=.41. Is the following encoding tree optimal? Answer: No. The current ABL is.41*1 +.08*2 +.17*3 +.25*4 +.09 * 4 = 2.44. Switch A and B in the tree and the ABL will be less. © 2024-25 Vanderbilt University This Photo by Unknown Author is licensed under CC BY 36 Practice: Huffman Codes On Your Own: Use Huffman’s algorithm to create a full binary tree to represent an optimal prefix encoding scheme. What is the ABL? a=.30 b=.18 c=.06 d=.15 e=.31 © 2024-25 Vanderbilt University This Photo by Unknown Author is licensed under CC BY-NC-ND 37 Huffman Codes and Optimality Question: What do you think the running time is for Huffman's algorithm as described? A. Θ(n) B. Θ(logn) C. Θ(nlogn) D. Θ(n^2) E. None of the above © 2024-25 Vanderbilt University This Photo by Unknown Author is licensed under CC BY 39 Huffman Codes and Optimality Question: What do you think the running time is for Huffman's algorithm as described? A. Θ(n) B. Θ(logn) C. Θ(nlogn) eventually, we’ll get it here D. Θ(n^2) E. None of the above © 2024-25 Vanderbilt University This Photo by Unknown Author is licensed under CC BY 40 Analysis: Huffman Codes and Algorithm What is the time complexity of the algorithm? Θ(n) * 2 1. Find the two lowest frequencies Θ(n) 2. Combine to make a meta-letter Θ(1) 3. Reinsert new letter into the list Θ(1) 4. Repeat until done… Total Θ(n2) This is not as good as we hoped. We will do better soon, once we learn a new data structure called a heap (stay tuned). © 2024-25 Vanderbilt University 41 Huffman Codes and Analysis Question: Can we sort the frequencies one time and then use that list to speed up our runtime? The problem with sorting the Huffman codes is that the list changes dynamically during running. We would need to re-sort the list after adding each item. That would be n(nlogn) which still puts us at n^2. New idea: What if we sort it once and then use binary search to insert the new meta symbol back into the list. Binary search is logn. We do n inserts, so we have nlogn. How about that? Is this true? © 2024-25 Vanderbilt University 42 Huffman Codes and Analysis New idea: What if we sort it once and then use binary search to insert the new meta symbol back into the list. Binary search is logn. We do n inserts, so we have nlogn. How about that? We fixed one bottleneck by using binary search but created another. How do we insert the new symbol where it belongs? In the worst case, the new symbols goes at the start of the list, so we must move everyone else down. If we do this for each of the n elements, it will be n^2 too. © 2024-25 Vanderbilt University 43 Huffman Codes: The Algorithm Huffman(S) (S is a table of symbols and frequencies) IF there are only two symbols left (|S| = 2) THEN return a tree with a root and the 2 symbols as leaves ELSE FIND the two minimum frequencies in S (call them u and v) REMOVE u and v from the symbol list (S = S – {u,v}) ADD META SYMBOL ω to S whose frequency is fu + fv T’ = Huffman(S) T = T’ with leaf ω removed and replaced with interior node + two children, u and v; label left branch 0, right branch 1 return T If we could do this faster, T(n) = T(n-1) + 2* find_min that would be much better. Stay tuned! © 2024-25 Vanderbilt University 44 Summary Representing characters as fixed-length codes is wasteful when there is not an even frequency distribution. This leads to variable-length encoding. To avoid ambiguity with variable-length encoding, it’s important to enforce prefix-free variable-length encoding. The problem becomes minimizing depth of the binary tree representing the frequency of the codes. © 2024-25 Vanderbilt University This Photo by Unknown Author is licensed under CC BY-NC-ND 45 That’s All For Now… Coming to a Slideshow Near You Soon… 1. Divide and Conquer 2. Heaps That’s All For Now © 2024-25 Vanderbilt University 47