Data Compression and Decompression
40 Questions
1 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What is one of the advantages of compression in terms of disk space?

  • Keep disk space usage the same
  • Use more disk space
  • Use less disk space (correct)
  • Increase disk space needed
  • Why do large search engines keep a significant part of the postings in memory?

  • To slow down the search process
  • Because compression lets you keep more in memory (correct)
  • Because compression is not possible
  • To increase disk space usage
  • What is the main advantage of decompression algorithms?

  • They are fast (correct)
  • They are not used in IR
  • They are not useful
  • They are slow
  • What is the approximate number of documents in Reuters RCV1?

    <p>800,000</p> Signup and view all the answers

    What is the purpose of compression in dictionaries?

    <p>To make them smaller and fit in main memory</p> Signup and view all the answers

    What is the benefit of compression in terms of data transfer?

    <p>It is faster than reading uncompressed data</p> Signup and view all the answers

    What is the average number of tokens per document in Reuters RCV1?

    <p>200</p> Signup and view all the answers

    Why is compression useful in information retrieval?

    <p>It saves a little money and gives users more space</p> Signup and view all the answers

    What type of compression preserves all information?

    <p>Lossless compression</p> Signup and view all the answers

    What is the primary purpose of dictionary compression?

    <p>To reduce storage space</p> Signup and view all the answers

    Which of the following preprocessing steps can be viewed as a form of lossy compression?

    <p>All of the above</p> Signup and view all the answers

    What is the relationship between vocabulary size and collection size?

    <p>Vocabulary size increases with collection size</p> Signup and view all the answers

    What is the main focus of the compression chapter in this course?

    <p>Basic Boolean index compression</p> Signup and view all the answers

    What is the approximate number of different words of length 20?

    <p>7020</p> Signup and view all the answers

    What is the benefit of pruning postings entries in Chapter 7?

    <p>Almost no loss of quality in top k list</p> Signup and view all the answers

    How many bytes are required to store each term in the postings?

    <p>7.5</p> Signup and view all the answers

    What is the purpose of storing the dictionary as a string?

    <p>To save up to 60% of dictionary space</p> Signup and view all the answers

    How many bytes are required to resolve 3.2M positions?

    <p>3 bytes</p> Signup and view all the answers

    What is the average number of bytes per term in the term string?

    <p>8 bytes</p> Signup and view all the answers

    What is the total space required for the dictionary using the fixed width approach?

    <p>11.2 MB</p> Signup and view all the answers

    What is the purpose of blocking in the context of dictionary compression?

    <p>To store pointers to every kth term string</p> Signup and view all the answers

    How many extra bytes are required to store term lengths in blocking?

    <p>1 byte</p> Signup and view all the answers

    What is the net effect of blocking on the dictionary space?

    <p>Save 9 bytes and lose 4 bytes on term lengths</p> Signup and view all the answers

    What is the total space required for the dictionary using the dictionary-as-a-string approach?

    <p>7.6 MB</p> Signup and view all the answers

    What is the main idea behind front-coding in dictionary compression?

    <p>Storing only the differences between sorted words</p> Signup and view all the answers

    How does the blocking technique with k = 4 reduce the size of the dictionary?

    <p>By reducing the number of blocks in the dictionary</p> Signup and view all the answers

    What is the main goal of postings compression?

    <p>To store each posting compactly</p> Signup and view all the answers

    How many bits are required to represent a docID in Reuters (800,000 documents) using 4-byte integers?

    <p>32 bits</p> Signup and view all the answers

    What is the approximate number of bits required to represent a docID in Reuters (800,000 documents) using logarithmic compression?

    <p>20 bits</p> Signup and view all the answers

    What is the term for a posting in the context of information retrieval?

    <p>A document ID</p> Signup and view all the answers

    What is the ratio of the size of the postings file to the size of the dictionary?

    <p>At least 100:1</p> Signup and view all the answers

    What is the main conflict in postings compression?

    <p>Storing rare terms compactly vs. storing common terms compactly</p> Signup and view all the answers

    What is the optimal prefix code when P(n) = 2–n?

    <p>Gamma code</p> Signup and view all the answers

    What is the purpose of the Gamma code?

    <p>To compress better with bit-level codes</p> Signup and view all the answers

    How is the offset represented in the Gamma code?

    <p>As a binary number with the leading bit cut off</p> Signup and view all the answers

    What is the length in the Gamma code of the number 13?

    <p>3</p> Signup and view all the answers

    What is the Gamma code of the number 13?

    <p>1110101</p> Signup and view all the answers

    What is the bitwise operator used for extracting 7 bits from a number?

    <p>&amp; (bitwise and)</p> Signup and view all the answers

    What is the purpose of the bitwise right shift operator?

    <p>To extract a certain number of bits from a number</p> Signup and view all the answers

    What is the Python syntax for a bitwise right shift operator?

    <p>&gt;&gt;</p> Signup and view all the answers

    Study Notes

    Compression in Information Retrieval

    • Compression is useful in IR to reduce disk space, save money, and increase speed.
    • Decompression algorithms are fast, making compression a viable option.

    Why Compression for Inverted Indexes?

    • Dictionary compression: Make the dictionary small enough to fit in main memory, allowing for faster querying.
    • Postings file compression: Reduce disk space needed and decrease the time needed to read postings lists from disk.

    Reuters RCV1 Statistics

    • Number of documents: 800,000
    • Average number of tokens per document: 200
    • Number of terms (word types): approximately 400,000
    • Average number of bytes per token: 6 (including spaces and punctuation), 4.5 (excluding spaces and punctuation)
    • Average number of bytes per term: 7.5
    • Number of non-positional postings: 100,000,000

    Lossless vs. Lossy Compression

    • Lossless compression: All information is preserved, typically used in IR.
    • Lossy compression: Discard some information, used in preprocessing steps like case folding, stop words, stemming, and number elimination.

    Vocabulary Size vs. Collection Size

    • The term vocabulary size grows with the collection size, especially with Unicode.
    • Vocabulary size is important for compression, as it affects the size of the dictionary.

    Compression Schemes

    • Compression can be applied to the dictionary and postings files.
    • Basic Boolean index compression is a starting point, but can be extended to positional indexes.

    Dictionary Compression

    • Dictionary compression is important to reduce disk space and make querying faster.
    • Techniques: store dictionary as a string, blocking, and front coding.

    Dictionary-as-a-String

    • Store dictionary as a long string of characters, with pointers to mark the end of each word.
    • This can save up to 60% of dictionary space.

    Blocking

    • Store pointers to every kth term string, reducing the number of pointers needed.
    • Term lengths need to be stored, adding an extra byte.

    Front Coding

    • Front-coding: store differences between sorted words, taking advantage of common prefixes.
    • This is similar to general string compression.

    RCV1 Dictionary Compression Summary

    • Fixed width: 11.2 MB
    • Dictionary-as-String with pointers to every term: 7.6 MB
    • Dictionary-as-String with blocking (k=4): 7.1 MB
    • Dictionary-as-String with blocking and front coding: 5.9 MB

    Postings Compression

    • Postings compression is important because the postings file is much larger than the dictionary.
    • Goal: store each posting compactly, using fewer than 20 bits per docID.

    Postings: Two Conflicting Forces

    • Store rare postings (e.g., arachnocentric) using fewer bits, but more frequent postings need more bits.
    • Optimal solution: use log2 P(n) bits, where P(n) is the probability of the posting.

    Gamma Codes

    • Gamma codes are a type of bit-level code that can compress better than other methods.
    • Represent a gap G as a pair of length and offset, where offset is G in binary with the leading bit cut off.
    • Length is the length of the offset, encoded with unary code.

    Gamma Code Examples

    • Examples of gamma codes for various numbers, showing the length and offset.

    Studying That Suits You

    Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

    Quiz Team

    Description

    Understand the benefits and principles of data compression and decompression, including saving disk space and increasing speed.

    More Like This

    Data Compression Quiz
    5 questions

    Data Compression Quiz

    UnwaveringWatermelonTourmaline avatar
    UnwaveringWatermelonTourmaline
    CPR: Data Compression
    120 questions
    Data Compression Techniques Quiz
    12 questions
    Use Quizgecko on...
    Browser
    Browser