Podcast
Questions and Answers
Which indexing method avoids the need for a global dictionary by generating separate dictionaries for each block of the index?
Which indexing method avoids the need for a global dictionary by generating separate dictionaries for each block of the index?
- Naïve in-memory inversion
- Distributed indexing using MapReduce
- Single-pass in-memory indexing (correct)
- Blocked sort-based indexing
Which of the following is a primary reason for compressing inverted indexes?
Which of the following is a primary reason for compressing inverted indexes?
- To improve the aesthetic appeal of search engine interfaces.
- To make the index fit entirely in main memory and reduce disk I/O. (correct)
- To encrypt the indexed content for security purposes.
- To reduce the computational complexity of search algorithms.
What is a key advantage of using compression algorithms where '[read compressed data | decompress] is faster than [read uncompressed data]'?
What is a key advantage of using compression algorithms where '[read compressed data | decompress] is faster than [read uncompressed data]'?
- Faster data transfer from disk to memory. (correct)
- Reduced CPU utilization during decompression.
- Increased energy consumption.
- Enhanced security due to encrypted data.
According to information presented, what is the approximate average number of bytes per token (including spaces and punctuation) in the Reuters RCV1 collection?
According to information presented, what is the approximate average number of bytes per token (including spaces and punctuation) in the Reuters RCV1 collection?
Which preprocessing step is considered a form of lossy compression in information retrieval?
Which preprocessing step is considered a form of lossy compression in information retrieval?
According to Heaps' Law, what is the relationship between the vocabulary size ($M$) and the number of tokens ($T$) in a collection?
According to Heaps' Law, what is the relationship between the vocabulary size ($M$) and the number of tokens ($T$) in a collection?
In the context of Zipf's Law, if the most frequent term occurs $cf_1$ times, how many times does the second most frequent term approximately occur?
In the context of Zipf's Law, if the most frequent term occurs $cf_1$ times, how many times does the second most frequent term approximately occur?
What is the primary goal of compressing the dictionary in an information retrieval system?
What is the primary goal of compressing the dictionary in an information retrieval system?
Which dictionary compression technique involves storing the dictionary as a single string of characters and using pointers to indicate the start of each term?
Which dictionary compression technique involves storing the dictionary as a single string of characters and using pointers to indicate the start of each term?
What is the purpose of 'blocking' in dictionary compression?
What is the purpose of 'blocking' in dictionary compression?
What is the main idea behind front coding in dictionary compression?
What is the main idea behind front coding in dictionary compression?
In the context of inverted indexes, what does a 'posting' typically represent?
In the context of inverted indexes, what does a 'posting' typically represent?
Why is it effective to store gaps instead of raw docIDs in a postings list?
Why is it effective to store gaps instead of raw docIDs in a postings list?
What is the key idea behind variable length encoding?
What is the key idea behind variable length encoding?
In Variable Byte (VB) encoding, what is the purpose of the continuation bit?
In Variable Byte (VB) encoding, what is the purpose of the continuation bit?
According to the procedures outlined, how would you encode the number 130 using Variable Byte (VB) encoding?
According to the procedures outlined, how would you encode the number 130 using Variable Byte (VB) encoding?
What are the two components used to represent a gap (G) in Gamma encoding?
What are the two components used to represent a gap (G) in Gamma encoding?
How is the 'length' component encoded in Gamma encoding?
How is the 'length' component encoded in Gamma encoding?
What is the gamma code for the number 5
?
What is the gamma code for the number 5
?
Which of the following is a property of Gamma codes?
Which of the following is a property of Gamma codes?
Which considerations primarily influence the practical usage of Gamma codes in real-world systems compared to Variable Byte Encoding
Which considerations primarily influence the practical usage of Gamma codes in real-world systems compared to Variable Byte Encoding
What type of index is most space efficient?
What type of index is most space efficient?
Sort-based indexing uses what method for disk-based sorting?
Sort-based indexing uses what method for disk-based sorting?
What is the primary advantage of using sort-based indexing?
What is the primary advantage of using sort-based indexing?
In single-pass in-memory indexing, what approach is used for dictionary management?
In single-pass in-memory indexing, what approach is used for dictionary management?
What method does single-pass in-memory indexing use for its postings?
What method does single-pass in-memory indexing use for its postings?
What is the chief goal of dictionary compression?
What is the chief goal of dictionary compression?
What is the term for the number of times a term appears in the collection?
What is the term for the number of times a term appears in the collection?
In dictionary as string what does the pointer point to?
In dictionary as string what does the pointer point to?
If a term occurs in virtually every document, what can we use?
If a term occurs in virtually every document, what can we use?
What is 'offset' in Gamma code?
What is 'offset' in Gamma code?
What is 'length' in Gamma code?
What is 'length' in Gamma code?
What is the worst thing about Unary code?
What is the worst thing about Unary code?
If G <= 127 what happens in binary-encoding?
If G <= 127 what happens in binary-encoding?
What is a major key requirement when storing postings?
What is a major key requirement when storing postings?
Which of these is NOT a dictionary compression technique?
Which of these is NOT a dictionary compression technique?
Given the gamma code 1110011
, what number does it represent?
Given the gamma code 1110011
, what number does it represent?
Given a document collection of 1,000,000 documents, and assuming each document ID requires $\log_2(10^6) \approx 20$ bits. If a term appears in almost every document, which of the following encoding schemes would be most efficient in terms of space?
Given a document collection of 1,000,000 documents, and assuming each document ID requires $\log_2(10^6) \approx 20$ bits. If a term appears in almost every document, which of the following encoding schemes would be most efficient in terms of space?
Considering dictionary compression techniques, what is the trade-off between using a larger block size (k) in blocking and search performance?
Considering dictionary compression techniques, what is the trade-off between using a larger block size (k) in blocking and search performance?
Flashcards
Sort-based indexing
Sort-based indexing
Organizing indexes using sorting techniques, useful for disk-based sorting.
Single-Pass In-Memory Indexing
Single-Pass In-Memory Indexing
Indexing method that avoids a global dictionary by generating separate dictionaries for each block
Distributed indexing
Distributed indexing
Indexing distribution accross multiple machines.
Dynamic indexing
Dynamic indexing
Signup and view all the flashcards
Why compression?
Why compression?
Signup and view all the flashcards
Lossless compression
Lossless compression
Signup and view all the flashcards
Lossy compression
Lossy compression
Signup and view all the flashcards
Term vocabulary
Term vocabulary
Signup and view all the flashcards
Heaps' Law
Heaps' Law
Signup and view all the flashcards
Heaps' law
Heaps' law
Signup and view all the flashcards
Zipf's Law
Zipf's Law
Signup and view all the flashcards
Collection frequency
Collection frequency
Signup and view all the flashcards
Compression
Compression
Signup and view all the flashcards
Why compress the dictionary?
Why compress the dictionary?
Signup and view all the flashcards
Dictionary compression
Dictionary compression
Signup and view all the flashcards
Dictionary-as-a-String
Dictionary-as-a-String
Signup and view all the flashcards
Blocking
Blocking
Signup and view all the flashcards
Front coding
Front coding
Signup and view all the flashcards
Postings Compression
Postings Compression
Signup and view all the flashcards
Variable Byte (VB) codes
Variable Byte (VB) codes
Signup and view all the flashcards
Unary code
Unary code
Signup and view all the flashcards
Gamma codes
Gamma codes
Signup and view all the flashcards
Study Notes
Index Construction (Last Lecture)
- Sort-based indexing involves naïve in-memory inversion.
- Blocked Sort-Based Indexing uses merge sort, which is effective for disk-based sorting.
- Single-Pass In-Memory Indexing operates without a global dictionary, generating separate dictionaries for each block.
- Postings are accumulated in postings lists as they occur, and are not sorted.
- Distributed indexing can use MapReduce.
- Dynamic indexing uses multiple indices with logarithmic merge.
Topics of Today
- Collection statistics focus on the size of the dictionary and postings, particularly using RCV1 data.
- Other topics include dictionary compression and postings compression techniques.
Why Compression?
- It reduces disk space usage, saving money.
- Compression enables more data to be stored in memory, which increases speed.
- Increases the speed of data transfer from disk to memory because reading compressed data and decompressing it is faster than reading uncompressed data.
- The decompression algorithms must be fast for this to work.
Compression for Inverted Indexes
- Aims to reduce the dictionary size so it can be kept in main memory.
- It reduces the size of postings files which reduces the disk space required.
- The process decreases the time needed to read postings lists from disk.
- This allows large search engines to keep a significant part of the postings in memory due to these benefits.
- There are IR-specific compression schemes.
Reuters RCV1 Statistics
- N (number of documents): 800,000
- L (average number of tokens per document): 200
- M (number of terms/word types): ~400,000
- Average number of bytes per token (including spaces and punctuation): 6
- Average number of bytes per token (without spaces and punctuation): 4.5
- Average number of bytes per term: 7.5
- Number of non-positional postings: 100,000,000
Index Parameters vs Indexing Choices
- Unfiltered dictionaries occupy 484K.
- The non-positional index size for unfiltered data is 109,971K.
- The positional index size for unfiltered data is 197,879K.
Lossless vs. Lossy Compression
- Lossless compression preserves all information, common in IR.
- Lossy compression discards some information.
- Preprocessing steps like case folding, stop words, stemming, and number elimination can be viewed as lossy compression.
- Pruning postings entries unlikely to appear in the top k list can also be a form of lossy compression, with almost no loss in quality for the top k list.
Vocabulary vs. Collection Size
- It's related to how many distinct words there are.
- 70^20 = 10^37 different words of length 20.
- The vocabulary will keep growing with the collection size, especially with Unicode.
Heaps' Law Formula
- M = kT^b, where M is the vocabulary size, T is the tokens' number, k and b are parameters.
- Typical values usually are 30 ≤ k ≤ 100 and b ≈ 0.5.
- Formula predicts a line with a slope of about 1/2 in a log-log plot of M against T.
- The law is an empirical finding.
Heaps' Law for RCV1
- log10M = 0.49 log10T + 1.64 is the best least squares fit.
- k = 10^1.64 ≈ 44 and b = 0.49
- For the first 1,000,020 tokens, the law predicts 38,323 terms while there are actually 38,365 terms.
Zipf's Law
- Compliments Heaps' law by studying the relative frequencies of terms.
- Natural language contains a few very frequent terms and many rare terms.
- The ith most frequent term has frequency proportional to 1/i.
- cfi ∝ 1/i = K/i where K is a normalizing constant and cfi is the collection frequency of the term ti.
Zipf Consequences
- It means if the most frequent term ("the") occurs cf1 times, the second most frequent ("of") occurs cf1/2 times, etc.
- cfi = K/i, where K is a normalizing factor.
- There is a linear relationship between log cfi and log i
- It is another power-law relationship.
Compression Focus
- Compressing the dictionary and postings.
- It's for a basic Boolean index only.
- Positional indexes are not featured.
- It only focuses on compression schemes.
Why Compress the Dictionary?
- The search begins with the dictionary.
- The goal is to keep it in memory.
- There is memory footprint competition with other applications in place.
- Embedded or mobile devices may have limited memory.
- It is important to compress the dictionary even if its stored on the disk.
Dictionary Storage - First Cut
- An array of fixed-width entries where ~400,000 terms at 28 bytes/term equal 11.2 MB.
- The size includes terms, frequency, and postings pointer.
Fixed-Width Terms
- Most bytes are wasted in the term column, allotting the same amount of space for shorter and longer terms.
- Written English averages are ~4.5 characters/word, while the average dictionary word in English contains ~8 characters.
- Short words dominate token counts, not type average.
Dictionary-as-a-String
- The dictionary is stored as a long string of characters.
- This saves space because pointer to the next word shows the current word's end.
- This hopes to save up to 60% of dictionary space.
- Total string length equals = 400K x 8B = 3.2MB.
- The pointers resolve 3.2M positions: log2 3.2M = 22bits = 3bytes.
Space Savings via Dictionary-as-a-String
- Now averages approximately 11 bytes/term
- 4 bytes per term for frequency.
- 4 bytes per term for pointer to Postings.
- 3 bytes per term pointer.
- Average of 8 bytes per term in the term string.
- 400K terms x 19 equals ⇒ 7.6 MB, which is smaller than the 11.2MB used for fixed width.
Blocking Technique
- Pointers are stored to every kth term string.
- The example given uses k=4.
- Term lengths need to be stored (1 extra byte).
Network Benefits of Blocking
- Example for block size k = 4.
- There is a reduction in size from 3 bytes/pointer without blocking.
- 3 x 4 = 12 bytes is reduced to 3 + 4 = 7 bytes with blocking.
- This shaves another ~0.5MB, reducing the dictionary size from 7.6 MB to 7.1 MB, and more can be saved with larger k.
Dictionary Search Without Blocking
- It assumes each dictionary term equally likely in the query and that the average number of comparisons is approximately ~2.6.
Dictionary Search With Blocking
- A binary search algorithm is used down to a 4-term block.
- Afterward, a linear search is done through terms in a block.
- With blocks of 4 and a binary tree structure, the average number of comparisons is 3.
Front Coding Technique
- Takes advantage of sorted words that commonly have long common prefixes by storing only the differences
- Achieved (for last k-1 in a block of k).
- Begins to resemble general string compression.
RCV1 Dictionary Compression
- Fixed width: 11.2 MB
- Dictionary-as-String with pointers to every term: 7.6 MB
- Also blocking k = 4: 7.1 MB
- Also, Blocking + front coding: 5.9 MB
Postings files, compression
- They are much larger than the dictionary, by at least a factor of 10.
- Key consideration: each posting should be stored compactly.
- A posting in context is defined as a docID.
- For Reuters (800,000 documents), 32 bits per docID with 4-byte integers is used.
- Using log2 800,000 is approximately ≈ 20 bits per docID.
- There is a desire to use less than 20 bits per docID.
Postings: Two Conflicting Forces
- Terms like arachnocentric occur in maybe one document out of a million and are stored with log2 1M ~ 20 bits.
- Terms like "the" occur in virtually every document, so 20 bits/posting is costly.
- A 0/1 bitmap vector would be preferred.
Postings File Entry
- It stores the list of docs containing a term in increasing order of docID.
- Storing gaps between docIDs is sufficient.
- The process hopes most gaps can be encoded or stored with far fewer than 20 bits.
Variable Length Encoding
- The goal is to use ~20 bits/gap entry (arachnocentric) or ~1 bit/gap entry ("the").
- By having the average gap of a term as G they want use ~log2G bits/gap entry.
- The goal is to encode every integer, gap, with about as few bits as needed for that integer using a variable length encoding.
- Variable length codes achieve this goal by using short codes for small numbers.
Variable Byte (VB) Codes
- Each gap value G uses close to the fewest bytes to hold log2 G bits.
- It begins with 1 byte to store G, dedicating 1 bit in it to be a continuation bit c.
- If G ≤127, binary-encode it in the 7 available bits and set c =1.
- Else encode G's lower-order 7 bits, then use additional bytes to encode the higher order bits using the same algorithm.
- The continuation bit of the last byte= 1 (c =1), and the other bytes c = 0.
Variable Byte Code with small gaps
- Postings are stored as the byte concatenation.
- VB-encoded postings are uniquely prefix-decodable.
- VB uses a whole byte for a small gap.
Other Variable Unit Codes
- Instead of bytes, we can use a different “unit of alignment”: 32 bits (words), 16 bits, 4 bits (nibbles).
- Variable byte alignment wastes space if you have many small gaps - nibbles do better in such cases.
- Variable byte codes which are used by many commercial/research systems are a good low-tech blend of variable-length coding and sensitivity to computer memory alignment matches which is unlike bit-level codes.
- There is also recent work on word-aligned codes that pack a variable number of gaps into one word
Unary Code
- n represented as n 1s with a final 0.
- Unary code for 3 is 1110.
- Unary code for 40 or 80 are long strings of 1's and then a zero.
Gamma Codes
- Gamma code uses bit-level codes to compress
- A gap G is represented as a (length, offset) pair
- Offset is G in binary, with the leading bit cut off
- Length is the length of offset
- Length is encoded with unary code.
- Gamma code of 13 is the concatenation of length and offset: 1110101
Gamma Code Properties
- G is encoded using 2(log G) + 1 bits
- The length of the offset is (log G) bits
- The length of the offset is (log G) + 1 bits
- All gamma codes are odd
- Always within the best possible factor of 2, so log2 G
- Gamma code is uniquely prefix-decodable, like VB
- It can be used for any distribution
- They are parameter-free
Reasons Gamma Codes Seldom Used
- Machines have word boundaries – 8, 16, 32, 64 bits
- Operations that cross word boundaries are slower
- Compressing and manipulating at the granularity of bits can be a slow process
- Variable byte encoding is aligned therefore can be more efficiency
- VB is conceptually simpler at little additional space cost, regardless of efficiency.
RCV1 Compression Results
- Dictionary, fixed-width - 11.2 MB
- Dictionary term pointers into string - 7.6MB
- Dictionaries with blocking, k=4 - 7.1MB
- Dictionaries with blocking and front coding - 5.9MB
- Collection (Text, XML markup etc) - 3,600.0MB
- Collection (text) - 960.0MB
- Term doc indicence matrix - 40,000MB
- Postings, uncompressed 32-bit words - 400.0MB
- Postings, uncompressed 20-bits - 250.0MB
- Postings, variable byte encoded - 116.0MB
- Postings, gamma encoded - 101.0MB
Index Compression Summary
- It can create highly efficient boolean retrieval indices that are very space efficient
- The index often reaches only 4% of the total size of the collection
- It is only 10-15% of the total size of the text in the collection
- Positional information has been removed.
- Space savings are less for indexes used in practice.
- The techniques, however, are substantially the same.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.