Index Compression Techniques

Podcast

Listen to an AI-generated conversation about this lesson

Download our mobile app to listen on the go

Get App

Questions and Answers

Which indexing method avoids the need for a global dictionary by generating separate dictionaries for each block of the index?

Naïve in-memory inversion
Distributed indexing using MapReduce
Single-pass in-memory indexing (correct)
Blocked sort-based indexing

Which of the following is a primary reason for compressing inverted indexes?

To improve the aesthetic appeal of search engine interfaces.
To make the index fit entirely in main memory and reduce disk I/O. (correct)
To encrypt the indexed content for security purposes.
To reduce the computational complexity of search algorithms.

What is a key advantage of using compression algorithms where '[read compressed data | decompress] is faster than [read uncompressed data]'?

Faster data transfer from disk to memory. (correct)
Reduced CPU utilization during decompression.
Increased energy consumption.
Enhanced security due to encrypted data.

According to information presented, what is the approximate average number of bytes per token (including spaces and punctuation) in the Reuters RCV1 collection?

6 bytes (D)

Signup and view all the answers

Which preprocessing step is considered a form of lossy compression in information retrieval?

Removing stop words. (B)

Signup and view all the answers

According to Heaps' Law, what is the relationship between the vocabulary size ($M$) and the number of tokens ($T$) in a collection?

$M = kT^b$ (A)

Signup and view all the answers

In the context of Zipf's Law, if the most frequent term occurs $cf_1$ times, how many times does the second most frequent term approximately occur?

$cf_1 / 2$ (D)

Signup and view all the answers

What is the primary goal of compressing the dictionary in an information retrieval system?

To make it small enough to keep in main memory. (C)

Signup and view all the answers

Which dictionary compression technique involves storing the dictionary as a single string of characters and using pointers to indicate the start of each term?

Dictionary-as-a-String (A)

Signup and view all the answers

What is the purpose of 'blocking' in dictionary compression?

To store pointers to every _k_th term string, reducing the number of pointers stored. (C)

Signup and view all the answers

What is the main idea behind front coding in dictionary compression?

Storing only the differences between consecutive sorted words, leveraging common prefixes. (D)

Signup and view all the answers

In the context of inverted indexes, what does a 'posting' typically represent?

A document identifier (docID) where a term appears. (D)

Signup and view all the answers

Why is it effective to store gaps instead of raw docIDs in a postings list?

Gaps are generally smaller than docIDs, requiring fewer bits for encoding. (A)

Signup and view all the answers

What is the key idea behind variable length encoding?

Using short codes for small numbers and longer codes for larger numbers. (D)

Signup and view all the answers

In Variable Byte (VB) encoding, what is the purpose of the continuation bit?

To signal that another byte follows for the same number. (C)

Signup and view all the answers

According to the procedures outlined, how would you encode the number 130 using Variable Byte (VB) encoding?

two bytes: 10000010 00000010 (C)

Signup and view all the answers

What are the two components used to represent a gap (G) in Gamma encoding?

Length and offset (B)

Signup and view all the answers

How is the 'length' component encoded in Gamma encoding?

Using unary code. (A)

Signup and view all the answers

What is the gamma code for the number `5`?

11001 (A)

Signup and view all the answers

Which of the following is a property of Gamma codes?

Uniquely prefix-decodable. (A)

Signup and view all the answers

Which considerations primarily influence the practical usage of Gamma codes in real-world systems compared to Variable Byte Encoding

While efficient, gamma codes require bit-level manipulation, which can be slower due to machine word boundaries, compared to byte-aligned VB encoding. (C)

Signup and view all the answers

What type of index is most space efficient?

Postings, gamma-encoded (B)

Signup and view all the answers

Sort-based indexing uses what method for disk-based sorting?

Merge sort (C)

Signup and view all the answers

What is the primary advantage of using sort-based indexing?

Efficient disk-based sorting (C)

Signup and view all the answers

In single-pass in-memory indexing, what approach is used for dictionary management?

Separate dictionaries are generated for each block of data. (B)

Signup and view all the answers

What method does single-pass in-memory indexing use for its postings?

Postings are accumulated in postings lists as they occur (D)

Signup and view all the answers

What is the chief goal of dictionary compression?

To reduce memory usage (D)

Signup and view all the answers

What is the term for the number of times a term appears in the collection?

Collection frequency (B)

Signup and view all the answers

In dictionary as string what does the pointer point to?

Next word (D)

Signup and view all the answers

If a term occurs in virtually every document, what can we use?

1/0 bitmap (C)

Signup and view all the answers

What is 'offset' in Gamma code?

G in binary with the leading bit cut off (C)

Signup and view all the answers

What is 'length' in Gamma code?

The length of offset (C)

Signup and view all the answers

What is the worst thing about Unary code?

Doesn't look promising (C)

Signup and view all the answers

If G <= 127 what happens in binary-encoding?

It's binary-encoded in 7 available bits (A)

Signup and view all the answers

What is a major key requirement when storing postings?

Store each posting in increasing order (A)

Signup and view all the answers

Which of these is NOT a dictionary compression technique?

Huffman Coding (C)

Signup and view all the answers

Given the gamma code `1110011`, what number does it represent?

7 (B)

Signup and view all the answers

Given a document collection of 1,000,000 documents, and assuming each document ID requires $\log_2(10^6) \approx 20$ bits. If a term appears in almost every document, which of the following encoding schemes would be most efficient in terms of space?

Using a 0/1 bitmap vector where each bit represents a document. (A)

Signup and view all the answers

Considering dictionary compression techniques, what is the trade-off between using a larger block size (k) in blocking and search performance?

Larger k improves compression but increases search time because more terms must be searched linearly within a block. (C)

Signup and view all the answers

Flashcards

Sort-based indexing

Organizing indexes using sorting techniques, useful for disk-based sorting.

Single-Pass In-Memory Indexing

Indexing method that avoids a global dictionary by generating separate dictionaries for each block

Distributed indexing

Indexing distribution accross multiple machines.

Dynamic indexing

Building multiple indices and merges them logarithmically.

Signup and view all the flashcards

Why compression?

Reduces disk space, increases speed.

Signup and view all the flashcards

Lossless compression

Preserves all information during compression.

Signup and view all the flashcards

Lossy compression

Discarding some information during compression.

Signup and view all the flashcards

Term vocabulary

The number of distinct words in a collection

Signup and view all the flashcards

Heaps' Law

M = kT^b, relates vocabulary size to collection size

Signup and view all the flashcards

Heaps' law

Vocabulary size in collections.

Signup and view all the flashcards

Zipf's Law

The i-th most frequent term has frequency proportional to 1/i.

Signup and view all the flashcards

Collection frequency

The number of occurrences of the term ti in the collection.

Signup and view all the flashcards

Compression

Compressing the space for the dictionary and postings.

Signup and view all the flashcards

Why compress the dictionary?

Search begins faster with the dictionary.

Signup and view all the flashcards

Dictionary compression

Save memory and be faster.

Signup and view all the flashcards

Dictionary-as-a-String

Storing dictionary as one long string.

Signup and view all the flashcards

Blocking

Save pointer to every kth term string.

Signup and view all the flashcards

Front coding

Sorted words commonly have long common prefix.

Signup and view all the flashcards

Postings Compression

Large, so compress.

Signup and view all the flashcards

Variable Byte (VB) codes

Variable length encoding that uses one byte to store G and dedicate 1 bit in it to be a continuation bit c.

Signup and view all the flashcards

Unary code

Represent n as n 1s with a final 0.

Signup and view all the flashcards

Gamma codes

A gap G as a pair offset and length.

Signup and view all the flashcards

Study Notes

Index Construction (Last Lecture)

Sort-based indexing involves naïve in-memory inversion.
Blocked Sort-Based Indexing uses merge sort, which is effective for disk-based sorting.
Single-Pass In-Memory Indexing operates without a global dictionary, generating separate dictionaries for each block.
Postings are accumulated in postings lists as they occur, and are not sorted.
Distributed indexing can use MapReduce.
Dynamic indexing uses multiple indices with logarithmic merge.

Topics of Today

Collection statistics focus on the size of the dictionary and postings, particularly using RCV1 data.
Other topics include dictionary compression and postings compression techniques.

Why Compression?

It reduces disk space usage, saving money.
Compression enables more data to be stored in memory, which increases speed.
Increases the speed of data transfer from disk to memory because reading compressed data and decompressing it is faster than reading uncompressed data.
The decompression algorithms must be fast for this to work.

Compression for Inverted Indexes

Aims to reduce the dictionary size so it can be kept in main memory.
It reduces the size of postings files which reduces the disk space required.
The process decreases the time needed to read postings lists from disk.
This allows large search engines to keep a significant part of the postings in memory due to these benefits.
There are IR-specific compression schemes.

Reuters RCV1 Statistics

N (number of documents): 800,000
L (average number of tokens per document): 200
M (number of terms/word types): ~400,000
Average number of bytes per token (including spaces and punctuation): 6
Average number of bytes per token (without spaces and punctuation): 4.5
Average number of bytes per term: 7.5
Number of non-positional postings: 100,000,000

Index Parameters vs Indexing Choices

Unfiltered dictionaries occupy 484K.
The non-positional index size for unfiltered data is 109,971K.
The positional index size for unfiltered data is 197,879K.

Lossless vs. Lossy Compression

Lossless compression preserves all information, common in IR.
Lossy compression discards some information.
Preprocessing steps like case folding, stop words, stemming, and number elimination can be viewed as lossy compression.
Pruning postings entries unlikely to appear in the top k list can also be a form of lossy compression, with almost no loss in quality for the top k list.

Vocabulary vs. Collection Size

It's related to how many distinct words there are.
70^20 = 10^37 different words of length 20.
The vocabulary will keep growing with the collection size, especially with Unicode.

Heaps' Law Formula

M = kT^b, where M is the vocabulary size, T is the tokens' number, k and b are parameters.
Typical values usually are 30 ≤ k ≤ 100 and b ≈ 0.5.
Formula predicts a line with a slope of about 1/2 in a log-log plot of M against T.
The law is an empirical finding.

Heaps' Law for RCV1

log10M = 0.49 log10T + 1.64 is the best least squares fit.
k = 10^1.64 ≈ 44 and b = 0.49
For the first 1,000,020 tokens, the law predicts 38,323 terms while there are actually 38,365 terms.

Zipf's Law

Compliments Heaps' law by studying the relative frequencies of terms.
Natural language contains a few very frequent terms and many rare terms.
The ith most frequent term has frequency proportional to 1/i.
cfi ∝ 1/i = K/i where K is a normalizing constant and cfi is the collection frequency of the term ti.

Zipf Consequences

It means if the most frequent term ("the") occurs cf1 times, the second most frequent ("of") occurs cf1/2 times, etc.
cfi = K/i, where K is a normalizing factor.
There is a linear relationship between log cfi and log i
It is another power-law relationship.

Compression Focus

Compressing the dictionary and postings.
It's for a basic Boolean index only.
Positional indexes are not featured.
It only focuses on compression schemes.

Why Compress the Dictionary?

The search begins with the dictionary.
The goal is to keep it in memory.
There is memory footprint competition with other applications in place.
Embedded or mobile devices may have limited memory.
It is important to compress the dictionary even if its stored on the disk.

Dictionary Storage - First Cut

An array of fixed-width entries where ~400,000 terms at 28 bytes/term equal 11.2 MB.
The size includes terms, frequency, and postings pointer.

Fixed-Width Terms

Most bytes are wasted in the term column, allotting the same amount of space for shorter and longer terms.
Written English averages are ~4.5 characters/word, while the average dictionary word in English contains ~8 characters.
Short words dominate token counts, not type average.

Dictionary-as-a-String

The dictionary is stored as a long string of characters.
This saves space because pointer to the next word shows the current word's end.
This hopes to save up to 60% of dictionary space.
Total string length equals = 400K x 8B = 3.2MB.
The pointers resolve 3.2M positions: log2 3.2M = 22bits = 3bytes.

Space Savings via Dictionary-as-a-String

Now averages approximately 11 bytes/term
4 bytes per term for frequency.
4 bytes per term for pointer to Postings.
3 bytes per term pointer.
Average of 8 bytes per term in the term string.
400K terms x 19 equals ⇒ 7.6 MB, which is smaller than the 11.2MB used for fixed width.

Blocking Technique

Pointers are stored to every kth term string.
The example given uses k=4.
Term lengths need to be stored (1 extra byte).

Network Benefits of Blocking

Example for block size k = 4.
There is a reduction in size from 3 bytes/pointer without blocking.
3 x 4 = 12 bytes is reduced to 3 + 4 = 7 bytes with blocking.
This shaves another ~0.5MB, reducing the dictionary size from 7.6 MB to 7.1 MB, and more can be saved with larger k.

Dictionary Search Without Blocking

It assumes each dictionary term equally likely in the query and that the average number of comparisons is approximately ~2.6.

Dictionary Search With Blocking

A binary search algorithm is used down to a 4-term block.
Afterward, a linear search is done through terms in a block.
With blocks of 4 and a binary tree structure, the average number of comparisons is 3.

Front Coding Technique

Takes advantage of sorted words that commonly have long common prefixes by storing only the differences
Achieved (for last k-1 in a block of k).
Begins to resemble general string compression.

RCV1 Dictionary Compression

Fixed width: 11.2 MB
Dictionary-as-String with pointers to every term: 7.6 MB
Also blocking k = 4: 7.1 MB
Also, Blocking + front coding: 5.9 MB

Postings files, compression

They are much larger than the dictionary, by at least a factor of 10.
Key consideration: each posting should be stored compactly.
A posting in context is defined as a docID.
For Reuters (800,000 documents), 32 bits per docID with 4-byte integers is used.
Using log2 800,000 is approximately ≈ 20 bits per docID.
There is a desire to use less than 20 bits per docID.

Postings: Two Conflicting Forces

Terms like arachnocentric occur in maybe one document out of a million and are stored with log2 1M ~ 20 bits.
Terms like "the" occur in virtually every document, so 20 bits/posting is costly.
A 0/1 bitmap vector would be preferred.

Postings File Entry

It stores the list of docs containing a term in increasing order of docID.
Storing gaps between docIDs is sufficient.
The process hopes most gaps can be encoded or stored with far fewer than 20 bits.

Variable Length Encoding

The goal is to use ~20 bits/gap entry (arachnocentric) or ~1 bit/gap entry ("the").
By having the average gap of a term as G they want use ~log2G bits/gap entry.
The goal is to encode every integer, gap, with about as few bits as needed for that integer using a variable length encoding.
Variable length codes achieve this goal by using short codes for small numbers.

Variable Byte (VB) Codes

Each gap value G uses close to the fewest bytes to hold log2 G bits.
It begins with 1 byte to store G, dedicating 1 bit in it to be a continuation bit c.
If G ≤127, binary-encode it in the 7 available bits and set c =1.
Else encode G's lower-order 7 bits, then use additional bytes to encode the higher order bits using the same algorithm.
The continuation bit of the last byte= 1 (c =1), and the other bytes c = 0.

Variable Byte Code with small gaps

Postings are stored as the byte concatenation.
VB-encoded postings are uniquely prefix-decodable.
VB uses a whole byte for a small gap.

Other Variable Unit Codes

Instead of bytes, we can use a different “unit of alignment”: 32 bits (words), 16 bits, 4 bits (nibbles).
Variable byte alignment wastes space if you have many small gaps - nibbles do better in such cases.
Variable byte codes which are used by many commercial/research systems are a good low-tech blend of variable-length coding and sensitivity to computer memory alignment matches which is unlike bit-level codes.
There is also recent work on word-aligned codes that pack a variable number of gaps into one word

Unary Code

n represented as n 1s with a final 0.
Unary code for 3 is 1110.
Unary code for 40 or 80 are long strings of 1's and then a zero.

Gamma Codes

Gamma code uses bit-level codes to compress
A gap G is represented as a (length, offset) pair
Offset is G in binary, with the leading bit cut off
Length is the length of offset
Length is encoded with unary code.
Gamma code of 13 is the concatenation of length and offset: 1110101

Gamma Code Properties

G is encoded using 2(log G) + 1 bits
The length of the offset is (log G) bits
The length of the offset is (log G) + 1 bits
All gamma codes are odd
Always within the best possible factor of 2, so log2 G
Gamma code is uniquely prefix-decodable, like VB
It can be used for any distribution
They are parameter-free

Reasons Gamma Codes Seldom Used

Machines have word boundaries – 8, 16, 32, 64 bits
Operations that cross word boundaries are slower
Compressing and manipulating at the granularity of bits can be a slow process
Variable byte encoding is aligned therefore can be more efficiency
VB is conceptually simpler at little additional space cost, regardless of efficiency.

RCV1 Compression Results

Dictionary, fixed-width - 11.2 MB
Dictionary term pointers into string - 7.6MB
Dictionaries with blocking, k=4 - 7.1MB
Dictionaries with blocking and front coding - 5.9MB
Collection (Text, XML markup etc) - 3,600.0MB
Collection (text) - 960.0MB
Term doc indicence matrix - 40,000MB
Postings, uncompressed 32-bit words - 400.0MB
Postings, uncompressed 20-bits - 250.0MB
Postings, variable byte encoded - 116.0MB
Postings, gamma encoded - 101.0MB

Index Compression Summary

It can create highly efficient boolean retrieval indices that are very space efficient
The index often reaches only 4% of the total size of the collection
It is only 10-15% of the total size of the text in the collection
Positional information has been removed.
Space savings are less for indexes used in practice.
The techniques, however, are substantially the same.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Index Compression Techniques

Choose a study mode

Podcast

Questions and Answers

Which indexing method avoids the need for a global dictionary by generating separate dictionaries for each block of the index?

Which of the following is a primary reason for compressing inverted indexes?

What is a key advantage of using compression algorithms where '[read compressed data | decompress] is faster than [read uncompressed data]'?

According to information presented, what is the approximate average number of bytes per token (including spaces and punctuation) in the Reuters RCV1 collection?

Which preprocessing step is considered a form of lossy compression in information retrieval?

According to Heaps' Law, what is the relationship between the vocabulary size ($M$) and the number of tokens ($T$) in a collection?

In the context of Zipf's Law, if the most frequent term occurs $cf_1$ times, how many times does the second most frequent term approximately occur?

What is the primary goal of compressing the dictionary in an information retrieval system?

Which dictionary compression technique involves storing the dictionary as a single string of characters and using pointers to indicate the start of each term?

What is the purpose of 'blocking' in dictionary compression?

What is the main idea behind front coding in dictionary compression?

In the context of inverted indexes, what does a 'posting' typically represent?

Why is it effective to store gaps instead of raw docIDs in a postings list?

What is the key idea behind variable length encoding?

In Variable Byte (VB) encoding, what is the purpose of the continuation bit?

According to the procedures outlined, how would you encode the number 130 using Variable Byte (VB) encoding?

What are the two components used to represent a gap (G) in Gamma encoding?

How is the 'length' component encoded in Gamma encoding?

What is the gamma code for the number 5?

Which of the following is a property of Gamma codes?

Which considerations primarily influence the practical usage of Gamma codes in real-world systems compared to Variable Byte Encoding

What type of index is most space efficient?

Sort-based indexing uses what method for disk-based sorting?

What is the primary advantage of using sort-based indexing?

In single-pass in-memory indexing, what approach is used for dictionary management?

What method does single-pass in-memory indexing use for its postings?

What is the chief goal of dictionary compression?

What is the term for the number of times a term appears in the collection?

In dictionary as string what does the pointer point to?

If a term occurs in virtually every document, what can we use?

What is 'offset' in Gamma code?

What is 'length' in Gamma code?

What is the worst thing about Unary code?

If G <= 127 what happens in binary-encoding?

What is a major key requirement when storing postings?

Which of these is NOT a dictionary compression technique?

Given the gamma code 1110011, what number does it represent?

Given a document collection of 1,000,000 documents, and assuming each document ID requires $\log_2(10^6) \approx 20$ bits. If a term appears in almost every document, which of the following encoding schemes would be most efficient in terms of space?

Considering dictionary compression techniques, what is the trade-off between using a larger block size (k) in blocking and search performance?

Flashcards

Sort-based indexing

Single-Pass In-Memory Indexing

Distributed indexing

Dynamic indexing

Why compression?

Lossless compression

Lossy compression

Term vocabulary

Heaps' Law

Heaps' law

Zipf's Law

Collection frequency

Compression

Why compress the dictionary?

Dictionary compression

Dictionary-as-a-String

Blocking

Front coding

Postings Compression

Variable Byte (VB) codes

Unary code

Gamma codes

Study Notes

Index Construction (Last Lecture)

Topics of Today

Why Compression?

Compression for Inverted Indexes

Reuters RCV1 Statistics

Index Parameters vs Indexing Choices

Lossless vs. Lossy Compression

Vocabulary vs. Collection Size

Heaps' Law Formula

Heaps' Law for RCV1

Zipf's Law

Zipf Consequences

Compression Focus

What is the gamma code for the number `5`?

Given the gamma code `1110011`, what number does it represent?