Full Transcript

# Lecture 18 ## Hash Functions ### Hash Function Applications Hash functions are used in a variety of applications, especially data structures * Hash Tables (obviously!) * Bloom filters * Cryptographic hash functions ### Hashing vs. Encryption Hash functions are often confused with encrypti...

# Lecture 18 ## Hash Functions ### Hash Function Applications Hash functions are used in a variety of applications, especially data structures * Hash Tables (obviously!) * Bloom filters * Cryptographic hash functions ### Hashing vs. Encryption Hash functions are often confused with encryption. While they both transform data, they have different purposes and properties: * **Hashing** * One-way function: Easy to compute the hash, but hard to find the original input * Collision: Different inputs can produce the same hash value * Fixed-size output: Regardless of the input size, the hash value is always the same size * **Encryption** * Two-way function: Encryption and decryption * No collision: Each input produces a unique ciphertext * Same-size output: Ciphertext has the same size as the input ### Hash Table Hash table is a data structure that stores key-value pairs. It uses a hash function to compute an index into an array of buckets or slots, from which the desired value can be found. ### Hash Table Operations * **Insert (key, value)** * Compute the hash of the key: $index = hash(key)$ * Store the key-value pair in the bucket at that index * **Get (key)** * Compute the hash of the key: $index = hash(key)$ * Retrieve the key-value pair from the bucket at that index * **Delete (key)** * Compute the hash of the key: $index = hash(key)$ * Delete the key-value pair from the bucket at that index ### Hash Table Example Let's say we have a hash table with 10 buckets, and we want to insert the following key-value pairs: * (apple, 1) * (banana, 2) * (cherry, 3) Let's assume our hash function is simply the sum of the ASCII values of the characters in the key, modulo the number of buckets. * hash(apple) = (97 + 112 + 112 + 108 + 101) % 10 = 530 % 10 = 0 * hash(banana) = (98 + 97 + 110 + 97 + 110 + 97) % 10 = 609 % 10 = 9 * hash(cherry) = (99 + 104 + 101 + 114 + 114 + 121) % 10 = 653 % 10 = 3 So, we would store the key-value pairs in the following buckets: * Bucket 0: (apple, 1) * Bucket 9: (banana, 2) * Bucket 3: (cherry, 3) ### Collision Resolution * **Separate Chaining** * Each bucket stores a list of key-value pairs * When a collision occurs, the new key-value pair is added to the list * **Open Addressing** * When a collision occurs, the hash table probes for an empty slot * Linear probing, quadratic probing, double hashing ### Simple Uniform Hashing * Each key is equally likely to hash to any of the $m$ slots * $\operatorname{Pr}\{h(k)=i\}=1 / m$ for each slot $i$. * However, this is generally not possible to achieve, since we don't know the distribution of keys ### Load Factor The load factor $\alpha$ of a hash table is the number of keys stored in the table divided by the number of slots in the table. $\alpha = \frac{n}{m}$ where: * $n$ is the number of keys stored in the table * $m$ is the number of slots in the table ### Hash Function Design * A good hash function should satisfy the assumption of simple uniform hashing * However, this is generally not possible to achieve, since we don't know the distribution of keys * In practice, we use heuristic techniques to produce a good hash function * **Division method**: $h(k) = k \mod m$ * **Multiplication method**: $h(k) = \lfloor m(k A \mod 1) \rfloor$ ### Division Method $h(k) = k \mod m$ * $m$ should be a prime number not too close to an exact power of 2 * For example, if we have $n = 2000$ strings and we don't mind examining 3 elements on average during an unsuccessful search, we allocate a hash table of size $m = 701$. * $2000/3 \approx 701$ * However, if $m$ is a power of 2, say $m = 2^p$, then $h(k)$ is just the $p$ lowest-order bits of $k$. * Unless we know that all low-order $p$-bit patterns are equally likely, this is not a good choice, since we are not using all the bits of the key. ### Multiplication Method $h(k) = \lfloor m(k A \mod 1) \rfloor$ * $A$ is a constant between 0 and 1 * $k A \mod 1$ means the fractional part of $k A$, i.e., $k A - \lfloor k A \rfloor$ * Multiply $k$ by $A$, extract the fractional part, multiply by $m$, and take the floor. * Advantage: The value of $m$ is not critical, we typically choose it to be a power of 2, since we can then easily implement the function on most computers * Advantage: The value of $A$ is critical, but Knuth suggests that $A = (\sqrt{5} - 1) / 2 = 0.6180339887 \dots$ is likely to work reasonably well. ### Multiplication Method Example $h(k) = \lfloor m(k A \mod 1) \rfloor$ * Suppose $k = 123456$, $m = 10000$, and $A = (\sqrt{5} - 1) / 2 = 0.6180339887 \dots$ * Then $k A = 76300.0041151 \dots$ * $k A \mod 1 = 0.0041151 \dots$ * $m(k A \mod 1) = 41.151 \dots$ * $h(k) = \lfloor 41.151 \dots \rfloor = 41$ ### Universal Hashing * Pick a hash function randomly from a well-designed class of functions. * This guarantees a low number of collisions *in expectation*, even if the data is chosen by an adversary. ### Universal Hash Function Let $H$ be a finite collection of hash functions that map a given universe $U$ of keys into the range $\{0,1, \ldots, m-1\}$. Such a collection is said to be *universal* if for each pair of distinct keys $k, l \in U$, the number of hash functions $h \in H$ for which $h(k)=h(l)$ is at most $|H| / m$. In other words, with a hash function randomly chosen from $H$, the chance of collision between distinct keys $k$ and $l$ is no more than the chance $1/m$ of a collision if $h(k)$ and $h(l)$ were randomly and independently chosen from the range $\{0,1, \ldots, m-1\}$. ### Universal Hashing Example * Choose a prime number $p$ large enough so that every possible key $k$ is in the range $0$ to $p-1$. * Let $a$ and $b$ be integers chosen randomly from the range $\{0,1, \ldots, p-1\}$. * Define the hash function $h_{a,b}(k) = ((a k + b) \mod p) \mod m$ * The set of all such hash functions $H = \{h_{a,b}\}$ is universal. ### Perfect Hashing * If all keys are known in advance (static), we can construct a perfect hash function that has no collisions. * Perfect hash functions can be constructed in $O(n)$ time, where $n$ is the number of keys. * Perfect hash functions are used in situations where collisions are unacceptable, such as in databases and compilers. * In practice, perfect hashing is only used for small sets of keys, since the size of the hash table grows quadratically with the number of keys. * There are two-level perfect hashing schemes that require only $O(n)$ space. ### Bloom Filter * A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. * It allows false positive membership queries, but not false negatives. * In other words, a query returns either "possibly in set" or "definitely not in set". ### Bloom Filter Operations * **Insert (element)** * Hash the element using $k$ different hash functions * Set the bits in the bit array at the $k$ indices to 1 * **Search (element)** * Hash the element using $k$ different hash functions * If all the bits in the bit array at the $k$ indices are 1, then the element is possibly in the set * Otherwise, the element is definitely not in the set ### Bloom Filter Example Let's say we have a Bloom filter with a bit array of size 10, and we use 3 hash functions. We want to insert the following elements: * apple * banana * cherry Let's assume our hash functions return the following indices: * hash1(apple) = 1, hash2(apple) = 3, hash3(apple) = 5 * hash1(banana) = 2, hash2(banana) = 4, hash3(banana) = 6 * hash1(cherry) = 3, hash2(cherry) = 5, hash3(cherry) = 7 So, we would set the bits in the bit array at the following indices to 1: * 1, 2, 3, 4, 5, 6, 7 Now, let's say we want to search for the element "grape". Let's assume our hash functions return the following indices: * hash1(grape) = 1, hash2(grape) = 4, hash3(grape) = 8 Since the bits at indices 1 and 4 are 1, but the bit at index 8 is 0, we can conclude that "grape" is definitely not in the set. ### Cryptographic Hash Functions * A cryptographic hash function is a hash function with some additional security properties. * It should be infeasible to find: * a message that corresponds to a given hash value (preimage resistance), * a second message that produces the same hash value as a given message (2nd-preimage resistance), and * two arbitrary messages that produce the same hash value (collision resistance). * Examples: SHA-256, SHA-3, scrypt, bcrypt. ### Hash Function Summary * Hash functions are used in a variety of applications, especially data structures * Hash tables are a fundamental data structure that use hash functions to map keys to indices in an array * Collision resolution is an important consideration when designing a hash table * Universal hashing guarantees a low number of collisions in expectation * Perfect hashing guarantees no collisions, but is only practical for small sets of keys * Bloom filters are a space-efficient probabilistic data structure that are used to test whether an element is a member of a set * Cryptographic hash functions are hash functions with additional security properties