Data Structures: Mapping and Hash Tables - PDF

Document Details

SuperbAlgorithm

Uploaded by SuperbAlgorithm

University of Connecticut

Tags

hashing data structures object-oriented design

Summary

These are lecture notes from the University of Connecticut. They cover mapping and hash tables, including hash functions and collision handling. The notes also explore the implementation of mapping ADT and dynamic hash mapping.

Full Transcript

2050. Data Structures and Object-Oriented Design Week 8 University of Connecticut Chapter 15: Mapping and hash tables Objectives 1. use vocabulary: mapping, hash function, hash collision, hash table 2. list the basic operations of the mapping ADT 3. explain how hashin...

2050. Data Structures and Object-Oriented Design Week 8 University of Connecticut Chapter 15: Mapping and hash tables Objectives 1. use vocabulary: mapping, hash function, hash collision, hash table 2. list the basic operations of the mapping ADT 3. explain how hashing works 4. list the two fundamental requirements of a hash function 5. explain why keys must be immutable (e.g. why cannot a list be used as the key in a dict) 6. implement the mapping ADT with O(1) for put and get using a hash table Search Search for an item Linear search 5 6 2 0 3 1 9 4 8 7 O(n) Binary search 0 1 2 3 4 5 6 7 8 9 O(log n) Can we do it in O(1) ? Yes, Hashing Dictionaries In Python, you can use a dictionary to store elements with keys of any hashable types (e.g., integers, floats, Booleans, strings, and tuples; but not lists, sets, and dictionaries themselves) and values of any types Mapping A mapping is an association between two sets of things. It associates a value to a key. We refer to these associated pairs as key-value pairs. Keys must be unique. Key Value Student ID Grade Hfg7678 98 khi9879 98 Qew987 54 Lji76586 89 Bhu7699 78 Mry8768 80 Concept of hashing The information to be retrieved is stored in a hash table which is best thought of as an array of m locations, called buckets The mapping between a key and a bucket is called the hash function The time to store and retrieve data is proportional to the time to compute the hash function Hash table Key2:value2 Key1:value1 Key2:value2 Buckets Key1:value1 hash function Optional Operations Mapping ADT supports the following methods. del(key) remove key:value pair Required Operations len put(key, value) return number of keys add key:value pair keys get(key) return set of keys in mapping return value (KeyError) is raised if the given key is not values present. return set of values in mapping items Can we do put and get it in return set of key:value pairs in mapping O(1) ? Idea – What if we know exactly where the key should be? Say all our keys are integers from 0-9 dea 1 – What if we know exactly where the key should be? Say all our keys are integers from 0-9 use key as the index 8 Keys --- > Integer 4 h(x) = x 0 2 9 5 1 3 6 7 possible keys Idea 1: use keys as indices. Say all our keys are integers from 0-9 Values use key as the index 8 0 None Keys --- > Integer 4 1 None h(x) = x 0 2 9 5 2 None 1 3 3 None 6 7 4 None possible keys 5 None 6 None 7 None 8 None 9 None Idea 1: use keys as indices. Say all our keys are integers from 0-9 Values use key as the index 0 None put(1, 'hey') Keys --- > Integer 8 put(6, 3) 4 1 None put(7, [1,2]) h(x) = x 0 2 9 5 2 None 1 3 3 None 6 7 4 None possible keys 5 None 6 None 7 None 8 None 9 None Idea 1: use keys as indices. Say all our keys are integers from 0-9 Values use key as the index 0 None put(1, 'hey') Keys --- > Integer 8 put(6, 3) 4 1 hey put(7, [1,2]) h(x) = x 0 2 9 5 2 None 1 3 3 None 6 7 4 None possible keys 5 None used keys 6 3 7 [1,2] 8 None 9 None Idea 1: use keys as indices. Say all our keys are integers from 0-9 Values use key as the index 0 None put(1, 'hey') Keys --- > Integer 8 put(6, 3) 4 1 hey put(7, [1,2]) h(x) = x 0 2 9 5 2 None 1 3 3 None 6 7 4 None Pros possible keys 5 None used keys 6 3 I can put and get with O(1) – I know where key will be if it exists 7 [1,2] Cons 8 None 9 None What if keys were 1, 6, and 1001? Storage – I need a slot for every possible key Idea 1: use keys as indices. (This approach is called direct-access table or direct-access map) indices 0 1 1 2 5000.. 202 202 5000 1 1 5000 5000 202 202.... 900007 900007 900007.. 900007 h(x) = x Problem: Too much wasted space Solution to the problem: Too much wasted space Idea 2: Map the entire key space into a small set of keys (use f(key))so we can use just the right sized array) indices 5000 0 5000 1 1 202 2 0 1 3 1 4 202 2 5 6 7 900007 7 8 900007 9 h(x) = x% table size Idea 2 – don't use key as the index; use f(key). Modify hash (We call f a hash function: it converts a key to an index) function h(x) = x % Table_size If we have 3 keys (1, 6, and 1001); we want a function with three outputs – 0, 1, and 2. try f(x) = x % 3 key f(key) 1 1 0 None 6 0 put(1, 'hey') 1 None 1001 2 2 None New idea – don't use key as the index; use f(key). Modify hash (We call f a hash function: it converts a key to an index) function h(x) = x % Table_size If we have 3 keys (1, 6, and 1001); we want a function with three outputs – 0, 1, and 2. try f(x) = x % 3 key f(key) 1 1 0 None 6 0 put(1, 'hey') 1 1:'hey' 1001 2 2 None New idea – don't use key as the index; use f(key). Modify hash (We call f a hash function: it converts a key to an index) function h(x) = x % Table_size If we have 3 keys (1, 6, and 1001); we want a function with three outputs – 0, 1, and 2. try f(x) = x % 3 key f(key) 1 1 0 None 6 0 put(1, 'hey') put(6, 3) 1 1:'hey' 1001 2 2 None New idea – don't use key as the index; use f(key). Modify hash (We call f a hash function: it converts a key to an integer)function h(x) = x % Table_size If we have 3 keys (1, 6, and 1001); we want a function with three outputs – 0, 1, and 2. try f(x) = x % 3 key f(key) 1 1 0 6:3 6 0 put(1, 'hey') put(6, 3) 1 1:'hey' 1001 2 2 None New idea – don't use key as the index; use f(key). Modify hash (We call f a hash function: it converts a key to an integer)function h(x) = x % Table_size If we have 3 keys (1, 6, and 1001); we want a function with three outputs – 0, 1, and 2. try f(x) = x % 3 key f(key) 1 1 0 6:3 6 0 put(1, 'hey') put(6, 3) 1 1:'hey' 1001 2 put(1001, [1,2]) 2 None New idea – don't use key as the index; use f(key). Modify hash (We call f a hash function: it converts a key to an integer)function h(x) = x % Table_size If we have 3 keys (1, 6, and 1001); we want a function with three outputs – 0, 1, and 2. try f(x) = x % 3 key f(key) 1 1 0 6:3 6 0 put(1, 'hey') put(6, 3) 1 1:'hey' 1001 2 put(1001, [1,2]) 2 1001:[1,2] 0 6:3 key f(key) put(1, 'hey') put(6, 3) 1 1:'hey' 1 1 put(1001, [1,2]) 2 1001:[1,2] 6 0 1001 2 0 6:3 key f(key) put(1, 'hey') put(6, 3) 1 1:'hey' 1 1 put(1001, [1,2]) 2 1001:[1,2] 6 0 1001 2 By using a function, we can map keys to an index put – O(?) get – O(?) No empty items! 0 6:3 key f(key) put(1, 'hey') put(6, 3) 1 1:'hey' 1 1 put(1001, [1,2]) 2 1001:[1,2] 6 0 1001 2 By using a function, we can map keys to an index put – O(1) get – O(1) No empty items! 0 6:3 key f(key) put(1, 'hey') put(6, 3) 1 1:'hey' 1 1 put(1001, [1,2]) 2 1001:[1,2] 6 0 1001 2 By using a function, we can map keys to an index put – O(?) get – O(?) No empty items! … What if my keys are 3, 6, and 9? 0 6:3 key f(key) put(1, 'hey') put(6, 3) 1 1:'hey' 1 1 put(1001, [1,2]) 2 1001:[1,2] 6 0 1001 2 f(x) = x % 3 By using a function, we can map keys to an index put – O(?) key f(key) get – O(?) 3 0 No empty items! 6 0 9 0 … What if my keys are 3, 6, and 9? 0 6:3 key f(key) put(1, 'hey') put(6, 3) 1 1:'hey' 1 1 put(1001, [1,2]) 2 1001:[1,2] 6 0 1001 2 f(x) = x % 3 By using a function, we can map keys to an index put – O(?) key f(key) get – O(?) 3 0 No empty items! 6 0 9 0 … What if my keys are 3, 6, and 9? Worst case – all keys hash to the same bucket Hash table: insert (1000) Some other value exists in slot at index 0 indices 1000 5000 0 5000 1 1 202 2 0 1 3 1 4 202 2 5 6 7 900007 7 8 900007 9 f(x) = x% table size f(x) = x % 10 Hash collision What is a hash collision? It’s a case when two different keys have the same hash value. Mathematically, h(k1) = h(k2) when k1 ≠ k2 31 Hash collision What is a hash collision? It’s a case when two different keys have the same hash value. Mathematically, h(k1) = h(k2) when k1 ≠ k2 Why is this a problem? - We put keys in slots determined by the hash function. That is, we put k1 at index h(k1), - A collision means the natural choice slot is taken - We cannot replace k1 with k2 (because the keys are different) - So the problem is where do we put k2? 32 Strategies to handle hash collision Collision Handling Schemes Collision occurs when different elements are mapped to the same cell Some collision resolution methods: 1. Separate chaining 2. Open addressing - Linear probing - Quadratic probing -Double hashing Separate chaining (Open Hashing) Separate chaining (Open Hashing) Separate chaining is a collision resolution strategy where collisions are resolved by storing all colliding keys in the same slot (using list, linked list or some other data structure) We could make this faster if we had many short lists instead of one large list. Then, we just need to have a quick way of knowing which short list to search or update. (buckets) key f(key) Example: Separate chaining 3 0 0 3:'hey', 6:3, 9:[1,2] 6 0 put(3, 'hey') put(6, 3) 1 None 9 0 put(9, [1,2]) 2 None By using a function, we can map keys to an index put – O(?) get – O(?) … What if my keys are 3, 6, and 9? Worst case – all keys hash to the same integer → Hash Collision key f(key) Example: Separate chaining 3 0 0 3:'hey', 6:3, 9:[1,2] 6 0 put(3, 'hey') put(6, 3) 1 None 9 0 put(9, [1,2]) 2 None By using a function, we can map keys to an index put – O(n) get – O(n) I can still put and get, but I need to iterate over everything in slot 0 first: O(n) Example 2: Separate chaining Example: insert the keys 23, 13, 21, 14, 7, 8, and 15 , in this order, in a hash table of size 7 with the hash function: h(key) = key % 7 h(23) = 23 % 7 = 2 0 21 14 7 h(13) = 13 % 7 = 6 h(21) = 21 % 7 = 0 1 8 15 h(14) = 14 % 7 = 0 collision 2 23 h(7) = 7 % 7 = 0 collision h(8) = 8 % 7 = 1 3 h(15) = 15 % 7 = 1 collision 4 5 6 13 Open Addressing (Closed Hashing) Open addressing is a collision resolution strategy where collisions are resolved by storing the colliding key in a different location when the natural choice is full. Linear Probing Quadratic Probing Double Hashing Search the table for the closest When collision occurs, we probe Use a secondary hash function following free location and inserts for i2‘th bucket in ith iteration. algorithm to find the next free the new key there available slot.(it is very unlikely that for some key, h(key) == g(key)) 42 Rehashing Dictionaries support O(1) for put and get, regardless of n How? 𝑏𝑢𝑐𝑘𝑒𝑡 = ℎ𝑎𝑠ℎ 𝑘𝑒𝑦 % 𝑚 Splits keys into 𝑚 buckets good for ~𝑚 items If 𝑛 gets too big/small, rehash: Create list with 𝑝 buckets Rehash every key: 𝑏𝑢𝑐𝑘𝑒𝑡 = ℎ𝑎𝑠ℎ 𝑘𝑒𝑦 % 𝑝 Cost: ? Rehashing Dictionaries support O(1) for put and get, regardless of n How? 𝑏𝑢𝑐𝑘𝑒𝑡 = ℎ𝑎𝑠ℎ 𝑘𝑒𝑦 % 𝑚 Splits keys into 𝑚 buckets good for ~𝑚 items If 𝑛 gets too big/small, rehash: Create list with 𝑝 buckets Rehash every key: 𝑏𝑢𝑐𝑘𝑒𝑡 = ℎ𝑎𝑠ℎ 𝑘𝑒𝑦 % 𝑝 Cost: O(n) Average cost: ? Rehashing Dictionaries support O(1) for put and get, regardless of n How? 𝑏𝑢𝑐𝑘𝑒𝑡 = ℎ𝑎𝑠ℎ 𝑘𝑒𝑦 % 𝑚 Splits keys into 𝑚 buckets good for ~𝑚 items If 𝑛 gets too big/small, rehash: Create list with 𝑝 buckets Rehash every key: 𝑏𝑢𝑐𝑘𝑒𝑡 = ℎ𝑎𝑠ℎ 𝑘𝑒𝑦 % 𝑝 Cost: O(n) Average cost: O(1) Lazy Update! When should we Rehash? Load Factor(Number of slots that are occupied) Load Factor (α) Ratio of number of entries in the table to table size. If n is the total number of (key, value) pairs stored in the table and c is capacity of the table (i.e., array), n Load factor α = c The Load factor is a measure that decides when to increase the HashMap capacity to maintain the get() and put() operation complexity of O(1). Rehashing Rehashing Example: insert the keys 25, 37, 83, 98 in this order, in a hash table of size 5 with the hash function: h(key) = key % 5 h(25) = 25 % 5 = 0 0 25 h(37) = 37 % 5 = 2 h(83) = 83 % 5 = 3 1 h(98) = 98 % 5 = 3 2 37 3 83 98 4 Load factor: 4/5 = 0.8 Rehashing Example: insert the keys 25, 37, 83, 98 in this order, in a hash table of size 5 with the hash function: h(key) = key % 5 0 h(25) = 25 % 5 = 0 0 25 1 h(37) = 37 % 5 = 2 2 h(83) = 83 % 5 = 3 1 3 25 h(98) = 98 % 5 = 3 2 37 4 37 3 5 83 98 6 83 4 7 8 Rehash h1(key) = key % 5 to h2(key) = key % 11 9 10 98 Python hash() hash(object) hash() function takes only one parameter. object (required) – The object whose hash value is to be returned. Only immutable objects can be hashed. Hashable – bool, int, long, float, string, tuple Non–Hashable – lists, set, dictionary Python hash() Example >>> hash(2) // 2 >>> hash(2.00) //2 >>> hash('Python’) // 397710083 >>> hash(4.4) //1288491832 #using hash function for tuple >>> hash((1,2,3,4)) ///89902565 #using hash function for list >>> hash([1,2,3,4]) Traceback (most recent call last): File "", line 1, in hash([1,2,3,4]) TypeError: unhashable type: 'list’ #using hash function for set >>> hash({1,2,3,4}) Traceback (most recent call last): File "", line 1, in hash({1,2,3,4}) TypeError: unhashable type: 'set' Optional Operations Mapping ADT supports the following methods. del(key) remove key:value pair Required Operations len put(key, value) return number of keys add key:value pair keys get(key) return set of keys in mapping return value values (KeyError) is raised if the given key is not present. return set of values in mapping items return set of key:value pairs in mapping Implement the Mapping ADT: [ ] Buckets Entries (key-value pair) [ color:blue ] ListMapping class Entry class [ make:Ford ] [ ] Hash table HashMapping class [ model:f150 ] Implement the Mapping ADT: myMapping = [color:blue make:Ford model:f150 ] Key-value Objects from Entry class Object from ListMapping class myMapping = ListMapping() myMapping['color'] = 'blue' # Implements the assignment self[key] = value def __setitem__(self, key, value): myMapping['make'] = 'Ford' myMapping['model'] = 'F150' # Implements the evaluation of self[key] Print(myMapping['model']) // ’f150' def __getitem__(self, key): print(len(myMapping)) //3 # Implements the built-in function len() def __len__(self): # Implements the membership test operator: keyword in (and not in) print('make' not in myMapping) //False def __contains__(self, key): print('year' in myMapping) //False Entry class # A class that stores key-value pairs class Entry: def __init__(self, the_key, the_value): self.key = the_key self.value = the_value def __str__(self): return str(self.key) + " : " + str(self.value) Implement the Mapping ADT: How to store? # Implements the Mapping ADT: # Stores the key-value pairs in a list class ListMapping: def __init__(self): # Initialize an empty list to store the data self._entries = [] List = [ ] Implement the Mapping ADT: # Add a new or update a key-value pair # Private helper method: def put(self, key, value): # Returns the entry associated with a e = self._entry(key) given key, or None if the key is not in # If the key is present in the list the mapping if e is not None: def _entry(self, key): # Update the value # Loop through all entries e.value = value for e in self._entries: else: # Until we find the key # If the key was not present in the list if e.key == key: : Add it return e self._entries.append(Entry(key, value)) # If the key was not found return None # Implements the assignment self[key] = value def __setitem__(self, key, value): self.put(key, value) myMapping = ListMapping() myMapping['color'] = 'blue' myMapping['make'] = 'Ford' myMapping['model'] = 'F150' myMapping = [color:blue make:Ford model:f150 ] Implement the Mapping ADT: def get(self, key): e = self._entry(key) # Implements the evaluation of self[key] # If the key is found in the list def __getitem__(self, key): if e is not None: return self.get(key) # Return the associated value return e.value else: # Else: Raise a KeyError raise KeyError myMapping = ListMapping() myMapping['color'] = 'blue' myMapping['make'] = 'Ford' myMapping = [color:blue ] make:Ford model:f150 myMapping['model'] = 'F150' Print(myMapping['model']) // 'F150' Print(myMapping[‘year’]) File "HashCar.py", line 38, in get raise KeyError KeyError Implement the Mapping ADT: # Implements the to-string conversion of the mapping myMapping = ListMapping() def __str__(self): myMapping['color'] = 'blue' return str([str(e) for e in self._entries]) myMapping['make'] = 'Ford' myMapping['model'] = 'F150' # Implements the built-in function len() def __len__(self): print(len(myMapping)) //3 return len(self._entries) # Implements the membership test operator: keyword in (and not in) def __contains__(self, key): if self._entry(key) is None: print('make' not in myMapping) //False return False print('year' in myMapping) //False else: return True Implement the Mapping ADT: # Returns an iterator: iterates over the keys in the mapping def __iter__(self): return (e.key for e in self._entries) # Returns an iterator: iterates over the values in the mapping def values(self): return (e.value for e in self._entries) # Returns an iterator: iterates over the key-value pairs in the mapping def items(self): return ((e.key, e.value) for e in self._entries) myMapping = ListMapping() for my_key in myMapping: myMapping['color'] = 'blue' print(my_key, myMapping[my_key]) myMapping['make'] = 'Ford' myMapping['model'] = 'F150' color blue make Ford model F-150 from list_mapping import ListMapping # A simple mapping using a hashtable class SimpleHashMapping: Simple def __init__(self): # Number of buckets HashMapping self._size = 100 # Create a list of n empty ListMappings class self._buckets = [ListMapping()] * self._size # Implement the assignment self[key] = value def __setitem__(self, key, value): # Use the helper function to find the bucket m = self._bucket(key) # Use the ListMapping contained in the bucket to find the key What if we need m[key] = value more buckets? # Implement the evaluation of self[key] def __getitem__(self, key): Rehashing. # Use the helper function to find the bucket m = self._bucket(key) # Use the ListMapping contained in the bucket to find the key return m[key] # Identify the right bucket def _bucket(self, key): # Use the hash function of the object used as key return self._buckets[hash(key) % self._size] from list_mapping import ListMapping # A mapping using a hashtable and dynamic reallocation of new buckets when Dynamic hash needed class HashMapping: def __init__(self, size=2): mapping # Number of buckets self._size = size # Create a list of empty ListMappings self._buckets = [ListMapping()] * self._size # Keep track of the number of elements currently in the mapping self._length = 0 # Implement the assignment self[key] = value def __setitem__(self, key, value): # Use the helper function to find the bucket m = self._bucket(key) if key not in m: self._length += 1 # Use the ListMapping contained in the bucket to find the key m[key] = value # Check if we need more buckets. if self._length > self._size: self.Rehashing() # Implement the evaluation of self[key] def __getitem__(self, key): # Use the helper function to find the bucket m = self._bucket(key) Dynamic hash # Use the ListMapping contained in the bucket to find the key return m[key] mapping # Identify the right bucket def _bucket(self, key): return self._buckets[hash(key) % self._size] Rehashing # Rehashing (Double) Increases the number of buckets. def Rehashing(self): print("I need more space!") It's not enough to just append more # Save a reference to the old buckets. buckets to the list, because the old_buckets = self._buckets bucket method that we use to find # Double the size. the right bucket depends on the self._size *= 2 number of buckets. When that # Create new buckets number changes, we have to self._buckets = [ListMapping()] * self._size # Add in all the old entries. reinsert all the items in the mapping for bucket in old_buckets: so that they can be found when we # For each entry in the old buckets next get them. for key, value in bucket.items(): # Identify the new bucket. m = self._bucket(key) # Add it to the bucket m[key] = value def __len__(self): return self._length