Document Details
Tags
Full Transcript
Hashing refers to the process of generating a fixed-size fixed size output from an input of variable size using the mathematical formulas known as hash functions. This technique determines an index or location for the storage of an item in a data st...
Hashing refers to the process of generating a fixed-size fixed size output from an input of variable size using the mathematical formulas known as hash functions. This technique determines an index or location for the storage of an item in a data structure. There are majorly three components of hashing: 1. Key: A Key can be anything string or integer which is fed as input in the hash function the technique that determines an index or location for storage of an item in a data structure. 2. Hash Function: The hash function receives the input key and returns the index of an element in an array called a hash table. The index is known as the hash index. index 3. Hash Table: Hash table is a data structure that maps keys to values using a special stores the data in an associative manner in an array function called a hash function. Hash stores where each data value has its own unique index. Types of Hash functions: There are many hash functions that use numeric or alphanumeric keys. This article focuses on discussing different hash functions functions: 1. Division Method. 2. Mid Square Method. 3. Folding Method. 4. Multiplication Method What is collision? The hashing process generates a small number for a big key, so there is a possibility that two keys could produce the same value. The situation where the newly inserted key maps to an already occupied, and it must be handled using some collision handling technology. Hashing is the process of generating a value from a text or a list of numbers using a mathematical function known as a hash function. A Hash Function is a function that converts a given numeric or alphanumeric key to a small practical integer value. The mapped integer value is used as an index in the hash table. In simple terms, a hash function maps a significant number or string to a small integer that can be used as the index in the hash table. The pair is of the form (key, value), where for a given key, one can find a value using some kind of a “function” that maps keys to values. The key for a given object can be calculated using a function called a hash function. For example, given an array A, if i is the key, then we can find the value by simply looking up A[i]. …………………………………………………………………………………………………… There are many hash functions that use numeric or alphanumeric keys. This article focuses on discussing different hash functions: 1. Division Method. Mid Square Method. Folding Method. Multiplication Method. 1. Division Method: This is the most simple and easiest method to generate a hash value. The hash function divides the value k by M and then uses the remainder obtained. Formula: h(K) = k mod M Here, k is the key value, and M is the size of the hash table. It is best suited that M is a prime number as that can make sure the keys are more uniformly distributed. The hash function is dependent upon the remainder of a division. Example: k = 12345 M = 95 h(12345) = 12345 mod 95 = 90 k = 1276 M = 11 h(1276) = 1276 mod 11 =0 Pros: 1. This method is quite good for any value of M. 2. The division method is very fast since it requires only a single division operation. Cons: 1. This method leads to poor performance since consecutive keys map to consecutive hash values in the hash table. 2. Sometimes extra care should be taken to choose the value of M. 2. Mid Square Method: The mid-square method is a very good hashing method. It involves two steps to compute the hash value- 1. Square the value of the key k i.e. k2 2. Extract the middle r digits as the hash value. Formula: h(K) = h(k x k) Here, k is the key value. The value of r can be decided based on the size of the table. Example: Suppose the hash table has 100 memory locations. So r = 2 because two digits are required to map the key to the memory location. k = 60 k x k = 60 x 60 = 3600 h(60) = 60 The hash value obtained is 60 Pros: 1. The performance of this method is good as most or all digits of the key value contribute to the result. This is because all digits in the key contribute to generating the middle digits of the squared result. 2. The result is not dominated by the distribution of the top digit or bottom digit of the original key value. Cons: 1. The size of the key is one of the limitations of this method, as the key is of big size then its square will double the number of digits. 2. Another disadvantage is that there will be collisions but we can try to reduce collisions. 3. Digit Folding Method: This method involves two steps: 1. Divide the key-value k into a number of parts i.e. k1, k2, k3,….,kn, where each part has the same number of digits except for the last part that can have lesser digits than the other parts. 2. Add the individual parts. The hash value is obtained by ignoring the last carry if any. Formula: k = k1, k2, k3, k4, ….., kn s = k1+ k2 + k3 + k4 +….+ kn h(K)= s Here, s is obtained by adding the parts of the key k Example: k = 12345 k1 = 12, k2 = 34, k3 = 5 s = k1 + k2 + k3 = 12 + 34 + 5 = 51 h(K) = 51 Note: The number of digits in each part varies depending upon the size of the hash table. Suppose for example the size of the hash table is 100, then each part must have two digits except for the last part which can have a lesser number of digits. 4. Multiplication Method This method involves the following steps: 1. Choose a constant value A such that 0 < A < 1. 2. Multiply the key value with A. 3. Extract the fractional part of kA. 4. Multiply the result of the above step by the size of the hash table i.e. M. 5. The resulting hash value is obtained by taking the floor of the result obtained in step 4. Formula: h(K) = floor (M (kA mod 1)) Here, M is the size of the hash table. k is the key value. A is a constant value. Example: k = 12345 A = 0.357840 M = 100 h(12345) = floor[ 100 (12345*0.357840 mod 1)] = floor[ 100 (4417.5348 mod 1) ] = floor[ 100 (0.5348) ] = floor[ 53.48 ] = 53 Pros: The advantage of the multiplication method is that it can work with any value between 0 and 1, although there are some values that tend to give better results than the rest. Cons: The multiplication method is generally suitable when the table size is the power of two, then the whole process of computing the index by the key using multiplication hashing is very fast. A good hash function should have the following properties: 1. Efficiently computable. 2. Should uniformly distribute the keys (Each table position is equally likely for each. 3. Should minimize collisions. 4. Should have a low load factor(number of items in the table divided by the size of the table). There are mainly two methods to handle collision: 1. Separate Chaining: 2. Open Addressing: 1) Separate Chaining The idea is to make each cell of the hash table point to a linked list of records that have the same hash function value. Chaining is simple but requires additiona additionall memory outside the table. Example: We have given a hash function and we have to insert some elements in the hash table using a separate chaining method for collision resolution technique. Hash function = key % 5, Elements = 12, 15, 22, 25 and 37. Let’s see step by step approach to how to solve the above problem: Step 1: First draw the empty hash table which will have a possible range of hash values from 0 to 4 according to the hash function provided. Step 2: Now insert all the keys in the hash table on onee by one. The first key to be inserted is 12 which is mapped to bucket number 2 which is calculated by using the hash function 12%5=2. Step 3: Now the next key is 22. It will map to bucket number 2 because 22%5=2. But bucket 2 is already occupied by key 12 Step 4: The next key is 15. It will map to slot number 0 because 15%5=0. Step 5: Now the next key is 25. Its bucket number will be 25%5=0. But bucket 0 is already occupied by key 25. So separate chaining method will again handle the collision by creating a linked list to bucket 0. used as the collision resolution technique. Hence In this way, the separate chaining method is used 2) Open Addressing In open addressing, all elements are stored in the hash table itself. Each table entry contains either a record or NIL. When searching for an element, we examine the table slots one by one until the desired element is found or it is clear that the element is not in the table. 2.a) Linear Probing In linear probing, the hash table is searched sequentially that starts from the original location of the hash. If in case the location that we get is already occupied, then we check for the next location. Algorithm: 1. Calculate the hash key. i.e. key = data % size 2. Check, if hashTable[key] is empty o store the value directly by hashTable[key] = data 3. If the hash index already has some value then o check for next index using key = (key+1) % size 4. Check, if the next index is available hashTable[key] then store the value. Otherwise try for next index. 5. Do thee above process till we find the space. Example: Let us consider a simple hash function as “key mod 5” and a sequence of keys that are to be inserted are 50, 70, 76, 85, 93. Step 1: First draw the empty hash table which will have a possible range of hash values from 0 to 4 according to the hash function provided. Hash table Step 2: Now insert all the keys in the hash table one by one. The first key is 50. It will map to slot number 0 because 50%5=0. So insert it into slot number 0. Insert 50 into hash table Step 3: The next key is 70. It will map to slot number 0 because 70%5=0 but 50 is already at slot number 0 so, search for the next empty slot and insert it. Insert 70 into hash table Step 4: The next key is 76. It will map to slot number 1 because 76%5=1 but 70 is already at slot number 1 so, search for the next empty slot and insert it. Insert 76 into hash table Step 5: The next key is 93 It will map to slot number 3 because 93%5=3, So insert it into slot number 3. Insert 93 into hash table 2.b) Quadratic Probing Quadratic probing is an open addressing scheme in computer programming for resolving hash collisions in hash tables. Quadratic probing operates by taking the original hash index and adding successive values of an arbitrary quadratic polynomial until an open slot is found. An example sequence using quadratic probing is: H + 12, H + 22, H + 32, H + 42…………………. H + k2 This method is also known as the mid-square method because in this method we look for i2‘th probe (slot) in i’th iteration and the value of i = 0, 1,... n – 1. We always start from the original hash location. If only the location is occupied then we check the other slots. Let hash(x) be the slot index computed using the hash function and n be the size of the hash table. If the slot hash(x) % n is full, then we try (hash(x) + 12) % n. If (hash(x) + 12) % n is also full, then we try (hash(x) + 22) % n. If (hash(x) + 22) % n is also full, then we try (hash(x) + 32) % n. This process will be repeated for all the values of i until an empty slot is found Example: Let us consider table Size = 7, hash function as Hash(x) = x % 7 and collision resolution strategy to be f(i) = i2. Insert = 22, 30, and 50 Step 1: Create a table of size 7. Hash table Step 2 – Insert 22 and 30 o Hash(22) = 22 % 7 = 1, Since the cell at index 1 is empty, we can easily insert 22 at slot 1. o Hash(30) = 30 % 7 = 2, Since the cell at index 2 is empty, we can easily insert 30 at slot 2. Insert key 22 and 30 in the hash table Step 3: Inserting 50 o Hash(50) = 50 % 7 = 1 o In our hash table slot 1 is already occupied. So, we will search for slot 1+12, i.e. 1+1 = 2, o Again slot 2 is found occupied, so we will search for cell 1+22, i.e.1+4 = 5, o Now, cell 5 is not occupied so we will place 50 in slot 5. Insert key 50 in the hash table 2.c) Double Hashing Double hashing is a collision resolving technique in Open Addressed Hash tables. Double hashing make use of two hash function, The first hash function is h1(k) which takes the key and gives out a location on the hash table. But if the new location is not occupied or empty then we can easily place our key. But in case the location is occupied (collision) we will use secondary hash-function hash h2(k) in combination with the first hash-function hash h1(k) to find the new location on the hash table. This combination of hash functions is of the form h(k, i) = (h1(k) + i * h2(k)) % n where i is a non-negative negative integer that indicates a collision number, k = element/key which is being hashed n = hash table size. Complexity of the Double hashing algorithm: Time complexity: O(n) Example: Insert the keys 27, 43, 692, 72 into the Hash Table of size 7. where first hash hash-function is h1(k) = k mod 7 and second hash hash-function is h2(k) = 1 + (k mod 5) Step 1: Insert 27 o 27 % 7 = 6, location 6 is empty so insert 27 into 6 slot. Insert key 27 in the hash table Step 2: Insert 43 o 43 % 7 = 1, location 1 is empty so insert 43 into 1 slot. Insert key 43 in the hash table Step 3: Insert 692 o 692 % 7 = 6, but location 6 is already being occupied and this is a collision o So we need to resolve this collision using double hashing. hnew = [h1(692) + i * (h2(692)] % 7 = [6 + 1 * (1 + 692 % 5)] % 7 =9%7 =2 Now, as 2 is an empty slot, so we can insert 692 into 2nd slot. Insert key 692 in the hash table Step 4: Insert 72 o 72 % 7 = 2, but location 2 is already being occupied and this is a collision. o So we need to resolve this collision using double hashing. hnew = [h1(72) + i * (h2(72)] % 7 = [2 + 1 * (1 + 72 % 5)] % 7 =5%7 = 5, Now, as 5 is an empty slot, so we can insert 72 into 5th slot. Insert key 72 in the hash table What is meant by Load Factor in Hashing? The load factor of the hash table can be defined as the number of items the hash table contains divided by the size of the hash table. Load factor is the decisive parameter that is used when we want to rehash the previous hash function or want to add more elements to the existing hash table. It helps us in determining the efficiency of the hash function i.e. it tells whether the hash function which we are using is distributing the keys uniformly or not in the hash table. Load Factor = Total elements in hash table/ Size of hash table What is Rehashing? As the name suggests, rehashing means hashing again. Basically, when the load factor increases to more than its predefined value (the default value of the load factor is 0.75), the complexity increases. So to overcome this, the size of the array is increased (doubled) and all the values are hashed again and stored in the new double-sized array to maintain a low load factor and low complexity. Applications of Hash Data structure Hash is used in databases for indexing. Hash is used in disk-based data structures. In some programming languages like Python, JavaScript hash is used to implement objects. Real-Time Applications of Hash Data structure Hash is used for cache mapping for fast access to the data. Hash can be used for password verification. Hash is used in cryptography as a message digest. Rabin-Karp algorithm for pattern matching in a string. Calculating the number of different substrings of a string. Advantages of Hash Data structure Hash provides better synchronization than other data structures. Hash tables are more efficient than search trees or other data structures Hash provides constant time for searching, insertion, and deletion operations on average. Disadvantages of Hash Data structure Hash is inefficient when there are many collisions. Hash collisions are practically not avoided for a large set of possible keys. Hash does not allow null values. Conclusion From the above discussion, we conclude that the goal of hashing is to resolve the challenge of finding an item quickly in a collection. For example, if we have a list of millions of English words and we wish to find a particular term then we would use hashing to locate and find it more efficiently. It would be inefficient to check each item on the millions of lists until we find a match. Hashing reduces search time by restricting the search to a smaller set of words at the beginning.