Lecture-03_mai.pdf
Document Details
Uploaded by Deleted User
Full Transcript
Database 2 Lecture 3 1 1 Hashing Techniques ◼ Hash function (randomizing function) ◼ Applied to hash field value of a record ◼ Yields address of the disk block of stored record ◼ Organization called hash file ◼ Search condit...
Database 2 Lecture 3 1 1 Hashing Techniques ◼ Hash function (randomizing function) ◼ Applied to hash field value of a record ◼ Yields address of the disk block of stored record ◼ Organization called hash file ◼ Search condition is equality condition on the hash field (where X=1234) ◼ Hash field typically key field. ◼ Hashing also internal search structure ◼ Used when group of records accessed exclusively by one field value Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe 2 2 Internal Hashing ◼ Internal hashing (table created in the main memory) Hash table Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe Slide 16- 3 3 3 Cont. ◼ We choose a hash function that transforms the hash field value into an integer between 0 and M − 1. ◼ One common hash function is the h(K) = K mod M function, which returns the remainder of an integer hash field value K after division by M; ◼ This value is then used for the record address Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe Slide 16- 4 4 4 Hashing Techniques (cont’d.) ◼ Collision The problem with most hashing functions is that they do not guarantee that distinct values will hash to distinct addresses, because the hash field space—the number of possible values a hash field can take—is usually much larger than the address space—the number of available addresses for records. The hashing function maps the hash field space to the address space. ◼ Hash field value for inserted record hashes to address already containing a different record ◼ The process of finding another position is called collision resolution Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe Slide 16- 5 5 5 Collision resolution 1. Open addressing: the program checks the subsequent positions in order until an unused (empty) position is found, issues in deletion and search 2. Chaining: Extending the array with a number of overflow positions Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe Slide 16- 6 6 6 Collision resolution cont. ◼ 3. Multiple hashing: The program applies a second hash function if the first results in a collision. If another collision results, the program uses open addressing or applies a third hash function and then uses open addressing if necessary. ◼ Note that using hash functions can result in empty spaces even though the allocated number of slots is equal to the number of records. Why? Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe Slide 16- 7 7 7 Example ◼ h(k)=k mod m ◼ Where: h(k) is the hash function applied to key k M is the number of available addresses or slots in the memory. Let’s assume m=10, meaning there are 10 available slots in memory. ◼ Example 1: ◼ Let’s take two different keys: k1=12 k2=22 Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe Slide 16- 8 8 8 Example Cont. ◼ Now applying the hash function to each: ◼ h(12)=12 mod 10=2 ◼ h(22)=22mod 10=2 ◼ Even though k1=12 and k2=22 are different, ◼ They both get mapped to the same address, i.e., 2. This is because both keys result in the same remainder when divided by 10. This situation is a hash collision. Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe Slide 16- 9 9 9 Cont. ◼ The goal of a good hashing function is twofold: first, to distribute the records uniformly over the address space so as to minimize collisions, thus making it possible to locate a record with a given key in a single access. ◼ The second, somewhat conflicting, goal is to achieve the above yet occupy the buckets fully, thus not leaving many unused locations. ◼ Hence, if we expect to have r records to store in the table, we should choose M locations for the address space such that (r/M) is between 0.7 and 0.9. capacity percentage Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe Slide 16- 10 10 10 Hashing Techniques (cont’d.) ◼ External hashing for disk files in the disk and not in the main memory ◼ Target address space made of buckets ◼ Bucket: one disk block or contiguous blocks ◼ Hashing function maps a key into relative bucket rather than assigning an absolute block address to the bucket ◼ Table in file header converts bucket number to disk block address Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe Slide 16- 11 11 11 Hashing Techniques (cont’d.) ◼ Collision problem less severe with buckets ◼ because as many records as will fit in a bucket can hash to the same bucket without causing problems ◼ When a bucket is filled, variation of chaining in which a pointer is maintained in each bucket to a linked list of overflow records for the bucket. ◼ record pointers, which include both a block address and a relative record position within the block. Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe Slide 16- 12 12 12 Example: ◼ Assume a bucket can hold 3 records, and the hash function maps 5 records to the same bucket. When the fourth and fifth records are inserted and the bucket is full: 1. The first 3 records are stored directly in the bucket. 2. A pointer is added in the bucket, pointing to an overflow area (typically another block or a chain of blocks). 3. In this overflow area, the 4th and 5th records are stored, with the record pointers pointing to their exact location (block address and relative position). 4. When a lookup is performed for these overflow records, the system follows the pointer to the correct block and retrieves the record from the linked list of overflow records. ◼ Thus, record pointers ensure that even when buckets overflow, the hash table can handle it efficiently without reorganizing the entire structure. Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe Slide 8- 13 13 13 Hashing Techniques (cont’d.) ◼ Static hashing ◼ Fixed number of buckets allocated ◼ Hash address space is fixed. Hence, it is difficult to expand or shrink the file dynamically. Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe Slide 16- 14 14 14 Exercise 1: ◼ Given the following input (4322, 1334, 1471, 9679, 1989, 6171, 6173, 4199) and the hash function x mod 10, which of the following statements are true? i. 9679, 1989, 4199 hash to the same value ii. 1471, 6171 hash to the same value iii. All elements hash to the same value iv. Each element hashes to a different value (A) i only (B) ii only (C) i and ii only (D) iii or iv Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe Slide 16- 15 15 15 Answer 1 4322 2 1334 4 1471 1 9679 9 1989 9 6171 1 6173 3 4199 9 Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe Slide 16- 16 16 16 Answer 1: ◼ i. 9679, 1989, 4199 hash to the same value ii. 1471, 6171 hash to the same value iii. All elements hash to the same value iv. Each element hashes to a different value (A) i only (B) ii only (C) i and ii only (D) iii or iv Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe Slide 16- 17 17 17 Question 2: ◼ The keys 12, 18, 13, 2, 3, 23, 5 and 15 are inserted into an initially empty hash table of length 10 using open addressing with hash function h(k) = k mod 10 and linear probing. What is the resultant hash table? Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe Slide 16- 18 18 18 Answer 2 ◼ Mode result 12 2 18 8 13 3 2 2 3 3 23 3 5 5 15 5 Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe Slide 16- 19 19 19 Answer 2 Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe Slide 16- 20 20 20 Hashing Techniques (cont’d.) ◼ Hashing techniques that allow dynamic file expansion ◼ Extendible hashing ◼ File performance does not degrade as file grows ◼ Dynamic hashing ◼ Maintains tree-structured directory ◼ Linear hashing ◼ Allows hash file to expand and shrink buckets without needing a directory Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe Slide 16- 21 21 21 Extendible hashing ◼ Extendible hashing maintains a directory, which is essentially an array containing bucket addresses. ◼ The directory has 𝟐𝒅 entries, where d is called the global depth. ◼ It represents the number of bits considered from the hash value to index into the directory. ◼ Each directory entry points to a bucket where records are stored. ◼ The hash function generates a hash value for a given record, and the first d bits of this hash value determine the index in the directory. Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe Slide 16- 22 22 22 Extendible hashing Cont. The value of d can be increased or decreased by one at a time, thus doubling or halving the number of entries in the directory array. Doubling is needed if a bucket, whose local depth d’ is equal to the global depth d, overflows. Halving occurs if d d’ for all the buckets after some deletions occur. Most record retrievals require two block accesses—one to the directory and the other to the bucket. Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe Slide 16- 23 23 23 16.10 Parallelizing Disk Access Using RAID Technology ◼ Redundant arrays of independent disks (RAID) ◼ Goal: improve disk speed and access time ◼ Set of RAID architectures (0 through 6) ◼ Data striping ◼ distributes data transparently over multiple disks to make them appear as a single large, fast disk ◼ Bit-level striping ◼ Block-level striping ◼ Improving Performance with RAID ◼ Data striping achieves higher transfer rates Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe Slide 16- 24 24 24 RAID cont a) a byte is split and individual bits are stored on independent disks. b) stripes blocks across disks every disk participates in every read or write operation; the number of accesses per second would remain the same as on a single disk, but the amount of data read in a given time would increase fourfold Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe Slide 16- 25 25 25 Parallelizing Disk Access Using RAID Technology (cont’d.) ◼ Improving reliability with RAID (Check notes) ◼ Redundancy techniques: mirroring and shadowing ◼ Data is written redundantly to two identical physical disks that are treated as one logical disk. When data is read, it can be retrieved from the disk with shorter ◼ queuing, seek, and rotational delays. If a disk fails, the other disk is used until the first is repaired. Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe Slide 16- 26 26 26 RAID organizations and levels (Read) ◼ RAID organizations and levels ◼ Level 0 ◼ Data striping, no redundant data ◼ Spits data evenly across two or more disks ◼ Level 1 ◼ Uses mirrored disks Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe Slide 16- 27 27 27 Parallelizing Disk Access Using RAID Technology (Read) ◼ RAID organizations and levels (cont’d.) ◼ Level 2 ◼ Hamming codes for memory-style redundancy ◼ Error detection and correction ◼ Level 3 ◼ Single parity disk relying on disk controller ◼ Levels 4 and 5 ◼ Block-level data striping ◼ Data distribution across all disks (level 5) Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe Slide 16- 28 28 28 Parallelizing Disk Access Using RAID Technology (Read) ◼ RAID organizations and levels (cont’d.) ◼ Level 6 ◼ Applies P+Q redundancy scheme ◼ Protects against up to two disk failures by using just two redundant disks ◼ Rebuilding easiest for RAID level 1 ◼ Other levels require reconstruction by reading multiple disks ◼ RAID levels 3 and 5 preferred for large volume storage Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe Slide 16- 29 29 29 RAID Levels (Read) Figure 16.14 Some popular levels of RAID (a) RAID level 1: Mirroring of data on two disks (b) RAID level 5: Striping of data with distributed parity across four disks Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe Slide 16-30 30 30 16.12 Summary ◼ Magnetic disks ◼ Accessing a disk block is expensive ◼ Commands for accessing file records ◼ File organizations: unordered, ordered, hashed Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe Slide 16- 31 31 31