Database Management Lecture 2 - Indexing Structures
Document Details
Uploaded by AmazedCrocus5543
Princess Nourah Bint Abdulrahman University
Tags
Summary
Lecture notes on database indexing techniques. The lecture covers different types of indexes, such as primary indexes, clustering indexes, and secondary indexes, and how they are used to improve query performance. The notes also introduce multi-level indexes, particularly in the context of B-trees and B+-trees, along with their associated advantages and implementation details.
Full Transcript
1 IS 221 Database Management Lecture 2 Indexing Structures for Files and Physical Database Design Book: Fundamentals of Database Systems – Seventh Edition : Chapter 17: Indexing Structures for Files and Physical Database Design...
1 IS 221 Database Management Lecture 2 Indexing Structures for Files and Physical Database Design Book: Fundamentals of Database Systems – Seventh Edition : Chapter 17: Indexing Structures for Files and Physical Database Design 1 Chapter Outline Types of Single-level Ordered Indexes Primary Indexes Clustering Indexes Secondary Indexes Multilevel Indexes Dynamic Multilevel Indexes Using B-Trees and B+- Trees 2 Book index 3 Introduction Indexes used to speed up record retrieval in response to certain search conditions Index structures provide secondary access paths Any field can be used to create an index The index is usually specified on one field of the data file ,although it could be specified on several fields (Multiple indexes can be constructed). A single-level index is an auxiliary file that makes it more efficient to search for a record in the data file. – One form of an index is a file of entries , which is ordered by field value Most indexes based on ordered files – Tree data structures organize the index 4 Introduction (Contd.) The index file usually occupies considerably less disk blocks than the data file because its entries are much smaller A binary search on the index yields a pointer to the file record Indexes can also be characterized (i) Dense (ii) Sparse or Non- Dense 5 Dense Vs. Sparse Index A dense index has an A sparse (or nondense) index entry for every index, on the other hand, search key value (and has index entries for only hence every record) in the some of the search values data file. 6 Types of Single-Level Ordered Indexes Ordered index similar to index in a textbook Indexing field (attribute) – Index stores each value of the index field with list of pointers to all disk blocks that contain records with that field value Values in index are ordered Types of Single-Level Ordered Indexes: 1. Primary index ▪ Specified on the ordering key field of ordered file of records 2. Clustering index ▪ Used if numerous records can have the same value for the ordering field 3. Secondary index ▪ Can be specified on any nonordering field ▪ Data file can have several secondary indexes 7 Primary index Ordered file with two fields – Primary key, K(i) – Pointer to a disk block, P(i) Defined on an ordered data file The data file is ordered on a key field Includes one index entry for each block in the data file; the index entry has the key field value for the first record in the block, which is called the block anchor A similar scheme can use the last record in a block. A primary index is a nondense (sparse) index, since it includes an entry for each disk block of the data file and the keys of its anchor record rather than for every search value. 8 Primary index on the ordering key field Ordered based on PK values Block 1 Block 2 Block 3 The first record in any Block is called: Block Anchor or Anchor record 9 Example: Given the data file EMPLOYEE(NAME, SSN, ADDRESS, JOB, SAL,... ). This file is ordered based on the employee SSN Suppose that: record size R=150 bytes, block size B=512 bytes, r=30000 records ◻ For data file blocking factor Bfr= B div R= 512 div 150= 3 records/block number of file blocks b= (r/Bfr)= (30000/3)= 10000 blocks ◻ For an index on the SSN field, assume the field size V SSN=9 bytes, assume the block pointer size PR=7 bytes. Then: index entry size RI=(VSSN+ PR)=(9+7)=16 bytes index blocking factor BfrI= B div RI= 512 div 16= 32 entries/block number of index blocks bI= (r/ BfrI)= (10000/32)= 313 blocks binary search on the index needs log2bI= log2 313 = 9 block accesses To locate a record with a search key value requires a binary 10 Clustering indexes Ordered file with two fields – Non-key field (clustering field) – Pointer to a disk block, P(i) Defined on an ordered data file The data file is ordered on a non-key field unlike primary index, which requires that the ordering field of the data file have a distinct value for each record. Includes one index entry for each distinct value of the field; the index entry points to the first data block that contains records with that field value. It is another example of nondense index where Insertion and Deletion is relatively straightforward with a clustering index. 11 A Clustering Index Example A clustering index on the DEPTNUMBER ordering non-key field of Ordered based on non-key an EMPLOYEE file. values (Deptnumber) Block 1 Block 2 Question: Insert, Delete? 12 Another Clustering Index Example 13 Secondary indexes A secondary index provides a secondary means of accessing a file for which some primary access already exists. Ordered file with two fields – Some non-ordering field of the data file. – Block pointer or record pointer, P(i) The data file records could be ordered or unordered. The secondary index may be created on a field that is a non- ordering candidate key and has a unique value in every record , OR on a non-ordering non key field with duplicate values. Usually need more storage space and longer search time than primary index There can be many secondary indexes (and hence, indexing fields) for the same file. 14 Example of a Dense Secondary Index(with block pointers) on a non-ordering candidate key field of a file 15 Secondary index on non-ordering non-key field Option 1: is to include duplicate index entries with the same K(i) value—one for each record. This would be a dense index. Option 2 is to have variable-length records for the index entries, with a repeating field for the pointer. We keep a list of pointers in the index entry for K(i)—one pointer to each block that contains a record whose indexing field value equals K(i) Option 3, which is more commonly used, is to keep the index entries themselves at a fixed length and have a single entry for each index field value, but to create an extra level of indirection to handle the multiple pointers. In this nondense scheme, the pointer P(i) in index entry points to a disk block, which contains a set of record pointers; each record pointer in that disk block points to one of the data file records with value K(i) for the indexing field. If some value 16 K(i) occurs in too many records, so that their record pointers Secondary index on non-ordering non-key field 17 Summary of single –level ordered indexes 18 Multilevel Indexes Because a single-level index is an ordered file, we can create a primary index to the index itself. Multilevel index allows faster data retrieval, reduces search space, and improves query performance. Index file: Considered first (or base level) of a multilevel index Second level: Primary index to the first level Third level: Primary index to the second level We can repeat the process, creating a third, fourth,..., top level until all entries of the top level fit in one disk block A multilevel index can be created for any type of first- level index (primary, secondary, clustering) as long as the first-level index consists of more than one disk 19 block A Two-level Primary Index: Example of Multilevel index Figure A two-level primary index resembling ISAM (indexed sequential access method) organization 20 Exercise Assume an ordered file with the following given: The ordering field of the file is a key. The file has 1000000 records of size 1000 bytes each. The disk block is of size 4096 bytes (unspanned allocation). The index record is of size 32 bytes. How many disk block accesses are needed to retrieve a random record when searching for a record? 1. Using no index ? 2. Using a primary index ? 3. Using a multilevel index with two levels ? 21 1. Using no index Average access time = log2 b 4096/1000 = 4.096 = 4 rec/block # of block = 1000000/4 = 250000 block Average access time (binary search) = log2 250000 = 17.93 = 18 block access 2. Using a primary index (FIRST LEVEL) Index bfr= 4096/32 = 128 rec/block # index block = 250000/128 = 1953.29 = 1954 block Average access time (binary search) = log2 1954 = 10.9 = 11 +1 = 12 block access 3.Multi level index bfr= 4096 /32 = 128 rec/block # index block = 1953/128 = 15 block Average access time (binary search) = log2 15 = 4 = 4 +1 + 1 + = 6 block access 22 Multi-Level Indexes Multilevel index reduces the number of blocks accessed when searching for a record , given its indexing field value. We are still faced with the problems of dealing with index insertion and deletions, because all index levels are physically ordered files. To retain the benefits of using multilevel indexing while reducing index insertion and deletion problems, designers adopted a multilevel index that leaves some space in each of its blocks for inserting new entries. This is called a dynamic multilevel index and is often implemented by using data structures called B -trees and B+ -trees 23 Dynamic Multilevel Indexes Using B- Trees and B+ -Trees B -tree and B+ -trees are special cases of the well-known search data structure known as tree. Tree data structure terminology Tree is formed of nodes Each node (except root) has one parent and zero or more child nodes Leaf node has no child nodes Unbalanced if leaf nodes occur at different levels Nonleaf node is called an internal node Subtree of node consists of node and all descendant nodes 24 Tree Data Structure A tree data structure that shows an unbalanced tree 25 Tree Data Structure Search tree used to guide search for a record – Given value of one of record’s fields A node in a search tree with pointers to subtrees below it Algorithms necessary for inserting and deleting search values into and from the tree A search tree of order p = 3 26 Dynamic Multilevel Indexes Using B-Tree and B+ -Tree In B-Tree and B+-Tree data structures, each node corresponds to a disk block Most multi-level indexes use B-tree or B+-tree data structures because of the insertion and deletion problem. These trees are always balanced. These data structures allow efficient insertion and deletion of new search values. o Each node of order p can have at most p-1 search values o Each node is kept between half-full and completely full o An insertion into a node that is not full is quite efficient o If a node is full the insertion causes a split into two nodes. Splitting may propagate to other tree levels 27 B-Tree Provide multi-level access structure B-tree structures (a) A node in a B-tree with q−1 search values (b) A B-tree of order p=3. The values were inserted in the order 8, 5, 1, 7, 3, 12, 9, 6 28 B+ -Tree Data pointers stored only at the leaf nodes o Leaf nodes have an entry for every value of the search field, and a data pointer to the record if search field is a key field Internal nodes o Some search field values from the leaf nodes repeated to guide search The nodes of a B+-tree (a) Internal node of a B+-tree with q−1 search values (b) Leaf node of a B+-tree with q−1 search values and q−1 data pointers 29 Difference between B-tree and B+- tree In a B-tree, pointers to data records exist at all levels of the tree In a B+-tree, all pointers to data records exists at the leaf- level nodes A B+-tree can have less levels (or higher capacity of search values) than the corresponding B-tree 30 Example 1 : Insertion in a B-tree B -tree of order p=4 So, In each node there are 4 pointers, and p-1=4-1= 3 key values 31 Example 1: Insertion in B-tree 9 Add 13.22 ,7 1 5 21 9 Add 10 1 5 7 13 21 22 9 21 1 5 7 10 13 22 32 Add 11 9 21 1 5 7 10 11 13 22 9 13 21 Add 14 1 5 7 10 11 14 22 33 13 7 9 21 1 5 8 10 11 14 16 22 34 Example 2: Insertion in a B+- tree B+ -tree of order p=4 So, In each node there are 4 pointers, and p-1=4-1= 3 key values Node is full – Split is needed Insert 20 Insert 13 N= number of elements= 4 Elements: 1, 4, 9, 13 Calculate Mid element= +1 = Mid element = 9 Insert 15 Insert 15 35 Example 2 Node is full – Split is needed N= number of elements= 4 Elements: 9, 10, 13, 15 Insert 10 Calculate Mid element= +1 = Mid element = 13 Insert 11 36 Extra example in B-tree 37 38 39