Topic 05 - Database indexing.pdf
Document Details
Uploaded by ArticulatePiccoloTrumpet7945
Singapore Management University
Full Transcript
SMU Classification: Restricted MITB [ISSS625] Query Processing and Optimisation Topic 5 Database indexing SMU Classification: Restricted Agenda Part 1: Overview Part 2: Types of indexes Part 3: Common algorithms used...
SMU Classification: Restricted MITB [ISSS625] Query Processing and Optimisation Topic 5 Database indexing SMU Classification: Restricted Agenda Part 1: Overview Part 2: Types of indexes Part 3: Common algorithms used for data indexing Part 4: Execution plans SMU Classification: Restricted Part 1 Overview SMU Classification: Restricted Motivation Data pages with records of a single table All attributes are stored together! INT DOUBLE FLOAT CHAR(8) Data page 1 Does each query INT DOUBLE FLOAT CHAR(8) need to access all? INT DOUBLE FLOAT CHAR(8) INT DOUBLE FLOAT CHAR(8) … INT DOUBLE FLOAT CHAR(8) Data page 2 INT DOUBLE FLOAT CHAR(8) INT DOUBLE FLOAT CHAR(8) INT DOUBLE FLOAT CHAR(8) … INT DOUBLE FLOAT CHAR(8) 3 SMU Classification: Restricted Motivation Data pages with records of a single table Linear search is slow! How to filter efficiently INT DOUBLE FLOAT CHAR(8) Data page 1 records based on INT DOUBLE FLOAT CHAR(8) a given attribute? INT DOUBLE FLOAT CHAR(8) INT DOUBLE FLOAT CHAR(8) … INT DOUBLE FLOAT CHAR(8) Data page 2 INT DOUBLE FLOAT CHAR(8) INT DOUBLE FLOAT CHAR(8) INT DOUBLE FLOAT CHAR(8) … INT DOUBLE FLOAT CHAR(8) 3 SMU Classification: Restricted Motivation Data pages with records of a single table Storage drives are slow! How to limit search to INT DOUBLE FLOAT CHAR(8) Data page 1 pages with relevant INT DOUBLE FLOAT CHAR(8) records only? INT DOUBLE FLOAT CHAR(8) INT DOUBLE FLOAT CHAR(8) … INT DOUBLE FLOAT CHAR(8) Data page 2 INT DOUBLE FLOAT CHAR(8) INT DOUBLE FLOAT CHAR(8) INT DOUBLE FLOAT CHAR(8) … INT DOUBLE FLOAT CHAR(8) 3 SMU Classification: Restricted Motivation Index Data pages with records of a single table INT DOUBLE FLOAT CHAR(8) Data page 1 Create INT DOUBLE FLOAT CHAR(8) an index! INT DOUBLE FLOAT CHAR(8) Data INT DOUBLE FLOAT CHAR(8) structures … improving search INT DOUBLE FLOAT CHAR(8) Data page 2 performance INT DOUBLE FLOAT CHAR(8) INT DOUBLE FLOAT CHAR(8) INT DOUBLE FLOAT CHAR(8) … INT DOUBLE FLOAT CHAR(8) 3 SMU Classification: Restricted Motivation Index pages Data pages with records of a single table Index needs INT DOUBLE FLOAT CHAR(8) Data page 1 to be INT DOUBLE FLOAT CHAR(8) stored! INT DOUBLE FLOAT CHAR(8) Index page 1 INT DOUBLE FLOAT CHAR(8) … INT DOUBLE FLOAT CHAR(8) Data page 2 INT DOUBLE FLOAT CHAR(8) INT DOUBLE FLOAT CHAR(8) INT DOUBLE FLOAT CHAR(8) … INT DOUBLE FLOAT CHAR(8) 3 SMU Classification: Restricted Motivation Benefits Improved search and filtering performance. Accessing a few index pages allows for eliminating many data pages from search. Automated use. Indexes are used transparently averting a constant need for querying a full target table directly. Automated scaling. As records are added, modified or deleted, relevant indexes do not require manual adjustment. SMU Classification: Restricted Motivation Trade-offs Additional storage cost. An index occupies additional memory pages that also require storage alongside data pages. Reduced modification performance. Each time data is inserted, updated or delete in columns, relevant indexes must be modified too. Costly creation performance. Creation of a new index on an existing table may take a lot of time. For optimal performance, monitoring is advised. Indexes are made to improve specific queries at the cost of other queries. SMU Classification: Restricted Optimization opportunity Administrators should check which indexes are being used and consider enabling/disabling indexes based on other queries executed on that table. Many RDBMSs offer statistics showing utilization of indexes and listing unutilized indexes. If a table has multiple indexes, benefits of some may get overshadowed by others. SMU Classification: Restricted Part 2 Types of indexes SMU Classification: Restricted Storage impact Indexes categorized by their impact on storage of data: Clustered Data is stored in a sorted order derived from the internal representation of an index. Modification of data in a table may require shifting all records in storage drives (very costly!). Shifting may not be required when data gets sorted to the last position and appended at the end (this is why AUTO_INCREMENT primary key indexes result in efficient INSERT operations). Non-clustered Data is stored in an order independent from an index. Most RDBMSs cannot store data with multiple distinct sorting orders simultaneously. In such databases, a table can have only one clustered index (usually: PK). SMU Classification: Restricted Role Indexes categorized by their role in a table: Primary Index created on a primary key of a table. Created on user-defined primary key attributes, or on implicit attributes automatically. Usually, it is a clustered index and implies that attributes are UNIQUE and NOT NULL. Required by RDBMS to associate a record with its storage location; usually point to a data page. Secondary User-defined indexes to speed up search and filter queries on a specific attribute set. Non-clustered; usually point to a primary key. SMU Classification: Restricted Two step search 1. Given a predicate for an indexed attribute, use a secondary index to find matching records' primary key pointers. 2. Given a primary key, use the primary index to find a record's data page pointer. Secondary Primary Data pages index index Data page 1 Data page 2 Data page 3 Data page 4 Predicate for Primary key an indexed value Data page 5 ata page 6 SMU Classification: Restricted Constraint Indexes categorized by a constraint: Unique Prevent existence of non-distinct values. An attempt at insertion of a duplicate value results in a statement error. Non-unique Duplicated values are permitted. An index may store [value-pointer] in a multi-associative collection, or store [value-[pointer]]. SMU Classification: Restricted Structure Indexes categorized by their structure: Simple index An index created on a single column of a single table. Composite index An index created on multiple columns of a single table. Various databases limit the maximum number of permitted columns (MySQL: 16). The order of columns matters! A composite index will only apply for lookups including all leftmost columns of the index. With too many columns storage of the index may get expensive. Consider calculated hashed columns as an alternative. https://dev.mysql.com/doc/refman/8.0/en/multiple-column-indexes.html SMU Classification: Restricted Algorithm supported by MySQL Indexes categorized by their algorithm: supported by MariaDB Single level Flat dense indexes. Flat sparse indexes. Bitmap indexes. Hash indexes. Multi-level B-trees (balanced search trees), B+-trees. R-Trees. SMU Classification: Restricted Part 3 Common algorithms used for data indexing SMU Classification: Restricted Algorithm Input Pointer 1 A predicate for an indexed attribute. Pointer 2 Output A set of pointers: Primary index → data page pointers Pointer 3 Secondary index → primary key pointers Predicate for ? Pointer 4 Goal an indexed In a set of pointers and values of an indexed attribute, value find pointers for which the value matches a given predicate. Pointer 5 Pointer 6 Pointer 7 SMU Classification: Restricted Algorithm Considerations: Count and data type of indexed attributes. Complexity of implementation (important for implementers, but not users). Expected count of index pages (the smaller the better). Performance of search operations (the faster the better). Performance of index modifications resulting from data modifications (the faster the better). Ability to produce deterministic results (the same input and relation produce the same output set). Ability not to deteriorate over time (in response to performed data modifications). SMU Classification: Restricted Single level Flat dense index An index is a copy of all values from an indexed column. Works on any data type; values do not have to be sorted. Search is accomplished via a linear search across as many values as records. Search operations are very slow (linear search of indexed data). (still faster than a linear search of data pages as index pages do not contain unnecessary columns) An index has a predictable size and is easy to update with data modifications. Frequently implemented in end-user software, usually too simplistic for use in RDBMSs. Easy to implement: [value-pointer]. SMU Classification: Restricted Single level Flat dense index Pointer 1 Pointer 2 Pointer 3 Predicate for Pointer 4 an indexed value Pointer 5 Pointer 6 Pointer 7 SMU Classification: Restricted Single level Flat sparse index Based on a flat naïve index. Search operations are faster (fewer values to assess). Offers a reduced size (needs fewer index pages than a naïve index). Pointers are sorted and grouped by value; only the first pointer in a group is stored. Does not have a predictable size; with all distinct values is equal to a flat naïve index. Frequently implemented in end-user software, usually too simplistic for use in RDBMSs. Easy to implement: [value-pointer]. SMU Classification: Restricted Single level Flat sparse index Sorted Pointer 1 Pointer 2 Pointer 3 Predicate for Pointer 4 an indexed value Pointer 5 Pointer 6 Pointer 7 SMU Classification: Restricted Single level Bitmap index An index is a map of bits and corresponding pointers. A sequence of bits indicates equality with one of all stored values (equal/not equal). Use only if a set of possible values (cardinality) is small, e.g. enums. Example: Gender ENUM('Male', 'Female') NULL Female Male 0 1 POINTER Points to a record WHERE Gender = 'Female' 1 0 POINTER Points to a record WHERE Gender = 'Male' 0 0 POINTER Points to a record WHERE IsNull(Gender) SMU Classification: Restricted Single level Bitmap index Very small index size; efficient storage that can be implemented as: Storage of pointers (inverted index), e.g.: A list of pointers WHERE Gender = 'Male', A list of pointers WHERE Gender = 'Female'. Storage of bits (bitmap compression). Very fast search performance; rather fast index update operations. Strong support for parallel DML and concurrent access, but implementation may be complex. Often used by data warehousing software (Apache Hive, Oracle 9i Enterprise Edition, and so on). SMU Classification: Restricted Single level Bitmap index Pointer 1 Pointer 2 Pointer 3 Predicate for Pointer 4 an indexed value Pointer 5 Pointer 6 Pointer 7 SMU Classification: Restricted Single level Hash index Maps variable-size attribute values into buckets identified by fixed-size identifiers using a hash function: Each bucket contains a list of pointers that may be subjected to a linear search. A hash function narrows the scope of a search function. Each bucket may be stored in a separate page or a sequence of pages. Efficient for attributes with data of any sizes, data types and cardinality; values do not have to be sorted. Performance of a search function depends on how uniformly values are distributed across the buckets. The worst-case scenario is that all values have been mapped into a single bucket. Frequently implemented in RDBMSs (e.g. in MariaDB, but not in MySQL). Easy to implement: [hash_value-[value-pointer]]. SMU Classification: Restricted Single level Hash index Search is a 3-step process: Map a value from a predicate using a hash function, producing a hash value. f Predicate value Hash value Search for a bucket corresponding to the hash value. Hash value search Bucket 1 Bucket 2 Bucket N Search in a bucket for records with values matching the predicate. Bucket Predicate search Value 1 Value 2 Value M SMU Classification: Restricted Single level Hash index Pointer 1 Bucket 1 Pointer 2 Pointer 3 f Bucket 2 Predicate for Hash Pointer 4 function an indexed value Pointer 5 Bucket 3 Pointer 6 Pointer 7 SMU Classification: Restricted Single level vs. multi-level Observations If values are sorted, one level of search can be seen as search through an index of the next level. We search iteratively a full interval, sub-intervals, down to a set of values matched in predicates. This is logarithmic search though a multilevel index. Each level may be stored on a separate index page. Logarithmic search may be based on: A single comparison A sequence of comparisons with an interval half-point: with multiple interval boundaries: binary search multiway search SMU Classification: Restricted Single level vs. multi-level Observations Binary search converges faster, but also progresses through more iterations: Beneficial when all values are available in memory. Detrimental when an interval for each iteration must be loaded by costly loading of pages into memory. With page switching involved, multiway search is faster: a longer search in a page already loaded limits further the number of pages considered for loading at later levels. A single comparison A sequence of comparisons with an interval half-point: with multiple interval boundaries: binary search multiway search SMU Classification: Restricted Single level vs. multi-level Note: Each node's key K (a predicate value) is also associated Observations with own pointer P. Binary search uses a data structure called a binary search tree (BST): K One parent node has many children nodes. P A parent has at most m = 2 child nodes. A parent is associated with n = 1 = m-1 keys (K) for comparison. x