Summary

This document covers lecture notes on indexes, specifically database indexes and tree structures. It includes a summary of prior knowledge on physical data representation, focusing on relational implementations like star schemas and snowflake schemas. The presentation also explains the use of indexes, highlighting the importance of efficiency and exploring different types of index structures, such as R-Trees.

Full Transcript

Lecture 07 Indexes Summary – last week Physical Level – Relational Implementation through: Star schema:improves query performance for often-used data Snowflake schema:reduce the size of the dimension tables – Array based storage How to perform lineariz...

Lecture 07 Indexes Summary – last week Physical Level – Relational Implementation through: Star schema:improves query performance for often-used data Snowflake schema:reduce the size of the dimension tables – Array based storage How to perform linearization – MOLAP, ROLAP, HOLAP This week: Indexes – Tree based indexes – Bitmap indexes Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 2 Indexes Why index? – Consider a 100 GB table;at 100 MB/s read speed we need 17 minutes for a full table scan – Consider an OLAP query:the number of Bosch S500 washing machines sold in Germany last month? Applying restrictions (product,location, and time) the selectivity would be strongly reduced – If we have 30 location,10000 products and 24 months in the DW,the selectivity is 1/30 * 1/ 10000 * 1/24 = 0,00000014 – So…we rea d 100 GB for 1.4KB of da ta …not very smart Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 3 Indexes (cont’d.) Reduce the size of rea d pages to a m inim um with indexes Product (Article) Product (Article) Product (Article) Product (Article) Time (Days) Time (Days) Time (Days) Time (Days) Full table scan Cluster primary More secondary Optimal multi- index Indexes, bitmap indexes dimensional index Scanned data Selected data Indexes generate some overhead but in DB not much in DW. Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 4 Tree Based Indexes In the beginning…there were B -Trees – Data structures for storing sorted data with amortized run times for insertion and deletion – Basic structure of a node Key Value Data Pointer Tree Node … Node Pointers Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 5 Tree Structures Search in database systems – B -tree structures allow exact search with logarithmic costs DBMS reads the whole block in memory even it needs only one data value and keep this block in memory (unless memory space is not available) for the future purposes. Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 6 Tree Structures (cont’d.) S ea rch in DW s – The data is m ultidim ensiona l, B-trees however, support only one-dimensional search Are there any possibilities to extend tree functionality for multidimensional data? Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 7 R-Tree The R-tree (Guttman,1984) is the prototype of a multi-dimensional extension of the classical B-trees Frequently used for low-dimensional applications (used to about 10 dimensions),such as geographic information systems More scalable versions:R+-Trees,R*-Trees and X- Trees (each up to 20 dimensions for uniform distributed data) Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 8 R-Tree Structure Dyna m ic Index S tructure (insert,update and delete are possible) Data structure – Da ta pages are leaf nodes and store clustered point data and data objects – Director y pages are the internal nodes and store directory entries – Multidimensional data are structured with the help of Minim um B ounding Recta ng les (MBRs) Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 9 R-Tree Example R1 R5 R3 R11 R10 root R4 Time R6 R1 R2 R3 R2 R7 XQ R4 R5 R6 R7 R8 R9 R10 R11 R8 Xp R9 XO root Q P O Product Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 10 R-Tree Characteristics L oca l g rouping for clustering – not effective for uniformly distributed data. Overlapping clusters (the more the clusters overlap the more inefficient is the index) Heig ht ba la nced tree structure (therefore all the children of a node in the tree have about the same number of successors) Objects are stored, only in the leaves – Internal nodes are used for navigation – query can be answered without searching of all levels of tree MBRs are used as a geom etr y Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 11 R-Tree Properties The root has a t lea st two children Each internal node has between m and M children M and m ≤ M / 2 are pre-defined parameters Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 12 Operations of R-Tree The essential operations for the use and management of an R-tree are – Search – Insert – Updates – Delete – Splitting Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 13 Searching in R-Trees The tree is searched recursively from the root to the leaves – One path is selected – If the requested record has not been found in that sub-tree, the next path is traversed The path selection is arbitrary Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 14 Example R1 R5 R3 root R11 R10 R4 R1 R2 R3 R6 Time X X R2 R7 R7 R8 R9 R8 Check all the objects R9 April S root in node R8 Mobile phones Product Check only 7 nodes instead of 12 Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 15 Searching in R-Trees (cont’d.) No g ua ra ntee for good performance In the worst case,all paths must traversed (due to overlaps of the MBRs) Search algorithms try to exclude as many irrelevant regions as possible (“pruning”) Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 16 Insert (or loading in DW) P rocedure – The best lea f page is chosen (ChooseLeaf) considering the spatial criteria Best leaf: the leaf that needs the smallest volume growth to include the new object – The object will be inserted there if there is enough room (number of objects in the node < M) Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 17 Insert (cont’d.) – If there is no more place left in the node,it is considered a case for overflow and the node is divided (SplitNode) Goal of the split is to result in m inim al overlap a nd a s sm all dead spa ce as possible – Interval of the parent node must be adapted to the new object (AdjustTree) – If the root is reached by division, then create a new root whose children are the two split nodes of the old root Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 18 R-Tree Insert Example R2 R2 R7 R2 R7 R1 R5 R3 R11 xP R10 xP R4 R8 R9 R8 R9 R6 R2 R7 xP Inserting P either in R7 or R8 R9 root R9 In R7,it needs more space,but does not overlap Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 19 Heuristics An object is always inserted in the nodes,to which it produces the smallest increase in volume If it falls in the interior of a MBR no enlargement is need If there are several possible nodes,then select the one with the sm a llest volum e Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 20 Insert with Overflow Overflow because M=3 R2 R7 R1 R5 R3 R11 R7b XP R10 R4 R8 R9 R6 R2 R7 XP root R8 R9 root R1 R2 R3 R4 R5 R6 R7 R7b R8 R9 R10 R11 Suppose R7 already have 3 elements. Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 21 Insert with Overflow Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 22 SplitNode If an object is inserted in a full node, then the M+ 1 objects will be divided among two new nodes The goal in splitting is that it should rarely be needed to traverse both resulting nodes on subsequent searches – Therefore use sm a ll MBRs.This leads to minimal overlapping with other MBRs Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 23 Split Example Calculate the minimum total area of two rectangles,and minimize the dea d spa ce R R R1 R2 R1 R2 X Dead space Bad split Better Split Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 24 Overflow Problem Deciding on how exactly to perform the splits is not trivia l – All objects of the old MBR can be divided in different ways on two new MBRs – The volume of both resulting MBRs should remain as small as possible Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 25 Delete P rocedure – Search the leaf node with the object to delete (FindLeaf) – Delete the object (deleteRecord) – The tree is condensed (CondenseTree) if the resulting node has < m objects – When condensing , a node is completely erased and the objects of the node which should have remained are reinserted – If the root remains with just one child, the child will become the new root Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 26 Example An object from R9 is deleted (1 object remains in R9,but m = 2) – Due to few objects R9 is deleted,and R2 is reduced (condenseTree) R2 R7 R2 R7 root R8 R8 R9 R1 R2 R3 m=2 R4 R5 R6 R7 R8 Want to delete this object R10 R11 Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 27 Update If a record is updated,its surrounding rectangle can change The index entry must then be deleted updated and then re-inserted Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 28 Block Access Cost The most efficient search in R-trees is performed when the overlap and the dea d space are minimal root E K B F A B C G S D A H J E I C M H I J K D E F G L M N N Avoiding overlapping is only possible if L data points are known in advance We did the traversing of extra 4 nodes which was not required if we knew that A and B are not overlapped. Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 29 Summary B-Trees are not fit for multidimensional data R-Trees – MBR as geometry to build multidimensional indexes – Operations:select,insert,overflow problem,node splitting ,delete – Inefficient because they a llow overlapping between neighboring MBRs – R+-trees - im prove the search performance Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 30 Next Lecture Index continued Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 31

Use Quizgecko on...
Browser
Browser