Physical File Organization and Indexing PDF

Physical File Organization and Indexing ◈ Chapter Objectives In this chapter, you will learn to: Opening Scenario Now that Sober has its relational logical data model from Chapter 6 ready, it wants to understand how it can be physically implemented. The company is also wondering whether there exist any physical means to speed up the response times of frequently used queries. grasp the basic principles of storage hardware and physical database design; understand how data items can be organized into stored records; identify various methods for primary and secondary file organization; This chapter focuses on the most important principles pertaining to the physical organization of records and files. As such, it can be considered as the prerequisite to Chapter 13, which applies these principles in the context of physical database organization. In this way Chapters 12 and 13 are complementary in covering all facets of physical database design – the translation of a logical data model into an internal data model, including the design of indexes to speed up data access. First, we present some overall properties of storage devices and the impact of the mechanicals of hard disk drives on the performance aspects of data retrieval. After that, we overview the mapping of logical modeling constructs onto physical concepts, and thus of the logical data model onto the internal model. Then, we briefly discuss record organization, covering the different alternatives to organize physical data records, consisting of individual data fields. After that comes the main body of this chapter, focusing on file organization and covering methods to organize records into physical files, as well as techniques to efficiently search for records with certain characteristics. We discuss several approaches, such as sequential files, hashing and the use of different index types, including B+-trees. Chapter 13 follows on from this chapter, applying the insights gained from the file organization section to the particular context of physical database organization. 12.1 Storage Hardware and Physical Database Design Physical database design pertains to the design of an internal data model and to the way in which the logical database concepts discussed in the previous chapters are realized physically as stored records, physical files, and, ultimately, a physical database. In this chapter, we assume a relational database setting unless noted otherwise, although most concepts apply to other database types as well.1 Connections Chapter 6 discusses the basic concepts of the relational model. 12.1.1 The Storage Hierarchy Before discussing the actual database files, we briefly focus on the storage devices on which these files reside. This section deals with the individual storage device, and we come back to storage hardware in the next chapter when we discuss storage device pooling and the overall architecture of enterprise storage subsystems. A computer system’s memory can be looked upon as a hierarchy (Figure 12.1), with high-speed memory that is very expensive and limited in capacity at the top, and slower memory that is relatively cheap and much larger in size at the bottom. The top of the hierarchy is the central processing unit (CPU) with its registers, in which the mathematical and logical processor operations are executed. Most often, some high- speed cache memory is physically integrated with the CPU and/or with the motherboard that contains the CPU. Cache memory operates at nearly the same speed as the CPU. Below that, we have central storage, which is also referred to as internal memory or main memory. It consists of memory chips (also called random access memory, or RAM), of which the performance is expressed in nanoseconds. Each individual byte in central storage has its own address and is directly referable by the operating system. The entirety of the memory described so far is called primary storage. This type of memory is considered volatile memory, which means its content is cleared when the power is turned off. Volatile memory certainly has its role in a database system, as it contains the database buffer as well as the runtime code of the applications and DBMS. However, the memory we will focus on in this chapter is secondary storage, which consists of persistent storage media, retaining its content even without being powered. The physical database files reside in secondary storage. The most important secondary storage device is still the hard disk drive (HDD), although solid state drives (SSD) based on flash memory are catching up quickly. Figure 12.1 The storage hierarchy. Connections Chapter 2 elaborates on the architecture of a DBMS and also includes a discussion on the database buffer. Primary and secondary storage are divided by what’s known as the I/O boundary. This means that all memory above this boundary, although slower than the CPU, still operates at speeds that make it efficient for the CPU to “wait” until data are retrieved from primary storage. In comparison, secondary storage is much slower. The speed of HDDs and SSDs is typically expressed in milliseconds. It is not efficient for the CPU to wait until the interaction with secondary storage is completed. Rather, the CPU will switch to another task or thread until the requested data are copied from secondary storage to primary storage or until data that were manipulated in primary storage become persistent in secondary storage. The exchange of data between secondary storage and primary storage is called I/O (input/output) and is supervised by the operating system.2 The operating system signals when the I/O task is completed, such that the CPU can continue processing the data. Still lower in the hierarchy, we have even slower storage technology such as optical drives (e.g., rewritable DVD or Blu-ray) and tape, which mainly serve as media for backup and archiving, rather than being considered a directly accessible layer in the storage hierarchy. In what follows, and unless noted otherwise, we assume a HDD as the storage medium for the physical database. Given their capacity and cost, hard disks are still the preferred medium for most database settings for the time being. Still, in-memory database technology, directly exploiting central storage for database purposes, is gaining momentum for particular highperformance applications. Hybrid solutions exist as well, caching part of the physical database in RAM for higher performance, as is the case with, e.g., the Memcached NoSQL database discussed in Chapter 11. Moreover, flash memory can be expected to take over from hard disk technology in the not too distant future, but at present the supported capacity is still limited in comparison to hard disk storage. We’ll now look at the internals of an HDD, as its physical concepts impact the way in which we deal with file organization and physical database design. That being said, much of the discussion in this chapter is applicable to SSDs, and sometimes in- memory databases, as well. 12.1.2 Internals of Hard Disk Drives Hard disk drives store their data on circular platters, which are covered with magnetic particles (Figure 12.2). A HDD also contains a hard disk controller, which is circuitry that oversees the drive’s functioning and that interfaces between the disk drive and the rest of the system. Figure 12.2 Internals of a hard disk drive. Reading from and writing to hard disks comes down to magnetizing and demagnetizing the spots on these platters to store binary data. HDDs are directly accessible storage devices (DASDs), which means that every location on a platter should be individually addressable and directly reachable to access its content.3 Since a platter is two-dimensional, movement in two dimensions should be supported by the HDD’s mechanicals. The first dimension of movement is disk rotation. For that purpose, the platters, or sometimes only a single platter, are secured on a spindle, which rotates at a constant speed. Movement in a second dimension is realized by positioning the read/write heads on arms, which are fixed to an actuator. There is a set of read/write heads for each writable platter surface. The actuator, along with the arms and the read/write heads, can be moved toward or away from the center of the disks. By combining disk rotation with actuator movement, each individual section of the disk is directly reachable. The magnetic particles on the platters are organized in concentric circular tracks, with each track consisting of individual sectors. The sector is the smallest addressable unit on a HDD. Traditionally, the standard sector size was 512 bytes, but many more recent HDDs have sector sizes of up to 4096 bytes. Also, for reasons of efficiency, it often occurs that the operating system or the HDD itself does not directly address individual sectors, but rather addresses socalled disk blocks (aka clusters, pages, or allocation units), which consist of two or more physically adjacent sectors. By addressing disk blocks encompassing multiple sectors, the number of required addresses and the amount of overhead can be reduced. In what follows, we make abstraction from these distinctions and always use the generic term disk block, or block for short, to refer to an addressable unit of hard disk storage capacity, consisting of one or more adjacent physical sectors. When organizing physical files on disk blocks and tracks, it is important to keep the mechanicals of the movement in a HDD in mind. Indeed, reading from a block, or writing to a block, identified by its block address comes down to the following. First, the actuator needs to be moved in such a way that the read/write heads are positioned above the track that contains the desired block. The time necessary to position the actuator is called the seek time. All sectors on the same track of a platter can then be read without additional seeks. Moreover, since all read/write heads are positioned on arms above one another, all tracks at the same distance from the center on the respective platter surfaces can be read without additional seeks. Such a set of tracks, with the same diameter, is called a cylinder. Once the heads are positioned appropriately, one must wait until the desired sector has rotated under the read/write head of that platter surface. The time required for that is called the rotational delay or latency. Finally, the data can be read or written. The transfer time depends on the block size, the density of the magnetic particles and the rotation speed of the disks. The response time to retrieve a disk block from a disk drive can be summarized as follows: We will not discuss the queuing time at this point; it pertains to the waiting time until the device is actually free from other jobs, and depends upon task scheduling, system workload, and facilities for parallelism. Some of these aspects are tackled in the next chapter and in later chapters. The transfer time itself is typically fixed and depends on the hardware properties of the disk drive. Still, physical file organization can be optimized in such a way that the expected seek time and, to a lesser extent, rotational delay are minimized and this may have a considerable impact on overall performance of a database system. For that reason, we discriminate between the expected service time for random block accesses (Trba) and the expected service time for sequential block accesses (Tsba). Trba refers to the expected time to retrieve or write a disk block independently of the previous read/write. Tsba denotes the expected time to sequentially retrieve a disk block with the read/write head already in the appropriate position from a previous read/write to a physically adjacent disk block: T rba is equal to the sum of the expected seek time, the rotational delay, and the transfer time. The seek time depends upon the number of cylinders to be response time = service time + queuing time; service time = seek time + rotational delay + transfer time. T rba = Seek + ROT/2 + BS/TR T sba = ROT/2 + BS/TR traversed. Hard disk manufacturers tend to provide an average value – “average seek time” – expressed in milliseconds in their descriptions of a drive model. The expected rotational delay is calculated as one-half of the time required for a full disk rotation (ROT), also expressed in milliseconds. If the manufacturer provides the rotation speed of the drive in rotations per minute (RPM), then ROT (in milliseconds) = (60 × 1000)/RPM. The last term in the above equation represents the transfer time, which is equal to the block size (BS) divided by the transfer rate TR. The transfer rate is expressed in megabytes per second (MBps) or megabits per second (Mbps). For Tsba, the expression is similar, with omission of the seek term. For example, for a HDD with the following characteristics: Average seek time: 8.9 ms Spindle speed: 7200 rpm Transfer rate: 150 MBps Block size: 4096 bytes we have the following results: From this example, it is recommended to organize physical files onto tracks and cylinders in a way that seeks and, to a lesser extent, rotational delay are minimized as much as possible. This objective is a leading theme throughout most of this and the next chapter. Drill Down T rba = 8.9 ms + 4.167 ms + 0.026 ms = 13.093 ms T sba = 4.167 ms + 0.026 ms = 4.193 ms Solid state drives (SSDs) are based on integrated circuitry (most often flash memory), rather than on magnetic disk technology. Unlike HDDs, SSDs have no moving mechanical components and therefore have lower access times than HDDs and are, in several aspects, more robust. In particular, read performance of SSDs is much better than with HDDs. On the other hand, most elements mentioned with respect to HDDs also apply to SSDs: SSDs are accessible through the same controller types and I/O commands as traditional HDDs. The concepts of file system, blocks, sequential block access, and random block access apply as well. One problem of SSDs (especially in the early days) is that their blocks can only sustain a limited number of writes before they go bad, whereas most HDDs fail due to mechanical failure, not blocks going bad. This has some implications for physical data management in the sense that the DBMS or operating system will make sure not to overwrite the same sector too many times. Modern SSDs will include firmware which transparently organizes data to protect the drive, so this is less of an issue when using newer drives. This technique is called wear leveling. Therefore, whereas a file that is updated on a HDD will generally be rewritten into its original location, a new version of a file on an SSD will typically be written to a different location. Some SSDs are not based on persistent flash memory, but on volatile DRAM (dynamic random access memory) circuitry. Such SSDs are characterized by even faster access times. However, in contrast to flash memory, DRAM does not retain its content in the case of a power outage. Therefore, DRAM-based SSDs are typically equipped with an internal battery or external power supply, which lasts long enough to persist its content into a backup storage facility in case of power failure. Finally, hybrid drives exist as well, combining SSD and HDD technology in a single unit, using the SSD as a cache for the most frequently accessed data. 12.1.3 From Logical Concepts to Physical Constructs The main focus of this and the next chapter is on how a database is realized as a set of physical files and other constructs. The purpose is mostly to optimize update and retrieval efficiency by minimizing the number of required block accesses, especially random block accesses. An optimal tradeoff is pursued between efficient update/retrieval and efficient use of storage space. The main focus is on the physical organization of structured data, by translating logical structures into physical ones. Chapter 18, dealing with data integration, also discusses the indexing and searching of unstructured data. Physical database design comes down to the translation of a logical data model into an internal data model, also called the physical data model. This translation takes the physical properties of the storage media into account, as well as the statistical properties of the data and the types of operations (search, insert, update, delete) that are executed on them. The internal data model should provide adequate support for the most frequent and/or most time-critical operations. Figure 12.3 recapitulates the three-layer database architecture and the position of the internal data model in it. As discussed in Chapter 1, this approach guarantees physical data independence. Figure 12.3 Position of the internal data model. Connections Chapter 1 discusses the three-layer database architecture, logical, and physical data independence. The logical data model does not contain any concrete implementationrelated specifications, but it does make an assumption about the actual type of DBMS used to implement the model physically. As already stated, we focus on a relational database setting although most aspects in the discussion on physical record and file organization are applicable to other DBMS types as well. If we position the logical data model next to the internal data model, we can compare the corresponding concepts in each of them, which are to be translated from logical into physical structures. We use both the “generic” terminology of a logical data model and the more specific terminology of a relational model. In general, a logical model defines entity records, or just records for short, as instances of entity record types (or record types for short). A record is described by its attributes. In a relational setting, we speak of a logical database as a set of tables or relations. A table consists of rows or tuples, which contain values (aka cell values, as they represent values to the individual cells in a relational table), described by the corresponding column names (see Chapter 6). The internal data model denotes how the former concepts can be realized physically on storage media and also defines physical structures that allow for these concepts to be accessed in an efficient way. Connections Chapter 6 discusses the basic concepts of the relational model. The corresponding physical concepts are represented in Table 12.1. A data item (also called field) is a collection of bits or characters that represents a specific value on a physical storage medium. A stored record is a collection of data items that pertain to the same real-world entity and that represent all attributes of this entity. In this way, it is the physical representation of a tuple in a relational table. A physical file (also called dataset) is a collection of stored records that represent similar real- world entities, such as students, wines, or purchase orders. It implements a relational table or, in general, all instances of a logical record type. In most cases, all records in a physical file have a similar structure and contain data items that represent the same set of attribute types. On some occasions, it may be required to combine stored records representing different real-world concepts into a single file. In that case, the records in the file may be different in structure and attribute types and we speak of a mixed file. Moreover, physical files may contain additional structures, such as indexes and pointers (see Section 12.3.5) to efficiently search and/or update the file and its contents. Table 12.1 Logical and internal data model concepts Logical data model (general terminology) Logical data model (relational setting) Internal data model Attribute type and attribute Column name and (cell) value Data item or field (Entity) record Row or tuple Stored record (Entity) record type Table or relation Physical file or dataset Set of (entity) record types Set of tables or relations Physical database or stored database Logical data structures Foreign keys Physical storage structures Finally, a physical database (also called stored database) is an integrated collection of stored files. In this way, it contains data items and stored records describing different kinds of real-world entities (e.g., suppliers and purchase orders) as well as their interrelationships (e.g., the fact that a particular purchase order is placed with a particular supplier). The logical structures that model the relationships between record types, such as the foreign keys in a relational model, also yield a physical representation. A stored database contains physical storage structures to represent these logical interrelations, as well as to support the efficient retrieval and manipulation of stored records according to these interrelations (e.g., to execute a join query in a relational setting). To conclude this discussion, we present the example of a simple conceptual data model, which is translated into a relational logical data model and, ultimately, into an internal data model (Figure 12.4). The relationship type between a supplier and his/her corresponding purchase orders is represented by means of a foreign key in the relational data model. In the internal data model, it is implemented by storing a supplier record physically adjacent to its purchase order records. If supplier records and the corresponding purchase orders are often retrieved together, this contiguous organization is beneficial, as they can be retrieved with a consecution of sequential block accesses, thus avoiding more inefficient random block accesses. In addition, a separate index is provided, which allows one to efficiently look up the supplier in question. The latter refers to the suppliers not by means of physical contiguousness, but by means of pointers. Figure 12.4 Example of conceptual, logical, and internal data model. It is important to note that Figure 12.4 provides just one possible way of physically realizing the logical data model. Depending on the storage device and the statistical properties of the data and queries (number of tuples, average number of purchase orders per supplier, most frequently executed query types, etc.) other physical models may be more appropriate. For example, if purchase orders are only rarely retrieved according to the supplier, but mostly according to the PODate, an index over the POdate is more appropriate. Moreover, at all levels (stored record, physical file and physical database) either direct physical contiguity or the indirection of pointers can be used to relate constructs (data items, stored record, index entries, etc.) to one another. In the rest of this chapter we discuss the different possible approaches. We start with a brief overview of some elements of physical record organization, such as the way in which data items are organized into stored records. Then, we deal with file organization and indexing, including the organization of stored records into physical files. In the next chapter we focus on physical database organization, with an emphasis on how physical storage structures can enhance the efficiency of querying data in a single table, as well as across tables, by means of a join query. Retention Questions What is meant by storage hierarchy? Discuss the basic functioning of a hard drive. Discuss how the following logical data model concepts can be mapped to physical data model concepts in a relational setting: attribute type and attribute; (entity) record; (entity) record type; set of (entity) record types; logical data structures. Illustrate with examples. 12.2 Record Organization Record organization refers to the organization of data items into stored records. Each data item embodies an attribute of a particular real-world entity that is represented by the stored record. The physical implementation of the data item is a series of bits; the actual format depends on the attribute’s data type. Typical data types in a relational database setting were discussed in Chapter 6. They include numeric data types (e.g., integer and float), character strings of variable or fixed length (e.g., character or varchar), date and time-related data types (e.g., date, time, timestamp), and the Boolean data type to represent truth values. In many cases, the data types BLOB and CLOB are also supported in order to capture large chunks of binary data and text data respectively (see Chapter 9). We only briefly highlight some aspects pertaining to organizing data items into stored records, as a database administrator typically has only limited impact on how this is realized in a particular DBMS. We focus on the following techniques: relative location, embedded identification, and the use of pointers and lists. Connections Chapter 6 discusses the various data types in a relational database setting. Chapter 9 reviews the BLOB and CLOB data types. The simplest, and most widespread, technique for record organization is relative location. Here, only the attributes are stored. The data items that represent the attributes of the same entity are stored on physically adjacent addresses. The corresponding attribute types are not stored with them; they are determined implicitly by the relative ordering in which the data items occur, based each data iteand most effipartial relatioFigure 12.5 While resomewhat proattributes are location cannattribute. A smissing attribthe irregularitide ntification preceded by tincluded (Figmissing attribin a fixed orstorage space this approach of working, on metadata about record structure in the catalog.4 I n this way,e simplest. 5 shows ature..t becomes g., if manyhe relatives to whichace for therage use ifembedd ed re alwaysre cord areexplici tly,e attributes the extralar recordsat this ways, is quite m can be identified by its relative location. This is thcient approach in terms of storage space. Figure 12nal table definition and the corresponding record strucExample of record organization with relative locationlative location is the most common approach, iblematic if the record structure is highly irregular (e.not always present in each record). In that case, tot be used to determine which data item correspondolution could be to always retain empty storage sputes, but this is obviously not efficient in terms of stoies are very frequent. An alternative solution is. Here, the data items representing attributes ahe attribute type. Only non-empty attributes of the ure 12.6). Because the attribute types are registered utes are not a problem and there is no need to store thder to identify them. The obvious disadvantage is required for the attribute types, but for highly irreguis still more efficient than relative location. Note thwith explicitly embedded metadata on attribute type similar to the approaches in languages for semi-structured data such as XML and JSON. Figure 12.6 Example of record organization with embedded identification. Connections Chapter 10 discusses XML and JSON. A third option is to use pointers and lists. These are discussed in more detail in Section 12.3, but can also be used for the sake of record organization. There are different possibilities. Figure 12.7 shows an example in which only attributes are stored. There is a regular record structure, except for the fact that the number of addresses may be different for each person. Therefore, for each person, one address is stored physically adjacent to the other data items; but if a person has multiple addresses, a pointer is included that refers to the storage location where the other addresses are positioned. In this way, irregularities can be dealt with without affecting the overall record structure. Figure 12.7 Example of record organization with pointers and lists. This is just one example of dealing with variable-length records. This variability can have different causes. A first cause may be one or more attribute types that have a data type with variable length (e.g., the VARCHA types may the exampop tional apossibl e recontain ing records). Anotdeli miters possibilit y themselv es another. In to the locaThis level CLOB datthe other dmethod s tcontain thcharact er, and CLOB show somrecor ds. R type). A second possible cause is that one or more attribute be multi-valued, as is the case with the Address attribute type in le above. A third possibility is that certain attribute types are nd occur only for some entities, as discussed previously. A fourth ason for variable-length records is when we have a mixed file different kinds of records (e.g., both supplier and purchase order her alternative for dealing with variable length records is to use that explicitly separate the respective attributes. Yet another is to use an indirect structure, consisting of pointers, which have a fixed length and are stored physically next to one this way, the record has a regular format, but the pointers point tion of the actual data items, which may have a variable length. of indirection is used very often when dealing with BLOB and a. BLOB and CLOB data types are mostly stored separately from ata types, as they are much larger in size and thus require other o be stored and retrieved efficiently. In that way, a record can e actual values of regular data types such as integer and as well as pointers to “large” data items, with the actual BLOB values stored in a separate file or file area. In Figure 12.8, we e typical organizations of both fixed-length and variable-length Figure 12.8 Dealing with fixed-and variable-length records. A last important concept in the context of record organization is the blocking factor (BF). The latter indicates how many records are stored in a single disk block.5 For a file with fixed-length records, BF is calculated as follows: In this formula, BS denotes the block size and RS is the record size, both represented in bytes. The floor function ⌊x⌋ rounds down the x value to an integer. For variable-length records, BF denotes the average number of records in a block. The blocking factor is an important value when organizing physical records for efficient access, as it determines how many records are retrieved with a single read operation, without intermediate seeks and/or rotational delay. Retention Questions Discuss and contrast different techniques to organize data items into stored records. 12.3 File Organization Physical file organization pertains to the organization of stored records into physical files or datasets. We first introduce some introductory concepts such as search keys and the distinction between primary and secondary file organization, before tackling the actual file organization techniques. 12.3.1 Introductory Concepts: Search Keys, Primary, and Secondary File Organization The records in a physical file are organized in such a way that they can be retrieved efficiently according to one or more search keys. A search key is a single attribute type, or set of attribute types, whose values determine the criteria according to which records are retrieved. Most often, these criteria are formulated by means of a query language, such as SQL in the case of relational databases. Connections Chapter 7 discusses SQL. A search key can be a primary key or candidate key, in which case the search only yields a single result since no two stored records can have the same primary or candidate key value in a physical file.6 Examples of such unique search keys are a customer ID, a license plate number, the combination of a flight number and the date of departure, etc. For example, a search key “customerID” with value “0285719” would yield only a single record, or none at all if there is no record with such a key value in the file. Although primary keys are often used to retrieve data, a search key can just as well consist of one or more non-key attribute types. Indeed, one often needs to retrieve data according to other criteria, which do not represent unique identifiers. For example, the “class” of a flight seat (economy class, business class, first class, etc.) is a search key that can be used to retrieve seats, but the result is not necessarily unique and may contain many seats. A search key can be composite. For example, the search key (country, gender) with values (USA, F) yields all female customers that live in the USA. Finally, queries tlower an“Yea rOfApa record thbe retrieac cordinn ever be custom er file orgaac cess aaccor dinsear ch keWe First, threcor ds metho ds indexe d on the enis retriev method s locatio n. accessi nsearc h kprimar y physica l file orga search keys can also be used to specif y range queries. These areattribute value is between aith a search key value foro attribute types in a storedh keys. They can, however,a record that was selectedtomer’s street address mayed as part of the result if therID or the year of birth. Theter aim at optimizing recordother candidate keys and/orer as relevant and frequentfile organization methods.sical positioning of storedprimary file organization ndom file organization, andods require a linear searchkey: each record in the filey. Yet, the more advanceds search key and its physicalrably, as it allows directlyords that correspond to theshing and indexing are thehip. When implementing ad according to the primary hat retrieve all records in which some d upper limit, such as all customers wBirth” between 1980 and 1990. rt from the above, there are typically alsat are never, or only rarely, used as searcved to provide additional information to g to other criteria. For example, a cusused as a search key, but can be displaywas retrieved according to the customenization methods we discuss in this chapccording to primary keys, according to g to non-key attribute types we considys. distinguish between two categories of ere are methods that determine the phyon a storage medium. We call them. Examples we discuss are heap files, rasequential file organization. Some methtire file for records that match the search ed and assessed against the search kespecify a relationship between a record’This improves retrieval speed consideg the storage locations that contain recey, thus avoiding a full linear search. Hatechniques to establish such a relationsfile, the records are physically organizenization method. Since the primary file organization impacts the physical ordering of the records, we can apply only one primary file organization on a particular physical file, at least if we want to avoid duplicating the file. Hence, it is important to organize the file according to the search key that is most often used, or most time-critical, for retrieving the file’s records. However, in most cases we want to be able to retrieve records from the same file according to different criteria. For example, sometimes we want to retrieve customers according to their customerID, sometimes according to their country or gender, or sometimes according to the combination of both country and gender. Therefore, we need secondary file organization methods, which provide constructs to efficiently retrieve records according to a search key that was not used for the primary file organization. Secondary file organization methods do not impact a record’s physical location and invariably use some kind of index, called a secondary index. In what follows, we first deal with the most important primary organization methods, starting with heap files. Then, we overview the different uses of pointers to improve physical file organization. We conclude with a discussion of secondary indexes and some particularly important index types, such a B-trees and B+-trees. 12.3.2 Heap File Organization The most basic primary file organization method is the heap file. New records are inserted at the end of the file; there is no relationship between a record’s attributes and its physical location. Consequently, adding records is fairly efficient, but retrieving a particular record or set of records according to a search key is not. The only option is to do a linear search, scanning through the entire file, and to retain the records that match the selection criteria. If a single record is sought according to its primary key, the scanning continues until the record is found, or until the end of the file is reached, meaning that the record is not present. For a file with NBLK blocks, it takes on average NBLK/2 sequential block accesses to find a record.7 Still, the more requests occur for records that turn out not to be in the file, the more searches will happen until the end of the file, requiring NBLK block accesses. Searching records according to a nonunique search key also requires scanning the entire file, hence NBLK block accesses. Deleting a record often comes down to just flagging it as “deleted”. The records are then physically removed upon periodical reorganization of the file. 12.3.3 Sequential File Organization With sequential file organization, records are stored in ascending or descending order of a search key. This is often the primary key, but a non-unique search key (i.e., a non-key attribute type or set of attribute types) can also be used as ordering criterion. An advantage this has over heap files is that it becomes much more efficient to retrieve the records in the order determined by this key, since it requires only sequential block accesses. Moreover, as with heap files, individual records can still be retrieved by means of a linear search, but now a more effective stopping criterion can be used, since the search key is the same as the attribute type(s) that determine(s) the order of the records. If the records are organized in ascending/descending order of this key, the linear search can be terminated once the first higher/lower key value than the required one is found; one can be assured that no more matching records exist in the remainder of the file. In addition, and even more importantly, if the sequential file is stored on a direct access storage device such as a HDD, a binary search technique can be used, which in the case of large files is far more efficient than a linear search. A binary search algorithm is applied recursively, halving the search interval with each iteration. For a unique search key K, with values Kj, the algorithm to retrieve a record with key value Kμ is as follows: Selection criterion: record with search key value Kμ Set l = 1; h = number of blocks in the file (suppose the records are in ascending order of the search key K) Repeat until h ≤ l The expected number of block accesses to retrieve a record according to its primary key in a sequential file of NBLK blocks by means of a linear search is still NBLK/2 sequential block accesses. If, on the other hand, binary search is used, the expected number is log2(NBLK) random block accesses, which is much more efficient for high values of NBLK. Note that the binary search algorithm needs to be modified slightly when searching according to a nonunique search key. In that case, a few additional sequential block accesses may be called for, to retrieve all successive records with the same search key value. For example, let’s examine a sequential file with the following properties: Number of records (NR): 30,000 Block size (BS): 2048 bytes Record size (RS): 100 bytes The blocking factor (BF) can be calculated in the following way: BF = BS/RS = 2048/100 = 20. Each block contains 20 records and thus the – i = (l + h) / 2, rounded to the nearest integer – Retrieve block i and examine the key values Kj of the records in block i - if any Kj = Kμ ➔ the record is found! - else if K μ > all Kj ➔ continue with l = i + 1 - else if K μ < all Kj ➔ continue with h = i - 1 - else record is not in the file required number of blocks NBLK to store the 30,000 records is 1500. If a single record is retrieved according to its primary key by means of a linear search, the expected number of required block accesses is 1500/2 = 750 sequential block accesses. On the other hand, if a binary search is used, the expected number of block accesses is log2(1500)≈11 random block accesses. Even though random block accesses take more time than sequential block accesses, the binary search algorithm is much more efficient to search sequential files than scanning sequential files or heap files. Then again, updating a sequential file is more cumbersome than updating a heap file, since now the records must be kept in order and new records cannot just be added at the end of the file. Updates are often executed in batch; they are organized according to the same ordering attribute type(s) as the actual sequential file and then the entire file is updated in a single run. A possible alternative is to place newly added records, as well as records for which the value of the ordering attribute type(s) is updated, in a separate “overflow” file (the latter organized as a heap file). Deleted records can just be flagged, without being physically removed. In this way, the file only has to be reorganized periodically, if the overflow file becomes too large. Overflow is discussed more extensively in Section 12.3.4. In a database setting, sequential files are often combined with one or more indexes, resulting in the indexed sequential file organization method (see Section 12.3.5). 12.3.4 Random File Organization (Hashing) The main disadvantage of sequential file organization is that many other records need to be accessed to retrieve a single required record. This problem is somewhat alleviated with binary search, but even then the number of unnecessary record retrievals may become quite large. With random file organization (also called direct file organization or hash file organization), there exists a direct relationship between the value of the search key and a record’s physical location. In this way, a record can be retrieved with one, or at most a few, block accesses if the key value is provided. 12.3.4.1 Key-to-Address Transformation The relationship described above is based on hashing. A hashing algorithm defines a key-to-address transformation, such that the record’s physical address can be calculated from its key value. Each time a new record is to be added to the file, this transformation is applied to its key, returning the physical address where the record should be stored (at least, if that address is still available, as discussed below). If later on the record is to be retrieved based on this search key, applying the same transformation to the key returns the address where the record can be found. Clearly, this approach is only feasible on a direct access storage device. As discussed in Chapter 11, variations of the hashing technique can be applied in multiple contexts. Hashing is used in programming languages to map key values to addresses in internal memory (e.g., an array data type). The recent wave of NoSQL databases often relies on a variant, consistent hashing, to evenly distribute and replicate data records over the respective database nodes in a cluster setting. In what follows, we focus on the use of hashing to transform record keys into physical addresses of a persistent storage device, most often a hard disk drive. This approach is most effective when using a primary key or other candidate key as a search key, for reasons that are clarified later in this section. Connections Chapter 11 discusses NoSQL databases and also introduces hashing. A key-to-address transformation consists of several steps, as represented in Figure 12.9. First, the key is converted into an integer numerical format, if it isn’t an integer already. For instance, non-integer numerical values can be rounded or alphanumerical keys can be turned into an integer by means of the characters’ position in the alphabet or their ASCII codes. Then, the actual hashing algorithm is applied, where the integer key values are transformed into a spread of numbers of roughly the same magnitude as the desired addresses. The more uniformly these keys are distributed over this range of numbers, the better. Note that, quite often, the generated addresses do not pertain to individual record addresses, but rather to a contiguous area of record addresses, called a bucket. A bucket contains one or more stored record slots. Its size can be defined at will, but can also be aligned to the physical characteristics of the storage device, such as an integer number of disk blocks, tracks, or cylinders. The hashing algorithm can take on many forms, ranging from a simple hash function to more complex algorithms. However, it invariably consists of a consecution of mathematical operations, applied to a numerical representation of the key and returning a hash value that corresponds to a bucket address. In a next step, this bucket address is translated into an actual block address. For that purpose, it is multiplied by a constant that “compresses” the generated hash values into the exact range of desired block addresses. The latter is still a relative block address, that is, relative to the first block of the file. Figure 12.9 Key-to-address transformation. A final step, which is governed by the file system and therefore can be considered beyond the scope of the key-to-address transformation, is the translation of the relative block address into an absolute address consisting of a device number, cylinder number, track number, and block number. If a bucket spans multiple physical blocks, retrieving a record according to its key comes down to a single random block access to the first block in the bucket as determined by the key-to-address transformation. This is possibly followed by one or more sequential block accesses until the entire bucket is retrieved and the record is found. A very important consideration with respect to hashing is that it cannot guarantee that all keys are mapped to different hash values, hence bucket addresses. Whereas its purpose is indeed to distribute all keys evenly over the available address space, several records may be assigned to the same bucket. In that case, we speak of a collision and the corresponding records are called synonyms. Collisions are not a problem per se, since a single bucket normally consists of multiple record slots, but if there are more synonyms than slots for a certain bucket, the bucket is said to be in overflow. As discussed later in this section, there are different ways of dealing with overflow, but regardless of the actual overflow-handling method, it inevitably causes some records to be stored on a different location than where they were initially expected according to the key-to- address transformation. Hence, additional block accesses are needed to retrieve overflow records and therefore overflow should be avoided as much as possible so as not to jeopardize performance. For that reason, a hashing algorithm should be chosen that distributes the keys as evenly as possible over the respective bucket addresses, thus reducing the risk of overflow. Many hashing techniques have been developed for transforming key values into addresses, such as division, digit analysis, mid-square, folding, and base transformation. We illustrate the hashing approach by one of them: division. This is one of the simplest hashing techniques, yet in most circumstances it performs remarkably well. With the division technique, the numerical form of the key is divided by a positive integer M. The remainder of the division8 becomes the record address: address(keyi) = keyi mod M The choice of M is very important; if M is inappropriate for the key set at hand, many collisions, and therefore extensive overflow, will be the result. Common factors between the key values and M should be avoided. For this reason, the chosen M is often a prime number. For example, Figure 12.10 shows two series of key values, each time with the remainder of division by 20 and by 23. For the first series, both 20 and 23 result in a nearly uniform distribution of remainders, and address values, so they would both be adequate as M. However, in the second series, division by 20 yields a much worse distribution, with many records attributed to bucket addresses 00, 05, 10, and 15, whereas other buckets remain empty. Yet, division by the prime number 23 again results in a nearly uniform distribution. For a particular key set, one prime number may still perform better than another, so ideally multiple candidates are tested. Most often, a prime number is chosen that is close to, but a bit larger than, the number of available addresses, yielding roughly as many values as the number of required addresses. The next step in the key-to-address transformation then multiplies them by a factor (a bit) smaller than 1, such that they fit perfectly into the actual address space (see Figure 12.9). Figure 12.10 Impact of hashing technique and key set distribution on number of collisions. 12.3.4.2 Factors that Determine the Efficiency of Random File Organization The efficiency of a hashing algorithm for a certain dataset is ultimately measured by the expected number of random and sequential block accesses to retrieve a record. Several factors have an impact on this efficiency. Retrieving a nonoverflow record requires only a single random block access to the first block of the bucket denoted by the hashing algorithm. It is possibly followed by one or more sequential block accesses if the bucket consists of multiple physical blocks. To retrieve an overflow record, additional block accesses are required, hence the percentage of overflow records, as well as the overflow-handling technique affect the performance. The former denotes how many records are in the overflow; the latter determines where records that do not fit in the bucket as determined by the hashing algorithm are stored and how many block accesses are required to retrieve them. We first discuss the parameters that impact the percentage of overflow records. After that, we briefly present some overflowhandling techniques. As illustrated in Figure 12.10, the percentage of overflow records depends on how appropriate the hashing algorithm is to the key set. In many cases, the key values are not distributed evenly; gaps and clusters may occur, even more so if records are frequently added and deleted. It is up to the hashing algorithm to map this irregularly distributed key set as evenly as possible onto a set of addresses. The term “random” file organization is somewhat misleading in this context: the aim is not to assign records to storage addresses according to a random statistical distribution function. With a random distribution, any record would have the same chance to be assigned to any physical address in the available range. However, the real aim is to achieve a uniform distribution, spreading the set of records evenly over the set of available buckets, because this would minimize the chance of overflow and, therefore, performance degradation. Yet, in practice, the best-performing hashing algorithms closely approximate the results of a theoretical random distribution, rather than the ideal of a uniform distribution. The statistical properties of a random distribution can be used by the file designer to estimate the expected percentage of overflow records. Moreover, theoretically, a perfectly uniform distribution would require only as many record slots as there are records to be stored. Still, the reality of less-than-uniform distributions implies that more record slots are needed, because some buckets are fuller than others and the only way to avoid too extensive an overflow is to provide more record slots than strictly necessary. The following formula expresses the required number of buckets (NB) as the number of records (NR) divided by the bucket size (BS) and the loading factor (LF): NB = NR/(BS × LF) The number of records is determined by the dataset. The bucket size indicates the number of record slots in a bucket. The tradeoff here is that the larger the bucket size, the smaller the chance of overflow will be. On the other hand, a larger bucket means more additional sequential block accesses to retrieve all non-overflow records in the bucket. Therefore, the blocking factor, which is the number of records in a single block, plays a role as well. Finally, the loading factor represents the average number of records in a bucket divided by the bucket size and indicates how “full” every bucket is on average. In this way, the loading factor embodies the tradeoff between efficient use of storage capacity and retrieval performance. A lower loading factor results in less overflow, and hence better performance, but also more wasted storage space. A higher loading factor has the opposite effect. In practice, the loading factor is often set between 0.7 and 0.9 to balance storage efficiency with performance. Clearly, the random file organization is most effective on a unique search key, thus a primary key or other candidate key. Hashing can also be applied to non-unique search keys, but in that case there are more records with the same key value and, hence, more collisions. The risk of overflow then depends on the average number of records sharing the same search key value, and on the distribution of this number. The second main factor impacting the performance of random file organization is the way in which overflow records are stored and retrieved, known as the overflow- handling technique. There are different approaches here, with overflow records being stored either in the primary area or in a separate overflow area. The primary area is the address space that also contains the non-overflow records; some techniques direct overflow records toward partially empty buckets of the primary area. Other techniques use a separate overflow area that only contains overflow records. The standard key-to-address algorithm does not apply in the overflow area. To discuss all techniques would be far too much detail, but let’s examine the small example in Figure 12.11 that illustrates how the combination of an inadequate transformation and inappropriate overflow handling may have a negative effect on retrieval performance. The simplified file contains 18 records, with numeric keys 12, 14, 15, 19, etc. There are ten buckets, with a bucket size of four. The blocking factor is two and each bucket spans two physical blocks. Suppose we use open addressing as a simple overflow-handling technique. With open addressing, overflow records are stored in the primary area, more particularly in the next free slot after the full bucket where the record would normally have been stored according to the key-to-address transformation. A “mod 10” key-to-address transformation is used, yielding ten possible bucket addresses. The resulting key distribution is far from uniform, and does not even come close to a random distribution, resulting in bucket 5 being overly stacked, whereas other buckets remain empty. Thus, despite the low loading factor of 0.45, we already have an overflow record. Indeed, the record with key 35 would normally have been stored in bucket 5, but since this bucket is already filled with other records (95, 125, etc.), the next free slot is used, in bucket 6. Retrieving record 35 now requires one random block access plus three sequential block accesses, whereas retrieving a non-overflow record would require at most one sequential block access in addition to the random block access. So indeed, overflow has a negative impact on performance. Moreover, so far record 35 is the only overflow record, but if more records are added, record 35 may take up storage space of other records that would rightfully be directed to bucket 6 according to the key-to-address transformation (e.g., a newly added record 86). Consequently, such records may also end up in overflow, resulting in a domino effect of more and more records being out of place and requiring additional block accesses for retrieval. Figure 12.11 Impact of overflow on retrieval performance. An alternative overflow-handling technique is chaining. Here, overflow records are stored in a separate overflow area, with subsequent records that overflow from the same bucket being chained together by pointers. Such a structure, where items are connected sequentially by pointers, is called a linked list. It can be used in several contexts, as discussed in detail in Section 12.3.6.2. In the context of hashing, linked lists can be used to access all records overflowing from the same bucket by following the pointers between them. The advantage is that they do not clutter the primary area, and therefore cannot cause additional overflow. The downside is that accessing linked lists results in additional random block accesses. Yet another overflow-handling technique is to use a second hashing algorithm, different to the primary one, to determine the location of overflow records. To conclude the discussion on hashing, it is important to note that so far we have assumed that the number of records does not increase or decrease significantly over time, which means the number of insertions is more or less equal to the number of deletions. If the number of records decreases over time, the storage capacity will not be used efficiently after a while. Even worse, if the number of records increases, many buckets will be in overflow, resulting in degrading performance. In both cases, the key-to-address transformation is not adequate anymore because of the changed number of records. A new transformation must be chosen, and the entire file should be rearranged according to the newly generated hash values. Luckily, several dynamic hashing techniques exist that allow for a file to shrink or grow without the need for it to be completely rearranged. It is beyond the scope of this overview to discuss these techniques in detail. 12.3.5 Indexed Sequential File Organization While random file organization, if applied adequately, is probably the most effective technique to retrieve individual records by their search key value, it is very inefficient if many records are to be retrieved in a certain order (e.g., sorted according to that same key, or if records are to be retrieved according to a range of key values). For example, for retrieving all customers in order of their customerID it would be inadequate because all customer records are scattered throughout the file according to the hashing algorithm, requiring a multitude of random block accesses.9 In comparison, sequential file organization would be much more efficient for this task, since all records are already ordered according to customerID and just need to be retrieved sequentially. An indexed sequential file organization method reconciles both concerns: in many situations, it is only marginally less efficient in directly retrieving individual records than random file organization and it still allows for records to be stored in ascending or descending order of the search key, to cater for efficient sequential access. 12.3.5.1 Basic Terminology of Indexes Indexed sequential file organization combines sequential file organization with the use of one or more indexes. To this end, the file is divided into different intervals or partitions. Each interval is represented by an index entry, which contains the search key value of the first record in the interval, as well as a pointer to the physical position of the first record in the interval. Depending on the case, this pointer may be realized as a block pointer or a record pointer. A block pointer refers to the physical block address of the corresponding record. A record pointer consists of the combination of a block address and a record ID or offset within this block and refers to an actual record. An index itself is then a sequential file, ordered according to the search key values. The index consists of entries with the following format: Index entry = 10 The search key can be atomic (e.g., a customer ID) or composite (e.g., the combination of year of birth and gender). In the case of composite search keys, the key values in the index entries are composite as well (e.g., ; ). For simplicity, we use examples with atomic keys hereafter, but all claims are valid for composite keys as well. In addition, we discriminate between dense indexes and sparse indexes. A dense index has an index entry for every possible value of the search key that it indexes. Therefore, if the search key is a unique key (e.g., primary key or other candidate key), a dense index has an index entry for each separate record. If the search key is a non-key attribute type or combination of attribute types, a dense index has an index entry for every group of records with the same value for that attribute type(s). A sparse index, on the other hand, has an index entry for only some of the search key values. Each entry refers to a group of records and there are fewer index entries than with a dense index. Dense indexes are generally faster, but require more storage space and are more complex to maintain than sparse indexes. With an indexed sequential file organization, both sequential access, based on the physical order of the records, and random access, based on the index, are supported. Different configurations, with one or more index levels, are possible. We discuss the most typical examples in the following subsections. 12.3.5.2 Primary Indexes With primary index file organization, the data file is ordered on a unique key (this can be a primary key or another candidate key) and an index is defined over this unique search key. For now, we work with only a single index level – multilevel indexes are covered in Section 12.3.5.4. An example is given in Figure 12.12. It depicts a file with intervals of four records. Each interval corresponds to a single disk block; hence the blocking factor is four. The records are ordered according to the primary key CustomerID. For each interval, there is an index entry, consisting of the key of the first record in the interval and a pointer referring to the disk block that contains the records in the interval. There is an index entry for each disk block, and not for each key value, so the index is sparse. To retrieve a record according to the required key value, say 12111, a binary search is executed on the index to retrieve the pointer to the block that should contain the corresponding record, at least if it is present in the file. This is the third block in the example. In this way, instead of searching the entire file with data records, only the index and a single block of the actual data file need to be accessed; either the record is found in that block, or it isn’t present in the file. Additionally, the index entries are much smaller than the actual stored records. This means an index file occupies considerably fewer disk blocks than the data file and can be searched much quicker. Figure 12.12 Example of primary index. The expected number of block accesses to retrieve a single record using a primary index amounts to log2(NBLKI) random block accesses for a binary search on the index, with NBLKI representing the number of blocks in the index. One additional random block access is needed to retrieve the actual record, assuming the intervals correspond to individual disk blocks. We summarize the required block accesses for respectively linear search, binary search, and indexed search in Table 12.2. Table 12.2 Required block accesses for linear search, binary search, and indexbased search Linear search NBLK sba Binary search log2(NBLK) rba Index-based search log2(NBLKI) + 1 rba, with NBLKI Kμ, then the pointer to the root of the “left” subtree with only key values lower than Ki is followed. Otherwise, if Ki < Kμ, the pointer to the root of the “right” subtree is followed. This subtree contains only key values higher than Ki. The same procedure is applied recursively to the chosen subtree until the node with Kμ is found or until a leaf node is reached, meaning that the key value Kμ is not present in the tree. This is illustrated in Figure 12.26, with Kμ = 24. The appropriate node is found after three steps, whereas a linear search would have taken nine steps in this case. As already discussed in the context of binary search, this performance gain becomes larger as the number of key values increases. Figure 12.26 Example of binary search tree. The performance could be increased even further if each node contained more than one key value and more than two children. In that case, with an equal total number of key values, the height of the tree would be reduced and therefore the average and maximal number of steps would be lower. This exact consideration is the basis of the B-tree concept, as discussed in the next section. 12.3.8.3 B-Trees A B-tree is a tree-structured index. B-trees can be considered as variations of search trees that are explicitly designed for hard disk storage. In particular, each node corresponds to a disk block and nodes are kept between half full and full to cater for a certain dynamism of the index, hereby accommodating changes in the data file without the need for too extensive rearrangements of the index. Every node contains a set of search key values, a set of tree pointers that refer to child nodes and a set of data pointers that refer to data records, or blocks with data records, that correspond to the search key values. The data records are stored separately and are not part of the B-tree. A B-tree of order k holds the following properties: Hence, every non-leaf node with q key values should have q data pointers and Every non-leaf node is of the following format:14 , with q ≤ 2k. Every Pi is a tree pointer: it points to another node in the tree. This node is the root of the subtree that Pi refers to. Every Pdi is a data pointer: it points to the record with key value Ki,15 or to the disk block that contains this record. A B-tree is a balanced tree; all leaf nodes are at the same level in the tree. Every path from the root of the B-tree to any leaf node thus has the same length, which is called the height of the B-tree. Leaf nodes have the same structure as non-leaf nodes, except that all their tree pointers Pi are null. Within a node, the property holds that K1 < K2 < … < Kq. For every key value X in the subtree referred to by Pi, the following holds: – Ki < X < Ki+1 for 0 < i < q – X < Ki+1 for i = 0 – Ki < X for i = q The B-tree’s root node has a number of key values, and an equal number of data pointers, that varies between 1 and 2k. The number of tree pointers and child nodes then varies between 2 and 2k + 1. All “normal” nodes (i.e., internal nodes: non-root and non-leaf nodes) have a number of key values and data pointers between k and 2k. The number of tree pointers and child nodes varies between k + 1 and 2k + 1. Every leaf node has a number of key values and data pointers between k and 2k and no tree pointers. q + 1 tree pointers to child nodes. If the indexed search key is non-unique, a level of indirection is introduced, similar to the inverted file approach discussed in Section 12.3.7.2. The data pointers Pdi then don’t point to the records directly, but to a block containing pointers to all records satisfying the search key value Ki. Figure 12.27 provides a few simple examples to illustrate the principles of Btrees, with varying order and height. The numbers represent key values Ki, whereas the arrows represent tree pointers Pi. The data pointers Pdi are not depicted so as not to clutter the illustration, but in reality there is a data pointer for every key value. Figure 12.27 Examples of a B-tree. A B-tree is searched starting from the root. If the desired key value X is found in a node (say Ki = X), then the corresponding data record(s) can be accessed in the data file by following the data pointer Pdi. If the desired value is not found in the node, a tree pointer is followed to the subtree that contains the appropriate range of key values. More precisely, the subtree pointer Pi to be followed is the one corresponding to the smallest value of i for which X < Ki+1. If X > all Ki then the tree pointer Pi+1 is followed. The same procedure is repeated for this subtree and so on, until the desired key value is found in a node or until a leaf node is reached, meaning that the desired search key value is not present. This approach is again very similar to a binary search algorithm, but since the number of tree pointers is much higher than two, the fan-out and therefore search efficiency is much higher than with a binary search. The capacity of a node equals the size of a disk block, and all nodes, except for the root, are filled to at least 50%. Hence, a B-tree uses the storage capacity efficiently, but still leaves room for additions to the data file without the need for impactful rearrangements of the index structure. If data records, and hence key values, are added, the empty space in a node is filled up. If all nodes in the appropriate subtree for that key value are already filled to capacity, a node is split into two half-full nodes. Both will be sibling children of the original node’s parent. If, upon deletion of records and key values, a node becomes less than half full, it is merged with one of its siblings to produce a node filled to at least 50%. Note that splitting or merging a node also impacts the parent node, where a tree pointer and key value are added or deleted, respectively. Therefore, a node split or merger at the parent’s level may be called for as well. In rare cases, these changes may work their way up to the root node, but even then the number of changes is limited to the height of the B-tree. This is still substantially less than the required changes if the index had been organized as a sequential file. It is very complex to make exact predictions about the required number of block accesses when searching a B-tree. There are many possible configurations, every node may contain between k and 2k key values (except for the root), and the tree may assume different shapes depending on node splits and mergers. For example, in Figure 12.27, searching for key value 24 requires three random block accesses in the first B-tree, one block access in the second B-tree, and two block accesses in the third tree. Searching for key value 17 requires three, two, and one block accesses, respectively. The height of the tree, and hence the maximum number of random block accesses to find a certain key value in the tree, will decrease as the order of the tree increases. Note also that a B-tree being balanced is an important property in this context. With a non-balanced tree, the path from root to leaf node would not be the same for all leaf nodes, resulting in even more variation in search time. Finally, it is worth mentioning that sometimes B-trees are also used as a primary file organization technique, hence organizing the actual data file as a search tree. The nodes then still contain search key values and tree pointers. However, instead of data pointers, they contain the actual data fields of the records that correspond to the search key values. This approach can be very efficient with small files and records with a very limited number of fields. Otherwise, the number of tree levels quickly becomes too large for efficient access. Drill Down B-trees were originally invented in 1971 by Rudolf Bayer and Edward McCreight, who then worked for Boeing Research Labs. It is not entirely clear what the “B” stands for; according to Edward McCreight it could be multiple things: “Boeing,” “balanced”, and even “Bayer” (Rudolf Bayer was the senior author of the two). However, as he stated at the 24th Symposium on Combinatorial Pattern Matching in 2013: “The more you think about what the B in B-trees means, the better you understand B-trees”. 12.3.8.4 B+-Trees Most DBMS implementations use indexes based on B+-trees rather than B-trees. The main difference is that in a B+-tree, only the leaf nodes contain data pointers. In addition, all key values that exist in the non-leaf nodes are repeated in the leaf nodes, such that every key value occurs in a leaf node, along with a corresponding data pointer. The higher-level nodes only contain a subset of the key values present in the leaf nodes. Finally, every leaf node of a B+-tree also has one tree pointer, pointing to its next sibling. In this way, the latter tree pointers create a linked list of leaf nodes, sequentially arranging all leaf nodes according to the key values they contain. Figure 12.28 presents some simple B+-tree examples, with the same order and search key values as the B-tree example. Again, the data pointers, which now only exist in the leaf nodes, are not represented. In contrast to the B-trees, the B+-trees contain some redundancy in their search key values. Note also the “next” tree pointers in the leaf nodes. Sometimes “previous” pointers are present as well. Figure 12.28 Examples of a B+-tree. Searching and updating a B+-tree occurs in a similar way as with a B-tree. Since only the leaf nodes contain data pointers, every search must continue until the leaf nodes, which was not the case with B-trees. Still, B+-trees are often more efficient, because the non-leaf nodes do not contain any data pointers, so with the same block size their order can be higher than with a B-tree. As a consequence, the height of a B+-tree is often smaller, resulting in fewer block accesses to search the tree. In most cases, the leaf nodes of a B+-tree, which do contain data pointers, have a different order than the other tree nodes. Also, the “next” tree pointers in the leaf nodes, in combination with the fact that all search key values and data pointers are present in the leaf nodes, provide for an additional way of traversing the tree. The tree is then not navigated topdown, but by accessing several leaf nodes consecutively, starting from a leftmost leaf node and following the next pointers between them. The latter allows for more efficient processing of range queries, as discussed in the next chapter. It is worth noting that other variants exist as well. In particular, there are variations on the fill factor, which denotes how “full” a non-leaf node must be, which is 50% for standard B-trees and B+-trees. For example, a B-tree with a fill factor of two-thirds is often called a B*-tree. Retention Questions What are the differences between primary and secondary file organization? Discuss and contrast the most important primary file organization methods. Discuss and contrast the most important secondary file organization methods. Discuss and contrast B-trees and B+-trees. Summary In this chapter we dealt with different aspects pertaining to physical file organization. First, we presented the characteristics of storage devices and how these affect the performance of physical data access. Then, we discussed the organization of, respectively, stored records and physical files. We distinguished between primary and secondary file organization methods. In this context, special attention was paid to different types of indexes and the ways in which they improve search performance. B-trees and B+-trees in particular were discussed as index types that often occur in DBMS products. The next chapter builds on these findings to discuss physical database organization, with the physical database consisting of a set of physical files and indexes. Connections Chapter 13 applies the principles of record organization and file organization in the context of physical database organization. It also comes back to the topic of storage hardware, discussing how individual storage devices are clustered and managed as larger entities, called enterprise storage subsystems. Scenario Conclusion Now that Sober has learned about various file organization methods, it has decided to physically implement each relational table using the indexed sequential file organization method. Moreover, to speed up the execution time of its queries it has decided to define the following indexes: Table Index CAR(CAR-NR, CARTYPE) Primary index on CAR-NR; secondary index on CARTYPE SOBER CAR(S-CAR-NR) Primary index on S-CAR-NR OTHER CAR(O-CAR-NR, OCUST-NR) Primary index on O-CAR-NR ACCIDENT(ACC-NR, ACCDATE-TIME, ACCLOCATION) Clustered index on ACCLOCATION INVOLVED(I-CAR-NR, I-ACCNR, DAMAGE AMOUNT) Primary index on I-CAR-NR, IACC-NR; secondary index on DAMAGE AMOUNT RIDE(RIDE-NR, PICKUPDATE-TIME, DROPOFFDATE-TIME, DURATION, PICKUP-LOC, DROPOFF-LOC, DISTANCE, FEE, R-CAR-NR) Clustered index on PICKUPLOC; secondary index on FEE RIDE HAILING(H-RIDE-NR, PASSENGERS, WAIT-TIME, REQUEST-TYPE, H-CUST-NR) Clustered index on WAIT-TIME: secondary on PASSENGERS RIDE SHARING(S-RIDE-NR) Primary index on S-RIDE-NR CUSTOMER(CUST-NR, CUSTNAME) Primary index on CUST-NR BOOK(B-CUST-NR, B-S-RIDENR) Primary index on B-CUST-NR, B-S-RIDE-NR Key Terms List absolute address actuator binary search binary search trees bitmap index block pointer blocking factor blocking factor of the index (BFI) B-tree B+-trees bucket central storage chaining clustered index collision cylinder data item data pointers delimiters dense indexes directly accessible storage devices (DASDs) directory disk blocks dynamic hashing embedded identification hard disk controller hash indexes hashing heap file I/O I/O boundary index entry indexed sequential file organization indexing intervals inverted file join index key-to-address transformation latency linear list linear search linked list lists loading factor mixed file multilevel indexes nonlinear list one-way linked list open addressing overflow overflow area overflow-handling technique partitions persistent storage media physical database physical file pointers primary area primary file organization methods primary index primary storage random file organization read/write heads record pointer relative block address relative location rotational delay search key search key values search tree secondary file organization methods secondary index secondary storage sectors seek time sequential file organization sparse indexes spindle stored record synonyms tracks transfer time tree pointers uniform distribution variable length records volatile memory

Physical File Organization and Indexing PDF

Document Details

Tags

Related

Summary

Full Transcript