Unit-4_5_DBMS.docx.pdf

Transaction A transaction is a collection of operations that performs a single logical function in a database application. Each transaction is a unit of both atomicity and consistency. Thus, we require that transactions do not violate any database consistency constraints. That is, if the database was consistent when a transaction started, the database must be consistent when the transaction successfully terminates. However, during the execution of a transaction, it may be necessary temporarily to allow inconsistency, since either the debit of A or the credit of B must be done before the other. This temporary inconsistency, although necessary, may lead to difficulty if a failure occurs. It is the programmer’s responsibility to properly define the various transactions, so that each preserves the consistency of the database. For example, the transaction to transfer funds from the account of department A to the account of department B could be defined to be composed of two separate programs: one that debits account A, and another that credits account B. The execution of these two programs one after the other will indeed preserve consistency. However, each program by itself does not transform the database from a consistent state to a new consistent state. Thus, those programs are not transactions. The concept of a transaction has been applied broadly in database systems and applications. While the initial use of transactions was in financial applications, the concept is now used in real-time applications in telecommunication, as well as in the management of long-duration activities such as product design or administrative workflows. A set of logically related operations is known as a transaction. The main operations of a transaction are: Read(A): Read operations Read(A) or R(A) reads the value of A from the database and stores it in a buffer in the main memory. Write (A): Write operation Write(A) or W(A) writes the value back to the database from the buffer. Let us take a debit transaction from an account that consists of the following operations: 1. R(A); 2. A=A-1000; 3. W(A); Assume A’s value before starting the transaction is 5000. The first operation reads the value of A from the database and stores it in a buffer. The Second operation will decrease its value by 1000. So the buffer will contain 4000. The Third operation will write the value from the buffer to the database. So A’s final value will be 4000. But it may also be possible that the transaction may fail after executing some of its operations. The failure can be because of hardware, software or power, etc. For example, if the debit transaction discussed above fails after executing operation 2, the value of A will remain 5000 in the database which is not acceptable by the bank. To avoid this, Database has two important operations: Commit: After all instructions of a transaction are successfully executed, the changes made by a transaction are made permanent in the database. Rollback: If a transaction is not able to execute all operations successfully, all the changes made by a transaction are undone. Properties of a Transaction Atomicity: As a transaction is a set of logically related operations, either all of them should be executed or none. A debit transaction discussed above should either execute all three operations or none. If the debit transaction fails after executing operations 1 and 2 then its new value of 4000 will not be updated in the database which leads to inconsistency. Consistency: If operations of debit and credit transactions on the same account are executed concurrently, it may leave the database in an inconsistent state. For Example, with T1 (debit of Rs. 1000 from A) and T2 (credit of 500 to A) executing concurrently, the database reaches an inconsistent state. Let us assume the Account balance of A is Rs. 5000. T1 reads A(5000) and stores the value in its local buffer space. Then T2 reads A(5000) and also stores the value in its local buffer space. T1 performs A=A-1000 (5000-1000=4000) and 4000 is stored in T1 buffer space. Then T2 performs A=A+500 (5000+500=5500) and 5500 is stored in the T2 buffer space. T1 writes the value from its buffer back to the database. A’s value is updated to 4000 in the database and then T2 writes the value from its buffer back to the database. A’s value is updated to 5500 which shows that the effect of the debit transaction is lost and the database has become inconsistent. To maintain consistency of the database, we need concurrency control protocols which will be discussed in the next article. The operations of T1 and T2 with their buffers and database have been shown in Table 1. T1’s buffer T2’s Buffer T1 space T2 Space Database A=5000 R(A); A=5000 A=5000 A=5000 R(A); A=5000 A=5000 A=A-1000; A=4000 A=5000 A=5000 A=4000 A=A+500; A=5500 W(A); A=5500 A=4000 W(A); A=5500 Isolation: The result of a transaction should not be visible to others before the transaction is committed. For example, let us assume that A’s balance is Rs. 5000 and T1 debits Rs. 1000 from A. A’s new balance will be 4000. If T2 credits Rs. 500 to A’s new balance, A will become 4500, and after this T1 fails. Then we have to roll back T2 as well because it is using the value produced by T1. So transaction results are not made visible to other transactions before it commits. Durable: Once the database has committed a transaction, the changes made by the transaction should be permanent. e.g.; If a person has credited $500000 to his account, the bank can’t say that the update has been lost. To avoid this problem, multiple copies of the database are stored at different locations. Commit Protocol in DBMS The concept of the Commit Protocol was developed in the context of database systems. Commit protocols are defined as algorithms that are used in distributed systems to ensure that the transaction is completed entirely or not. It helps us to maintain data integrity, Atomicity, and consistency of the data. It helps us to create robust, efficient and reliable systems. One-Phase Commit It is the simplest commit protocol. In this commit protocol, there is a controlling site, and there are a variety of slave sites where the transaction is performed. The steps followed in the one-phase commit protocol are following: – Each slave sends a ‘DONE’ message to the controlling site after each slave has completed its transaction. After sending the ‘DONE’ message, the slaves start waiting for a ‘Commit’ or ‘Abort’ response from the controlling site. After receiving the ‘DONE’ message from all the slaves, then the controlling site decides whether they have to commit or abort. Here the confirmation of the message is done. Then it sends a message to every slave. Then after the slave performs the operation as instructed by the controlling site, they send an acknowledgement to the controlling site. Two-Phase Commit Protocol It is the second type of commit protocol in DBMS. It was introduced to reduce the vulnerabilities of the one phase commit protocol. There are two phases in the two-phase commit protocol. Prepare Phase Each slave sends a ‘DONE’ message to the controlling site after each slave has completed its transaction. After getting ‘DONE’ message from all the slaves, it sends a “prepare” message to all the slaves. Then the slaves share their vote or opinion whether they want to commit or not. If a slave wants to commit, it sends a message which is “Ready”. If the slaves doesn’t want to commit, it then sends a “Not Ready” message. Prepare Phase Controlling Site, after receiving “Ready” message from all the slaves The controlling site sends a message “Global Commit” to all the slaves. The message contains the details of the transaction which needs to be stored in the databases. Then each slave completes the transaction and returns an acknowledgement message back to the controlling site. The controlling site after receiving acknowledgement from all the slaves, which means the transaction is completed. When the controlling site receives “Not Ready” message from the slaves The controlling site sends a message “Global Abort” to all the slaves. After receiving the “Global Abort” message, the transaction is aborted by the slaves. Then the slaves send back an acknowledgement message back to the controlling site. When the controlling site receives Abort Acknowledgement from all the slaves, it means the transaction is aborted. Three Phase Commit Protocol It is the second type of commit protocol in DBMS. It was introduced to address the issue of blocking. In this commit protocol, there are three phases: – Prepare Phase It is the same as the preparation phase of the Two-phase Commit Protocol. In this phase every site should be prepared or ready to make a commitment. Prepare to Commit Phase In this phase the controlling site issues an “Enter Prepared State” message and sends it to all the slaves. In the response, all the slaves' sites issue an OK message. Commit/Abort Phase This phase consists of the steps which are the same as they were in the two-phase commit. In this phase, no acknowledgement is provided after the process. Three-Phase Commit protocol Advantages of Commit Protocol in DBMS The commit protocol in DBMS helps to ensure that the changes in the database remain consistent throughout the database. It basically also helps to ensure that the integrity of the data is maintained throughout the database. It will also help to maintain the atomicity which means that either all the operations in a transaction are completed successfully or not done at all. The commit protocol provides mechanisms for system recovery in the case of system failures. Introduction to 2PL: Two-Phase Locking (2PL) is a concurrency control mechanism that operates on the principle that transactions need to acquire locks on data items before reading or modifying them. It consists of two main phases: the Growing Phase, where transactions acquire locks, and the Shrinking Phase, where locks are released. Hence the name, 2PL. Note that while two-phase locking (2PL) sounds very similar to two-phase commit (2PC), they are completely different things. 2PL manages locks in databases, while 2PC coordinates distributed transactions for data consistency. In a distributed system with multiple concurrent users, Serializability means each transaction behaves as if it’s the sole active one in the entire database. For around 30 years, there was only one widely used algorithm for serializability in databases: two-phase locking (2PL). 2PL is one of the methods to achieve Serializability. In 2PL several transactions are allowed to concurrently read the same object as long as nobody is writing to it. But as soon as anyone wants to write (modify or delete) an object, exclusive access is required: 1. If transaction A has read an object and transaction B wants to write to that object, B must wait until A commits or aborts before it can continue. (This ensures that B can’t change the object unexpectedly behind A’s back.) 2. If transaction A has written an object and transaction B wants to read that object, B must wait until A commits or aborts before it can continue. (Reading an old version of the object is not acceptable under 2PL.) In 2PL, writers don’t just block other writers; they also block readers and readers block writers. That is, several transactions reading the same object use a shared lock while a transaction writing to the object must acquire an exclusive lock. If a transaction first reads and then writes an object, it may upgrade its shared lock to an exclusive lock. The upgrade works the same as getting an exclusive lock directly. Scenario 1: Transaction A Reads, and Transaction B Wants to Write Transaction B has to wait to acquire Exclusive Lock until Transaction A releases the shared lock. 1. Growing Phase (Lock Acquisition): a. Transaction A begins and reads an object, acquiring a read lock on it. b. Transaction B starts and intends to write to the same object but finds it locked by A. c. Transaction B must wait during the growing phase since it cannot acquire a write lock on the object while A holds a read lock. 2. Shrinking Phase (Lock Release): a. Transaction A completes its reading operation and releases the read lock on the object. b. Now, Transaction B, which was waiting, acquires a write lock and proceeds with its writing operation. c. After finishing the write, Transaction B releases the lock. Scenario 2: Transaction A Writes, and Transaction B Wants to Read Transaction B has to wait to acquire Shared Lock until Transaction A releases the Exclusive Lock. 1. Growing Phase (Lock Acquisition): a. Transaction A begins and writes to an object, obtaining a write lock on it. b. Transaction B starts and intends to read the same object but finds it locked by A. c. Transaction B must wait during the growing phase since it cannot acquire a read lock on the object while A holds a write lock. 2. Shrinking Phase (Lock Release): a. Transaction A completes its writing operation and releases the write lock on the object. b. Now, Transaction B, which was waiting, acquires a read lock and proceeds with its reading operation. c. Transaction B ensures it reads the most up-to-date version of the object. Database Recovery Techniques in DBMS Database Systems like any other computer system, are subject to failures but the data stored in them must be available as and when required. When a database fails it must possess the facilities for fast recovery. It must also have atomicity i.e. either transactions are completed successfully and committed (the effect is recorded permanently in the database) or the transaction should have no effect on the database. In any Database Management System (DBMS), data recovery is an essential component. Data recovery mechanisms in DBMS ensure that the database remains consistent and correct even in the face of failures like system crashes, database corruption, or user errors. Types of Recovery Techniques in DBMS A transaction is a unit of work in ACID properties: Atomicity, Consistency, Isolation, Durability. DBMS recovery maintains these properties, restoring the database to the last consistent state post-failure. Backup and Restoration:- The most basic recovery technique is the regular backup of the database. DBAs can schedule backups to run periodically, storing snapshots of the database. In the event of a failure, the latest backup is restored. Log-Based Recovery:- The log-based recovery technique uses logs, or history lists, of each transaction. Each transaction log includes the transaction identifier, the data item affected, the type of operation, and the before and after values of the update. In case of a system failure, the system checks the logs and undoes any transaction that was not completed before the failure, ensuring the Atomicity and Consistency of the database. Checkpoints:- Checkpoints are points of synchronization between the database and the transaction log. At a checkpoint, all buffers are written to disk, and a special checkpoint record is added to the transaction log. In case of failure, recovery starts from the last checkpoint, reducing recovery time and effort. Shadow Paging:- Shadow paging is a technique suitable for databases stored on disk. It involves maintaining a shadow directory, or a copy of the database. When a transaction begins, the DBMS points to the shadow database. Any changes are reflected in a new copy of the database, preserving the shadow copy. In case of failure, the DBMS reverts to the shadow copy, maintaining database consistency. Database Replication:- Database replication involves copying and maintaining the same set of data across multiple databases. This ensures redundancy, improves availability, and can enhance performance by distributing workload. Point-in-Time Recovery:- Point-in-Time Recovery (PITR) allows restoring a database to a specific moment in time, rather than just the latest backup. It involves using transaction logs to roll forward or backward to the desired timestamp, enabling recovery to a precise state before data loss or corruption. Trigger in DBMS Trigger in DBMS is a special type of stored procedure that is automatically executed in response to certain database events such as an INSERT, UPDATE, or DELETE operation. Triggers can be used to perform actions such as data validation, enforcing business rules, or logging. They can be defined to execute before or after the triggering event and can be defined to execute for every row or once for each statement. Triggers are a powerful feature of dbms that allow developers to define automatic actions based on database events. Types of Trigger in DBMS on Basis of their Execution: Triggers in Dbms can be classified into two types based on when they are executed: BEFORE Trigger: A BEFORE trigger is executed before the triggering event (such as an insert, update, or delete) is executed on the table. This type of trigger is often used to modify the data before it is inserted, updated, or deleted. AFTER Trigger: An AFTER trigger is executed after the triggering event has been executed on the table. This type of trigger is often used to perform actions that depend on the changes made to the table due to the triggering event. Types of Trigger in DBMS on Basis of Event or Action Triggering Them Triggers can also be classified based on the event or action that triggers them. The most commonly used types of triggers based on this classification are: INSERT Trigger: This type of trigger occurs when new data is inserted into a table. UPDATE Trigger: This type of trigger occurs when existing data in a table is updated. DELETE Trigger: This type of trigger occurs when data is deleted from a table. INSTEAD OF Trigger: This type of trigger is used to replace the regular insert, update, or delete operation on a view with a custom operation. It is fired instead of the regular operation. Creating a Trigger in DBMS: CREATE TRIGGER trigger_name [BEFORE|AFTER] [INSERT|UPDATE|DELETE] ON table_name [FOR EACH ROW] BEGIN -- code to be executed when the trigger is activated END; Let’s break down each part of this code: CREATE TRIGGER: This is the SQL command used to create a trigger. trigger_name: This is the name of the trigger. You can choose any name you like, as long as it follows the naming conventions of your DBMS. [BEFORE|AFTER]: This specifies whether the trigger should be activated before or after the triggering event. [INSERT|UPDATE|DELETE]: This specifies the event that will trigger the execution of the trigger. You can choose one or more of these options depending on the specific trigger you want to create. ON table_name: This specifies the table that the trigger will be associated with. [FOR EACH ROW]: This specifies that the trigger will be executed once for each row that is affected by the triggering event. BEGIN and END: This is where you will put the code that should be executed when the trigger is activated. The code inside the BEGIN and END blocks can be any SQL statements that you want to execute when the trigger is activated. The exact syntax and functionality of triggers can vary depending on the specific DBMS you are using. Example of Trigger in DBMS: Suppose we have a table named "employees" with columns "id", "name", "age", and "salary". We want to ensure that whenever a new employee is added to the table, the age of the employee must be greater than or equal to 18. We can create a BEFORE INSERT trigger to enforce this rule. Here’s how the trigger might look: Code of Trigger in DBMS: CREATE TRIGGER age_check BEFORE INSERT ON employees FOR EACH ROW BEGIN IF NEW.age < 18 THEN SIGNAL SQLSTATE '45000' SET MESSAGE_TEXT = 'Employees must be at least 18 years old'; END IF; END; Explanation of Trigger in DBMS: This trigger will be fired before any new row is inserted into the "employees" table. It checks the age of the new employee being inserted, and if it is less than 18, it raises an error message using the SIGNAL statement, which prevents the insertion of the new row. This trigger ensures that the age constraint is enforced automatically every time a new employee is added to the table, reducing the chances of data inconsistency and improving data quality. Advantages of Trigger in DBMS: They can help ensure that data remains consistent and accurate by enforcing rules and constraints on the data. These can simplify the implementation of complex business logic that requires multiple steps or queries to be executed. It can be used to log changes to the database, providing an audit trail that can be used for compliance, debugging, or analysis purposes. Triggers can automate tasks that would otherwise require manual intervention, improving the efficiency and accuracy of database operations. Disadvantages of Trigger in DBMS: Triggers in DBMS can impact database performance, especially when they involve complex operations or are triggered frequently. They can introduce hidden logic into the database, which can make it more difficult to understand and maintain. These can make it harder to debug issues in the database because they are executed automatically and can be difficult to trace. It can introduce security vulnerabilities if they are not properly designed or tested, potentially allowing unauthorized access or modification of data. Deadlock in DBMS In database management systems (DBMS) a deadlock occurs when two or more transactions are unable to proceed because each transaction is waiting for the other to release locks on resources. This situation creates a cycle of the dependencies where no transaction can continue leading to the standstill in the system. The Deadlocks can severely impact the performance and reliability of a DBMS making it crucial to understand and manage them effectively. What is Deadlock? The Deadlock is a condition in a multi-user database environment where transactions are unable to complete because they are each waiting for the resources held by other transactions. This results in a cycle of the dependencies where no transaction can proceed. Characteristics of Deadlock Mutual Exclusion: Only one transaction can hold a particular resource at a time. Hold and Wait: The Transactions holding resources may request additional resources held by others. No Preemption: The Resources cannot be forcibly taken from the transaction holding them. Circular Wait: A cycle of transactions exists where each transaction is waiting for the resource held by the next transaction in the cycle. Example – let us understand the concept of deadlock. Suppose, Transaction T1 holds a lock on some rows in the Students table and needs to update some rows in the Grades table. Simultaneously, Transaction T2 holds locks on those very rows (Which T1 needs to update) in the Grades table but needs to update the rows in the Student table held by Transaction T1. Now, the main problem arises. Transaction T1 will wait for transaction T2 to give up the lock, and similarly, transaction T2 will wait for transaction T1 to give up the lock. As a consequence, All activity comes to a halt and remains at a standstill forever unless the DBMS detects the deadlock and aborts one of the transactions. Deadlock in DBMS What is Deadlock Avoidance? When a database is stuck in a deadlock, It is always better to avoid the deadlock rather than restarting or aborting the database. The deadlock avoidance method is suitable for smaller databases whereas the deadlock prevention method is suitable for larger databases. One method of avoiding deadlock is using application-consistent logic. In the above-given example, Transactions that access Students and Grades should always access the tables in the same order. In this way, in the scenario described above, Transaction T1 simply waits for transaction T2 to release the lock on Grades before it begins. When transaction T2 releases the lock, Transaction T1 can proceed freely. Another method for avoiding deadlock is to apply both the row-level locking mechanism and the READ COMMITTED isolation level. However, It does not guarantee to remove deadlocks completely. What is Deadlock Detection? When a transaction waits indefinitely to obtain a lock, The database management system should detect whether the transaction is involved in a deadlock or not. Wait-for-graph is one of the methods for detecting the deadlock situation. This method is suitable for smaller databases. In this method, a graph is drawn based on the transaction and its lock on the resource. If the graph created has a closed loop or a cycle, then there is a deadlock. For the above-mentioned scenario, the Wait-For graph is drawn below: What is Deadlock Prevention? For a large database, the deadlock prevention method is suitable. A deadlock can be prevented if the resources are allocated in such a way that a deadlock never occurs. The DBMS analyzes the operations whether they can create a deadlock situation or not, If they do, that transaction is never allowed to be executed. Deadlock prevention mechanism proposes two schemes: Wait-Die Scheme: In this scheme, If a transaction requests a resource that is locked by another transaction, then the DBMS simply checks the timestamp of both transactions and allows the older transaction to wait until the resource is available for execution. Suppose, there are two transactions T1 and T2, and Let the timestamp of any transaction T be TS (T). Now, If there is a lock on T2 by some other transaction and T1 is requesting resources held by T2, then DBMS performs the following actions: Checks if TS (T1) < TS (T2) – if T1 is the older transaction and T2 has held some resource, then it allows T1 to wait until resource is available for execution. That means if a younger transaction has locked some resource and an older transaction is waiting for it, then an older transaction is allowed to wait for it till it is available. If T1 is an older transaction and has held some resource with it and if T2 is waiting for it, then T2 is killed and restarted later with random delay but with the same timestamp. i.e. if the older transaction has held some resource and the younger transaction waits for the resource, then the younger transaction is killed and restarted with a very minute delay with the same timestamp. This scheme allows the older transaction to wait but kills the younger one. Wound Wait Scheme: In this scheme, if an older transaction requests for a resource held by a younger transaction, then an older transaction forces a younger transaction to kill the transaction and release the resource. The younger transaction is restarted with a minute delay but with the same timestamp. If the younger transaction is requesting a resource that is held by an older one, then the younger transaction is asked to wait till the older one releases it. The following table lists the differences between Wait – Die and Wound -Wait scheme prevention schemes: Wait – Die Wound -Wait It is based on a preemptive It is based on a non-preemptive technique. technique. In this, older transactions must wait for the In this, older transactions never wait younger one to release its data items. for younger transactions. The number of aborts and rollbacks is In this, the number of aborts and higher in these techniques. rollback is lesser. Unit 5 SQL INDEX The Index in SQL is a special table used to speed up the searching of the data in the database tables. It also retrieves a vast amount of data from the tables frequently. The INDEX requires its own space in the hard disk. The index concept in SQL is same as the index concept in the novel or a book. It is the best SQL technique for improving the performance of queries. The drawback of using indexes is that they slow down the execution time of UPDATE and INSERT statements. But they have one advantage also as they speed up the execution time of SELECT and WHERE statements. In SQL, an Index is created on the fields of the tables. We can easily build one or more indexes on a table. The creation and deletion of the Index do not affect the data of the database. Why SQL Index? The following reasons tell why Index is necessary in SQL: ○ SQL Indexes can search the information of the large database quickly. ○ This concept is a quick process for those columns, including different values. ○ This data structure sorts the data values of columns (fields) either in ascending or descending order. And then, it assigns the entry for each value. ○ Each Index table contains only two columns. The first column is row_id, and the other is indexed-column. ○ When indexes are used with smaller tables, the performance of the index may not be recognized. Create an INDEX In SQL, we can easily create the Index using the following CREATE Statement: CREATE INDEX Index_Name ON Table_Name ( Column_Name); Here, Index_Name is the name of that index that we want to create, and Table_Name is the name of the table on which the index is to be created. The Column_Name represents the name of the column on which index is to be applied. CREATE INDEX Index_Name ON Table_Name ( column_name1, column_name2,...., column_nameN); Example for creating an Index in SQL: Let's take an Employee table: Emp_Id Emp_Name Emp_Salary Emp_City Emp_State 1001 Akshay 20000 Noida U.P 1002 Ram 35000 Jaipur Rajasthan 1003 Shyam 25000 Gurgaon Haryana 1004 Yatin 30000 Lucknow U.P The following SQL query creates an Index 'Index_state' on the Emp_State column of the Employee table. CREATE INDEX index_state ON Employee (Emp_State); Suppose we want to create an index on the combination of the Emp_city and the Emp_State column of the above Employee table. For this, we have to use the following query: CREATE INDEX index_city_State ON Employee (Emp_City, Emp_State); Create UNIQUE INDEX Unique Index is the same as the Primary key in SQL. The unique index does not allow selecting those columns which contain duplicate values. This index is the best way to maintain the data integrity of the SQL tables. Syntax for creating the Unique Index is as follows: CREATE UNIQUE INDEX Index_Name ON Table_Name ( Column_Name); Example for creating a Unique Index in SQL: Let's take the above Employee table. The following SQL query creates the unique index index_salary on the Emp_Salary column of the Employee table. CREATE UNIQUE INDEX index_salary ON Employee (Emp_Salary); Rename an INDEX We can easily rename the index of the table in the relational database using the ALTER command. Syntax: ALTER INDEX old_Index_Name RENAME TO new_Index_Name; Example for Renaming the Index in SQL: The following SQL query renames the index 'index_Salary' to 'index_Employee_Salary' of the above Employee table: ALTER INDEX index_Salary RENAME TO index_Employee_Salary; Remove an INDEX An Index of the table can be easily removed from the SQL database using the DROP command. If you want to delete an index from the data dictionary, you must be the owner of the database or have the privileges for removing it. Syntaxes for Removing an Index in relational databases are as follows: In Oracle database: DROP INDEX Index_Name; In MySQL database: ALTER TABLE Table_Name DROP INDEX Index_Name; In Ms-Access database: DROP INDEX Index_Name ON Table_Name; In SQL Server Database: DROP INDEX Table_Name.Index_Name; Example for removing an Index in SQL: Suppose we want to remove the above 'index_Salary' from the SQL database. For this, we have to use the following SQL query: DROP INDEX index_salary; Alter an INDEX An index of the table can be easily modified in the relational database using the ALTER command. The basic syntax for modifying the Index in SQL is as follows: ALTER INDEX Index_Name ON Table_Name REBUILD; Hashing in DBMS Hashing in DBMS is a technique to quickly locate a data record in a database irrespective of the size of the database. For larger databases containing thousands and millions of records, the indexing data structure technique becomes very inefficient because searching a specific record through indexing will consume more time. This doesn’t align with the goals of DBMS, especially when performance and data retrieval time are minimized. So, to counter this problem hashing technique is used. In this article, we will learn about various hashing techniques. What is Hashing? The hashing technique utilizes an auxiliary hash table to store the data records using a hash function. There are 2 key components in hashing: Hash Table: A hash table is an array or data structure and its size is determined by the total volume of data records present in the database. Each memory location in a hash table is called a ‘bucket‘ or hash indice and stores a data record’s exact location and can be accessed through a hash function. Bucket: A bucket is a memory location (index) in the hash table that stores the data record. These buckets generally store a disk block which further stores multiple records. It is also known as the hash index. Hash Function: A hash function is a mathematical equation or algorithm that takes one data record’s primary key as input and computes the hash index as output. Hash Function A hash function is a mathematical algorithm that computes the index or the location where the current data record is to be stored in the hash table so that it can be accessed efficiently later. This hash function is the most crucial component that determines the speed of fetching data. Working of Hash Function The hash function generates a hash index through the primary key of the data record. Now, there are 2 possibilities: 1. The hash index generated isn’t already occupied by any other value. So, the address of the data record will be stored here. 2. The hash index generated is already occupied by some other value. This is called collision so to counter this, a collision resolution technique will be applied. 3. Now whenever we query a specific record, the hash function will be applied and returns the data record comparatively faster than indexing because we can directly reach the exact location of the data record through the hash function rather than searching through indices one by one. Example: Hashing Types of Hashing in DBMS There are two primary hashing techniques in DBMS. 1. Static Hashing In static hashing, the hash function always generates the same bucket’s address. For example, if we have a data record for employee_id = 107, the hash function is mod-5 which is – H(x) % 5, where x = id. Then the operation will take place like this: H(106) % 5 = 1. This indicates that the data record should be placed or searched in the 1st bucket (or 1st hash index) in the hash table. Example: Static Hashing Technique The primary key is used as the input to the hash function and the hash function generates the output as the hash index (bucket’s address) which contains the address of the actual data record on the disk block. Static Hashing has the following Properties Data Buckets: The number of buckets in memory remains constant. The size of the hash table is decided initially and it may also implement chaining that will allow handling some collision issues though, it’s only a slight optimization and may not prove worthy if the database size keeps fluctuating. Hash function: It uses the simplest hash function to map the data records to its appropriate bucket. It is generally modulo-hash function Efficient for known data size: It’s very efficient in terms when we know the data size and its distribution in the database. It is inefficient and inaccurate when the data size dynamically varies because we have limited space and the hash function always generates the same value for every specific input. When the data size fluctuates very often it’s not at all useful because collision will keep happening and it will result in problems like – bucket skew, insufficient buckets etc. To resolve this problem of bucket overflow, techniques such as – chaining and open addressing are used. Here’s a brief info on both: 1. Chaining Chaining is a mechanism in which the hash table is implemented using an array of type nodes, where each bucket is of node type and can contain a long chain of linked lists to store the data records. So, even if a hash function generates the same value for any data record it can still be stored in a bucket by adding a new node. However, this will give rise to the problem bucket skew that is, if the hash function keeps generating the same value again and again then the hashing will become inefficient as the remaining data buckets will stay unoccupied or store minimal data. 2. Open Addressing/Closed Hashing This is also called closed hashing this aims to solve the problem of collision by looking out for the next empty slot available which can store data. It uses techniques like linear probing, quadratic probing, double hashing, etc. 2. Dynamic Hashing Dynamic hashing is also known as extendible hashing, used to handle database that frequently changes data sets. This method offers us a way to add and remove data buckets on demand dynamically. This way as the number of data records varies, the buckets will also grow and shrink in size periodically whenever a change is made. Properties of Dynamic Hashing The buckets will vary in size dynamically periodically as changes are made offering more flexibility in making any change. Dynamic Hashing aids in improving overall performance by minimizing or completely preventing collisions. It has the following major components: Data bucket, Flexible hash function, and directories A flexible hash function means that it will generate more dynamic values and will keep changing periodically asserting to the requirements of the database. Directories are containers that store the pointer to buckets. If bucket overflow or bucket skew-like problems happen to occur, then bucket splitting is done to maintain efficient retrieval time of data records. Each directory will have a directory id. Global Depth: It is defined as the number of bits in each directory id. The more the number of records, the more bits are there. Working of Dynamic Hashing Example: If global depth: k = 2, the keys will be mapped accordingly to the hash index. K bits starting from LSB will be taken to map a key to the buckets. That leaves us with the following 4 possibilities: 00, 11, 10, 01. Dynamic Hashing – mapping As we can see in the above image, the k bits from LSBs are taken in the hash index to map to their appropriate buckets through directory IDs. The hash indices point to the directories, and the k bits are taken from the directories’ IDs and then mapped to the buckets. Each bucket holds the value corresponding to the IDs converted in binary. Query Processing in DBMS Query Processing is the activity performed in extracting data from the database. In query processing, it takes various steps for fetching the data from the database. The steps involved are: 1. Parsing and translation 2. Optimization 3. Evaluation he query processing works in the following way: Parsing and Translation As query processing includes certain activities for data retrieval. Initially, the given user queries get translated in high-level database languages such as SQL. It gets translated into expressions that can be further used at the physical level of the file system. After this, the actual evaluation of the queries and a variety of query -optimizing transformations takes place. Thus before processing a query, a computer system needs to translate the query into a human-readable and understandable language. Consequently, SQL or Structured Query Language is the best suitable choice for humans. But, it is not perfectly suitable for the internal representation of the query to the system. Relational algebra is well suited for the internal representation of a query. The translation process in query processing is similar to the parser of a query. When a user executes any query, for generating the internal form of the query, the parser in the system checks the syntax of the query, verifies the name of the relation in the database, the tuple, and finally the required attribute value. The parser creates a tree of the query, known as 'parse-tree.' Further, translate it into the form of relational algebra. With this, it evenly replaces all the use of the views when used in the query. Thus, we can understand the working of a query processing in the below-described diagram: Suppose a user executes a query. As we have learned that there are various methods of extracting the data from the database. In SQL, a user wants to fetch the records of the employees whose salary is greater than or equal to 10000. For doing this, the following query is undertaken: select emp_name from Employee where salary>10000; Thus, to make the system understand the user query, it needs to be translated in the form of relational algebra. We can bring this query in the relational algebra form as: ○ σsalary>10000 (πsalary (Employee)) ○ πsalary (σsalary>10000 (Employee)) After translating the given query, we can execute each relational algebra operation by using different algorithms. So, in this way, a query processing begins its working. Evaluation For this, with addition to the relational algebra translation, it is required to annotate the translated relational algebra expression with the instructions used for specifying and evaluating each operation. Thus, after translating the user query, the system executes a query evaluation plan. Query Evaluation Plan ○ In order to fully evaluate a query, the system needs to construct a query evaluation plan. ○ The annotations in the evaluation plan may refer to the algorithms to be used for the particular index or the specific operations. ○ Such relational algebra with annotations is referred to as Evaluation Primitives. The evaluation primitives carry the instructions needed for the evaluation of the operation. ○ Thus, a query evaluation plan defines a sequence of primitive operations used for evaluating a query. The query evaluation plan is also referred to as the query execution plan. ○ A query execution engine is responsible for generating the output of the given query. It takes the query execution plan, executes it, and finally makes the output for the user query. Optimization ○ The cost of the query evaluation can vary for different types of queries. Although the system is responsible for constructing the evaluation plan, the user does not need to write their query efficiently. ○ Usually, a database system generates an efficient query evaluation plan, which minimizes its cost. This type of task is performed by the database system and is known as Query Optimization. ○ For optimizing a query, the query optimizer should have an estimated cost analysis of each operation. It is because the overall operation cost depends on the memory allocations to several operations, execution costs, and so on. Finally, after selecting an evaluation plan, the system evaluates the query and produces the output of the query.

Unit-4_5_DBMS.docx.pdf

Document Details

Tags

Related

Full Transcript