INFS2200 Ultimate Study Sheet PDF
Document Details
Tags
Summary
This document contains study notes for an INFS2200 course, focusing on database management systems concepts like record organization, index structures (bitmap index, B+ trees), triggers, and query optimization. The document appears to be a collection of lecture notes or study materials, rather than a specific exam paper.
Full Transcript
**PLEASE READ!!!!!! PLEASE ADD TO THIS DOC ALL NOTES ABOUT EVERYTHING. UNDERLYING KNOWLEDGE FOR QUESTIONS FOR PAST EXAMS, DEFINITIONS E\>T\>C!!!!!!** **Record Organisation** *Record Internal fragmentation*: the wasted space between records due to the fixed record size. This is caused by unused dat...
**PLEASE READ!!!!!! PLEASE ADD TO THIS DOC ALL NOTES ABOUT EVERYTHING. UNDERLYING KNOWLEDGE FOR QUESTIONS FOR PAST EXAMS, DEFINITIONS E\>T\>C!!!!!!** **Record Organisation** *Record Internal fragmentation*: the wasted space between records due to the fixed record size. This is caused by unused data when fields are smaller than requested. *Block Internal fragmentation*: the wasted space in blocks due to the nature of fixed record sizes not fitting "neatly" in the allocated block size. *Fixed Length:* The full potential record size with all fields is used for each record. +-----------------------------------+-----------------------------------+ | **Pros** | **Cons** | +===================================+===================================+ | - - - | - | +-----------------------------------+-----------------------------------+ *Variable Length:* Record size varies based on the data actually stored in the record. Variable length records use special symbols and structures to describe the data stored in a record. +-----------------------------------+-----------------------------------+ | **Pros** | **Cons** | +===================================+===================================+ | - | - - - | +-----------------------------------+-----------------------------------+ **Ordered and Unordered files** *Unordered (heap)* Pros: Fast at inserting Cons: Slow at Searching (n blocks) *Ordered (sorted)* Pros: Fast at searching Cons: Slow at inserting (have to shift n/2 blocks on average) **Bitmap Index:** A bitmap index is a bitmap used on attributes for each field value. It has a 1 for each record it finds that matches the attribute and a 0 for all other records. Cons: Should not be used on data that is modified regularly as it requires modifying all associated bitmap indexes as well. Pros: In general, they are advantageous due to space efficiency and the ability to query on multiple keys. (good for warehouse data sets which are large and not updated frequently) The size of a bitmap (on a single attribute), is the number of rows (in bits). The number of bitmaps required for a field, is the number of distinct values for the field. The total space needed to index one field, is equal to the product number of bitmaps required and the size of each of those bitmaps (in bits). **Triggers** Triggers are an event that occur when a particular condition is found from an event executed on a DBMS. The three components of a trigger are the Events, Condition and action. **B+ Trees:** *Advantages over Multi-Level Index Structure* - - \- Must be able to do the damn Illustrate the figure the tree that changes based on an insert or query. - write down rules for this \- Must be able to determine which nodes are accessed based on a condition given. When searching a node u can travel across leaf nodes to find value instead of heading back up to parent nodes to move forward or back. **Sanity checking rules for B+ Tree correctness** **Node Type** **Children Type** **Min Children** **Max Children** **Example b = 7** **Example b = 100** ------------------------------------------------------ ---------------------------------- ------------------------------------------------ ------------------ ------------------- --------------------- **Root Node (when it is the only node in the tree)** **Records** **1** **b-1** **1 - 6** **1 - 99** **Root Node** **Internal Nodes or Leaf Nodes** **2** **b** **2 - 7** **2 - 100** **Internal Node** **Internal Nodes or Leaf Nodes** \\lceil b/2 \\rceil **b** **4 - 7** **50 - 100** **Leaf Node** **Records** ![ \\lceil b/2 \\rceil - 1 ](media/image2.png) **b - 1** **4 - 6** **50 - 99** **Inserting a new entry into a B+ tree:** - - - - - - **How to split Leaf node:**\ 1. Redistribute entries: if L is of order p, then ceiling(p+1 / 2) entries stay in L leaf node and the rest move to Lnew. E.g. if each leaf node fits 4 entries, then when a new entry is inserted, ceiling(4+1 / 2) = 3 entries stay in the old node and the last moves into Lnew along with the new entry. Can also use (p-1 / 2). 2\. Copy up first key in Lnew into L's parent\ 3. Insert pointer to Lnew into L's parent. **How to split Internal node:** 1\. Redistribute entries evenly 2\. Push up middle key (middle key "pops up" to upper level and is not copied). **Search in a B+ tree index** - - - **Tree height:**\ 1. The number of entries in leaf nodes (e) and 2. Tree order (i.e. fan-out (fo)). Tree height is proportional to log fo (e) Fan out = index blocking factor = block size / entry size **Views** *Materialised Views:* - - *Regular Views:* - - **Data** +-----------------------+-----------------------+-----------------------+ | | **Pros** | **Cons** | +=======================+=======================+=======================+ | Data Partitioning | \- Increased access | \- Popular data | | | rate | stored on same disk, | | | | might lead to request | | | | collision (hotspots) | +-----------------------+-----------------------+-----------------------+ | Data Mirroring | \- Double read rate | \- Expensive, pay the | | | | cost of 2 disk to get | | | \- No problem with | the storage of 1. | | | request collision | | | | | \- Cost of several | | | | small disk is greater | | | | than a single one | | | | with the same | | | | capacity. | +-----------------------+-----------------------+-----------------------+ **Block and Record calculations** **Avg Record Length R** = R fixed + R variable refer to tutorial 5 **\# of Blocks needed for \# records** = totalbytes / (blocksize + pointer to next block) **Records per Block (RPB) or blocking factor** = Block Size / Size of each record **\# of Data File Blocks** = \# of records / RPB(blocking factor) **\# of Index entries =** \# of records (because index is primary index) OR Tuples of blocks when talking about minimum levels index can have **\# of Index Blocks =** Index Entries / (BFI) **Index file Blocking Factor (BFI) =** Floor ( Block Size / (Field key size + Block Pointer)) **Minimum possible number of levels an Index can have =** ceiling(log base index bfr(index records)) **ACID** **A**tomicity: - All actions associated with a transaction occur or none of them occur. **C**onsistency: - At the transaction Boundaries, all integrity constraints on the database are satisfied. **I**solation: - Concurrent transactions evaluate like they were executed serially. **D**urability: - Completed transactions become permanent and survive subsequent failures. **Query Trees** A query tree is a data structure that corresponds to a relational algebra expression. There are two components: LEAF NODES - which relate to input relations & INTERNAL NODES - which relate to relational algebra operations. **Algebraic Query Optimisation** It has three main heuristics to optimise: - - - **SELECTIVITY of S=0.\# of condition C-** Just means that \# % of records in S match condition C. **Remember Operational Signs** Projection (π) Selection (σ) Cross Product (TABLE1 X TABLE2 ) Join (⨝) **Transformation Rules** - - - - - **Two-Phase Locking (2PL)** *Growing Phase:* During this Phase locks are obtained *Shrinking Phase:* During this Phase all locks are released that were obtained during the growing phase. When transaction enters this phase it can't grow. **Cost Based Query Optimisation** Estimates and compares the cost of executing a query using different execution strategies and then choosing the strategy that evaluates to the least cost estimate. Cost Factors: - - - - - **2PL Protocols** *Basic 2PL Protocol:* 1\. The growing phase: In this phase the locks for a transaction are obtained 2\. The shrinking phase: In this phase the locks for a transaction are released As soon as a transaction enters shrinking phase it cannot ask for any more locks on data items. Every 2PL history is serializable. Basic 2PL may cause deadlock. *Strict 2PL Protocol:* The only change from basic 2PL is that exclusive or write locks can only be released after the transaction terminates (commits or aborts). Strict 2PL is not deadlock-free. But it simplifies the recovery process as you only need to rollback the aborted transaction (others not affected since they couldn't read any updates). *Conservative 2PL Protocol:* Has no growing phase. All locks for a transaction need to be obtained before a transaction starts. Deadlock-free protocol. - Wait-For-Graph (WFG) - uses a lock table to detect cycles (deadlocks). One of the transactions in a deadlock is rolled back (aborted). **Timeout Mechanisms** *Long Timeout*: Has a low chance of accidentally aborting non deadlocked transactions. Potentially lots of wasted time in determining a deadlock has occurred. *Short Timeout:* High chance of aborting, non deadlocked transaction. Short amount of time deadlock exists before being aborted. **Schedule S Execution Rules:** W(X) - Means a write on X - Takes a lock on X R(Y) - Means a read on Y T1:Commit - Means a commit of all data to the DB permanently from T1 - releasing all locks. Two major types of locks are utilized: - - All locks happen before you do operations. *Example:* **S = T1:W(X), T2:R(Y), T1:R(Y), T2:R(X), T1:Commit, T2:Commit** The above schedule is not allowed under Strict 2PL. T2 would not be able to issue T2: R(X) before T1 commits. However it would be allowed under Basic 2PL. Read the table line by line, left to right, and it will make sense. Remember read locks are shared and write locks are exclusive. +-----------------------------------+-----------------------------------+ | T1 | T2 | +===================================+===================================+ | write\_lock(X) | read\_lock(Y) | | | | | W(X) | R(Y) | | | | | read\_lock(Y) | read\_lock(X) | | | | | unlock(X) | unlock(Y) | | | | | R(Y) | R(X) | | | | | unlock(Y) | unlock(X) | | | | | Commit | Commit | +-----------------------------------+-----------------------------------+ **Efficient Equality Search** **Database Anomalies** Lost Update: This occurs when two transactions read the same value of a data item but finish after each other. The value that the first transaction writes will be overwritten by the second. Dirty Read: This occurs when a transaction reads another transactions write to a data item before the writing transaction commits. This is a problem because up until that point the writing transaction could roll back and the reading transaction will have read a temporary incorrect value for the data item. Unrepeatable Read: This occurs when a transaction reads from a data item twice but the second read value is different from the first. This is an anomaly because in a serial schedule the two values will be the same. So if the second read is different from the first in a schedule then the schedule is not equivalent to any serial order of transactions. **For S to be serializable -** Serializable schedule means, a schedule that is interleaved but still equates to some serial execution of the transactions. **Serialization Graphs\ **Nodes represent transactions.\ An edge T1 -\> T2 means that one of T1's operations precedes and conflicts with one of T2's operations. **Recovery Protocols\ **Undo action:\ - required for atomicity. Undoes all updates by an uncommitted transaction. Redo action:\ - required for durability. Redoes the update of the committed transaction (which may not have been written to disk yet). **Cache Flushing**\ Steal: a cache page can be flushed before transaction commits\ No-steal: A cache cannot be flushed before transaction commits\ Force: Cache is flushed to disk when transaction commits\ No-force: Cache is not flushed to disk when transaction commits.\ \ Leads to four different ways for handling recovery:\ - steal / no-force - undo / redo "Practical"\ - steal / force - undo / no redo\ - no-steal / no-force - no-undo / redo\ - no-steal / force - no-undo / no-redo "Simple" **Write-ahead Logging** (WAL)\ For undo:\ - Before a data items AFIM is flushed to disk (overwriting the BIFM), the BIFM must be written to the log.\ For redo:\ Before a transaction executes its commit operation, all its AFIMs must be written to the log. **Nested-loop join:** Outer relation is scanned once (Blocks in outer relation) Inner relation is scanned once for each record in outer relation (Records in outer x Blocks in inner) Cost = Blocks in outer + Records in outer x Blocks in inner Outer should be smaller file **Page Oriented Nested Loops:** The idea behind page-oriented nested loop join is for each of the blocks in the outer relation to access all of the blocks in the inner relation and compare the records currently in memory. Total cost = Blocks In the outer relation + Blocks In the outer relation \* Blocks In the inner relation **Block nested loop join:** Reads as much from outer file as possible at a time. One block of memory for buffering result, and one block for reading from the inner file - Num memory blocks - 2 remaining. Cost = Blocks in outer + Blocks in inner x (Blocks in outer / (Number of blocks in memory - 2)) Outer should be smaller file **Single loop join:** Only works if index exists for inner relation. Scans outer and uses index on inner to perform join without nested loop. Cost = Blocks in outer + (Records in outer x (Levels in inner index + 1)) Outer should be smaller file **Steps to Drawing Query Plan** **1.** Break up selections (with conjunctive conditions) into a cascade of selection operators **2.** Push selection operators as far down in the tree as possible **3. ** Convert Cartesian products into joins **4.** Rearrange leaf nodes so that to: Execute first the most restrictive select operators **5.** Move projections as far down as possible **Transfer rate (HDD)** Transfer rate = Block size / block transfer time Transfer rate (bytes/ms) = Track size / time for 1 revolution in ms **Blocks** Block Transfer Time (BTT) = Block size / Transfer rate (bytes/ms) Blocking factor = Floor(block size / record size) **Rotational delay** Average rotational delay = time for one disk revolution (ms) / 2 **File structure properties** *File Type* *Non-unique field Search* *Unique field Search* *Range Search* *Insert* -------------------- --------------------------- ----------------------- --------------------------- ------------------ Heap (Non-Ordered) B 0.5B B 2 (Read + Write) Sorted (Ordered) B log~2~B log~2~B + matching blocks log~2~B + B **Big Data** Four V's: Volume - Scale of Data Variety - Different forms of data Velocity - Analysis of streaming data Veracity - uncertainty of data