Podcast
Questions and Answers
During query processing, what is the primary role of the 'parser and translator' component?
During query processing, what is the primary role of the 'parser and translator' component?
- To optimize the query execution plan.
- To execute the query and produce the final result.
- To check the syntax of the query and convert it into relational algebra. (correct)
- To estimate query costs using statistical information.
What does an 'evaluation-plan' specify in the context of query processing?
What does an 'evaluation-plan' specify in the context of query processing?
- The equivalent relational algebra expressions that can be used to produce the same result.
- The statistical information used to estimate the cost of query execution.
- The different algorithms available for evaluating relational algebra operations.
- A detailed strategy for how to execute a query, including specific algorithms and access methods. (correct)
What is the main goal of query optimization?
What is the main goal of query optimization?
- To choose the evaluation plan with the lowest estimated cost among all equivalent plans. (correct)
- To reduce the number of tuples in each relation.
- To find all equivalent evaluation plans for a given query.
- To ensure the syntax of the query is correct.
When measuring the cost of a query, what factors are typically considered?
When measuring the cost of a query, what factors are typically considered?
In the context of query cost, what does the term 'seek' refer to?
In the context of query cost, what does the term 'seek' refer to?
Why is sorting considered an important operation in DBMSs?
Why is sorting considered an important operation in DBMSs?
What is the key idea behind the external sort-merge algorithm?
What is the key idea behind the external sort-merge algorithm?
In the 2-way external sort-merge algorithm, what is represented by 'N'?
In the 2-way external sort-merge algorithm, what is represented by 'N'?
What is the primary purpose of hashing aggregation?
What is the primary purpose of hashing aggregation?
What is the role of the hash function in the first phase of external hashing aggregation?
What is the role of the hash function in the first phase of external hashing aggregation?
Flashcards
Parsing and Translation
Parsing and Translation
Transforming a query into a tree-like form and verifying relations. Then, converting it into Relational Algebra.
Query Optimization
Query Optimization
Choosing the lowest cost evaluation plan from equivalent options, using database statistics.
Query Evaluation
Query Evaluation
Executing a query-evaluation plan to get results.
Response Time
Response Time
Signup and view all the flashcards
Resource Consumption
Resource Consumption
Signup and view all the flashcards
Disk cost
Disk cost
Signup and view all the flashcards
External Sort-Merge
External Sort-Merge
Signup and view all the flashcards
External Sort-Merge
External Sort-Merge
Signup and view all the flashcards
Run Generation Cost
Run Generation Cost
Signup and view all the flashcards
Sorting Aggregation
Sorting Aggregation
Signup and view all the flashcards
Study Notes
Basic Steps in Query Processing
- Query processing involves parsing and translation, optimization, and evaluation.
- Parsing checks for syntax and verifies relations.
- Queries are translated into an internal tree-like form and then into Relational Algebra during parsing.
- Evaluation involves the query-execution engine taking a query-evaluation plan, executing it, and returning the answers.
Optimization of Queries
- A relational algebra expression can have multiple equivalent expressions.
- Relational algebra operations can be evaluated using several different algorithms.
- An evaluation-plan is an annotated expression specifying a detailed evaluation strategy.
- Query optimization chooses the lowest cost evaluation plan from all equivalent plans.
- Cost is estimated using statistical information from the database catalog, like the number of tuples and size of tuples in each relation.
- Query costs are measured, and algorithms are used for evaluating relational algebra operations.
- Algorithms combine individual operations to evaluate a complete expression.
- Query optimization involves finding an evaluation plan with the lowest estimated cost.
Measures of Query Cost
- Factors contributing to time cost include disk access, CPU usage, and network communication.
- Cost can be measured by response time or total resource consumption.
- Total resource consumption is used as the cost metric since response time is harder to estimate.
- Minimizing resource consumption is beneficial in shared databases.
- CPU costs are ignored for simplicity.
- Real systems consider CPU costs.
- Network costs need consideration in parallel systems.
- The estimation of the cost for each operation is described by the text.
- The cost of writing output to disk is not included.
- Disk cost estimation includes the number of seeks multiplied by the average seek cost.
- Disk cost estimation includes the number of blocks read times the average block read cost.
- Disk cost estimation also includes the number of blocks written times the average block write cost.
- The number of block transfers from disk and the number of seeks are used as cost measures.
- tT is the time to transfer one block, assuming write cost is the same as read cost.
- tS is the time for one seek.
- The cost for 'b' block transfers plus 'S' seeks is calculated as b * tT + S * tS.
- tS and tT depend on where the data is stored, with 4 KB blocks.
- For high-end magnetic disks, tS = 4 msec and tT = 0.1 msec.
- For SSDs, tS = 20-90 μsec and tT = 2-10 μsec for 4KB.
- Required data may already reside in the buffer, thus avoiding disk I/O.
- It's difficult to account for data in the buffer for cost estimation purposes.
- Disk I/O can be reduced using extra buffer space with several algorithms.
- The real memory available to the buffer depends on concurrent queries and the OS.
- Worst-case estimates assume no initial data is in the buffer, and only the minimum memory is available.
- More optimistic estimates are used in practice.
Operator Execution
- Queries are executed using the DBMS with operator algorithms, query processing models, and runtime architectures.
- Query planning arranges operators in a tree structure where data flows from leaves to the root.
- The result of the query is the output of the root node.
Disk-Oriented DBMS
- There is no assumption that a table fits entirely in memory.
- Similarly, a disk-oriented DBMS does not assume query results fit in memory.
- The buffer pool is used to implement algorithms that need to spill to disk.
- Algorithms that maximize the amount of sequential I/O are preferred.
Sorting
- Relational model/SQL is unsorted.
- Sorting is important in DBMSs because it makes many operations (e.g., joins) more efficient.
- "ORDER BY" queries require sorting.
- Sorting is trivial to support duplicate elimination using "DISTINCT".
- Bulk loading of sorted tuples into a B+Tree index occurs faster.
- Sorting helps aggregations using "GROUP BY".
- An index builds on the relation, and is then used to read the relation in sorted order which may lead to one disk block access per tuple.
In-Memory Sorting
- A standard sorting algorithm like Quicksort is used if data fits in memory.
- The default algorithm in Python is TimSort (2002), which is a hybrid of insertion sort and merge sort.
- A technique aware of reading and writing disk pages is used if data does not fit in memory.
Top-N Heap Sort
- The DBMS only needs to scan the data once to find the top-N elements if a query contains an ORDER BY with a LIMIT.
- If the topN elements fit in memory, it is the ideal scenario for heapsort.
- Scan data once, maintain an in-memory sorted priority queue.
External Sort-Merge
- External sort-merge splits data into separate runs, sorts them individually, and combines them into longer sorted runs; it is a divide-and-conquer algorithm.
- A run is a list of key/value pairs.
- The key is the attribute(s) being compared.
- The value has two choices (tuple for early materialization or record ID for late materialization).
- Phase 1 (Sorting): Sort chunks of data that fit in memory and write the sorted chunks to a file on disk.
- Phase 2 (Merging): Combine sorted runs into larger chunks.
2-Way External Sort Merge
- For each pass, "2" is the number of runs being merged into a new run.
- Data is broken up into N pages.
- The DBMS has B buffer pool pages to hold input and output data.
- Pass #0 Reads one page of the table into memory, sorts it into a "run", and writes it back to disk.
- Pass #0 Repeats until the whole table is sorted into runs (N runs)
- Pass #1,2,3,... recursively merge pairs of runs into runs twice as long.
- At least 3 buffer pages are needed (2 for input, 1 for output).
- The number of passes = 1 + ⌈ log2 N ⌉.
- The total I/O cost = 2N * (# of passes).
General External Sort-Merge Algorithm
- The previous algorithm only requires 3 buffer pages for sorting and does not effectively utilize bigger buffer sizes.
- Pass #0 uses B buffer pages to produce ⌈N / B⌉ sorted runs of size B.
- Pass #1,2,3,… merges B-1 runs.
- The number of passes = 1 + ⌈ logB-1 ⌈N / B⌉ ⌉.
- The total I/O cost = 2N * (# of passes).
External Sort-Merge: Book Version
- B denotes memory size in pages.
- Create sorted runs. Let i be 0 initially.
- (a) Read B blocks of relation into memory.
- (b) Sort the in-memory blocks.
- (c) Write sorted data to run Ri; increment i.
- Let the final value of i be N.
- Merge the runs (N-way merge), assuming N < B.
-
- Use N blocks of memory to buffer input runs and 1 block to buffer output. Read the first block of each run into its buffer page.
-
- repeat
- i.Select the first record (in sort order) among all buffer pages.
- ii.Write the record to the output buffer. Write to disk if the output buffer is full.
- iii.Delete the record from its input buffer page. Read the next block (if any) of the run into the buffer if the buffer page becomes empty.
-
- until all input buffer pages are empty.
-
- Several merge passes are required if N  B.
- During a pass, contiguous groups of B - 1 runs are merged.
- A pass reduces the number of runs by a factor of B - 1 and creates runs longer by the same factor.
- E.g., with B=11 and 90 runs, one pass reduces the number of runs to 9, increasing the size of initial runs by 10.
- Repeated passes occur until all runs have been merged into one.
External merge sort – Cost analysis
- The number of runs is ceiling(br/B)
- Using 1 block per run leads to too many seeks during merge
- To counter this use bb buffer blocks per run
- Read / write bb blocks at a time
- Can merge floor(B/bb)-1 runs in one pass.
- The total number of merge passes required is ceiling(logfloor(B/bb)-1).
- Block transfers for initial run creation, and in each pass is 2br.
- Cost analysis doesn't count the final write
- Ignoring final write occurs because the final output may be sent to the parent operation without being written to disk.
- Thus total number of block transfers for external sorting=br(2ceiling(logfloor(B/bb)-1)+1).
- Cost of seeks includes During run generation,
- One seek to read each run
- one seek to write each run which is 2ceiling(br/B)
- cost also Includes, During the merge phase,
- It needs needs, 2ceiling(br/bb) seeks for each merge pass
- Except the final one which should not require a write So total number of seeks: 2ceiling(br/B)+ceiling(br/bb)(2(ceiling(log-⌊B/bb⌋-1(br/B))-1)
Using B+-Trees for Sorting
- If the table to be sorted has a B+-Tree index on the sort attribute(s), we can use that to accelerate sorting.
- Retrieve tuples in a desired order by traversing the leaf pages of the tree.
- Traverse to the leftmost leaf page when clustered B+Tree is being used, and retrieve tuples from all leaf pages.
- Clustered B+Trees are better than external sorting because there is no computational cost and all disk access is sequential.
- Unclustered: Follow each pointer to the page that contains the data. One I/O per data record.
Aggregations
- Aggregation collapses values for an attribute from multiple tuples into a single scalar value.
- The DBMS quickly finds tuples with grouping attributes with sorting and hashing.
- Sorting Aggregation can be used on tables.
- While hashing:
- Populate an ephemeral hash table as the DBMS scans the table.
- For each record, check whether there is already an entry in the hash table.
- DISTINCT: Discard DUPLICATE
- GROUPBY: Perform aggregate computation.
- If everything fits in that memory: this becomes easy
- external Hashing Aggregation occurs otherwise
External Hashing Aggregation
- Phase one known as "Partition" - Divide tuples into buckets based on hash key - Write them out to disk when they get full (Buckets filling up)
- Second Phase known as "ReHash" - Build in-memory hash table for each partition Compute aggregation.
- Use a hash function h1 is being used to split tuples into partitions on disk.
- Partitions are spilled to disk via output buffers.
- It's assumed that B buffers are utilized, B-1 buffers for the partitions plus 1 buffer for the input data.
Summarization with Hashing
- The rehash phase is the phase, where pairs can be stored and paired form of (GroupKey→RunningVal)
- When we insert a tuple to the hash table:
- If a matching GroupKey is found, update the RunningVal
- Else insert a group key
Selection Operation
- Algorithms search for records that satisfy a predicate with file scan
- In Algorithm A1 (linear search), each file block scans while each record test is performed to see if a selection condition is met.
- Cost estimate = br block transfers + 1 seek, where br denotes the number of blocks with records in relation r.
- One can stop finding a record if it's on a key attribute.
- Average cost = (br /2) block transfers + 1 seek
- Linear search is applicable regardless of selection condition, records ordering, or available indices.
- Binary search doesn't make sense because data is not generally stored consecutively
- Exception is when there is index available
Selections when Using Indices
- Index scanning occurs while using algorithms that have an index to search.
- During index scanning, search-key can have a selection condition.
- A2 (clustering index, equality on key) Retrieves a single record with corresponding equality condition
- Cost is a constant of (hi + 1) * (tT + tS)
- A3 (clustering index, equality on non-key) Retrieves multiple records.
- Records will be on consecutive blocks, indicated by B representing number of blocks with matching records.
- Cost: hi * (tT + tS) + tS + tT * b
Selections Using Indices
- A4 (secondary index, equality on key/non-key).
- Retrieves single record if the search-key is a candidate key
Cost = (hi + 1) * (tT + tS)
Retrieves multiple records if search-key is not a candidate key
- Matching records each of the can could be on a different block is an occurrence.
- Cost = (hi + n) * (tT + tS). -Very bad/expensive process .
- Retrieves single record if the search-key is a candidate key
Cost = (hi + 1) * (tT + tS)
Retrieves multiple records if search-key is not a candidate key
Selections Involving Comparisons
-
Selection is implemented in the form of σA≤V (r) or σA  V (r) with the utilization of linear file scanning
-
Used when there is indices present
- An A5 (clustering index, comparison) will be utilized
- σA  V (r) will traverse utilize the index for the finding of first tuple with great-or-equal value and scanning of relation Thereafter or from that range
- While If σA ≤ V (r) just scans relation, sequentially till finds tuple that possess greater than v; A6 (secondary index, comparison)
- σA  V (r) uses the index for the determination of first index entry is greater than or equivalent to v with subsequent scanning of pointer finding and scanning to records.
- While when σA ≤ V (r) takes place there are just leaf pages that the index scans within and utilizes for the finding of pointers to records within a less than entry of V
-
In each case records can retrieve and be pointed to but requires Linear file scanning is cheaper.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.