Recent Lessons

Show all results for ""

Query Processing and Optimization

Query Processing and Optimization

Choose a study mode

Play Quiz

Study Flashcards

Spaced Repetition

Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Download our mobile app to listen on the go

Get App

Questions and Answers

During query processing, what is the primary role of the 'parser and translator' component?

To optimize the query execution plan.
To execute the query and produce the final result.
To check the syntax of the query and convert it into relational algebra. (correct)
To estimate query costs using statistical information.

What does an 'evaluation-plan' specify in the context of query processing?

The equivalent relational algebra expressions that can be used to produce the same result.
The statistical information used to estimate the cost of query execution.
The different algorithms available for evaluating relational algebra operations.
A detailed strategy for how to execute a query, including specific algorithms and access methods. (correct)

What is the main goal of query optimization?

To choose the evaluation plan with the lowest estimated cost among all equivalent plans. (correct)
To reduce the number of tuples in each relation.
To find all equivalent evaluation plans for a given query.
To ensure the syntax of the query is correct.

When measuring the cost of a query, what factors are typically considered?

<p>Disk access, CPU usage, and network communication. (B)</p> Signup and view all the answers

In the context of query cost, what does the term 'seek' refer to?

<p>The average time it takes for a disk arm to move to the correct track. (D)</p> Signup and view all the answers

Why is sorting considered an important operation in DBMSs?

<p>Because many operations are more efficient with sorted inputs and to satisfy ORDER BY clauses. (A)</p> Signup and view all the answers

What is the key idea behind the external sort-merge algorithm?

<p>To divide data into smaller chunks, sort them individually, and then merge the sorted chunks. (A)</p> Signup and view all the answers

In the 2-way external sort-merge algorithm, what is represented by 'N'?

<p>The number of pages the data is broken up into. (C)</p> Signup and view all the answers

What is the primary purpose of hashing aggregation?

<p>To quickly identify tuples with the same distinguishing attributes for grouping. (A)</p> Signup and view all the answers

What is the role of the hash function in the first phase of external hashing aggregation?

<p>To split tuples into partitions on disk based on the hash key. (A)</p> Signup and view all the answers

Flashcards

Parsing and Translation

Transforming a query into a tree-like form and verifying relations. Then, converting it into Relational Algebra.

Query Optimization

Choosing the lowest cost evaluation plan from equivalent options, using database statistics.

Query Evaluation

Executing a query-evaluation plan to get results.

Response Time

Total elapsed time for answering a query.

Signup and view all the flashcards

Resource Consumption

Total resources consumed by a query.

Signup and view all the flashcards

Disk cost

Cost metric based on the number of block transfers from disk and the number of seeks.

Signup and view all the flashcards

External Sort-Merge

Data is split then sorted into smaller runs and combined into longer sorted runs.

Signup and view all the flashcards

External Sort-Merge

Sorting algorithm that splits data into separate runs, sorts them individually, and combines them into longer sorted runs.

Signup and view all the flashcards

Run Generation Cost

Sorting algorithm where one seeks to read each run and one seek to write each run.

Signup and view all the flashcards

Sorting Aggregation

A method that quickly identifies tuples sharing common attributes for efficient grouping.

Signup and view all the flashcards

Study Notes

Basic Steps in Query Processing

Query processing involves parsing and translation, optimization, and evaluation.
Parsing checks for syntax and verifies relations.
Queries are translated into an internal tree-like form and then into Relational Algebra during parsing.
Evaluation involves the query-execution engine taking a query-evaluation plan, executing it, and returning the answers.

Optimization of Queries

A relational algebra expression can have multiple equivalent expressions.
Relational algebra operations can be evaluated using several different algorithms.
An evaluation-plan is an annotated expression specifying a detailed evaluation strategy.
Query optimization chooses the lowest cost evaluation plan from all equivalent plans.
Cost is estimated using statistical information from the database catalog, like the number of tuples and size of tuples in each relation.
Query costs are measured, and algorithms are used for evaluating relational algebra operations.
Algorithms combine individual operations to evaluate a complete expression.
Query optimization involves finding an evaluation plan with the lowest estimated cost.

Measures of Query Cost

Factors contributing to time cost include disk access, CPU usage, and network communication.
Cost can be measured by response time or total resource consumption.
Total resource consumption is used as the cost metric since response time is harder to estimate.
Minimizing resource consumption is beneficial in shared databases.
CPU costs are ignored for simplicity.
Real systems consider CPU costs.
Network costs need consideration in parallel systems.
The estimation of the cost for each operation is described by the text.
The cost of writing output to disk is not included.
Disk cost estimation includes the number of seeks multiplied by the average seek cost.
Disk cost estimation includes the number of blocks read times the average block read cost.
Disk cost estimation also includes the number of blocks written times the average block write cost.
The number of block transfers from disk and the number of seeks are used as cost measures.
tT is the time to transfer one block, assuming write cost is the same as read cost.
tS is the time for one seek.
The cost for 'b' block transfers plus 'S' seeks is calculated as b * tT + S * tS.
tS and tT depend on where the data is stored, with 4 KB blocks.
For high-end magnetic disks, tS = 4 msec and tT = 0.1 msec.
For SSDs, tS = 20-90 μsec and tT = 2-10 μsec for 4KB.
Required data may already reside in the buffer, thus avoiding disk I/O.
It's difficult to account for data in the buffer for cost estimation purposes.
Disk I/O can be reduced using extra buffer space with several algorithms.
The real memory available to the buffer depends on concurrent queries and the OS.
Worst-case estimates assume no initial data is in the buffer, and only the minimum memory is available.
More optimistic estimates are used in practice.

Operator Execution

Queries are executed using the DBMS with operator algorithms, query processing models, and runtime architectures.
Query planning arranges operators in a tree structure where data flows from leaves to the root.
The result of the query is the output of the root node.

Disk-Oriented DBMS

There is no assumption that a table fits entirely in memory.
Similarly, a disk-oriented DBMS does not assume query results fit in memory.
The buffer pool is used to implement algorithms that need to spill to disk.
Algorithms that maximize the amount of sequential I/O are preferred.

Sorting

Relational model/SQL is unsorted.
Sorting is important in DBMSs because it makes many operations (e.g., joins) more efficient.
"ORDER BY" queries require sorting.
Sorting is trivial to support duplicate elimination using "DISTINCT".
Bulk loading of sorted tuples into a B+Tree index occurs faster.
Sorting helps aggregations using "GROUP BY".
An index builds on the relation, and is then used to read the relation in sorted order which may lead to one disk block access per tuple.

In-Memory Sorting

A standard sorting algorithm like Quicksort is used if data fits in memory.
The default algorithm in Python is TimSort (2002), which is a hybrid of insertion sort and merge sort.
A technique aware of reading and writing disk pages is used if data does not fit in memory.

Top-N Heap Sort

The DBMS only needs to scan the data once to find the top-N elements if a query contains an ORDER BY with a LIMIT.
If the topN elements fit in memory, it is the ideal scenario for heapsort.
Scan data once, maintain an in-memory sorted priority queue.

External Sort-Merge

External sort-merge splits data into separate runs, sorts them individually, and combines them into longer sorted runs; it is a divide-and-conquer algorithm.
A run is a list of key/value pairs.
The key is the attribute(s) being compared.
The value has two choices (tuple for early materialization or record ID for late materialization).
Phase 1 (Sorting): Sort chunks of data that fit in memory and write the sorted chunks to a file on disk.
Phase 2 (Merging): Combine sorted runs into larger chunks.

2-Way External Sort Merge

For each pass, "2" is the number of runs being merged into a new run.
Data is broken up into N pages.
The DBMS has B buffer pool pages to hold input and output data.
Pass #0 Reads one page of the table into memory, sorts it into a "run", and writes it back to disk.
Pass #0 Repeats until the whole table is sorted into runs (N runs)
Pass #1,2,3,... recursively merge pairs of runs into runs twice as long.
At least 3 buffer pages are needed (2 for input, 1 for output).
The number of passes = 1 + ⌈ log2 N ⌉.
The total I/O cost = 2N * (# of passes).

General External Sort-Merge Algorithm

The previous algorithm only requires 3 buffer pages for sorting and does not effectively utilize bigger buffer sizes.
Pass #0 uses B buffer pages to produce ⌈N / B⌉ sorted runs of size B.
Pass #1,2,3,… merges B-1 runs.
The number of passes = 1 + ⌈ logB-1 ⌈N / B⌉ ⌉.
The total I/O cost = 2N * (# of passes).

External Sort-Merge: Book Version

B denotes memory size in pages.

Create sorted runs. Let i be 0 initially.
- (a) Read B blocks of relation into memory.
- (b) Sort the in-memory blocks.
- (c) Write sorted data to run Ri; increment i.

Let the final value of i be N.

Merge the runs (N-way merge), assuming N < B.
- 1. Use N blocks of memory to buffer input runs and 1 block to buffer output. Read the first block of each run into its buffer page.
- 1. repeat
  - i.Select the first record (in sort order) among all buffer pages.
  - ii.Write the record to the output buffer. Write to disk if the output buffer is full.
  - iii.Delete the record from its input buffer page. Read the next block (if any) of the run into the buffer if the buffer page becomes empty.
- 1. until all input buffer pages are empty.

Several merge passes are required if N  B.
During a pass, contiguous groups of B - 1 runs are merged.
A pass reduces the number of runs by a factor of B - 1 and creates runs longer by the same factor.
E.g., with B=11 and 90 runs, one pass reduces the number of runs to 9, increasing the size of initial runs by 10.
Repeated passes occur until all runs have been merged into one.

External merge sort – Cost analysis

The number of runs is ceiling(br/B)
Using 1 block per run leads to too many seeks during merge
To counter this use bb buffer blocks per run
- Read / write bb blocks at a time
- Can merge floor(B/bb)-1 runs in one pass.
The total number of merge passes required is ceiling(logfloor(B/bb)-1).
Block transfers for initial run creation, and in each pass is 2br.
Cost analysis doesn't count the final write
- Ignoring final write occurs because the final output may be sent to the parent operation without being written to disk.
Thus total number of block transfers for external sorting=br(2ceiling(logfloor(B/bb)-1)+1).
Cost of seeks includes During run generation,
One seek to read each run
one seek to write each run which is 2ceiling(br/B)
cost also Includes, During the merge phase,
- It needs needs, 2ceiling(br/bb) seeks for each merge pass
- Except the final one which should not require a write So total number of seeks: 2ceiling(br/B)+ceiling(br/bb)(2(ceiling(log-⌊B/bb⌋-1(br/B))-1)

Using B+-Trees for Sorting

If the table to be sorted has a B+-Tree index on the sort attribute(s), we can use that to accelerate sorting.
Retrieve tuples in a desired order by traversing the leaf pages of the tree.
Traverse to the leftmost leaf page when clustered B+Tree is being used, and retrieve tuples from all leaf pages.
Clustered B+Trees are better than external sorting because there is no computational cost and all disk access is sequential.
Unclustered: Follow each pointer to the page that contains the data. One I/O per data record.

Aggregations

Aggregation collapses values for an attribute from multiple tuples into a single scalar value.
The DBMS quickly finds tuples with grouping attributes with sorting and hashing.
Sorting Aggregation can be used on tables.
While hashing:
- Populate an ephemeral hash table as the DBMS scans the table.
- For each record, check whether there is already an entry in the hash table.
- DISTINCT: Discard DUPLICATE
- GROUPBY: Perform aggregate computation.
If everything fits in that memory: this becomes easy
external Hashing Aggregation occurs otherwise

External Hashing Aggregation

Phase one known as "Partition" - Divide tuples into buckets based on hash key - Write them out to disk when they get full (Buckets filling up)
Second Phase known as "ReHash" - Build in-memory hash table for each partition Compute aggregation.
Use a hash function h1 is being used to split tuples into partitions on disk.
Partitions are spilled to disk via output buffers.
It's assumed that B buffers are utilized, B-1 buffers for the partitions plus 1 buffer for the input data.

Summarization with Hashing

The rehash phase is the phase, where pairs can be stored and paired form of (GroupKey→RunningVal)
When we insert a tuple to the hash table:
- If a matching GroupKey is found, update the RunningVal
- Else insert a group key

Selection Operation

Algorithms search for records that satisfy a predicate with file scan
In Algorithm A1 (linear search), each file block scans while each record test is performed to see if a selection condition is met.
Cost estimate = br block transfers + 1 seek, where br denotes the number of blocks with records in relation r.
One can stop finding a record if it's on a key attribute.
Average cost = (br /2) block transfers + 1 seek
Linear search is applicable regardless of selection condition, records ordering, or available indices.
Binary search doesn't make sense because data is not generally stored consecutively
- Exception is when there is index available

Selections when Using Indices

Index scanning occurs while using algorithms that have an index to search.
During index scanning, search-key can have a selection condition.
A2 (clustering index, equality on key) Retrieves a single record with corresponding equality condition
- Cost is a constant of (hi + 1) * (tT + tS)
A3 (clustering index, equality on non-key) Retrieves multiple records.
- Records will be on consecutive blocks, indicated by B representing number of blocks with matching records.
- Cost: hi * (tT + tS) + tS + tT * b

Selections Using Indices

A4 (secondary index, equality on key/non-key).
- Retrieves single record if the search-key is a candidate key Cost = (hi + 1) * (tT + tS) Retrieves multiple records if search-key is not a candidate key
  - Matching records each of the can could be on a different block is an occurrence.
- Cost = (hi + n) * (tT + tS). -Very bad/expensive process .

Selections Involving Comparisons

Selection is implemented in the form of σA≤V (r) or σA  V (r) with the utilization of linear file scanning
Used when there is indices present
- An A5 (clustering index, comparison) will be utilized
- σA  V (r) will traverse utilize the index for the finding of first tuple with great-or-equal value and scanning of relation Thereafter or from that range
- While If σA ≤ V (r) just scans relation, sequentially till finds tuple that possess greater than v; A6 (secondary index, comparison)
- σA  V (r) uses the index for the determination of first index entry is greater than or equivalent to v with subsequent scanning of pointer finding and scanning to records.
  - While when σA ≤ V (r) takes place there are just leaf pages that the index scans within and utilizes for the finding of pointers to records within a less than entry of V
In each case records can retrieve and be pointed to but requires Linear file scanning is cheaper.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

SQL Query Processing and Optimization Quiz

5 questions

SQL Query Processing and Optimization Quiz

ConsummateGarnet1839

GTU # 3130703 DBMS Chapter 2: Query Processing & Optimization Quiz

19 questions

DBMS Chapter 2 Query Processing & Optimization Quiz: Evaluation of Exp...

TougherArtDeco

Database Optimization and Query Processing

5 questions

Database Optimization and Query Processing

UndamagedDobro

Database Query Processing Overview

13 questions

Database Query Processing Overview

EfficientMountRushmore

Use Quizgecko on...

Browser