Podcast
Questions and Answers
What is one fundamental feature of the hard drive technology established in 1956?
What is one fundamental feature of the hard drive technology established in 1956?
Which of the following best describes the memory hierarchy in data storage?
Which of the following best describes the memory hierarchy in data storage?
How do new DB technologies respond to the evolution of data storage?
How do new DB technologies respond to the evolution of data storage?
What role do CPU frequencies and memory access play in the context of hardware evolution?
What role do CPU frequencies and memory access play in the context of hardware evolution?
Signup and view all the answers
What is one of the key challenges related to trust management as mentioned in the context of data storage?
What is one of the key challenges related to trust management as mentioned in the context of data storage?
Signup and view all the answers
Which architecture does NoSQL databases typically use?
Which architecture does NoSQL databases typically use?
Signup and view all the answers
What is the primary function of ACID in distributed data processing?
What is the primary function of ACID in distributed data processing?
Signup and view all the answers
Which of the following trends is emphasized in the evolution of data storage and processing?
Which of the following trends is emphasized in the evolution of data storage and processing?
Signup and view all the answers
Which storage technology is the cheapest per gigabyte?
Which storage technology is the cheapest per gigabyte?
Signup and view all the answers
What has a lower latency than reading 1 MB sequentially from SSD?
What has a lower latency than reading 1 MB sequentially from SSD?
Signup and view all the answers
Which memory level has the highest access speed?
Which memory level has the highest access speed?
Signup and view all the answers
What type of data access structure is a B+ tree classified as?
What type of data access structure is a B+ tree classified as?
Signup and view all the answers
What is the approximate latency for sending a packet from California to the Netherlands?
What is the approximate latency for sending a packet from California to the Netherlands?
Signup and view all the answers
Which memory component is volatile?
Which memory component is volatile?
Signup and view all the answers
What is one of the primary challenges in hardware regarding CPU speed versus memory speed?
What is one of the primary challenges in hardware regarding CPU speed versus memory speed?
Signup and view all the answers
Which type of storage has the highest price per gigabyte?
Which type of storage has the highest price per gigabyte?
Signup and view all the answers
What is a primary advantage of using LSM-trees over B-trees?
What is a primary advantage of using LSM-trees over B-trees?
Signup and view all the answers
What is the latency for a mutex lock/unlock operation?
What is the latency for a mutex lock/unlock operation?
Signup and view all the answers
In the context of memory hierarchy, which memory type is used for faster access by the CPU?
In the context of memory hierarchy, which memory type is used for faster access by the CPU?
Signup and view all the answers
What is the purpose of ROWID in a database management system?
What is the purpose of ROWID in a database management system?
Signup and view all the answers
In the context of LSM-trees, what happens when level Ci is full?
In the context of LSM-trees, what happens when level Ci is full?
Signup and view all the answers
How do database management systems (SGBDs) improve efficiency in loading pages?
How do database management systems (SGBDs) improve efficiency in loading pages?
Signup and view all the answers
What characteristic of LSM-trees makes access to a value potentially slower?
What characteristic of LSM-trees makes access to a value potentially slower?
Signup and view all the answers
What is a defining feature of the storage strategy used by LSM-trees?
What is a defining feature of the storage strategy used by LSM-trees?
Signup and view all the answers
What is the size of an Oracle data block as indicated in the content?
What is the size of an Oracle data block as indicated in the content?
Signup and view all the answers
Which component in PostgreSQL pages contains information about free space?
Which component in PostgreSQL pages contains information about free space?
Signup and view all the answers
What does the recommendation of geometric progression for |Ci+1| imply in LSM-trees?
What does the recommendation of geometric progression for |Ci+1| imply in LSM-trees?
Signup and view all the answers
What aspect does the 'Item Ids' array in PostgreSQL pages represent?
What aspect does the 'Item Ids' array in PostgreSQL pages represent?
Signup and view all the answers
What is a disadvantage of using LSM-trees compared to B-trees?
What is a disadvantage of using LSM-trees compared to B-trees?
Signup and view all the answers
What is the primary storage structure for the first level (C0) of an LSM-tree?
What is the primary storage structure for the first level (C0) of an LSM-tree?
Signup and view all the answers
What are tuples in the context of a PostgreSQL data page?
What are tuples in the context of a PostgreSQL data page?
Signup and view all the answers
How does the log-structured storage approach benefit the writing process?
How does the log-structured storage approach benefit the writing process?
Signup and view all the answers
In the context of physical storage, what is a key factor affecting the number of pages a system can load?
In the context of physical storage, what is a key factor affecting the number of pages a system can load?
Signup and view all the answers
What does the term 'MVCC' refer to in database architectures?
What does the term 'MVCC' refer to in database architectures?
Signup and view all the answers
What information does the Special space in a PostgreSQL data page provide?
What information does the Special space in a PostgreSQL data page provide?
Signup and view all the answers
Which method of data writing is characterized by writing in real-time?
Which method of data writing is characterized by writing in real-time?
Signup and view all the answers
What is a limitation of update-in-place storage in relational databases?
What is a limitation of update-in-place storage in relational databases?
Signup and view all the answers
What is the primary function of logs in a database system?
What is the primary function of logs in a database system?
Signup and view all the answers
In what context is the term 'O2N' used?
In what context is the term 'O2N' used?
Signup and view all the answers
What is the role of a buffer in database management?
What is the role of a buffer in database management?
Signup and view all the answers
Which of the following distinguishes between sequential and random writing?
Which of the following distinguishes between sequential and random writing?
Signup and view all the answers
Which type of storage architecture primarily uses update-in-place methods?
Which type of storage architecture primarily uses update-in-place methods?
Signup and view all the answers
Study Notes
Hardware: Data Storage
- The hard drive was initially created in 1956 by IBM.
- It has multiple disks (platters) that spin rapidly together (thousands of revolutions per minute) and are ferromagnetic.
- Each platter has multiple tracks and each platter has one head on an air cushion.
- It costs roughly €100 and has a storage capacity of a few terabytes.
- Since the 1960s, it has been the primary storage (secondary) for computers.
- DBMSs were designed to be supported by this type of storage.
Memory Hierarchy
- Faster access is associated with higher levels of memory.
- Storage devices, from fastest to slowest access time are: CPU registers, CPU caches, main memory, NVRAM, Flash, hard disk, tape.
- CPU registers and CPU caches are volatile, meaning they lose their data when the power is turned off.
- On early computers, the speed of the CPU was limited by the speed of the memory bus and memory access. However, CPU speed has increased much faster than memory speed.
- SRAM is relatively expensive with a cost of 500$/GB, while DRAM is much cheaper at 5$/GB.
- NVRAM and Flash are even cheaper, at 1$/GB and 0.4$/GB, respectively.
Latency Numbers
- In 2012, the latency for a L1 cache reference was 0.5 ns, while a branch mispredict took 5 ns.
- L2 cache reference took 7 ns, while a mutex lock/unlock took 25 ns.
- A main memory reference took 100 ns, while compressing 1 KB of data with Zippy took 3,000 ns (3 µs).
- Sending 2 KB data over a 1 Gbps network took 20,000 ns (20 µs).
- SSD random read took 150,000 ns (150 µs), while reading 1 MB sequentially from memory took 250,000 ns (250 µs).
- Round trip within the same data center took 500,000 ns (0.5 ms), while sequentially reading 1 MB from an SSD took 1,000,000 ns (1 ms).
- A disk seek took 10,000,000 ns (10 ms), while reading 1 MB sequentially from disk took 20,000,000 ns (20 ms).
- Sending a packet from California to the Netherlands and back to California took 150,000,000 ns (150 ms).
Latency Numbers (2019)
- Latency for a 4 KB block in 2019: Fast NVMe (Optane) took 7 µs, Fast NVMe (Z-SSD) took 12 µs, round trip TCP packet on 10Gb Ethernet took 20-50 µs, and NVMe Flash SSD took 80 µs.
Database Design Levels
- There are three levels of database design: Conceptual, Logical, and Physical.
- The conceptual design provides a user-oriented description of the database, independent of implementation, and can be represented using E-R diagrams or UML.
- The logical design defines the structure of the database independently of the DBMS, using the relational model to define the schema of the tables.
- The physical design specifies how the database is physically implemented, including materialized views, partitions, and indexes.
Data Access
- Physical and logical access paths are used to retrieve data from the database.
- Physical access paths involve B+ tree, R-tree, k-d tree, bitmap, and Z-curve.
- Logical access paths include materialized views and partitions.
- Materialized views are pre-computed results of queries that are stored in the database, while partitions are divisions of data into smaller, more manageable units.
Physical Storage in DBMS (Recap)
- Each record has a physical address, known as the ROWID.
- Records are grouped into pages, also called blocks.
- Reading or writing data requires loading the corresponding page into memory.
- Efficient DBMSs aim to minimize page loads.
- Increasing block size allows access to more records within a single page but reduces the number of pages that can be loaded at once.
Physical Storage: Oracle (1)
- The physical storage in Oracle is based on the logical storage structures, which are defined by the E-R diagram.
- Physical storage is implemented using blocks of data.
Physical Storage: Oracle (2)
- Oracle uses blocks of data that are 8 KB in size.
- Each block contains a data block and system blocks.
- Each data block contains multiple 1 KB blocks.
- The data blocks are stored on disk, while system blocks are stored in memory.
Physical Storage: PostgreSQL Blocks
- PostgreSQL blocks are 8 KB in size.
- Each block contains:
- Page Header: Contains general information about the page and pointers to free space.
- Item Ids: Array of pairs (offset, length), pointing to the actual items.
- Free space: Unallocated space for new items.
- Tuples: The actual items themselves (rows for tables, entries for indexes).
- Special space: Empty in ordinary tables.
Managing Writes in DBMS
- Writes are managed by the DBMS using a buffer pool and a log.
- When a record is updated, the new version is initially written to the buffer pool.
- The log file records all changes, creating a journal of updates.
- The updates are then transferred from the buffer pool to the database file.
- Writes to the database or log can be synchronous (in real-time) or asynchronous (delayed).
- The log file ensures data recovery even if the database is corrupted.
Types of DB Architectures (MVCC)
- MVCC stands for Multi-Version Concurrency Control.
- Two common approaches are oldest-to-newest (O2N) and newest-to-oldest (N2O).
- In O2N, older versions of data are stored at the beginning of the file, while newer versions are stored at the end.
- In N2O, newer versions are stored at the beginning, and older versions are moved toward the end.
- Both approaches have different advantages and disadvantages, depending on the specific workload and needs of the application.
Types of Storage Architectures
- Update-in-place storage is commonly used by relational databases, which update data directly in the main storage area. Updating might lead to fragmentation, requiring extra efforts for accessing data and managing the index.
- Log-structured storage (LSM) focuses on fast writes by appending changes to a log file instead of updating the main storage directly. This approach benefits write performance but may slow down reads because the system needs to scan the log file for the latest data. This architecture is popularized in LSM-tree (Log-Structured Merge tree).
- Delta with main store combines the best of both worlds. The main store is optimized for reads, while updates are made to a write-optimized buffer (similar to SAP Hana, Sans Souci, etc. ). The buffer is periodically merged with the main store, combining the benefits of update-in-place and log-structured storage. This approach allows for high write throughput and efficient read performance.
LSM-Tree
- The LSM-tree is a family of structures optimized for write throughput, characteristic of log-structured storage systems.
- It consists of multiple levels (C0, C1, C2, ..., Ck) with C0 being an in-memory tree or hash table.
- Subsequent levels (C1, C2, etc.) are B-trees stored on disk or slower memory.
- New updates are initially written only to C0.
- When a level (Ci) is full, it is merged with the next level (Ci+1), emptying Ci.
- This means that every level (Ci) contains versions older than the versions in level (Ci + 1).
- This results in up to 'r' versions of tuples coexisting.
- A read operation traverses the levels from C0, C1... until the desired tuple is found.
- The original implementation recommended using a geometric progression for level sizes (|Ci+1| = r * |Ci|), resulting in an overall number of unique keys (N) of O(r^k).
- More recent solutions offer potentially more efficient strategies.
LSM-Tree: Benefits and Disadvantages
- By delaying writes to subsequent levels, the LSM-tree effectively accelerates write operations compared to traditional B-trees.
- Advantages:
- Faster sequential writes
- Reduced fragmentation, leading to more efficient range queries
- Disadvantages:
- Retrieving a specific value might take longer.
References
- Database System Fundamentals (by Ramakrishnan & Gehrke)
- Transaction Processing (by Jim Gray and Andreas Reuter)
- Database Systems: A Practical Approach (by Thomas Connolly and Carolyn Begg)
- Database Systems: Design, Implementation, and Management (by Carlos Coronel and Stephen Morris)
- The Transaction Processing Performance Council (TPC)
- The International Organization for Standardization (ISO)
- The American National Standards Institute (ANSI)
- The VLDB Endowment
- The ACM SIGMOD
In addition to the notes, I recommend reviewing the references listed in the text for a more comprehensive understanding of the topic.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Related Documents
Description
This quiz explores the fundamentals of data storage, specifically focusing on hard drives and their history, as well as the memory hierarchy in computing systems. You'll learn about different types of storage and the speed and efficiency of data access across various components. Test your knowledge of these critical computer hardware concepts.