Database Concepts PDF

Höitoanes raneforne [-Bidliectibnau talse Rotefontot om Tromgformis 6 Database Concepts 6.1. Database Terminology Database and Managed Database A database is a structured collection of data that is stored through a computer system. database management system (DBMS) is the software that is used by end-users and clie side applications to access the data itself. Together they are referred to as a database syste; but this is often simply called a database, the same term that refers to the structured collecti of data. A managed database is a database that a third-party provider maintains and operates o behalf of a user. Cloud databases are an example of managed databases. Each database has its advantages, limitations, and tradeoffs. Designing and choosing database that will best suit your application based on the use cases and throughpu requirements is a key part of the system design interview. Database Model and Database Schema A database model (data model) refers to how data is organized and the relationships that exist within the data. A database model is an abstract model that is not related to a specific database implementation. Data modeling is the process of defining these data models and is an essential step in designing a database. A database model may be represented by an entity-relationship diagram (ERD). A database schema refers to the physical implementation of the database model to a specific database platform. The design of the data models is the same regardless of the database platform or type. In contrast, the design of database schemas is specific to a database platform. Transactions and Operations A database transaction is a logical unit of work performed on a database and nay consist of multiple operations. An operation is a smaller unit of work used that helps complete the transaction. For example, a database transaction to transfer funds from one account to another 26 Chapter6|)uats ton ji could consist of two operations. The first operation is to subtract the transfer amounrom tle fist account, and the second operation is to add the same amount to the second account 1he behavior of transactions differs depending on the database type and data model. Availability, Reliability, and Persistence Availability is the percentage of time that the users of the database can access it. For example, a database can have a stated availability service level of 99.99%, which means that in a year, that database can have up to a total of 52 minutes where it cannot be accessed. Reliability is the measure of the database's integrity, safety, and recoverability. A database that has high reliability has low rates of data corruption, secured authorization, and can recover data in the event of a failure. Persistence means that after a piece of data is written to the database, it is stably stored on non-volatile memnory such as a hard drive or a solid-state drive. Non-volatile memory refers to storage that will retain data even if it's not powered (such as hard drives), while volatile memory refers to storage that will retain data only when it is powered (such as RAM). An "inmemory" database - where the data is stored in RAM-typically has an order of magnitude faster access latency than non-volatile-based databases. While this database lacks persistence, it could be useful for applications where the past information becomes quickly irrelevant, such as gaming or real-time services. This is an example of a database design tradeoff; lower latency is achieved by sacrificing persistence and reliability. 6.2. Data Modeling Data modeling can be more specifically defined as the process of designing data types used and stored within a system, the relationship between data types, and the way data is organized through its attributes. An object represented in a database is called an entity, and its type is referred to as entity type or entity class. An attribute is a property of the entity type. For example, an entity with an entity type of "person" could have an attribute "name" with a value of Jon. There are multiple types of data models, and listing from the least technically detailed to the most: Conceptual data model: a high-level data model that defines the product scope, business requirements, and major features Logical data model: a more specific data model that outlines the main entities and attributes Physical data model: a technical data model that includes specifics about data structure, including entities, attributes, and relationship The following diagram contains example data models for the music application: Chypre 6 |Daalaw inhipyg Congusl Dats Mase Loglcal Data Model Arnlet Song ertiat ld Artist created meten tslname eated tmestarp Aermo Country Album •. artisi id album ld genre Album .. album kd created mestarnp song length abum nane artist ld description Physical Data Model Song ong d kong (8 bytes) (P) lcreated trestanp tnestamo (8 bytes) tie. stng (128 bytes) enis 4. ong (9 bytes) . Artist artist_Id: long (8 bytes) (PK) created_timestamp: timestamp (8 bytes) artist_name: string (128 bytes) country: sting (20 bytes) abum_ id long (8 bytas) genre: string (20 bytes) Bong length. int (4 bytes) Album ...H album_d: long (8 bytes) (PK) created_timestamp: timestamp (8 bytes) album_name: string (128 bytes) artist Jd: long (B bytes) description: string (256 bytes) Although conceptual and logical data models are helpful in the product developme brainstorming stages, both are abstract and high-level designs. For the system & interview, we focus on the physical data model, which describes the layout, association, structure of data used by services and stored in databases. It also illustrates the relations among entities and attributes. Note that a physical data model is not a database plat: specific model-that is the term "database schema" that was defined earlier. The physia; model is usually aligned toa specific platform and can be used to generate the database sh; Data Model Overload "Data model" is an overloaded term that is used to describe data used for both databases and services. The layout and schema of messages that we used to send requests and responses in "Designing Services" are also considered a type of data model. We disambiguate between the two by calling one "database model" and the othermessage data structure." Also, the terms field" and "attribute" are often used interchangeably and often mean the same thing. However, we'll use the term "attribute" with 28 Chapter6 | Database Concepts database models and "ficld" for messages. The attributes of a database model often overlap with the fields of a message data model. 6.3. Entily Relationship Diagram An ERD (Entity Relationship Diagram) or ER Diagram is a diagram that visualizes the relationship of the entities and attributes used during database modeling. ER diagrams are not tied to the physical structure of the database but are used to design databases conceptually; their purpose is to illustrate a data mnodel that can be converted into a database schema. ERDS can help with design decisions before database implementation; altering a database structure after productionization can be complicated. Attributes and Relationships There are several formal versions of ERD, but this section covers some of the basic rules that will be useful in system design interviews. There are three main components in ERDS: Entities, Attributes, and Relationships. An entity is represented bya rectangle or a rounded rectangle, with its name at the top and its attributes listed below. In the list of attributes, the data type of the attribute and expected size are specified. Additionally, the attributes can be denoted as primary keys or foreign keys. Primary keys (PK) are used to uniquely identify an entity. A foreign key (FK) is used to uniquely identify another entity and is used to link two entities together. In a relational database, which will be described in the next section, foreign keys link two tables together. Additionally, a composite primary key (CPK) is a key that uses two or more attributes to uniquely identify the entity. They are usually used for child entities that can be uniquely identified by other entities. Relationships are represented by lines between entities. A solid line means a strong (identifying) relationship. This means that the child entity's primary key contains the parent entity's primary key. A dashed line means a weak (non-identifying) relationship where neither entity contains the other's primary key in their primary key. Cardinality and Modality Cardinality refers to the maximum number of elements of an entity that are associated with the elements in another entity. Modality refers to the minimum number of elements of an entity that are associated with elements in another entity. The relationship between two entities, Entity A and Entity B, can have the following types of cardinalities: One-to-one: an element of Entity A is linked to only one element of Entity B, and vice versa. For example, a user and contact information can have a one-to-one relationship. One-to-many: an element of Entity A is linked to multiple elements of Entity B, but an element of Entity B is linked to only one element of Entity A. For example, an album can have multiple songs, but a song belongs to only one album. 29 BERTidicchba fem ranst . Manyto-many: an clement of Entity A is linked to multiple elements of lEnta, an cenent ot Enttv Bis lnkal to multiple clements of Entity A. For exampl mav eTen by multıple artists, and an artist may write multiple songs. h hllonng dagram tlustrates the drawing notation for the cardinality in ERDS: Etty A identitying One-n Entity B non-identtying one Entity A One Entity t one and onty one zero or more ..... |Entty A Many t-manyEntity B one or more zero or one Example ERD In the tollowing example, the Song entity has a weak one-to-many relationship wiAibum entity sine a song belongs to a single album, and an album can have multiple The Saved Song entity represents a song that is added to a playlist, and it has a wea to-many relationship with the Song entity since multiple users can save the same sonz playlist. Saved Song has a strong many-to-one relationship with the Playlist sr must belong to a playlist, and a playlist can have multiple songs. Song song id: long (8 bytes) (PK) cEStod _bnestam: timestamp (8 bytes) de: sting (128 bytes) atst id long (8 bytes) abum_d kong (8 byles) genre: sting (20 bytes) song length: int (4 bytes) Artist artist id: long (8 bytes) (PK) created_timestamp: timestamp (8 bytes) artist_name: string (128 bytes) cOuntry: string (20 bytes) Album album_id: long (8 bytes) (PK) created_ timestamp: timestamp (8 bytes) album_name: string (128 bytes) artist_ id: long (8 bytes) Saved_Song description: string (256 bytes) saved_song it long (9 bytes) (PK) | ceated trnestamp: tinestamp (8 bylesı song id: long (8 bytes) piayts_ id long i8 bytes) Playllst playist_id: long (8 bytes) (PK) created_timestamp: timestanmp (8 bytes) playlist_name: string (128 bytes) user_id: long (8 bytes) 30 Chapter 6| Database Concepts 6.4. Relational vs. NOSQL Databases A relational database is a type of database that represents data in tables, where each row of a table is a record. The columns of a table can refer to data in another table, thus creating a relationship between the tables. SQL (Structured Query Language) is the language used to communicate with a relational database. "Relational" and "SQL" are tightly associated with one another and often used synonymously. Though not all relational databases use SQL, "SQL database" usually means a relational database, and a "relational database" usually means that the query language is SQL. In an SQL table, a primary key is a field that uniquely identifies each row (a record) of the table. A foreign key is a field that refers to the primary key of another table and is used to link rows from different tables together. A NOSQL database ("not only SQL") is a type of database that does not store data in tables. NOSQL is a catch-all category for databases that do not use a relational model and instead store data in a non-tabular format. This non-tabular format is represented by a schema, which describes the organization and layout of data in the NOSQL database. NOSQL databases are also called non-relational databases. The diagram below illustrates how data for playlists and songs could be organized in a relational and non-relational database: Non-relational Relational Playllst table (playlist id, playlist _name) 23483 49310 "Playlist1" Playllst collection (playlist_id, playlist name, songs) |23483 "Playlist 1" "Song 1" "Song 3" "Playlist 2" 49310 "Playlist 2" "Song 2" Song table (song_id, song_name, playlist_id) |23483 8201 "Song 8532 "Song 2n 49310 8756 "Song 3" 23483 The most common types of NOSQL databases are key-value, column-oriented, and graph. In a key-value database, a data element is stored as a pair consisting of a key and a value. Groups of key-values are stored in collections as opposed to tables in a relational database. Because of their simplicity and ease of use, key-value databases are the most common type of NOSQL database. In this book, we'll use the key-value type for the examples of a N SQL database. Understanding the tradeoffs between relational and NOSQL databases is a common theme in system design. Relational databases excel at storing transactional data and allow complex queries across tabular relationships. In contrast, NOSQL databases use flexible schemas to store data and can easily scale to high request throughput and large data sizes. The table below outlines when relational or NOSQL databases should be used: 31 Entaer ithejontotn from romnstormes o|Datathuwirys Relational databases Structurd data Strict schema / tixecd format NosQl. databases Unstructured or semi-structured dat Flexible schema/unknown Relational data Non-relational data Need for joins or nested lookups Data can be denormalized Transactions driven workloads High QPS workloads Known scaling behavior Unlimited scaling Multiple indexes or unknown Key-value driven lookups lookup patterns Somd of the real-world use cases for relational databases include: . . . . Online transaction processing (OLTP) Online analytical processing (OLAP) Business intelligence and data warehouses Logistics and inventory management Financial services and payments Come of the real-world use cases for NOSQL databases include: . • • Real-time analytics Logging and click-driven data Fraud detection Recommendations and news feeds Internet of things (IoT) sensor data • High-throughput messaging Video and photo sharing • Data modeling is a process that is used to design both relational and NOSQL databases ER diagrams are typically associated with relational databases, they can also be used for No databases. In NOSQL databases, relationships within the data are not enforced by the dath but they can still be visualized by ERDS. NOSQL databases have a more flexible sde attributes and relationships can be more easily added compared to relational database, « designs of the database models are critical. Relational database models use a table format, while NOSQL uses a schema. 6.5. Example of Relational Database vs. NOSQL Database The following example compares how a music application stores songs, playlists, albun,a users in both a relational and NOSQL database. 32 Chapter 6 | Database C In. relational database, each entity class typically corresponds to a table, and cach entity instance or entity clement corresponds to a row in a table. In the Song table, each row represents a song, and the primary key is song_ id: Song song_id created ti mestamp 1642227231 2 1642248185 3 1642268330 title artist id album id genre Sonata in 5442 254634 classical Happy 366 761456 pop 143 Alt Song 1532 593874 indie 197 C minor Song song 1 ength 28 In the Saved_Song table, a row represents a song that was saved by a user to a playlist, and the primary key is playlist_id. The foreign key song id associates a row in the Saved_Song table with a row in the Song table: Saved Song saved song id Created timestamp 39482 |1642286434 1642288243 39484 1642288930 39483 song id playlist id 5442 366 3 2535 In the Playlist table, a row represents a playlist, and the primary key is the playlist_id. The foreign key user id associates a row in the Playlist table to a row in the User table (not shown): Playlist playlist id created timestamp playlist nam user id 2535 2536 1642287137 1642287523 gym wOrkout relax time 69405 1642288164 all-ttime favs 2537 24523 53411 Suppose a user with a user_id of 69405 wants to retrieve the songs in their playlist "gym workout," which has a playlist_id of 2535. This can be performed with an SQL SELECT statement, which is a query to retrieve rows from a table. A Select on the Saved _Song table can be performed with the following statement: SELECT * FROM Saved Song WHERE playlist id=2535; This would return rows that match a playlist id of 2535: saved 39484 song id Created timestamp 1642288930 song id playlist id 2535 33 SFLECT SAved Snq, a4vel Aong 1d, Saved Song. aong_id, Song. title, Jong.artini |Song,album id FROM Saved Song NHERE playlist 1d ' JOIN Song ON Saved Song.song idSong. aong_id This SELECT and JOIN statenment combines the rows from the two tables that match. for the key song_id: saved song i d 39484 song ld title artist i album i Sonata in 366 761456 C minor Similar JOIN queries need to be run to retrieve the artist name and album name from respective tables (not shown). SQL JOINS are a powerful way to query multiple tabl, associate rows with cach other. However, JOINS are more computationally expensie SELECTS or simple lookups, and the database needs to perform lookups on multiple, and combine the results. The same data can be organized in a NOSQL database. In a key-value NOSQL databas, &, stored in collections instead of tables. Instead of the columns and rows of a table, colley are defined by a schema that outlines how anelement in a collection is formatted. For et the Song collection has the following schema: ., / Song schema "song id": long, "created timestamp": timestamp "title": string, "artist id": longr "album id": long "genre" : string "song length: int Each key has a defined data type, and each element within the Song collection woui: formatted according to this data schema. Likewise, the schema for the Saved Song: Playlist collections are: Saved Song schemna { "saved song id": long, "created timestamp": timestanp, "song id": long, "playlist id": long 34 t Chapter 6| Database Concepts Playlist schema "plàylist i": lung *yeate timestamp":ti megtamp, Pplayl ist name" : str ing, "user_1d": long So far, this doesn't seem that much different from the way the data was organized in the tables; we have sinmply created a collection for each table. An element of a collection looks similar to a row of the table. For example, an element from the Playlist collection corresponds to a row in the Playlist table: /Element of the Playlist collection "playlist_id": 2535, "created timestamp": 1642287137, "playlist name": "gym workout", "user id": 69405 If we wanted to perform the same task as before (get songs in playlist_id of 2535), the routine to do so is: Dlaylist = db.playl ist . find (("playlist id": 2535}) playlist_songs = U for playlist_song in playlist.songs: playlist _songs.append (db. song. find(("song id": playlist song.song_id D) This routine required lookups in both the Playlist and the Song collections. This doesn't seem to be that much better than performing a JOIN in the relational database. Both the relational and NOSQL databases have similar lookup computational costs: one lookup for the playlist and a second lookup for the song name for each song of the playlist. . However, NOSQL schemas are flexible and not constrained to storing normalized elements. One way to improve this query is to embed (or nest) the song information as a list within the playlist element. The new schema of the Playlist collection is the following: // Playlist schema "playlist id": long, "created timestamp": timestamp , "playlist name": string, "user id": long, "songs": [{"song id": long, "title": string) - An element using the new Playlist schema: 7/ Element of the new Playlist collection "playlist id" : 2535, "created timestamp": 1642287137, "playlist name": "gym workout", "user id": 69405, "songs" : [ The routine to fetch the playlist with song information, using the new schema is: playiist db.piaylist.fand "playlist_id": 2535) For this query, the embedded version of this schema is more efficient compared to embedded version since the lookup no longer needs to aggregate and combine clements. This also illustrates how storing the playlists in a key-value format is fundn different from storing them in tables: the song ids can be embedded within the data playlist id, and this type of schema is called an embedded schema. Embedding does not always improve query performance. In this case, embedding & improve performance because it reduces the number of operations that we need to p However, suppose that we need to check if a song is already in an existing playlist embedded schema, this would require a search through all the playlists and the neste, ids. However, for the non-embedded schema, this would only require a search of elemen match the song id. As a rule of thumb, embedding works best when: The cardinality of the parent to the embed elements is one-to-few (i.e., hunde thousands). There is no use case for accessing the embedded elements outside the parent coy Non-embedded or usinga reference to another object works best when: The cardinality of the parent to the embed elements is one-to-many or on: squillions (ie., millions or above) Embedded elements are independently accessed outside of the parent Embedding can also be performed in a relational database; that is, a "songs" colum Playlist table could be added so that each row contains a list of song informt However, a row within a relational database is not designed to effectively handle complers primitive datatypes and scale to an unlimited number of sub-elements. Rows in a relai database are designed to hold basic primitive and non-primitive data types such as intege strings and use SELECT and JOIN statements to combine those rows. Schema design is a critical part of designing performant NOSQL databases, and haiy poorly designed schema could mean greater latency and higher query costs. But if thet: multiple different use cases for the data that require both embedded and non-embd. schema, does that mean that we need to choose one over the other? One way to solret problem is through denormalization, which is described in the next section. Chapter 6| Database Concepta 6.6. Nornalization or Denormalization Nornallzation is a database design technique that storcs data in a non-redundant and consistent schenta. In a normalized schena, there is no redundant data in a database, which helps clinminate inconsistencies during write and modify opcrations. SQL tables commonly store data in a normalized format and use table relationships to aggregate rows through SELECT and JOIN statements. For example, the Song and Playlist tables from the previous section are in a normalized format. Denormalization is a database design technique that stores data in a redundant (repetitive) or grouped schema. Adding redundant copies of the data within a database increases the read performance at the cost of write performance and consistency. For example, in the Song and Playlist tables, a denormalized approach stores two copies of the song's information: once in the Song table and again in the Playlist table. This denormalized schema reduces the number of JOINS needed and improves the latency of read queries. Denormalization could also mean combining data by grouping or embedding it to avoid a JOIN. Normallzed Denormallzed • Combined and/or redundant data • Non-redundant and conslstant data • Faster queries • Uses a smaller amount of storage Playllst table (playllst_ Jd, playlistname, usOr_ld) 23483 49310 "Playlst 1 3864 23483 "Playlist 1" "Playlist 2" 3295 49310 "Playlist 2" Saved Song table (song_ld, song_name, playlist_id) 8201 Playlst table (playlst Jd, playlist_name, usOr_name, song names) "Song 1 8532 "Song 8756 "Song 3 "User 3"'Song 1", "Song 3"] "User 1" "'Song 2"] Saved Song table (song_id, song name, playlist_name, username) 23483 8201 "Song 1 "Playlist 1" "User 3" 49310 8532 "Song 2" "Playlist 2" "User 1" |23483 8756 "Song 3" "Playlist 1" "User 3 User table (user_ld, user_name) User table (user_id, user_name, playlists) 3295 "User 1" 3295 "User 1" 3543 "User 2" 3543 "User2 3864 "User3 3 |3864 "User 3 "Playlist 2"] "Playlist 17 To fetch a playlist, the database no longer needs to perform a JOIN to get the titles of the songs. Instead, the redundant copies of the song's title, artist, and album are already in the playlist table. This denormalized schema improves the read efficiency of all Playlist reads. However, there are also downsides of denormalization: suppose that we need to change the name of a song. This would require writes to both the Song and Playlist tables instead of a single write. Furthermore, these operations are not guaranteed to be executed simultaneously, meaning the names in both tables are different at some point in time, causing inconsistency. The advantages of denormalization: 37 Pretrained Tromsformer R=Bidiectibna ncadr hotresontad trem ramsta Chaptet 6| Dataluw lws Rrdue the number of joins or lookups. JOINS are expensive operations th. lhe query time. • Simpler and more pertormant queries. Reduce relationships within the data. • Increases overall performance of the database, especially for read-heavy wor. The disadvantages of denormalization: • Inconsistency within the database due to redundant copies of the data tha., separately modified. • Reduces performance of writes since multiple locations need to be written to, • Require more storage due to redundant data. While denormalization is a technique that can be used in both relational and databases, it is less likely used in relational databases that support ACID transaci having redundant data, there is a possibility of inconsistent values at different arts database, which could break the use cases for transactional data that require consistency. However, denormalization is common in NOSQL databases with i consistency; schemas can be designed to use redundant data to achieve higher perform Normalization in Relational and NOSQL databases Normalization versus denormalization is a separate concept from relational versus NOSQL databases. Both types of databases can be normalized or denormalized. This section used a relational database and its tables as an example of normalization versus denormalization, but the same concepts apply to a NOSQL database and its collections. However, it is common that denormalization is associated witha NOSQL database, and normalization is associated with a relational database. This is because the consistency use cases of a relational database are aligned with normalization, and the performance use cases of a NOSQL database are aligned with denormalization. 6.7. Vertical and Horizontal Scaling As a system serves more users, it receives more requests and has greater storage needs. T: are two ways to handle the additional workload demands: vertical scaling or horiue: scaling. Vertical scaling means increasing the CPU, memory, or storage of a single madi Horizontal scaling means increasing the number of machines and distributing the work among those machines. Horizontal scaling allows a nearly limitless amount of scalablt, any number of machines can be added to handle larger data and workloads. In conta vertical scaling is limited to how much a single machine can only be upgraded. Notet 38 Chapter6 | Database Concepts wortical scaling and horizontal scaling are terms that apply to not only databases but to any ver woftware system. In system design, horizontal scaling is usually favored over vertical scaling because there is no limit amount that the system can be scaled. Vertlcal Scallng Horlzontal Scaling Add more instances Add more resourc8s to a slngle instance In databases, the two most common forms of horizontal scaling are replication and sharding, which are described in the next sections. In horizontal scaling, a database becomes distributed, and while this adds scalability, it also introduces more complexities. There is no longer a single view of the data, communication between nodes can be unreliable, and there might be simultaneous writes on the same key or row. 6.8. Replication Replication is the process of copying data from a primary database to a secondary or replica database. These replicas are often read-only versions of the primary database. Replication results in a distributed database and increases the availability of data since users can access the same data set from multiple nodes. While writes still need to be handled by the primary database, read requests can be distributed to other nodes, reducing the load on the primary database. Web Server Read-only Read, Writes Reads, Writes Replication Replica Database Web Server Web Server Read-only Read-only Replication Primary Database Replica Database While being to read data from multiple nodes is useful for read-heavy workloads, replication means that each write needs to be propagated from the primary database to the replicas. This 39 dr Brtresontt. om Tromster mo Chaptet 6 | Databave ('mepts atds mlexity to the onsistency model since data is no longer located in a single . INl a nrad iron a replica return data that is up-to-date or stale? In a database wi. unsistency, the database needs to verify that the response to replicas reads contain uNated data clement(s). If needed, the reads will be blocked until the relevant popagated to the replicas. In a database with eventual consistency, replicas respon reads immediately but possibly return stale data. The chapter "Distributed System C, will further describe consistency in distributed systems. Replication also adds fault tolerance: if one replica fails, the other replicas can stll, and handle the failed replica's requests. If the primary database fails, the replicas cho, of the cxisting replicas to take over as the primary database in a process called leaderThis process ensures that there is always a single node that handles the writes and pr. those changes to the replicas. Despite the complexities that replication and distributed database introduces, it is on: primary methods of scaling throughput and availability in databases, and well . technique in our systenn designs. 6.9. Sharding Sharding (horizontal partitioning) is a form of horizontal scaling in databases w dataset is split into smaller datasets by rows or keys, which can then be distributed multiple machines. This allows increased throughput, capacity, and availability, : workload is shared across multiple machines. Vertical partitioning refers to hori scaling by splitting a dataset by columns. It is a less common technique since this onhyz; D, to relational databases, and tables usually are fewer columns than rows. In sharding, a portion of the dataset, called a shard, is stored on a node. This achieves inck. storage capacity since the dataset is no longer restricted to the capacity of a single mai Additionally, this increases the throughput of the database system, as each node can pa requests. Another benefit of sharding is that it increases availability: even ifa single node or a s; shard becomes unavailable, the other shards are still functional, and the database can op: in a partially degraded state. A shard can be replicated and is called a replicated shard. In the diagram below, a user table is split into three shards by using user id as the st key. Data for users with an id of 1 to 1000 is placed into shard 1, and so on. Chapter 6| Database Concepts User table (user_jd, usor_name) shard key Shard 1: user_id 0001 - 1000 0039 "User 1" 0593 "User 2 0724 "User 3 0932 "User 1204 "User 5 1340 "User 6 4 1729 "User 7 1879 "User 8 1920 "User 9 2103 "User 10" 2553 "User 11" Shard 3: user_ld 2001 -3000 Shard 2: user_id 1001 - 2000 0039 "User 1" 1204 "User 5 2103 "User 1O" 0593 "User 2" 1340 "User 6" 2553 "User 11" O724 "User 1729 "User 7 0932 "User 4" 1879 "User 8" 1920 "User 9" Evenly splitting the range of an attribute is a simple method to shard a database, but it may not be appropriate for all data. The shards should be created so that traffic is evenly distributed between different shards. Splitting the data based on the range of an attribute may result in unbalanced shards, where some shards can hold more keys than other shards. Additionally, using a monotonically increasing shard key such as user id or timestamp may result in a single shard that handles more writes than other shards. For example, using user id as the shard key means that new users will fall into the same shard, which becomes problematic if new users make more requests than existing users. One option to mitigate uneven sharding is to use a hashed shard key. This approach uses a hash function to compute a shard key based on multiple attributes, providing a more even distribution of data among shards. 6.10. ACID vs. BASE Another difference between a relational and a NOSQL database is the behavior of database transactions. We previously defined a database transaction as units of work performed on a database that can consist ofmultiple operations. The most common transaction types are read, write, and read-write. The behavior of these transactions and how they are executed is called the database transaction nodel. The two most common database transaction models are ACID and BASE, acronyms for the properties that they represent. The ACID model provides a consistent system and is 41 4 romstory Chapcr 6| Datahase(onps ommonly asiated with rlational databases. The BASE model provides bigh; av and is comnmonly associated with NOSQL databases. ACID and BASE in Relational and NOSQL Databases While many relational databases follow the ACID model, not all relational databases have ACID properties. Likewise, not all NOSQL databases have BASE properties. During the interview, you can explicitly state this assumption of associating relational databases with ACID properties and NOSQL with BASE properties. But be aware that this association does not apply to all databases. For example, there are several popular NOSQL databases with strong consistency and can support ACID transactions. ACID slands for Alomicity, Consistency, Isolation, Durability: Atomicity is a property that means either all of a database transaction suceel. none of it does. If part of a transaction fails, then the whole transaction fails. Ing words, all parts of a transaction behave as if they were a single transaction. For example, suppose you want to transfer money from you to your friend database transaction that corresponds to this action would consist of two parts:a operation to debit your account and another write operation to credit your fie account. Atomicity ensures that both happen or neither does--it won't be stucki state where only the first operation happens. Consistency (Correctness) is a property that means the database is in a valid sta. the beginning and end of a transaction. For example, the database that handles the moneytransfer from you to your fi may define that the data is in a valid state if the sum total of both accounts beforea after the transaction is the same. In other words, no transaction can create an ing. data state, ensuring data integrity and preventing corruption. Note that "consisteny is an overloaded term, and this "ACID consistency" is a different concept from consistency in a distributed system, which is called "CAP consistency." • Isolation means that a database transaction cannot be affected by any atg transaction. In other words, the intermediate state of a transaction is not visible: a re request can't read the result of a write request that has not been completed yet. For example, the database transaction of the money transfer from you to your friea can only be seen by one account or the other, but not both. There is no intermediz state where the transferred money is in both accounts or neither account. • Durability means that once a database transaction completes, the changes from t: transaction are stored permanently, even if there is a database failure immediat afterward. For example, the database transaction of the money transfer persists and is a reversed even if there is a database failure. 42 Chapter 6| Database Concept ACID databases are commonly used for transaction-oriented data, where a failure midway through a transaction would result in catastrophic consequences. Many organizations require ACID-compliant databases for transaction-driven workloads such as financial operations, payments, and e-commerce. BASE stands for Basically Available, Soft state, Eventual consistency: Basically Available means that the database has high availability. The database appears to be in a good state most of the time and accepts requests. Soft state means that the state of the data can change. Since data does not need immediate consistency in this transaction model, replicated data at different nodes could have different values and change without further transactions. Eventual consistency means that if there are no new updates to a data item, eventually, all reads of that item will return the last updated value. Updates and writes are propagated throughout the nodes of a system, but those changes will not be immediately reflected. A BASE database gives up consistency in return for availability. Databases that have more BASE characteristics usually scale to higher throughput than those that have ACID characteristics. The eventual consistency property of a BASE database allows lower latency and better performance but at the cost of stale data. For this reason, BASE databases are commonly used for performance-driven applications and use cases that don't require consistency. For example, it is not important to users of a social networking site if their posts are inconsistent (e.g, they don't have the latest tweets) for a short period of time. Instead, users want to be able to read and make posts most of the time (i.e., have high availability). Modern databases don't necessarily fall into one category or the other but may have some ACID and some BASE properties. 43

Database Concepts PDF

Document Details

Tags

Related

Summary

Full Transcript