CS 217-0 Final PDF

Lecture 9: 10/29/24 Relational Database Design Principles - Relational database: A structured set of data held in tables related to one another - Key features: Tables, rows columns, relationships, primary keys, and foreign keys Primary Keys - Primary keys act as a unique identifier for records in a database table, ensuring each entry is distinct - Each primary key is unique, non-null, immutable (non-changing), and minimal - Every table has a primary key - Rather than having a single column has a primary key, tables can have composite primary keys, made up of two columns Foreign Keys - A column in one table that uniquely identifies a row in another table, acting as a reference to a primary key in another table - Creates relationships between tables - Tables are linked by their various foreign keys between tables - Not every table needs to have a foreign key Relationships - The association, or relationship between table can have three types of relationships: One to many, many to many, or one to one One to Many (Parent / Child Tables) Relationships - One row in the first table can be associated with many rows in the second table, but a single row in the second table can be related to only one row in the first table - This is achieved by adding the primary key of the restricted “one” table into the table with “many” as a foreign key - EX: CustomerID in relation to each unique order by customers - Each CustomerID will be associated with various orders, but each order will be related to only one CustomerID - This one to many relationship can be viewed as a parent-child relationship - Parent Table: The table on the “one” side - Child Table: The table on the “many” side - Tables can be both a parent and child when referencing multiple tables Many to Many Relationships - One row in the first table can be associated with many rows in the second table, and a single row in the second table can be related to many rows in the first table - Established by adding a linking table, which holds the primary keys of the tables that form this relationship One to One Relationships - One row in the first table is related to one row in the second table, and a single row in the second table is related to one row in the first table - This can be done when the primary key of the first table is added into the second table as both the primary key and the foreign key Notation of Relationships in Schema Diagram - “1-∞” - 1 represents the “one side” - ∞ represents the “many side” - Arrow notation - Arrow starts from the many side, and then points towards the one side - Crow’s foot notation - One side: Represented by two vertical lines: || - Many side: Represented by the Crow’s foot: ⌱ Building a Relational Database (Developing Conceptual Schema for a Database) - Identify the tables needed within the database - Each table’s name, columns, data types - Which columns will be primary keys or foreign keys - Determining the relationships between tables: one to many, many to many, one to one - Tools like Lucidchart are commonly used to do this schema design Lecture 9: 10/31/24 Design Decisions for Creating a Database - Must avoid incomplete representation, and unnecessary redundancy - Atomicity: Store information in its smallest logical parts - Ex: Split up first name and last name into separate columns - Entity integrity: Each table should have a primary key column that should be unique and not null - Referential integrity: The reference from a row in one table to another table must be valid Representing Relationships - One-to-many relationships: Create one table with a primary key (one table) that links to a foreign key in another table (many table) - Ex: Going from a table of artists to a table of albums - Multiple albums can go to a single artist - Many-to-many relationships: Use a linking table, with each of the many tables giving their primary key to the linking table as a foreign key - Ex: Going from a table of songs to a table of playlists - Can have multiple of either table entries in the other - One-to-one relationships: Define the same primary key in both tables - Both the tables employees and compensation use the same primary key, EmpID Crow’s Foot Notation - Cardinality: The number of times an instance in one entity can be associated with instances in the related entity - Basic symbols to represent cardinality: - The ring (o) represents “zero” - The vertical bar ( | ) represents “one” - The crow’s foot represents “many” or infinite - These symbols are shown on both sides of the relationships between tables - Usually we show both the minimum and maximum cardinality on each side of the entry Creating Databases in SQLiteStudio Creating a Table with Attributes - CREATE TABLE: Creates a table with a specified number of rows each with their own data type - Ex: Create TABLE TableName (column1 dataType NOT NULL PRIMARY KEY, column2 dataType) - Can take in additional constraints on each column - Primary-Key constraint onto a column - Foreign-Key constraint - NOT NULL constraint (or NULL constraint to allow NULL values) - Unique constraint - Default constraint - Check constraint Defining Primary Keys - CREATE TABLE tableName(column1 datatype NOT NULL PRIMARY KEY) - CREATE TABLE tableName(column1 datatype NOT NULL, … PRIMARY KEY(column1)) - Can also list multiple columns in the PRIMARY KEY() function to define a composite primary key (important for all columns to be NOT NULL) - Defines column1 as the primary key - Important for this column to be NOT NULL as it is the primary key Defining Foreign Keys - CREATE TABLE tableName(column1 datatype REFERENCES parentTable(pkColumn)) - CREATE TABLE tableName(column1 datatype, … FOREIGN KEY (column1) REFERENCES parentTable(pkColumn)) Dropping tables - DROP TABLE table1 - DROP TABLE IF EXISTS table1 Populating Tables with New Data - Populate tables with new data in the form of rows with the INSERT statement - EX: INSERT INTO table1 VALUES (column1Val, column2Val, column3Val,...) Lecture 13: 11/5/24 Modifying Databases - UPDATE: Used to change values that already exist within a table - Syntax: UPDATE TableName SET ColumnName = Value WHERE Condition - Can change one or more columns in a single UPDATE call - UPDATE TableName SET Column1 = Value1, Column2 = Value2 … - Can use subqueries as well in order to set a column equal to a value - Math functions also work as well within the set call - DELETE: Used to delete rows within a table - Syntax: DELETE FROM TableName WHERE Condition - Rows that are referenced by a foreign key cannot be deleted! - It’s not allowed to delete data from a parent table if a child data refers to it - When creating a table, the calls, ON DELETE CASCADE, or ON DELETE SET NULL can allow for certain foreign keys to be deleted - ON DELETE CASCADE will delete the child table rows of the rows of the deleted foreign key (entire row) - ON DELETE SET NULL will set the value as NULL within child table rows of the deleted foreign key (only value in the column) - EX: CREATE tableName( PKey INT PRIMARY KEY, FKey INT REFERENCES Table2(FKey) ON DELETE CASCADES)) Data Types - When choosing a data type for a column, make sure it is the efficient, simple, and accurately reflects the nature of the data Integers - Come in sizes of TINYINT, SMALLINT, MEDIUMINT, INT, and BIGINT - Each integer type has a different range of values that can be stored Decimals / Floats - Uses floating-point representation to estimate numbers with decimals - FLOAT: Stores 32 bits, or around 7 decimal digits of precision - Refrain from using FLOAT, because it only acts as an approximate representation - DOUBLE: Stores 64 bits, or around 15-16 decimal digits of precision Text - ASCII uses 1 byte to represent a character, with these characters being encoded by the ASCII table - UTF-8 uses variable length character encoding for encoding all code points in Unicode, with 1-4 bytes being used to represent a character - UTF-8 is the most common text encoding, allowing for variable-length strings - Various text data types in SQL (and their max length) - CHARACTER(20) - VARCHAR(50) - NCHAR(55) - NVARCHAR(100) - TEXT (Most common and most flexible without a limit) Datetime - Date format: YYYY-MM-DD with time HH:MM:SS - When sorting this data type, sorting by ascending or descending gives chronological order - Can take in either the data and time, or just one of the date or the time Lecture 14: Relational Database Design Examples - Create a detailed schema based on the given info - Create a blank database file in SQLite Studio - Use code to create tables that are consistent with the schema - CREATE TABLE TableName(column1 data type, column2 data type, …) - Populate the database with the data - Doing this manually using the INSERT function is tedious, instead we can use the statement, INSERT INTO SELECT to copy data from one table and insert into another - This can work with raw data tables like a CSV, - EX: INSERT INTO targetTable (column1, column2) SELECT column1, column2 FROM rawTable Lecture 15: SQL Transactions - Action queries can be used to modify a database - EX: INSERT, UPDATE, DELETE - INSERT syntax: INSERT INTO TableName(column1, column2, …) VALUES (value1, value2, …) - UPDATE syntax: UPDATE TableName SET column1 = newvalue1, column2 = newvalue2, WHERE condition - By not using a WHERE clause with UPDATE, all rows in the table will be updated - DELETE syntax: DELETE FROM TableName WHERE condition - Action queries have constraints, such as primary keys must be unique, and rows that reference a foreign key cannot be deleted Transactional Control Language - Important for databases to be structured in a way to protect against data loss and failures - Can cause many problems if queries are run individually, and something goes wrong when only one query has been executed - Transactions make a series of operations act as one - Either all operations succeed, or all operations fail - Executing a series of statements at a single time (batch execution) - Transactions allow for data integrity, error recovery, and consistency -Ensures data remains accurate and reliable by treating related operations as one unit -Allows a safe way to cancel operations if an error occurs, preserving the initial database state - Guarantees that changes to data only occur if all parts of the transaction succeed - A DBMS must implement transactions that adhere to ACID properties - A: Atomicity - transactions either fully occur, or do not - C: Consistency - Only valid data is saved - I: Isolation - Transactions do not affect each other - D: Durability - Written data will not be left Transaction Commands - BEGIN TRANSACTION: Begins a new transaction - COMMIT: Saves all operations within a transaction to the database - ROLLBACK: Cancels all operations within a transaction, restoring the previous state - Example: BEGIN TRANSACTION; SELECT available_sets FROM classes WHERE class_id = “CS101”; INSERT INTO student_classes (student_id, class_id) VALUES(‘S123’, ‘CS101’); UPDATE classes SET available_set = available_sets - 1 WHERE class_id = ‘CS101’; COMMIT; - Rollbacks can be added into a transaction to handle errors - Example: BEGIN TRANSACTION; SELECT available_sets FROM classes WHERE class_id = “CS101”; IF available_sets = 0 THEN ROLLBACK; ELSE INSERT INTO student_classes (student_id, class_id) VALUES(‘S123’, ‘CS101’); UPDATE classes SET available_set = available_sets - 1 WHERE class_id = ‘CS101’; COMMIT; - Rollbacks can also be initiated due to a user changing their mind, when an app finds a problem within the database, or a system-initiated abort - Using transactions, DBMS can be accessed by multiple users at the same time - Isolation properties in SQL transactions ensures that each transaction from a user does not chaotically overlap with one another Types of Isolation Levels in SQL Transactions - Read uncommitted: Transactions can see the uncommitted changes from other transactions - Transaction B will be able to read from transaction A even if transaction A has not been committed - Read committed: Transactions can only see the committed changes from other transactions - Transaction A will only be able to read from the data from transaction A if transaction A has been committed - Repeatable read: Prevents others from modifying rows until the transaction is complete - Serializable: Highest isolation, transactions are fully isolated, preventing any conflicts - Can use the SET TRANSACTION ISOLATION LEVEL command: - EX: SET TRANSACTION ISOLATION LEVEL SERIALIZABLE; Best Practices for using Transactions - Keep transactions short, minimizing locking and improving performance - Use transactions when data integrity is important - Handle errors with care, ensuring fallbacks are present using ROLLBACK for unexpected issues - Choose isolation levels wisely, with higher isolation impacting performance Lecture 16: Distributed Storage and Databases - Single node database: An entire database is operated and stored only on one computer - What we have been working with so far - Distributed database: Data is distributed across multiple computers that make up a single database - Ex: NoSQL, Graphlab, Spark - Distributed databases have scalability, performance, and resilience benefits - Scalability: Data could be large, can fit more data into a distributed database than data into a single computer - Performance: Computing all of the data takes a very long time - Resilience: We don’t want our whole system to go down if our database hardware fails Properties of Distributed Databases - Fragmentation: Relations and data is partitioned into several fragments stored in distinct places - Replication: System maintains multiple copies of data, stored in different sites, for faster retrieval and fault tolerance Cloud Based Storage - The cloud can store distributed databases that users can access from their own computer - EX: Google Cloud, Dropbox, Amazon Cloud, etc. - Slower to access than data stored locally, but it allows for larger capacity to store data, and is typically more reliable Storage Fragmentation - Sharding: Data is partitioned into different groups, with each group being stored onto different machines - Arrays: The segments or partitions that data is placed into - Allows for parallel programming, thus faster speeds and better efficiency - Disadvantages of fragmentation - Some nodes or data may not be available - If there is a network failure, some data will not be accessible - If arrays are created without redundancy, then they are too unreliable to be useful - Due to possibility for network failure causing certain arrays to become completely inaccessible Storage Replication - Advantages of replication - Availability: Failure of site containing relations does not result in unavailability of data (fault tolerant) - Parallelism: Queries on a distributed database may be processed by several nodes at the same time, thus being quicker and more efficient - Reduced data transfer: Relations are available locally at each site containing a replica of the relation - Disadvantages of replication - Increased cost of updates: Each replica of a relation must be updated - Increased complexity of concurrency control: Concurrent updates to distinct replicas may lead to inconsistent data, unless certain control mechanisms are in place Redundant Array of Independent Disks (RAID) - A system to support data redundancy and performance improvement - This splits arrays into disks - Replicates the sharding of data, so as to overcome the shortcomings of sharding (not fault tolerant) - Basically, we are keeping copies of the original data within RAID’s - This requires greater capacity to store data within a distributed database to store both the RAID’s, and the original data RAID Level 0 - No backup is implemented, thus the capacity of the database depends on how many disk drives there are, and how much each stores - Disadvantage: If one data drive fails, the data will be lost RAID Level 1 - Mirrors the original data, thus keeping one copy of the original data - This requires half of the data drive disks to be used to store the original database - Thus the database can only store half of its intended capacity RAID Level 5 - Begin to use parity disks, which distributes the data from one disk into that of another disk - If we have N drives or arrays for storage, then there are N-1 data disk drives, while there is 1 parity drive - Disks drives are the ones use to actually store data, which gives the capacity of the distributed database - The parity drive stores the backup of the original data - The parity drive is distributed across all the data drives - This balances a distributed database Limitations of Distributed Storage - Dilemma is inherent between trying to achieve consistency, availability, and partitioning - Consistency: All clients see the same view of data, even when updating and deleting - Availability: All clients can find a replica of data, even with partial node failures - Partitioning: The system continues to work as expected even if partial network failure - Distributed databases can only achieve two of these three at a time (CAP Theorem) Lecture 18: 11/21/24 - MongoDB, databases hold one or more collection of documents - Collections are where documents are stored - Documents are the basic unit of data in MongoDB Using Pymongo (see the setup document on Canvas) - Create a client object - Calling a database and a collection - Ex: col = client[‘sample_airbnb’][‘listingsAndReviews’] - Samplie_airbnb is the collection, with the document, listingsAndReviews - Use find_one() to return one document from the collection - Ex: col.find_one() - Use find() to query multiple documents at a time - Ex: for i in col.find(): print(i) - With the client object created, you can access different collections and documents with filters - Querying for multiple documents with filters: - For i in col.find({“attribute”: “matching_attribute”)}: Print(i) - Can instead of printing all values, you can specify the exact attributes that you want to achieve - Print(i[“attribute”], i[“attribute2”], …) - Important filter keywords (other comparison useful operators) - “$gte” used as greater than within dictionary - Ex: col.find({“beds”: {“$gte”: 10}}) - Using two conditions: col.find({“beds”: - “$and” specifies two conditions that need to be met in a query - “$or” specifies that the query only has to follow one of the conditions - Ex: col.find({‘$and’:{‘beds’: [{‘beds’:{‘$gte’:9}, {‘beds’:{‘$lte’:12}]}})

CS 217-0 Final PDF

Document Details

Tags

Related

Summary

Full Transcript