Data Management Foundations PDF
Document Details
Uploaded by Deleted User
Tags
Summary
This document provides a foundation in data management, especially relational databases. It covers key topics such as database systems, roles, and performance. Key concepts like queries and database design are also touched upon.
Full Transcript
Data Management - Foundations If you see “(ON THE EXAM)” that means it will definitely be on the exam ” ” Are key topics you need to know! Data - nemeric, textual or audio information that describes real-world systems. Data is collected and processed Database - a collection of data in a st...
Data Management - Foundations If you see “(ON THE EXAM)” that means it will definitely be on the exam ” ” Are key topics you need to know! Data - nemeric, textual or audio information that describes real-world systems. Data is collected and processed Database - a collection of data in a structured format Database system - database management system or DBMS, software that reads and writes data in a database Query - to retrieve or request data from a database. Query language is a specific programming language designed for database systems. Database application - software that helps business users interact with database systems Database Roles database administrator - responsible for securing the database system against unauthorized users. enforces procedures for user access and database system availability (ON THE EXAM) database designer - determines the format of each data element and the overall database structure database programmer - develops compouter programs that utilize a database. write applications that combine database query languages and programming languages database user - consumer of data in a database Database Systems performance - fast query response time even when many users access the database authorization - users should have limited access. only access data that the user needs security - ensure authorized access, encrypt data and restrict access rules - data is consistent with structural and business rules recovery - must recover from failures and restore the database to a consistent state without loss of data transaction - group of queries that must be either completed or rejected as a whole. a transaction must process either both queries or neither. ex: query 1 - withdraws 100$ from one account query 2 - sends a 100$ deposit to another account what if query 1 goes thru but query 2 fails, a 100$ is lost ensure transactions are processed completely or not at all prevent conflicts between concurrent transactions - when multiple transactions access the same data at the same time, a conflict may occur ensure transaction results are never lost architecture - describes the internal relationship between components query processor - interprets queries, creates a plan to retrieve data and return results to application, performs query optimization to ensure the most efficient path (ON THE EXAM) storage manager - translates query processor instructions into low-level file-system commands, uses indexes to locate data transaction manager - ensures transactions are properly executed, prevents conflicts between concurrent transactions (ON THE EXAM) log - file containing a complete record of all inserts, updates, deletes, etc catalog - data dictionary, directory table, columns, indexes, a dictionary showing where all data is located metadata - data about the database, such as column names and the number of rows in each table Most leading database systems are relational relational database - stores data in tables, columns, rows, similar to a spreadsheet, relational database systems support the SQL query language. relational systems are ideal for databases that require an accurate record of every transaction, suc has banking airline reservations systems and student records SQL - stands for Structured Query Language and includes statements that read and write data, creata and delete tables and administer the database systems online data = big data - collection of data that is too large, complex, or fast to be processed by traditional data management systems newer non-relational systems are called NoSQL and are only optimized for big data open source - is software tha anyon can inspect, copy, and modify with no licensing fee | Product | Sponsor | Type | License | DB-Engines rank (popularity) | | --- | --- | --- | --- | --- | | Oracle Database | Oracle | Relational | Commercial | 1 | | MySQL | Oracle | Relational | Open source | 2 | | SQL Server | Microsoft | Relational | Commercial | 3 | | PostgreSQL | PostgreSQL Group | Relational | Open source | 4 | | MongoDB | MongoDB (ON THE EXAM) | NoSQL | Open source | 5 | SQL (ON THE EXAM) INSERT - inserts rows into a table SELECT - retrieves data from a table UPDATE - modifies data in a table DELETE - deletes rows from a table CREATE TABLE - creates a new table by specifying the table and column names. each column is assigned a data type that indicates the format of column values. can be numeric, textual, complex INT - integer values DECIMAL - fractional numeric values VARCHAR - textual values VARCHAR(N) - values with 0 to N characters DATE - years, month and day Design type Analysis - specifies database requirements without regard to a specific database system. requirements are presented as entities, relationships and attributes. an entity is a person, place, activity or thing. A relationship is a link between entities, and an attribute is a descriptive property of an entity. also called CONCEPTUAL DESIGN Logical design - implements database requirements in a specific database system. for relational database systems, logical design converts entities, relationships, and attributes into tables, keys and columns Physical design - adds indexes and specifies how tables are organized on storage media. physical design affects query processing speed but never affects the query result. the principle that physical design never affects query results is called data independence Rule 7 Data Indepedence - allows database administrator to improve query performance by changing the organization of data on storage devices, without affecting query results analysis - who are the people, what are their relationships, attributes logical design - put that entity and relationship data into columns and rows physical design - adds indexes, how to organize data MySQL command line client - text interface included in the MySQL server download. MySQL server returns a error code and description when a SQL statement is syntactically incorrect or the database cannot execute the statement tuple - ordered collection of elements enclosed in parentheses ex: (a, b, c) and (c, b, a) Data structure data structure organizes data in tables table - name, fixed tuple of columns, and a varying set of rows column - name, data type row - unnamed tuple of values, each value corresponds to a column and belongs to the column’s data type data type - named set of values, from which column values are drawn synonyms: Table, File, Relation Row, Record, Tuple Column, Field, Attribute Business rules - based on business policy and specific to a particular database Type Description Examples Explicit values that are string numeric or binary. Strings must be Literals (ON surronded by single quotes or double quotes. Binary values are `'String' THE EXAM) represented with x’0’ where the 0 is any hex values "String" 123 x'0fa2'` SELECT, Keywords Words with special meaning FROM, WHERE Type Description Examples City, Name, Identifiers Objects from the database like tables, columns, etc Population Statements intended only for humans and ignored by the `-- single line Comments database comment SQL sublanguages DDL Data Definition Language - defines the structure of the database DQL Data Query Language - retrieves data from the database DML Data Manipulation Language - manipulates data stored in a database DCL Data Control Language - controls database user access DTL Data Transaction Language - manages database transactions all data in a relational database is structured in tables table - name, fixed tuple of columns, and a varying set of rows column - name, data type row - unnamed tuple of values, each value corresponds to a column and belongs to the column’s data type cell - single column of a single row DROP TABLE - deletes a table, along with all the table’s rows, from a database ALTER TABLE - statement adds, deletes, or modifies columns on a exisiting table Integers Arithmetic + Adds two numeric values 4+3 7 Reverses the sign of one numeric - (unary) -(-2) 2 value Subtracts one numeric value from - (binary) 11 - 5 6 another * Multiplies two numeric values 3*5 15 Divides one numeric value by / 4/2 2 another Divides one numeric value by % (modulo) another and returns the integer 5%2 1 remainder Arithmetic + Adds two numeric values 4+3 7 ^ (ON THE Raises one numeric value to the 5^2 25 EXAM) power of another Comparison = Compares two values for equality 1=2 FALSE != (ON THE Compares two values for inequality 1 != 2 TRUE EXAM) < Compares two values with < 2 FALSE '2021-08-13' Category Example Data type Storage Notes 34 and - Integer TINYINT 1 byte Signed range: -128 to 127 739448 Unsigned range: 0 to 255 34 and - Integer SMALLINT 2 bytes Signed range: -32,768 to 32,767 739448 Unsigned range: 0 to 65,535 34 and - Signed range: -8,388,608 to Integer MEDIUMINT 3 bytes 739448 8,388,607 Unsigned range: 0 to 16,777,215 34 and - INTEGER or Signed range: -2,147,483,648 to Integer 4 bytes 739448 INT 2,147,483,647 Unsigned range: 0 to 4,294,967,295 34 and - 8 Integer BIGINT Signed range: -2^63 to 2^63 -1 739448 bytes Unsigned range: 0 to 2^64 -1 UPDATE statement modifies existing rows in a table. The UPDATE statement uses the SET clause to specify to the new column values. An optional WHERE clause specifies which rows are updated. Omitting(leaving out) the WHERE clause results in all rows being updated KNOW THE DIFFERENCE BETWEEN TRUNCATE AND DELETE (ON THE EXAM) DELETE statement The DELETE statement deletes existing rows in a table. The FROM keyword is followed by the table name whose rows are to be deleted. An optional WHERE clause specifies which rows should be deleted. Omitting the WHERE clause results in all rows in the table being deleted TRUNCATE - deletes all rows from table, nearly identical to DELETE, but with no WHERE clause expect minor differences MERGE - selects data from one table, inserts data to another table, called target Primary Key - a column, or group of columns, used to identify a row. the primary key is usually the table’s first column. unique identifiers for rows in a database Simple Primary Key - consists of a single column Composite Primary Key - consists of multiple columns Auto-Increment Column - numeric column that is assigned an automatically incrementing value when a new row is inserted (ON THE EXAM) Database users make the following errors when inserting primary keys: IT WILL ASK WHAT ARE SOME COMMON MISTAKES PEOPLE MAKE (ON THE EXAM) 1. Inserting values for auto-incrementing primary keys (Overridding auto-increment for a primary key is usually a mistake, tho MySQL does allow it) 2. Omitting (leaving out) values for primary keys that are not auto-increment columns Foreign Key - column or group of columns that refer to a primary key. Data type for foreign and primary key are the same. (REFERS TO PRIMARY) Foreign keys may be repeated and may be NULL. Referential integrity, matches a referenced primary key Foreign Key Constraint - database rejects inserts, updates, and deletes statement that violate referential integrity Contraint - rule that governs allowable values in a database. Constraints are based on relational and business rules Databases automatically correct referential integrity with these 4 actions: (ON THE EXAM) RESTRICT - rejects modifications SET NULL - sets invalid foreign keys to NULL SET DEFAULT - sets invalid foreign keys to default value CASCADE - propagates (spreads) primary key changes to foreign keys (ON THE EXAM) Contraints are added and dropped with ALTER TABLE followed by ADD DROP and CHANGE clause BETWEEN operator - provides alternative way to determine if a value is between two other values LIKE operator - when used in a WHERE clause, matches text against a pattern using the two wildcard characters % and _ LIKE % example: LIKE ‘L%t’ matches “Lt” “Lot” “Lift” “Loot” LIKE _ matches exactly one character example: LIKE ‘L_t’ matches “Lit” “Lot” but not “Lt” and “Loot” OR - TRUE if any of the conditions separated by OR is TRUE IN - TRUE if the operand is equal to one of a list of expressions. compare against a list of values KNOW THE DIFFERENCE OF ALL THESE OPERATORS, THIS IS VERY IMPORTANT ORDER BY - clause orders selected rows by one or more clumns in a ascending order DESC - with ORDER BY rows in DESCending order Function Description Example ABS(n) Returns the absolute value of n SELECT ABS(-5); returns 5 | | LOWER(s) | Returns the lowercase s | SELECT LOWER('MySQL'); returns 'mysql' | | TRIM(s) | Returns the string s without leading and trailing spaces | SELECT TRIM(' test '); returns 'test' | | LTRIM(s) (ON THE EXAM) | Only trims the leading spaces, NOT trailing. | SELECT TRIM(' test '); returns 'test ’ | | HOUR(t) MINUTE(t) SECOND(t) | Returns the hour, minute, or second from time t | SELECT HOUR('22:11:45'); returns 22 SELECT MINUTE('22:11:45'); returns 11 SELECT SECOND('22:11:45'); returns 45 | Aggregate functions (the ones to do with numbers) (ON THE EXAM) COUNT() - counts the number of rows in the set MIN() - finds the minimum value in the set MAX() - finds maximum value in the set SUM() - sums all the values in the set AVG() - computes the arithmetic mean of all the values in the set These functions appear in SELECT clauses and prices all rows that satisfy the WHERE clause condition. If a SELECT statement has no WHERE clause, the aggregate function processes all rows HAVING clause - used with GROUP BY clause to filter to group results KNOW THE DIFFERENCE BETWEEN THE JOIN CLAUSES (ON THE EXAM) JOIN clause - combines two tables known as the left and right table, into a single result. Compares columns from the left and right tables, usually with the = operator a column name can be replaced with an alias. The alias follows the column name, seperated by an optional AS keyword JOIN clause determines how a join query handles unmatched rows INNER JOIN - selects only matching left and right table rows FULL JOIN - selects all left and right table rows, regardless of match (unmatched rows appear with NULL values) LEFT JOIN - selects all left table rows, but only matching right table rows RIGHT JOIN - selects all right table rows, but only matching left table rows OUTER JOIN - JOIN that selects unmatched rows, including left, right, and full joins UNION keyword - combines the two results into one table EQUIJOIN - compares columns of two table with the = operator, most joins are EQUIJOINS. NON- EQUIJOIN compares columns with an operator other than =, such as < and > SELF-JOIN - joins a table to itself CROSS JOIN combines two tables without comparing columns. a cross-join uses a CROSS JOIN clause without a ON clause. as a result, all possible combinations of rows from both tables appear in the result subquery - nested query or inner query, is a query within another SQL query alias - temporary name assigned to a column or table. The AS keyword follows a column or table name to create an alias (ON THE EXAM) Materialized view - is a view for which data is stored at all times. whenever a base table changes, the corresponding view tables can also change, so materialized views must be refreshed. view data can be stored WITH CHECK OPTION - database rejects inserts and updates that do not satisfy the view query WHERE clause (ON THE EXAM) entity-relationship model - high level representation of data requiremetns, ignoring implementation details entity - person place product relationship - statement about two entities attribute - descriptive property of an entity reflexive relationship - relates an entity to itself (ON THE EXAM) entity-relationship diagram - called ER diagram, a schematic picture of entities, relationships, and attributes. entities are drawn as rectangles Types and instances entity type - set of things ex: All employees in a company relationship type - set of related things ex: Employee-Manages-Department is a set of employee, departments pairs, where the employee manages the department attribute type - set of values ex: All employees salaries entity instance - individual thing relationship instance - statement about entity instances attribute instance - individual value Analysis - develops an entity, relationship model, capturing data requirements while ignoring implementation details Logical design - converts the entity-relationship model into tables, columns, and keys for a paticular database system Physical design - adds indexes and specifies how tablse are organized on storage media KNOW THESE STEPS IN ORDER ASKED SPECIFICALLY ABOUT STEP 2 IN THE LOGICAL PROCESS: Implement relationships Analysis steps: Step Name 1 Discover entities, relationships, and attributes 2 Determine cardinality 3 Distinguish strong and weak entities 4 Create supertype and subtype entities Cardinality - refers to maxima and minima of relationships and attributes (ON THE EXAM) Logical design steps Step Name 1 Implement entities 2 Implement relationships 3 Implement attributes 4 Apply normal form (consistent format) Relationship maximum - greatest number of instances of one entity that can relate to single instance of another entity Relationship minimum - least number of instances of one entity that can relate to single instance of another entity ER diagrams -attribute maximum and minmum follows the attribute names. minimum appears in parentheses Subtype entity - subset of another entity type, called superentity Superentity - is just the entity that the subtype applies to Identifying relationship is called IsA relationship Partitions - of a supertype entity is a group of mutually exclusive subtype entities Crow’s Foot Notation - depicts carinality as a circle (zero), a short line (one), or three short lines (many). The three short lines look like a bird’s foot, hence the name intangible entity - documented in the data model, but not tracked with data in database PRIMARY KEYS Stable - Primary key values should not change. When a primary key value changes, statements that specify the old value must also change Simple - ****Primary key values should be easy to type and store. Small values are easy to specify in an SQL WHERE clause and speed up query processing. A 2-byte integer is easier to type and faster to process than a 15-byte character string Meaningless - Primary keys should not contain descriptive information. Descriptive information occasionally, changes, sp primary keys containing descriptive information are unstable Artificial Key - single column key primary key created by the database designer when no suitable single-column or composite primary key exists. Usually artificial key values are integers, generated automatically by the database as new rows are inserted to the table. Artificial keys are stable, simple and meaningless Dependence of one column on another is called functional dependence Redundancy is the repetition of related values in a table Normal forms are rules for designing tables with less redundancy Candidate key - simple or composite column that is unique and minimal. Minimal means all columns are necessary for uniqueness. unique (ON THE EXAM) Non-key column - column that is not contained in a candidate key Third normal form - whenever a non-key column A depends on column B, then B is unique Boyce-Codd normal form - whenever column A depends on column B, then B is unique. Columns A and B may be simple or composite. This definition is identical to the definition of third normal form with the term ‘non-key’ removed. GOLD STANDARD Boyce-Codd normal form is ideal for tables with frequent inserts, updates, and deletes Trivial dependencies When the column of A are a subset of the columns of B, A always depends on B ex FareClass depends on FlightCode. These dependencies are called trivial This redundancy is eliminated with normization, the last step of logical design Normalization emilimates redundancy by decomposing a table into two or more tables in higher normal form Denormilzation means intentially introducing redundancy by merging tables Heap table - no order is imposed on rows optimize insert operations. Heap tables are particulary fast for bulk load of many rows, since rows are sotred in load order. Sorted table - database designer identifies a sort column that determienes physical row order Hash table - rows are assigned to buckers. A bucket is a block or group of blocks containing rows. The modulo funciton is a simple hash function with four steps: 1. Convert the hash key by interpreting the key’s bits as an integer value 2. Divide the integer by the number of buckets 3. Interpret the division reaminder as the bucker number 4. Convert the bucket number to the physica address of the block containing the row To execute a SELECT query, the database can perform a table scan or an index scan: Table scan - database operation that reads table blocks directly, without accessing an index Index scan - database operation that reads index blocks sequentially, in order to locate the needed table blocks KNOW DIFFERENCE BETWEEN TABLE AND INDEX SCAN. BOTH ARE ON THE EXAM Hit ratio - also called Filter Factor or Selectivity, is the percentage of table rows selected by a query. When a SELECT query is executed In a binary search, the database repeatedly splits the index in two until it finds the entry containing the search value Indexes may also be dense or sparse Dense index contains an entry for every table row Sparse index contains an entry for every table block Multi-level indexes A multi-level index stores column values and row pointers in a hierarchy. The bottom level of the hierarchy is a sorted single-level index. The bottom level is sparse for primary indexes, or dense for secondary indexes. In a hash index, index entries are assigned to buckets. A bucket is a block or group of blocks containing index entries. Initially, each bucket has one block. As an index grows, some buckets eventually fill up, and additional blocks are allocated and linked to the initial block. Hash index Bitmap index Logical index Function index Tablespace - database objects thatt maps one or more tables to a single file CREATE TABLESPACE Logical design specifies tables, columns , and keys. Physical design specifies indexes table structures and partitions. Physical design affects query performance but never affects query results. A storage engine or storage manager translates instructions generated by a query processor into a low-level commands taht access data on storage media. Statement Description Syntax CREATE INDEX Create an index CREATE INDEX IndexName Statement Description Syntax ON TableName (Column1, Column2,..., ColumnN); Pre-Assessment Notes World database is included with MySQL installation MySQL format DATE YYYY-MM-DD TIME hh:mm:ss MySQL Operator precedences INTERVAL BINARY, COLLATE ! (unary minus), ~ (unary bit inversion) ^ *, /, DIV, %, MOD -, + & | = (comparison), , >=, >,