DBMS Notes - PDF
Document Details
Uploaded by DistinguishedCarnelian3676
Aberdeen International College
Surya Basnet
Tags
Summary
These notes provide an introduction to database management systems (DBMS). Topics covered include the purpose of database systems, data views, database languages, transaction management, database administrators, system structures, data storage hierarchy, information, and database applications. The document contrasts file systems with database systems.
Full Transcript
U Unit 1 Introduction of Database Purpose of database system View of data Database languages Transaction management Database administrator...
U Unit 1 Introduction of Database Purpose of database system View of data Database languages Transaction management Database administrator System structure Data: Data is commonly defined as raw facts or observation, typically about physical phenomena or business transactions. For example of data would be the marks obtained by students in different subjects. Data can be in any form-numerical, textual, graphical, image, sound, video etc. The following figure shows the hierarchy of data storage. Bit Character Field Record File Database Fig:-Data storage hierarchy Information: Information is defined as refined or processed data that has been transformed into meaningful and useful form for specific users. For example, after processing the marks obtained by student it transformed into information, which is meaningful and from which we can decide which student stood first, second and so forth. Information comes from data and takes the form of table, graphs, diagrams etc. Database: A database-management system (DBMS) is a collection of interrelated data and a set of programs to access those data. The collection of data, usually referred to as the database, contains information relevant to an enterprise. The primary goal of a DBMS is to provide a way to store and retrieve database information that is both convenient and efficient. Database systems are designed to manage large bodies of information. Management of data involves both defining structures for storage of information and providing mechanisms for the manipulation of information. In addition, the database system must ensure the safety of the information stored, despite system crashes or attempts at unauthorized access. If data are to be shared among several users, the system must avoid possible anomalous results. Database System Applications Databases are widely used. Here are some representative applications: Banking: For customer information, accounts, and loans, and banking transactions. Airlines: For reservations and schedule information. Airlines were among the first to use databases in a geographically distributed manner—terminals situated around the world accessed the central database system through phone lines and other data networks. Universities: For student information, course registrations, and grades. Credit card transactions: For purchases on credit cards and generation of monthly statements. Telecommunication: For keeping records of calls made, generating monthly bills, maintaining balances on prepaid calling cards, and storing information about the communication networks. Finance: For storing information about holdings, sales, and purchases of financial instruments such as stocks and bonds. Sales: For customer, product, and purchase information. Manufacturing: For management of supply chain and for tracking production of items in factories, inventories of items in warehouses/stores, and orders for items. Human resources: For information about employees, salaries, payroll taxes and benefits, and for generation of paychecks. Database Systems versus File Systems Consider part of a savings-bank enterprise that keeps information about all customers and savings accounts. One way to keep the information on a computer is to store it in operating system files. To allow users to manipulate the information, the- system has a number of application programs that manipulate the files, including A program to debit or credit an account A program to add a new account A program to find the balance of an account A program to generate monthly statements file-processing system is supported by a conventional operating system.The system stores permanent records in various files, and it needs different application programs to extract records from, and add records to, the appropriate files. Before database management systems (DBMSs) came along, organizations usually stored information in such systems. Keeping organizational information in a file-processing system has a number of major disadvantages: Data redundancy and inconsistency. Since different programmers create the files and application programs over a long period, the various files are likely to have different formats and the programs may be written in several programming languages. Moreover, the same information may be duplicated in several places (files). For example, the address and telephone number of a particular customer may appear in a file that consists of savings-account records and in a file that consists of checking-account records. This redundancy leads to higher storage and access cost. In addition, it may lead to data inconsistency; that is, the various copies of the same data may no longer agree. For example, a changed customer address may be reflected in savings-account records but not elsewhere in the system. Difficulty in accessing data. Suppose that one of the bank officers needs to find out the names of all customers who live within a particular postal-code area. The officer asks the data-processing department to generate such a list. Because the designers of the original system did not anticipate this request, there is no application program on hand to meet it. There is, however, an application program to generate the list of all customers. The bank officer has now two choices: either obtain the list of all customers and extract the needed information manually or ask a system programmer to write the necessary application program. Both alternatives are obviously unsatisfactory. Suppose that such a program is written, and that, several days later, the same officer needs to trim that list to include only those customers who have an account balance of $10,000 or more. As expected, a program to generate such a list does not exist. Again, the officer has the preceding two options, neither of which is satisfactory. Data isolation. Because data are scattered in various files, and files may be in different formats, writing new application programs to retrieve the appropriate data is difficult. Integrity problems. The data values stored in the database must satisfy certain types of consistency constraints. For example, the balance of a bank account may never fall below a prescribed amount (say, $25). Developers enforce these constraints in the system by adding appropriate code in the various application programs. However, when new constraints are added, it is difficult to change the programs to enforce them. The problem is compounded when constraints involve several data items from different files. Atomicity problems. A computer system, like any other mechanical or electrical device, is subject to failure. In many applications, it is crucial that, if a failure occurs, the data be restored to the consistent state that existed prior to the failure. Consider a program to transfer $50 from account A to account B. If a system failure occurs during the execution of the program, it is possible that the $50 was removed from account A but was not credited to account B, resulting in an inconsistent database state. Clearly, it is essential to database consistency that either both the credit and debit occur, or that neither occur. That is, the funds transfer must be atomic—it must happen in its entirety or not at all. It is difficult to ensure atomicity in a conventional file-processing system. Concurrent-access anomalies. For the sake of overall performance of the system and faster response, many systems allow multiple users to update the data simultaneously. In such an environment, interaction of concurrent updates may result in inconsistent data. Consider bank account A, containing $500. If two customers withdraw funds (say $50 and $100 respectively) from account A at about the same time, the result of the concurrent executions may leave the account in an incorrect (or inconsistent) state. Suppose that the programs executing on behalf of each withdrawal read the old balance, reduce that value by the amount being withdrawn, and write the result back. If the two programs run concurrently, they may both read the value $500, and write back $450 and $400, respectively. Depending on which one writes the value last, the account may contain either $450 or $400, rather than the correct value of $350. To guard against this possibility, the system must maintain some form of supervision. But supervision is difficult to provide because data may be accessed by many different application programs that have not been coordinated previously. Security problems. Not every user of the database system should be able to access all the data. For example, in a banking system, payroll personnel need to see only that part of the database that has information about the various bank employees. They do not need access to information about customer accounts. But, since application programs are added to the system in an ad hoc manner, enforcing such security constraints is difficult. Characteristics of data in database: The data in a database should have the following features: Shared: Data should be sharable among different users and applications. Persistence: Data should exist permanently in the database. Changes in the database must not be lost because of any failure. Validity/Integrity/Correctness: It should maintain the integrity so that there is always correct data in the database. Security: Data should be protected from unauthorized access. Non-redundancy: Data should not be repeated. Consistency: A consistent state of the database satisfies all the constraints specified in the database. Data in a database is consistent if any changes in the database take the database from one consistent state to another. Independence: The three levels in the schema should be independent of each other so that the changes in the schema in one level should not affect the other levels. Purpose of DBMS (Functions of DBMS): The benefits of using DBMS are: To reduce redundancy:Repeating of the same information in database is called redundancy of data which leads to several problems such as wastage of space, duplication effort for entering data and inconsistency. When DBMS is used and database is created, redundancy is minimized. To avoid inconsistency:The database is said to be inconsistent if various copies of the same data may no longer agree. For example, a changed customer address may be reflected in saving account but not elsewhere in the system. By using DBMS we can avoid inconsistency. To share data:The data in the database can be shared among many users and applications. The data requirements of new applications may be satisfied without having to create any new stored files. To provide support for transactions: A transaction is a sequence of database operations that represents a logical unit of work. It accesses a database and transforms it from one state to another. A transaction can update a record, delete one, modify a set of records etc. when the DBMS does a ‘commit’, the changes made by the transaction are made permanent. We can roll back the transaction to undo the effects of transaction. To maintain integrity:Most database applications have certain integrity constraints that must hold for the data. A DBMS provides capabilities for defining and enforcing these constraints. For example, the value of roll number field of each student in student database should be unique for each student. It is a type of rule. Such a rule is enforced using constraint at the time of creation of database. To enforce security: Not every user of the database system should be able to access all data.Different checks can be established for each type of access (retrieve, modify, delete, etc) toeach piece of information in the database. To provide efficient backup and recovery: Provide facilities for recovering from software and hardware failures to restore database to previous consistent state. To Concurrent Access Database: Concurrent access means access to the same data simultaneously by more than one user. The same data may be used by many users for the purpose reading at the same time. But when a user tries to modify a data, there should be a concurrency control mechanism to avoid the inconsistency of data. A DBMS provides facilities for these operations. Disadvantages of DBMS Problem associated with centralization: Centralization increases the security problems. Cost of software: Today’s there are several softwares which are very costly. Hence from economic point of view it is the drawback. Cost of hardware:To support various software some upgraded hardware components are needed.Hence from economic point of view it is the drawback. Complexity of backup and recovery:DBMS provides the centralization of the data, which requires the adequate backups of data. Overhead for providing generality, security, recovery, integrity, and concurrency control. If the database and applications are simple, well defined, and not expected to change. If there are stringent real-time requirements that may not be met because of DBMS overhead. Differences betn DBMS and file processing system: DBMS File processing system 1. A Database Management System 1. A flat file system stores data in a plain (DBMS) is a collection of interrelated text file.A flat file is a file that data and the set of programs to access contains records, and in which each those data. record is specified in a single line. 2. Data redundancy problem is not 2. Data redundancy problem exist. found. 3. Data inconsistency may exist. 3. Data inconsistency does not exist. 4. Accessing data from database is 4. Accessing data from database is comparatively difficult. easier. 5. Here data are scattered in various files and formats so data isolation problem 5. The problem of data isolations is not exist. found. 6. Here these problems are found. 6. Atomicity and integrating problems 7. Data are less secure. are not found. 8. Here there is no concurrent access and 7. Data are more secure. no recovery. 8. Concurrent access and crash recovery. Views of Data/ Data abstraction: The system hides certain details of how the data are stored and maintained and such view is an abstract view. The Database System provides users with an abstract view of the data. Data Abstraction:-The database designers use the complex data structure to represent the data in the database and developer hides the complexity from user from several level of abstraction such as physical level, logical level, and view level. This process is called data abstraction. Levels of Data Abstraction: The three levels of data abstraction can be shown as follows: External view View 1 View 2 View n ------------ View level Conceptual Level Logical Level Internal Level Physical Level Stored database Fig: Three levels of data abstraction Physical level It is the lowest level of abstractions describes how the data are actually stored. The physical level describes complex low level data structure in details. At this level records such as customer, account can be described as a block of consecutive storage location (e.g. byte, word) The database system hides many of the lowest level storage details from database programmer. Database administrator may be aware of certain details of the physical organization of the data. Logical level It is the next higher level of data abstraction which describes what data are stored in the database, and what relationships exist among those data. At the logical level , each record is described by a type definition Programmers and database administrator works at this level of abstraction. View level It is the highest level of abstraction describes only a part of the database and hides some information to the user. At view level, computer users see a set of application programs that hide details of data types. Similarly at the view level several views of the database are defined and database user see only these views. Views also provides the security mechanism to prevent users from accessing certain parts of the database (that is views can also hide information (such as an employee‘s salary) for security purposes.) Instances and Schemas Instance (Database State): The collection of information stored in the database at a particular moment is called an instance of the database. It is the actual content of the database at a particular point in time The term instance is also applied to individual database components, e.g. record instance, table instance, entity instance. Initial Database State Refers to the database state when it is initially loaded into the system. Valid State A state that satisfies the structure and constraints of the database. Schema: The overall design of the database which is not expected to change frequently is called database schema. Simply, the database schema is the logical structure of the database. The concept of database schema and instances can be understood by analogy to a program written in a programming language A database schema corresponds to the variable declaration and the values of the variables in a program at a point in time correspond to an instance of a database. The database systems have several schemas and partitioned according to the level of abstraction such as physical and logical schema. STUDENT Name Student-number Class Major Fig: Schema diagram for Student Note: The database schema changes very infrequently. The database state changes every time the database is updated. Data Independence: The three schema architecture further explains the concept of data independence, the capacity to change the schema at one level without having to change the schema at the next higher level. Logical Data Independence Physical Data Independence Logical Data Independence: The capacity to change the conceptual schema without having to change the external schemas and their associated application programs is called logical data independence. When modification is done to the conceptual schema (i.e tables) the mapping called “external mapping” is changes automatically by DBMS. Physical Data Independence: The capacity to change the internal schema without having to change the conceptual schema is called physical data independence.When a schema at a lower level is changed, only the mappings between this schema and higher-level schemas need to be changed in a DBMS. This mapping is called “logical mapping”. For example, the internal schema may be changed when certain file structures are reorganized or new indexes are created to improve database performance. Database Languages: DBMS provides two languages: Data-Definition Language (DDL) and Data-Manipulation Language (DML). Data Definition Language (DDL): Data definition language is the specification notation for defining the database schema. Used by the DBA and database designers to specify the conceptual schema of a database. In many DBMSs, the DDL is also used to define internal and external schemas (views). Example: CREATE TABLE account ( account-number CHAR(10), balance INTEGER ) Execution of the above DDL statement creates the account table. It updates a special set of tables called the data dictionary. Data dictionary: DDL compiler generates a set of tables stored in a data dictionary. Simply, Data dictionary is a special set of tables that contain the information about tables. Data dictionary contains metadata (i.e., data about data) Metadata: data that describes the database or one of its parts is called metadata. The schema of a table is an example of metadata The DDL provides the facilities to define: Database scheme Database tables Integrity constraints Domain constraints Referential integrity (references constraint in SQL) Assertions Triggers Views Security and Authorization Modify the Scheme The common DDL Commands are: CREATE, ALTER,DROP Data Manipulation Language (DML): A Data-manipulation language (DML) is a language that enables users to access or manipulate data organized by the appropriate data model.DML also known as query language. Ex. SELECT * FROM account WHERE balance(SELECT MAX (S2.age) FROM Sailors S2 WHERE S2.rating =9); The result-set will look like this: Sname Robin Example 7: find the maximum and minimum aged sailors name. SELECT sname FROM Sailors WHERE age=(SELECT max(age) FROM Sailors) UNION SELECT S1.sname FROM Sailors S1 WHERE S1.age=(SELECT min(age) FROM Sailors); The result-set will look like this: Sname Robin Raju GROUP BY Clause: The SQL GROUP BY clause is use to divide the rows in a table into groups. The GROUP BY statement is used along with the SQL aggregate functions. In GROUP BY clause, the tuples with same values are placed in one group. Example: Find the age of the youngest sailor for each rating level. SELECT rating, MIN(age) FROM Sailors GROUP BY rating; The result-set will look like this: Rating Min(age) 4 19 7 22 9 31 11 43 12 33 32 28 This table display the minimum age of each group according to their rating. SQL HAVING Clause: The SQL HAVING clause allows us to specify conditions on the rows for each group. It is used instead of the WHERE clause when Aggregate Functions are used. HAVING clause should follow the GROUP BY clause if we are using it. Example: let’s take an instance S3 of Sailors, S3 Sid sname Rating age 1 Ajaya 12 33 2 Robin 11 43 3 Ganga 32 28 4 Manoj 9 31 7 Rahul 7 22 9 Sanjaya 9 42 11 Raju 4 19 22 Robin 11 54 32 Anish 7 21 Example: Find the age of the youngest sailor who is eligible to vote (i.e., is at least 18 years old) for each rating level with at least two such sailors. SELECT S.rating, MIN (S.age) AS minage FROM Sailors S WHERE S.age>= 18 GROUP BY S.rating HAVING COUNT (*) >1 The result-set will look like this: Rating Minage 7 21 9 31 11 43 Example 2: Find the average age of sailors who are of voting age (i.e., at least 18 years old) for each rating level that has at least two sailors. SELECT S.rating, AVG ( S.age ) FROM Sailors S WHERE S. age >= 18 GROUP BY S.rating HAVING 1 = SOME, = SOME, SOME =ALL, = ALL, ALL The keyword ANY is synonymous to SOME in SQL. Example 1: let’s take an instance S4 of Sailors as: S4 Sid Sname Rating Age 1 Ajaya 12 33 2 Robin 11 43 3 Ganga 32 28 4 Manoj 9 31 7 Rahul 7 22 9 Sanjaya 9 42 11 Raju 4 19 8 Rahul 6 76 Find the id and names of sailors whose rating is better than some sailor called “Rahul”. SELECT sid, sname FROM S4 WHERE rating >ANY (SELECT rating FROM S4 WHERE sname=’Rahul’); The result-set will look like this: Sid Sname 1 Ajaya 2 Robin 3 Ganga 4 Manoj 7 Rahul 9 Sanjaya Example 2: Find the id and names of sailors whose rating is better than every sailor called “Rahul”. SELECT sid, sname FROM S4 WHERE rating >ALL (SELECT rating FROM S4 WHERE sname=’Rahul’); The result-set will look like this: Sid Sname 1 Ajaya 2 Robin 3 Ganga 4 Manoj 9 Sanjaya Example 3: Find the id and name of sailorwith height rating. SELECT sid, sname FROM S4 WHERE rating >=ALL (SELECT rating FROM S4); The result-set will look like this: Sid Sname 3 Ganga Note: IN and NOT IN are equivalent to =ANY and respectively. Views: A database view is a logical table. It does not physically store data like tables but represent data stored in underlying tables in different formats. A view does not require desk space and we can use view in most places where a table can be used. Since the views are derived from other tables thus when the data in its source tables are updated, the view reflects the updates as well. They also can be used by DBA to enforce database security. Advantages of Views: Database security: view allows users to access only those sections of database that directly concerns them. View provides data independence. Easier querying Shielding from change Views provide group of users to access the data according to their criteria. Vies allow the same data to be seen by different users in different ways at the same time. Syntax for creating view is: Optional CREATE VIEW AS Where, is any legal query expression. Example: Following view contains the id, name, rating+5 and age of those Sailors whose age is greater than 30: CREATE VIEW Sailor_view AS SELECT sid, sname, rating+5, age FROM Sailors WHERE age>30; Now by executing this query we get following view (logical table); Sailor_view Sid sname Rating+5 age 1 Ajaya 17 33 2 Robin 16 43 9 Sanjaya 14 42 8 Rahul 11 76 Now any valid database operations can be performed in this view like in that of general table. Modification of the database: Until now we only study about how information can be extract from the database. Now, we show how to add, remove, or change information with SQL. To modify database, there are mainly three operations are used: Insertion Deletion and Updates 1.Insertion: To insert data into a relation, we either specify a tuple to be inserted or write a query whose result is a set of tuples to be inserted. Example 1: suppose we need to insert a new record of Sailors of id is11, name is “Rahul”, rating is 9 and of age is 29 then we write following SQL query, INSERT INTO Sailors VALUES(11,’Rahul’,9,29); OR INSERT INTO Sailors (sid, sname, rating, age) VALUES (11,’Rahul’,9,29); More generally, we might want to insert tuples on the basis of the result of query. Example 2:suppose we have already some tuples on the relation ‘Sailores’. Suppose we need to insert those tuples of sailors into their own relation whose rating is less than 7, this can be write as, INSERT INTO Sailors SELECT * FROM Sailors WHERE rating40; Joined relations: An SQL JOIN clause is used to combine rows from two or more tables, based on a common field between them. The types the different SQL JOINs are: INNER JOIN LEFT OUTER JOIN RIGHT OUTER JOIN FULL OUTER JOIN INNER JOIN: It is most common type of join. An SQL INNER JOIN return all rows from multiple tables where the join condition is met. SQL INNER JOIN Syntax: SELECT column_name(s) FROM table1 INNER JOIN table2 ON table1.column_name=table2.column_name; Note: INNER JOIN is the same as JOIN. Example 1: Find the sailor id, boat id, boat name, boat color of those sailors who have reserved a red boat. SELECT sailors.sid, boats.bid, boats.bid, boats.bname, boats.color FROM sailors INNER JOIN reserver INNER JOIN boats WHERE sailors.sid=reserver.sid AND reserver.bid=boats.bid; The result-set will look like this: Example 2: Find the name and age of those sailors who have reserved a Marine boat. SELECT sailors.sname, sailors.age FROM sailors INNER JOIN reserver INNER JOIN boats WHERE sailors.sid=reserver.sid AND reserver.bid=boats.bid; The result-set will look like this: LEFT OUTER JOIN: The LEFT JOIN keyword returns all rows from the left table (table1), with the matching rows in the right table (table2). The result is NULL in the right side when there is no match. SQL LEFT JOIN Syntax: SELECT column_name(s) FROM table1 LEFT OUTER JOIN table2 ON table1.column_name=table2.column_name; RIGHT OUTER JOIN: The RIGHT JOIN keyword returns all rows from the right table (table2), with the matching rows in the left table (table1). The result is NULL in the left side when there is no match. SQL RIGHT JOIN Syntax: SELECT column_name(s) FROM table1 RIGHT OUTER JOIN table2 ON table1.column_name=table2.column_name; FULL OUTER JOIN: The FULL OUTER JOIN keyword returns all rows from the left table (table1) and from the right table (table2).The FULL OUTER JOIN keyword combines the result of both LEFT and RIGHT joins. SQL FULL OUTER JOIN Syntax: SELECT column_name(s) FROM table1 FULL OUTER JOIN table2 ON table1.column_name=table2.column_name; Data definition languages: The DDL part of SQL permits database tables to be created or deleted. It also defines indexes (keys), specify links between tables, and impose constraints between tables. The most important DDL statements in SQLare: CREATE DATABASE- creates a new database ALTER DATABASE- modifies a database CREATE TABLE- creates a new table ALTER TABLE- modifies a table DROP TABLE- deletes a table CREATE INDEX- creates an index (search key) DROP INDEX- deletes an index Domain type (data type) in SQL: When we create a table each column of the table must be specified by their domain or data type. Due to e it helps us what type of data will be stored in the field. The SQL standard supports a variety of build-in domain types which are: Char (n): A fixed length character data (string). Also we can use full form character. Varchar(n): A variable character string. Int: used to represent whole number. Also we can use it’s full form integer. Numeric(p, d): A fixed point number with user-specified precision. The number consists of p digits (plus a sign), and d represent the number of digits to right of decimal point. Real, double precision: floating point numbers with machine dependent precisions. Float(n): floating point number with precision of at least n digits. Date: A calendar date containing a four digit year, month and day. Eg ‘2006-04-22’ Time: The time of day, in hours, minutes, and seconds.eg ’09:34:23’ Timestamp: combination of date and time.eg ‘2008-05-21 11:23:08’ CREATE DATABASE: The CREATE DATABASEstatement is used to create a database. Syntax: CREATE DATABASE dbname; Example: CREATE DATABASE my_db; After creating database ‘my_db’ we need to connect it as; CONNECT my_db; ALTER DATABASE: Allow us to modify existing database name. Syntax: CREATE TABLE: Allow us to create a new table within given database. Syntax: CREATE TABLE ( [not null] [unique][], [not null] [unique][], ………………. ……………… [not null] [unique][] ) Note: [ ]: optional Example: CREATE TABLE Sailors ( Sid INTEGER NOT NULL, Sname VARCHAR(12), Rating INTEGER, Age INTEGER, PRIMARY KEY (sid) ) ALTER TABLE: Allow us to modifya given table. The structure of given table can be changed either of the following: By adding new column in existing table By deleting some columns from an existing table and By modifying some columns of given table A new column can be added to the table as follows: Syntax: ALTER TABLE ADD (); Example: suppose we want to add a new column ‘addresses’ to an existing table Sailors, ALTER TABLE Sailors ADD (addresses varchar(15)); An existing column can be removed from the table as, ALTER TABLE DROP (column_name); Example: suppose we want to remove an existing column ‘addresses’ from the table sailors as, ALTER TABLE Sailors DROP (addresses); An existing column can be modified as, ALTER TABLE MODIFY (); Example: modify sailors relation by changing the range of the name of sailors by 20, ALTER TABLE sailors MODIFY (snamevarchar(20)); DROP TABLE It allows us to remove an existing table from the database. Syntax: DROP TABLE Example: if we want to remove a table ‘Sailors’ from the database my_db as, DROP TABLE Sailors; Embedded SQL: The programming language in which SQL queries are embedded is called host language. And the SQL structures permitted in the host language is called embedded SQL. They are compiled by the embedded SQL processor. Writing queries in SQL is usually much easier than coding same query in a programming language. However, a programmer must access to a database from a general purpose programming language for following two reasons: Not all queries can be expressed in SQL Non-declarative actions such as printing a report, interacting with user, or sending the result of query to a graphical user interface etc. cannot be done from the SQL. Dynamic SQL: The dynamic SQL component of SQL allows programs to construct and submit SQLqueries at run time. In contrast, embedded SQL statements must be completely presentat compile time; they are compiled by the embedded SQL preprocessor. SQL Practice:- Query 1. Retrieve the name and address of all employees who work for the ‘Research’ department. Q1: SELECT Fname, Lname, Address FROM EMPLOYEE, DEPARTMENT WHERE Dname=‘Research’ AND Dnumber=Dno; Query 2. For every project located in ‘Stafford’, list the project number, the controlling department number, and the department manager’s last name, address, and birth date. Q2: SELECT Pnumber, Dnum, Lname, Address, Bdate FROM PROJECT, DEPARTMENT, EMPLOYEE WHERE Dnum=Dnumber AND Mgr_ssn=Ssn AND Plocation=‘Stafford’; Query 8. For each employee, retrieve the employee’s first and last name and the first and last name of his or her immediate supervisor. Q8: SELECT E.Fname, E.Lname, S.Fname, S.Lname FROM EMPLOYEE AS E, EMPLOYEE AS S WHERE E.Super_ssn=S.Ssn; Queries 9 and 10. Select all EMPLOYEE Ssns (Q9) and all combinations of EMPLOYEE Ssn and DEPARTMENT Dname (Q10) in the database. Q9: SELECT Ssn FROM EMPLOYEE; Q10: SELECT Ssn, Dname FROM EMPLOYEE, DEPARTMENT; Query 11. Retrieve the salary of every employee (Q11) and all distinct salary values (Q11A). Q11: SELECT ALL Salary FROM EMPLOYEE; Q11A: SELECT DISTINCT Salary FROM EMPLOYEE; Query 4. Make a list of all project numbers for projects that involve an employee whose last name is ‘Smith’, either as a worker or as a manager of the department that controls the project. ( SELECT DISTINCT Pnumber FROM PROJECT, DEPARTMENT, EMPLOYEE WHERE Dnum=Dnumber AND Mgr_ssn=Ssn AND Lname=‘Smith’ ) UNION ( SELECT DISTINCT Pnumber FROM PROJECT, WORKS_ON, EMPLOYEE WHERE Pnumber=Pno AND Essn=Ssn AND Lname=‘Smith’ ); Query 12. Retrieve all employees whose address is in Houston, Texas. SELECT Fname, Lname FROM EMPLOYEE WHERE Address LIKE ‘%Houston,TX%’; Query 12A. Find all employees who were born during the 1950s. SELECT Fname, Lname FROM EMPLOYEE WHERE Bdate LIKE ‘_ _ 5 _ _ _ _ _ _ _’; Query 13. Show the resulting salaries if every employee working on the ‘ProductX’ project is given a 10 percent raise. SELECT E.Fname, E.Lname, 1.1 * E.Salary AS Increased_sal FROM EMPLOYEE AS E, WORKS_ON AS W, PROJECT AS P WHERE E.Ssn=W.Essn AND W.Pno=P.Pnumber AND P.Pname=‘ProductX’; Query 14. Retrieve all employees in department 5 whose salary is between $30,000 and $40,000. Q14: SELECT * FROM EMPLOYEE WHERE (Salary BETWEEN 30000 AND 40000) AND Dno = 5; Query 15. Retrieve a list of employees and the projects they are working on, ordered by department and, within each department, ordered alphabetically by last name, then first name. Q15: SELECT D.Dname, E.Lname, E.Fname, P.Pname FROM DEPARTMENT D, EMPLOYEE E, WORKS_ON W, PROJECT P WHERE D.Dnumber= E.Dno AND E.Ssn= W.Essn AND W.Pno= P.Pnumber ORDER BY D.Dname, E.Lname, E.Fname; Insert update delete command:- INSERT INTO EMPLOYEE VALUES ( ‘Richard’, ‘K’, ‘Marini’, ‘653298653’, ‘1962-12-30’, ‘98 Oak Forest, Katy, TX’, ‘M’, 37000, ‘653298653’, 4 ); DELETE FROM EMPLOYEE WHERE Lname=‘Brown’; DELETE FROM EMPLOYEE WHERE Ssn=‘123456789’; DELETE FROM EMPLOYEE WHERE Dno=5; DELETE FROM EMPLOYEE; give all employees in the ‘Research’ department a 10 percent raise in salary UPDATE EMPLOYEE SET Salary = Salary * 1.1 WHERE Dno = 5; Query 18. Retrieve the names of all employees who do not have supervisors. Q18: SELECT Fname, Lname FROM EMPLOYEE WHERE Super_ssn IS NULL; Query 16. Retrieve the name of each employee who has a dependent with the same first name and is the same sex as the employee. Q16: SELECT E.Fname, E.Lname FROM EMPLOYEE AS E WHERE E.Ssn IN ( SELECT Essn FROM DEPENDENT AS D WHERE E.Fname=D.Dependent_name AND E.Sex=D.Sex ); Query 6. Retrieve the names of employees who have no dependents. Q6: SELECT Fname, Lname FROM EMPLOYEE WHERE NOT EXISTS ( SELECT * FROM DEPENDENT WHERE Ssn=Essn ); Query 7. List the names of managers who have at least one dependent. Q7: SELECT Fname, Lname FROM EMPLOYEE WHERE EXISTS ( SELECT * FROM DEPENDENT WHERE Ssn=Essn ) AND EXISTS ( SELECT * FROM DEPARTMENT WHERE Ssn=Mgr_ssn ); Query 17. Retrieve the Social Security numbers of all employees who work on project numbers 1, 2, or 3. Q17: SELECT DISTINCT Essn FROM WORKS_ON WHERE Pno IN (1, 2, 3); Consider query Q1, which retrieves the name and address of every employee who works for the ‘Research’ department. SELECT Fname, Lname, Address FROM (EMPLOYEE JOIN DEPARTMENT ON Dno=Dnumber) WHERE Dname=‘Research’; Query 19. Find the sum of the salaries of all employees, the maximum salary, the minimum salary, and the average salary. Q19: SELECT SUM (Salary), MAX (Salary), MIN (Salary), AVG (Salary) FROM EMPLOYEE; Query 20. Find the sum of the salaries of all employees of the ‘Research’ department, as well as the maximum salary, the minimum salary, and the aver-age salary in this department. Q20: SELECT SUM (Salary), MAX (Salary), MIN (Salary), AVG (Salary) FROM (EMPLOYEE JOIN DEPARTMENT ON Dno=Dnumber) WHERE Dname=‘Research’; Queries 21 and 22. Retrieve the total number of employees in the company (Q21) and the number of employees in the ‘Research’ department (Q22). Q21: SELECT COUNT (*) FROM EMPLOYEE; Q22: SELECT COUNT (*) FROM EMPLOYEE, DEPARTMENT WHERE DNO=DNUMBER AND DNAME=‘Research’; Query 23. Count the number of distinct salary values in the database. Q23: SELECT COUNT (DISTINCT Salary) FROM EMPLOYEE; Query 24. For each department, retrieve the department number, the number of employees in the department, and their average salary. Q24: SELECT Dno, COUNT (*), AVG (Salary) FROM EMPLOYEE GROUP BY Dno; Query 25. For each project, retrieve the project number, the project name, and the number of employees who work on that project. Q25: SELECT Pnumber, Pname, COUNT (*) FROM PROJECT, WORKS_ON WHERE Pnumber=Pno GROUP BY Pnumber, Pname; Query 26. For each project on which more than two employees work, retrieve the project number, the project name, and the number of employees who work on the project. SELECT Pnumber, Pname, COUNT (*) FROM PROJECT, WORKS_ON WHERE Pnumber=Pno GROUP BY Pnumber, Pname HAVING COUNT (*) > 2; Query 27. For each project, retrieve the project number, the project name, and the number of employees from department 5 who work on the project. Q27: SELECT Pnumber, Pname, COUNT (*) FROM PROJECT, WORKS_ON, EMPLOYEE WHERE Pnumber=Pno AND Ssn=Essn AND Dno=5 GROUP BY Pnumber, Pname; Query 28. For each department that has more than five employees, retrieve the department number and the number of its employees who are making more than $40,000. Q28: SELECT Dnumber, COUNT (*) FROM DEPARTMENT, EMPLOYEE WHERE Dnumber=Dno AND Salary>40000 AND ( SELECT Dno FROM EMPLOYEE GROUP BY Dno HAVING COUNT (*) > 5) Specification of Views in SQL CREATE VIEW WORKS_ON1 AS SELECT Fname, Lname, Pname, Hours FROM EMPLOYEE, PROJECT, WORKS_ON WHERE Ssn=Essn AND Pno=Pnumber; CREATE VIEW DEPT_INFO(Dept_name, No_of_emps, Total_sal) AS SELECT Dname, COUNT (*), SUM (Salary) FROM DEPARTMENT, EMPLOYEE WHERE Dnumber=Dno GROUP BY Dname; To retrieve the last name and first name of all employees who work on the ‘ProductX’ project, we can utilize the WORKS_ON1 view SELECT Fname, Lname FROM WORKS_ON1 WHERE Pname=‘ProductX’; To Drop View:- DROP VIEW WORKS_ON1; Exercise 1:Consider the following relations: Student(snum: integer, sname: string, major: string, level: string, age: integer) Class(cname: string, meets-at: time, room: string, fid: integer) Enrolled(snum: integer, cname: string) Faculty(fid: integer, fname: string, deptid: integer) 1. Find the names of all Juniors (Level = JR) who are enrolled in a class taught by I. Teach. 2. Find the age of the oldest student who is either a History major or is enrolled in a course taught by I. Teach. 3. Find the names of all classes that either meet in room R128 or have five or more students enrolled. 4. Find the names of all students who are enrolled in two classes that meet at the same time. 5. Find the names of faculty members who teach in every room in which some class is taught. 6. Find the names of faculty members for whom the combined enrollment of the courses that they teach is less than _ve. 7. Print the Level and the average age of students for that Level, for each Level. 8. Print the Level and the average age of students for that Level, for all Levels except JR. 9. Find the names of students who are enrolled in the maximum number of classes. 10. Find the names of students who are not enrolled in any class. Exercise 2 Consider the following schema: Suppliers(sid: integer, sname: string, address: string) Parts(pid: integer, pname: string, color: string) Catalog(sid: integer, pid: integer, cost: real) Write the followingqueries in SQL: 1. Find the pnames of parts for which there is some supplier. 2. Find the snames of suppliers who supply every part. 3. Find the snames of suppliers who supply every red part. 4. Find the pnames of parts supplied by Acme Widget Suppliers and by no one else. 5. Find the sids of suppliers who charge more for some part than the average cost of that part (averaged over all the suppliers who supply that part). 7. Find the sids of suppliers who supply only red parts. 8. Find the sids of suppliers who supply a red part and a green part. 9. Find the sids of suppliers who supply a red part or a green part. Exercise 3Consider the following schema of the relational database Books(Bid,Btitle, Bauthor, Bpublisher, Bprice) Members( Member_id, Name, Designation, Age) Reserve( Member_id,Bid, Date) 1. Create the tables using Books, Members and Reserve by specifying the Primary key, Not NULL , Foreign key Constraints DDL Statement in MySQL database 2. Write SQL DML statement to insert any five tuples ( Five records) in each relation(table) 3. Find the Books of Database System title and price above 500. 4. List the books published by TaTa McGraw Hill publication 5. Find the Name of the member who made reserve book in 12-10-2011 Exercise 4Consider the following schema of the relational database Department(dept_no,d_name, city) Employee(emp_Id, e_name, salary) Works(dept_no, emp_Id) 1. Create the above tables by specifying the Primary key, Not NULL , Foreign key Constraints DDL Statement in MySQL database 2. Write DML Statement to Insert any five records in each tables 3. Display the name of the employees. 4. Find the name of the employees whose salary is greater than 10000. 5. Find the department (d_name) of the employee ‘Binek’. Exercise 5: Consider the following insurance database, where primary keys are underlined: Teacher (Tid, Tname, Address, Age) Student (Sid,Sname, Age, sex) Takes (sid,course-id) Course (course-id,course_name, text_book) Teaches(Tid, Cousrse-id) Taught-by{Sid, Tid} Construct the following RA expressions for this relational database a. Fine name, age and sex of all students who takes course “DBMS” b. Find total number of students who are taught by the teacher “T01” c. List all course names text books taught by teaher “T16” d. Find average age of teachers for each course. e. Insert the record of new teacher “T06” named “Bhupi” of age 27 into database who lives in “Balkhu” and takes course “DBMS” Employee Database --------------------------------- employee (employee_name, street, city) works (employee_name, company_name, salary) company (company_name, city) manages (employee_name, manager_name) Consider the employee database , where the primary keys are underlined. Give an expression in SQL for each of the following queries. a. Find the names of all employees who work for First Bank Corporation. b. Find all employees in the database who live in the same cities as the companies for which they work. c. Find all employees in the database who live in the same cities and on the same streets as do their managers. d. Find all employees who earn more than the average salary of all employees of their company. e. Find the company that has the smallest payroll. Answer: 3 a. Find the names of all employees who work for First Bank Corporation. select employee_name from works where company_name = ’First Bank Corporation’; b. Find all employees in the database who live in the same cities as the com- panies for which they work. select e.employee_name from employee e, works w, company c where e.employee_name = w.employee_name and e.city = c.city and w.company_name = c.company_name c. Find all employees in the database who live in the same cities and on the same streets as do their managers. select P.employee_name from employee P, employee R, manages M where P.employee_name = M.employee_name and M.manager_name = R.employee_name and P.street = R.street and P.city = R.city d. Find all employees who earn more than the average salary of all employees of their company. The following solution assumes that all people work for at most one company. select employee_name from works where salary > (select avg (salary) from works ) ) e. Find the company that has the smallest payroll. select company name from works group by company name having sum (salary) Y holds if whenever two tuples have the same value for X, they must have the same value for Y. For any tuples t1 and t2 in any relation instance r(R): if t1[X]=t2[X], then t1[Y]=t2[Y] Customer(Cid,Cname,Age,Salary) Cid->Cname It is true because Cid uniquely determine the customer name even in the case of duplicate Cname. Cname-> Cid It is false because the name of the customer may be same for different Cid. So Cname not uniquely determine the customer. cid-> Age [True] cid-> salary[True] Age-> cid [False] Example 2:-Consider a relation supplier Supplier(supplier_id,sname,status, city) Here, sname,status and city are functionally dependent on supplier_id. Meaning is that each supplier_id uniquely determines the value of attributes supplier_name, supplier_status and city. This can be expressed by supplier_id->sname supplier_id-> status supplier_id-> city 102 Types of Functional Dependency:- -------------------------------------------- There are many types of functional dependencies, depending on several criteria. Fully functional dependency For a relation schema R, in a functional dependency X->Y, Y is said to be fully functional dependent on X. Partial Functional Dependency For a relation schema R, in a functional dependency X->Y, Y is said to be partial functional dependent on X if by removal of some attributes from X and the dependency still holds. Example The dependency {Emp_id,Proj_No} -> emp_name is partial becauase only emp_id-> emp_name can holds. Transitive Dependency if X->Y and Y->Z then X->Z Trivial and Non-Trivial dependencies:- A functional dependency X->Y is said to be trivial dependency. If Y is subset of X(not necessarily a proper subset) The functional dependency that satisfied by all the relations is called trivial dependency. Example A->A is satisfied by all the relations involving attribute A. The functional dependency that is not trivial is said to be non-trivial dependency. Armstrong Axioms (Inference Rules for FDs) :- Reflexive:- If Y subset of X, then X->Y Augmentation: If X-> Y , then XZ->YZ Transitive: If X->Y and Y->Z , then X->Z Decomposition: If X->YZ then X-> Y and X->Z Union: If X->Y and X->Z then X->YZ Pseudo-transitivity: If X->Y and WY->Z, then WX->Z Closure of Set of Functional Dependencies:- The Closure Of Functional Dependency means the complete set of all possible attributes that can be functionally derived from given functional dependency using the inference rules known as Armstrong's Rules. If “F” is a functional dependency then closure of functional dependency can be denoted using “{F}+”. Closure Of Functional Dependency : Introduction The Closure Of Functional Dependency means the complete set of all possible attributes that can be functionally derived from given functional dependency using the inference rules known as Armstrong’s Rules. If “F” is a functional dependency then closure of functional dependency can be denoted using “{F}+”. There are three steps to calculate closure of functional dependency. These are: 103 Step-1 : Add the attributes which are present on Left Hand Side in the original functional dependency. Step-2 : Now, add the attributes present on the Right Hand Side of the functional dependency. Step-3 : With the help of attributes present on Right Hand Side, check the other attributes that can be derived from the other given functional dependencies. Repeat this process until all the possible attributes which can be derived are added in the closure. Seems difficult? Check out the example explained below and it will surely clear your doubt on how to calculate closure of functional dependency. Closure Of Functional Dependency : Examples Example-1 : Consider the table student_details having (Roll_No, Name,Marks, Location) as the attributes and having two functional dependencies. FD1 : Roll_No-> Name, Marks FD2 : Name -> Marks, Location Now, We will calculate the closure of all the attributes present in the relation using the three steps mentioned below. Step-1 : Add attributes present on the LHS of the first functional dependency to the closure. {Roll_no}+ = {Roll_No} Step-2 : Add attributes present on the RHS of the original functional dependency to the closure. {Roll_no}+ = {Roll_No, Marks} Step-3 : Add the other possible attributes which can be derived using attributes present on the RHS of the closure. So Roll_No attribute cannot functionally determine any attribute but Name attribute can determine other attributes such as Marks and Location using 2nd Functional Dependency(Name Marks, Location). Therefore, complete closure of Roll_No will be : 104 {Roll_no}+ = {Roll_No, Marks, Name, Location} Similarly, we can calculate closure for other attributes too i.e “Name”. Step-1 : Add attributes present on the LHS of the functional dependency to the closure. {Name}+ = {Name} Step-2 : Add the attributes present on the RHS of the functional dependency to the closure. {Name}+ = {Name, Marks, Location} Step-3 : Since, we don’t have any functional dependency where “Marks or Location” attribute is functionally determining any other attribute , we cannot add more attributes to the closure. Hence complete closure of Name would be : {Name}+ = {Name, Marks, Location} NOTE : We don’t have any Functional dependency where marks and location can functionally determine any attribute. Hence, for those attributes we can only add the attributes themselves in their closures. Therefore, {Marks}+ = {Marks} and {Location}+ = { Location} Example-2 : Consider a relation R(A,B,C,D,E) having below mentioned functional dependencies. FD1 : A -> BC FD2 : C -> B FD3 : D -> E FD4 : E -> D 105 Now, we need to calculate the closure of attributes of the relation R. The closures will be: {A}+ = {A, B, C} {B}+ = {B} {C}+ = {B, C} {D}+ = {D, E} {E}+ = {E} Closure Of Functional Dependency : Calculating Candidate Key “A Candidate Key of a relation is an attribute or set of attributes that can determine the whole relation or contains all the attributes in its closure." Let’s try to understand how to calculate candidate keys. Example-1 : Consider the relation R(A,B,C) with given functional dependencies : FD1 : A -> B FD2 : B -> C Now, calculating the closure of the attributes as : {A}+ = {A, B, C} {B}+ = {B, C} {C}+ = {C} Clearly, “A” is the candidate key as, its closure contains all the attributes present in the relation “R”. Example-2 : Consider another relation R(A, B, C, D, E) having the Functional dependencies : FD1 : A -> BC FD2 : C -> B FD3 : D -> E FD4 : E -> D Now, calculating the closure of the attributes as : {A}+ = {A, B, C} {B}+ = {B} {C}+ = {C, B} 106 {D}+ = {E, D} {E}+ = {E, D} In this case, a single attribute does is unable to determine all the attribute on its own like in previous example. Here, we need to club two or more attributes to determine the candidate keys. {A, D}+ = {A, B, C, D, E} {A, E}+ = {A, B, C, D, E} Hence, "AD" and "AE" are the two possible keys of the given relation “R”. Any other combination other than these two would have acted as extraneous attributes. NOTE : Any relation “R” can have either single or multiple candidate keys. Closure Of Functional Dependency : Key Definitions Prime Attributes : Attributes which are indispensable part of candidate keys. For example : “A, D, E” attributes are prime attributes in above example-2. Non-Prime Attributes : Attributes other than prime attributes which does not take part in formation of candidate keys. For example. Extraneous Attributes : Attributes which does not make any effect on removal from candidate key. For example : Consider the relation R(A, B, C, D) with functional dependencies : FD1 : A -> BC FD2 : B -> C FD3 : D -> C Here, Candidate key can be “AD” only. Hence, Prime Attributes : A, D. Non-Prime Attributes : B, C Extraneous Attributes : B, C(As if we add any of the to the candidate key, it will remain unaffected). Those attributes, which if removed does not affect closure of that set. Example 1: Let R = (A,B,C,D) F={A->B,A->C,BC->D}. List several member of F+. Soln:- 107 Since A->B,A->C then A->BC [ By Union Rule] Since A->BC , BC->D then AB-> D [ By pseudo transitivity rule] Hence F+ ={A->B,A->C,BC->D,A->BC,A-->D,AB->D} Example 2: let R=(A,B,C,G,H,I) F={A->B, A->C,CG->H,CG->I,B->H}.Compute closure of F (F+) Soln:- since A->B,B->H then A->H [ By transitivity rule] since A->B,A->C then A->BC [by union rule] since A->C then AG->CG [ by augmentation rule] since AG->CG, CG-> I then AG->I [ by transitivity rule] since CG->H , CG->I then CG-> HI [ by union rule] Augmentation of CG->I to infer CG->CGI , augmentation of CG->H to infer CGI->HI and then transitivity. Hence, F+= {A->A ,B->B,C->C,H->H,G->G,I->I,A->B,A->C,CG->H,G->I,CG->HI,B->H,A->H,AG- >I,CG->HI} Here, first six FDs obtain by reflexive axioms. Example 3: let R=(A,B,C,D) and F={A->B,A->C,BC->D} then compute F+ since A->B and A->C then by union rule A->BC Since BC->D then by projective/Decomposition B->D,C->D Again by transitivity A->B and B->D and A->C and C->D => A->D Hence F+ = { A->A,B->B,C->C , D->D,A->B,A->C,BC->D,B->D,C->D,A->D} Normalization:- 1 Informal Design Guidelines for Relational Databases:- Semantics of the Relation Attributes,Redundant Information in Tuples and Update Anomalies, Null Values in Tuples , Spurious Tuples 2 Functional Dependencies (FDs): Definition of FD, Inference Rules for FDs, Equivalence of Sets of FDs, Minimal Sets of FDs 3 Normal Forms Based on Primary Keys:- Normalization of Relations , Practical Use of Normal Forms , Definitions of Keys and Attributes Participating in Keys , First Normal Form, Second Normal Form, Third Normal Form 108 4 General Normal Form Definitions (For Multiple Keys) 5 BCNF (Boyce-Codd Normal Form) GUIDELINE 1: Informally, each tuple in a relation should represent one entity or relationship instance. (Applies to individual relations and their attributes). Attributes of different entities (EMPLOYEEs, DEPARTMENTs, PROJECTs) should not be mixed in the same relation Only foreign keys should be used to refer to other entities Entity and relationship attributes should be kept apart as much as possible. Bottom Line: Design a schema that can be explained easily relation by relation. The semantics of attributes should be easy to interpret. 1.2 Redundant Information in Tuples and Update Anomalies Mixing attributes of multiple entities may cause problems Information is stored redundantly wasting storage – Problems with update anomalies: Insertion anomalies, Deletion anomalies , Modification anomalies EXAMPLE OF AN UPDATE ANOMALY 109 Consider the relation: EMP_PROJ ( Emp#, Proj#, Ename, Pname, No_hours) Update Anomaly: Changing the name of project number P1 from “Billing” to “Customer- Accounting” may cause this update to be made for all 100 employees working on project P1. Insert Anomaly: Cannot insert a project unless an employee is assigned to. Inversely - Cannot insert an employee unless an he/she is assigned to a project. Delete Anomaly: When a project is deleted, it will result in deleting all the employees who work on that project. Alternately, if an employee is the sole employee on a project, deleting that employee would result in deleting the corresponding project. 110 GUIDELINE 2: Design a schema that does not suffer from the insertion, deletion and update anomalies. If there are any present, then note them so that applications can be made to take them into account GUIDELINE 3: Relations should be designed such that their tuples will have as few NULL values as possible Attributes that are NULL frequently could be placed in separate relations (with the primary key) Reasons for nulls: – attribute not applicable or invalid , attribute value unknown (may exist) ,value known to exist, but unavailable 1.4 Spurious Tuples Bad designs for a relational database may result in erroneous results for certain JOIN operations The "lossless join" property is used to guarantee meaningful results for join operations GUIDELINE 4: The relations should be designed to satisfy the lossless join condition. No spurious tuples should be generated by doing a natural-join of any relations. There are two important properties of decompositions: (a) non-additive or losslessness of the corresponding join (b) preservation of the functional dependencies. 111 Note that property (a) is extremely important and cannot be sacrificed. Property (b) is less stringent and may be sacrificed. (See Chapter 11). 2.1 Functional Dependencies Functional dependencies (FDs) are used to specify formal measures of the "goodness" of relational designs FDs and keys are used to define normal forms for relations FDs are constraints that are derived from the meaning and interrelationships of the data attributes A set of attributes X functionally determines a set of attributes Y if the value of X determines a unique value for Y X -> Y holds if whenever two tuples have the same value for X, they must have the same value for Y For any two tuples t1 and t2 in any relation instance r(R): If t1[X]=t2[X], then t1[Y]=t2[Y] X -> Y in R specifies a constraint on all relation instances r(R) Written as X -> Y; can be displayed graphically on a relation schema as in Figures. ( denoted by the arrow: ). FDs are derived from the real-world constraints on the attributes Examples of FD constraints social security number determines employee name , SSN -> ENAME project number determines project name and location , PNUMBER -> {PNAME, PLOCATION} employee ssn and project number determines the hours per week that the employee works on the project. {SSN, PNUMBER} -> HOURS Examples of FD constraints An FD is a property of the attributes in the schema R The constraint must hold on every relation instance r(R) If K is a key of R, then K functionally determines all attributes in R (since we never have two distinct tuples with t1[K]=t2[K]) 2.2 Inference Rules for FDs 112 Given a set of FDs F, we can infer additional FDs that hold whenever the FDs in F hold Armstrong's inference rules: IR1. (Reflexive) If Y subset-of X, then X -> Y IR2. (Augmentation) If X -> Y, then XZ -> YZ , (Notation: XZ stands for X U Z) IR3. (Transitive) If X -> Y and Y -> Z, then X -> Z IR1, IR2, IR3 form a sound and complete set of inference rules Some additional inference rules that are useful: (Decomposition) If X -> YZ, then X -> Y and X -> Z (Union) If X -> Y and X -> Z, then X -> YZ (Psuedotransitivity) If X -> Y and WY -> Z, then WX -> Z The last three inference rules, as well as any other inference rules, can be deduced from IR1, IR2, and IR3 (completeness property) Closure of a set F of FDs is the set F+ of all FDs that can be inferred from F Closure of a set of attributes X with respect to F is the set X + of all attributes that are functionally determined by X X + can be calculated by repeatedly applying IR1, IR2, IR3 using the FDs in F Equivalence of Sets of FDs Two sets of FDs F and G are equivalent if: - every FD in F can be inferred from G, and , - every FD in G can be inferred from F Hence, F and G are equivalent if F + =G + Definition: F covers G if every FD in G can be inferred from F (i.e., if G + subset-of F +) F and G are equivalent if F covers G and G covers F There is an algorithm for checking equivalence of sets of FDs 2.4 Minimal Sets of FDs A set of FDs is minimal if it satisfies the following conditions: (1) Every dependency in F has a single attribute for its RHS. (2) We cannot remove any dependency from F and have a set of dependencies that is equivalent to F. 113 (3) We cannot replace any dependency X -> A in F with a dependency Y -> A, where Y proper- subset-of X ( Y subset-of X) and still have a set of dependencies that is equivalent to F. (4) Every set of FDs has an equivalent minimal set, There can be several equivalent minimal sets (5) There is no simple algorithm for computing a minimal set of FDs that is equivalent to a set F of FDs (6) To synthesize a set of relations, we assume that we start with a set of dependencies that is a minimal set (e.g., see algorithms 11.2 and 11.4) Normal Forms Based on Primary Keys The normalization process, as first proposed by Codd (1972a), takes a relation schema through a series of tests to certify whether it satisfies a certain normal form. The process, which proceeds in a top-down fashion by evaluating each relation against the criteria for normal forms and decomposing relations as necessary, can thus be considered as relational design by analysis. Initially, Codd proposed three normal forms, which he called first, second, and third normal form. A stronger def-inition of 3NF—called Boyce-Codd normal form (BCNF)—was proposed later by Boyce and Codd. All these normal forms are based on a single analytical tool: the functional dependencies among the attributes of a relation. Later, a fourth normal form (4NF) and a fifth normal form (5NF) were proposed, based on the concepts of multivalued dependencies and join dependencies. Normalization of data can be considered a process of analyzing the given relation schemas based on their FDs and primary keys to achieve the desirable properties of minimizing redundancy and (2) minimizing the insertion, deletion, and update anomalies. It can be considered as a “filtering” or “purifi-cation” process to make the design have successively better quality. Unsatisfactory relation schemas that do not meet certain conditions—the normal form tests—are decomposed into smaller relation schemas that meet the tests and hence possess the desirable properties. Thus, the normalization procedure provides database design-ers with the following: A formal framework for analyzing relation schemas based on their keys and on the functional dependencies among their attributes A series of normal form tests that can be carried out on individual relation schemas so that the relational database can be normalized to any desired degree Definition. The normal form of a relation refers to the highest normal form condition that it meets, and hence indicates the degree to which it has been nor-malized. Normalization: The process of decomposing unsatisfactory "bad" relations by breaking up their attributes into smaller relations Normal form: Condition using keys and FDs of a relation to certify whether a relation schema is in a particular normal form 114 2NF, 3NF, BCNF based on keys and FDs of a relation schema 4NF based on keys, multi-valued dependencies : MVDs; 5NF based on keys, join dependencies : JDs (Chapter 11) Additional properties may be needed to ensure a good relational design (lossless join, dependency preservation; Chapter 11) Normalization is carried out in practice so that the resulting designs are of high quality and meet the desirable properties The practical utility of these normal forms becomes questionable when the constraints on which they are based are hard to understand or to detect The database designers need not normalize to the highest possible normal form. (usually up to 3NF, BCNF or 4NF) Denormalization: the process of storing the join of higher normal form relations as a base relation—which is in a lower normal form A superkey of a relation schema R = {A1, A2,...., An} is a set of attributes S subset-of R with the property that no two tuples t1 and t2 in any legal relation state r of R will have t1[S] = t2[S] A key K is a superkey with the additional property that removal of any attribute from K will cause K not to be a superkey any more. If a relation schema has more than one key, each is called a candidate key. One of the candidate keys is arbitrarily designated to be the primary key, and the others are called secondary keys. A Prime attribute must be a member of some candidate key A Nonprime attribute is not a prime attribute—that is, it is not a member of any candidate key. 3.2 First Normal Form First normal form (1NF) is now considered to be part of the formal definition of a relation in the basic (flat) relational model; historically, it was defined to disallow multivalued attributes, composite attributes, and their combinations. It states that the domain of an attribute must include only atomic (simple, indivisible) values and that the value of any attribute in a tuple must be a single value from the domain of that attribute. Hence, 1NF disallows having a set of values, a tuple of values, or a combination of both as an attribute value for a single tuple. In other words, 1NF dis-allows relations within relations or relations as attribute values within tuples. The only attribute values permitted by 1NF are single atomic (or indivisible) values. Disallows composite attributes, multivalued attributes, and nested relations; attributes whose values for an individual tuple are non-atomic Considered to be part of the definition of relation 115 116 3.3 Second Normal Form Uses the concepts of FDs, primary key Second normal form (2NF) is based on the concept of full functional dependency. A functional dependency X → Y is a full functional dependency if removal of any attribute A from X means that the dependency does not hold any more; that is, for any attribute A ε X, (X – {A}) does not functionally determine Y. A functional dependency X → Y is a partial dependency if some attribute A ε X can be removed from X and the dependency still holds; that is, for some A ε X, (X – {A}) → Y. In Figure 15.3(b), {Ssn, Pnumber} → Hours is a full dependency (neither Ssn → Hours nor Pnumber → Hours holds). However, the dependency {Ssn, Pnumber} → Ename is partial because Ssn → Ename holds. Definition. A relation schema R is in 2NF if every nonprime attribute A in R is fully functionally dependent on the primary key of R. Definitions: Prime attribute - attribute that is member of the primary key K Full functional dependency - a FD Y -> Z where removal of any attribute from Y means the FD does not hold any more Examples: - {SSN, PNUMBER} -> HOURS is a full FD since neither SSN -> HOURS nor PNUMBER -> HOURS hold - {SSN, PNUMBER} -> ENAME is not a full FD (it is called a partial dependency ) since SSN -> ENAME also holds A relation schema R is in second normal form (2NF) if every non-prime attribute A in R is fully functionally dependent on the primary key R can be decomposed into 2NF relations via the process of 2NF normalization 117 118 Third Normal Form Definition: Transitive functional dependency - a FD X -> Z that can be derived from two FDs X -> Y and Y -> Z Examples: - SSN -> DMGRSSN is a transitive FD since SSN -> DNUMBER and DNUMBER -> DMGRSSN hold - SSN -> ENAME is non-transitive since there is no set of attributes X where SSN -> X and X - > ENAME 119 Third Normal Form Third normal form (3NF) is based on the concept of transitive dependency. A functional dependency X → Y in a relation schema R is a transitive dependency if there exists a set of attributes Z in R that is neither a candidate key nor a subset of any key of R,10 and both X → Z and Z → Y hold. The dependency Ssn → Dmgr_ssn is transitive through Dnumber in EMP_DEPT A relation schema R is in third normal form (3NF) if it is in 2NF and no non-prime attribute A in R is transitively dependent on the primary key R can be decomposed into 3NF relations via the process of 3NF normalization NOTE: In X -> Y and Y -> Z, with X as the primary key, we consider this a problem only if Y is not a candidate key. When Y is a candidate key, there is no problem with the transitive dependency. E.g., Consider EMP (SSN, Emp#, Salary ). Here, SSN -> Emp# -> Salary and Emp# is a candidate key. General Normal Form Definitions Definition: Superkey of relation schema R - a set of attributes S of R that contains a key of R A relation schema R is in third normal form (3NF) if whenever a FD X -> A holds in R, then either: (a) X is a superkey of R, or , (b) A is a prime attribute of R NOTE: Boyce-Codd normal form disallows condition (b) above 5 BCNF (Boyce-Codd Normal Form) A relation schema R is in Boyce-Codd Normal Form (BCNF) if whenever an FD X -> A holds in R, then X is a superkey of R Each normal form is strictly stronger than the previous one – Every 2NF relation is in 1NF, Every 3NF relation is in 2NF, Every BCNF relation is in 3NF 120 There exist relations that are in 3NF but not in BCNF , The goal is to have each relation in BCNF (or 3NF) Achieving the BCNF by Decomposition Two FDs exist in the relation TEACH: fd1: { student, course} -> instructor fd2: instructor -> course {student, course} is a candidate key for this relation and that the dependencies shown follow the pattern in Figure 10.12 (b). So this relation is in 3NF but not in BCNF A relation NOT in BCNF should be decomposed so as to meet this property, while possibly forgoing the preservation of all functional dependencies in the decomposed relations. (See Algorithm 11.3) Achieving the BCNF by Decomposition Three possible decompositions for relation TEACH 1. {student, instructor} and {student, course} 2. {course, instructor } and {course, student} 121 3. {instructor, course } and {instructor, student} All three decompositions will lose fd1. We have to settle for sacrificing the functional dependency preservation. But we cannot sacrifice the non-additivity property after decomposition. Out of the above three, only the 3rd decomposition will not generate spurious tuples after join.(and hence has the non-additivity property). A test to determine whether a binary decomposition (decomposition into two relations) is nonadditive (lossless) is discussed in section 11.1.4 under Property LJ1. Verify that the third decomposition above meets the property. 122 Transaction concurrency control Recovery 1. Introduction to Transaction Processing Concepts and Theory The concept of transaction provides a mechanism for describing logical units of database processing. Transaction processing systems are systems with large databases and hundreds of concurrent users executing database transactions. Examples of such systems include airline reservations, banking, credit card processing, online retail purchasing, stock markets, supermarket checkouts, and many other applications. These systems require high availability and fast response time for hundreds of concurrent users. We define the concept of a transaction, which is used to represent a logical unit of database processing that must be completed in its entirety to ensure correct-ness. A transaction is typically implemented by a computer program, which includes database commands such as retrievals, insertions, deletions, and updates. Introduction to Transaction Processing 1.1 Single-User versus Multiuser Systems One criterion for classifying a database system is according to the number of users who can use the system concurrently. A DBMS is single-user if at most one user at a time can use the system, and it is multiuser if many users can use the system—and hence access the database—concurrently. Single-user DBMSs are mostly restricted to personal computer systems; most other DBMSs are multiuser. For example, an airline reservations system is used by hundreds of travel agents and reservation clerks concurrently. Database systems used in banks, insurance agencies, stock exchanges, supermarkets, and many other applications are multiuser systems. In these systems, hundreds or thousands of users are typically operating on the data-base by submitting transactions concurrently to the system process, and so on. A process is resumed at the point where it was suspended when-ever it gets its turn to use the CPU again. Hence, concurrent execution of processes is actually interleaved, which shows two processes, A and B, executing concurrently in an interleaved fashion. Interleaving keeps the CPU busy when a process requires an input or output (I/O) operation, such as reading a block from disk. The CPU is switched to execute another process rather than remaining idle during I/O time. Interleaving also prevents a long process from delaying other processes. If the computer system has multiple hardware processors (CPUs), parallel process-ing of multiple processes is possible, as illustrated by processes C and D in Figure. Most of the theory concerning concurrency control in databases is developed in terms of interleaved concurrency. In a multiuser DBMS, the stored data items are the primary resources that may be accessed concurrently by interactive users or application programs, which are constantly retrieving information from and modifying the database. 1.1.2 Transactions, Database Items, Read and Write Operations, and DBMS Buffers A transaction is an executing program that forms a logical unit of database process-ing. A transaction includes one or more database access operations—these can include insertion, deletion, modification, or 123 retrieval operations. The database operations that form a transaction can either be embedded within an application program or they can be specified interactively via a high-level query language such as SQL. One way of specifying the transaction boundaries is by specifying explicit begin transaction and end transaction statements in an application program; in this case, all database access operations between the two are considered as forming one transaction. A single application program may contain more than one transac-tion if it contains several transaction boundaries. If the database operations in a transaction do not update the database but only retrieve data, the transaction is called a read-only transaction; otherwise it is known as a read-write transaction. The database model that is used to present transaction processing concepts is quite simple when compared to the data models that we discussed earlier in the book, such as the relational model or the object model. A database is basically represented as a collection of named data items. The size of a data item is called its granularity. A data item can be a database record, but it can also be a larger unit such as a whole disk block, or even a smaller unit such as an individual field (attribute) value of some record in the database. The transaction processing concepts we discuss are inde-pendent of the data item granularity (size) and apply to data items in general. Each data item has a unique name, but this name is not typically used by the programmer; rather, it is just a means to uniquely identify each data item. For example, if the data item granularity is one disk block, then the disk block address can be used as the data item name. Using this simplified database model, the basic database access operations that a transaction can include are as follows: read_item(X). Reads a database item named X into a program variable. To simplify our notation, we assume that the program variable is also named X. write_item(X). Writes the value of program variable X into the database item named X. The basic unit of data transfer from disk to main memory is one block. Executing a read_item(X) command includes the following steps: Find the address of the disk block that contains item X. Copy that disk block into a buffer in main memory (if that disk block is not already in some main memory buffer). Copy item X from the buffer to the program variable named X. Executing a write_item(X) command includes the following steps: Find the address of the disk block that contains item X. Copy that disk block into a buffer in main memory (if that disk block is not already in some main memory buffer). Copy item X from the program variable named X into its correct location in the buffer. Store the updated block from the buffer back to disk (either immediately or at some later point in time). It is step 4 that actually updates the database on disk. In some cases the buffer is not immediately stored to disk, in case additional changes are to be made to the buffer. Usually, the decision about when to store a modified disk block whose contents are in a main memory buffer is handled by the recovery manager of the DBMS in coop-eration with the underlying operating system. The DBMS will maintain in the database cache a number of data buffers in main memory. Each buffer typically holds the contents of one database disk block, which contains some of the database items being processed. When these buffers are all occupied, and additional database disk blocks must be copied into memory, some buffer 124 replacement policy is used to choose which of the current buffers is to be replaced. If the chosen buffer has been modified, it must be written back to disk before it is reused. A transaction includes read_item and write_item operations to access and update the database. Figure 1.2 shows examples of two very simple transactions. The read-set of a transaction is the set of all items that the transaction reads, and the write-set is the set of all items that the transaction writes. For example, the read-set of T1 in Figure 1.2 is {X, Y} and its write-set is also {X, Y}. Concurrency control and recovery mechanisms are mainly concerned with the database commands in a transaction. Transactions submitted by the various users may execute concurrently and may access and update the same database items. If this concurrent execution is uncontrolled, it may lead to problems, such as an incon-sistent database. In the next section we informally introduce some of the problems that may occur. 1.1.3 Why Concurrency Control Is Needed Several problems can occur when concurrent transactions execute in an uncon-trolled manner. We illustrate some of these problems by referring to a much simpli-fied airline reservations database in which a record is stored for each airline flight. Each record includes the number of reserved seats on that flight as a named (uniquely identifiable) data item, among other information. Figure 1.2(a) shows a transac- tion T1 that transfers N reservations from one flight whose number of reserved seats is stored in the database item named X to another flight whose number of reserved seats is stored in the database item named Y. Figure 1.2(b) shows a simpler trans-action T2 that just reserves M seats on the first flight (X) referenced in transaction T1.2 To simplify our example, we do not show additional portions of the transac-tions, such as checking whether a flight has enough seats available before reserving additional seats. Figure 1.2 Two sample transac-tions. (a) Transaction T1. (b) Transaction T2. We will not discuss buffer replacement policies here because they are typically discussed in operating systems textbooks. A similar, more commonly used example assumes a bank database, with one transaction doing a trans-fer of funds from account X to account Y and the other transaction doing a deposit to account X. When a database access program is written, it has the flight number, flight date, and the number of seats to be booked as parameters; hence, the same program can be used to execute many 125 different transactions, each with a different flight number, date, and number of seats to be booked. For concurrency control purposes, a trans-action is a particular execution of a program on a specific date, flight, and number of seats. In Figure 1.2(a) and (b), the transactions T1 and T2 are specific executions of the programs that refer to the specific flights whose numbers of seats are stored in data items X and Y in the database. Next we discuss the types of problems we may encounter with these two simple transactions if they run concurrently. The Lost Update Problem. This problem occurs when two transactions that access the same database items have their operations interleaved in a way that makes the value of some database items incorrect. Suppose that transactions T1 and T2 are submitted at approximately the same time, and suppose that their operations are interleaved as shown in Figure 21.3(a); then the final value of item X is incorrect because T2 reads the value of X before T1 changes it in the database, and hence the updated value resulting from T1 is lost. For example, if X = 80 at the start (originally there were 80 reservations on the flight), N = 5 (T1 transfers 5 seat reservations from the flight corresponding to X to the flight corresponding to Y), and M = 4 (T2 reserves 4 seats on X), the final result should be X = 79. However, in the interleaving of operations shown in Figure 21.3(a), it is X = 84 because the update in T1 that removed the five seats from X was lost. The Temporary Update (or Dirty Read) Problem. This problem occurs when one transaction updates a database item and then the transaction fails for some rea-son. Meanwhile, the updated item is accessed (read) by another transaction before it is changed back to its original value. Figure 1.3(b) shows an example where T1 updates item X and then fails before completion, so the system must change X back to its original value. Before it can do so, however, transaction T2 reads the temporary value of X, which will not be recorded permanently in the data-base because of the failure of T1. The value of item X that is read by T2 is called dirty data because it has been created by a transaction that has not completed and com-mitted yet; hence, this problem is also known as the dirty read problem. The Incorrect Summary Problem. If one transaction is calculating an aggregate summary function on a number of database items while other transactions are updating some of these items, the aggregate function may calculate some values before they are updated and others after they are updated. For example, suppose that a transaction T3 is calculating the total number of reservations on all the flights; meanwhile, transaction T1 is executing. If the interleaving of operations shown in Figure 21.3(c) occurs, the result of T3 will be off by an amount N because T3 reads the value of X after N seats have been 126 subtracted from it but reads the value of Y before those N seats have been added to it. Figure 1.3 Some problems that occur when concurrent execution is uncontrolled. (a) The lost update problem. (b) The temporary update problem. (c) The incorrect summary problem. Item X has an incorrect value because its update by T1 is lost (overwritten). Transaction T1 fails and must change the value of X back to its old value; meanwhile T2 has read the temporary incorrect value of X. T3 reads X after N is subtracted and reads Y before N is added; a wrong summary is the result (off by N ). The Unrepeatable Read Problem. Another problem that may occur is called unrepeatable read, where a transaction T reads the same item twice and the item is changed by another transaction T between the two reads. Hence, T receives different values for its two reads of the same item. This may occur, for example, if during an airline reservation transaction, a customer inquires about seat availability on several flights. When the customer decides on a particular flight, the transaction then reads the number of seats on that flight a second time before completing the reservation, and it may end up reading a different value for the item. Concurrency Control :- ==================== Process of managing simultaneous operations on the database without having them interfere with one another. Prevents interference when two or more users are accessing database simultaneously and at least one 127 is updating data. Although two transactions may be correct in themselves, interleaving of operations may produce an incorrect result. Why Recovery Is Needed 128 Whenever a transaction is submitted to a DBMS for execution, the system is respon-sible for making sure that either all the operations in the transaction are completed successfully and their effect is recorded permanently in the database, or that the transaction does not have any effect on the database or any other transactions. In the first case, the transaction is said to be committed, whereas in the second case, the transaction is aborted. The DBMS must not permit some operations of a trans-action T to be applied to the database while other operations of T are not, because the whole transaction is a logical unit of database processing. If a transaction fails after executing some of its operations but before executing all of them, the opera-tions already executed must be undone and have no lasting effect. Types of Failures. Failures are generally classified as transaction, system, and media failures. There are several possible reasons for a transaction to fail in the mid-dle of execution: A computer failure (system crash). A hardware, software, or network error occurs in the computer system during transaction execution. Hardware crashes are usually media failures—for example, main memory failure. A transaction or system error. Some operation in the transaction may cause it to fail, such as integer overflow or division by zero. Transaction failure may also occur because of erroneous parameter values or because of a logical programming error. 3 Additionally, the user may interrupt the transaction during its execution. Local errors or exception conditions detected by the transaction. During transaction execution, certain conditions may occur that necessitate cancel-lation of the transaction. For example, data for the transaction may not be found. An exception condition,4 such as insufficient account balance in a banking database, may cause a transaction, such as a fund withdrawal, to be canceled. This exception could be programmed in the transaction itself, and in such a case would not be considered as a transaction failure. Disk failure. Some disk blocks may lose their data because of a read or write malfunction or because of a disk read/write head crash. This may happen during a read or a write operation of the transaction. Physical problems and catastrophes. This refers to an endless list of prob-lems that includes power or air-conditioning failure, fire, theft, sabotage, overwriting disks or tapes by mistake, and mounting of a wrong tape by the operator. Failures of types 1, 2, 3, and 4 are more common than those of types 5 or 6. Whenever a failure of type 1 through 4 occurs, the system must keep sufficient information to quickly recover from the failure. Disk failure or other catastrophic failures of type 5 or 6 do not happen frequently; if they do occur, recovery is a major task. we will discuss recovery in later. The concept of transaction is fundamental to many techniques for concurrency control and recovery from failures. 129 1.2 Transaction and System Concepts A transaction is an atomic unit of work that should either be completed in its entirety or not done at all. For recovery purposes, the system needs to keep track of when each transaction starts, terminates, and commits or aborts.Therefore, the recovery manager of the DBMS needs to keep track of the following operations: BEGIN_TRANSACTION. This marks the beginning of transaction execution. READ or WRITE. These specify read or write operations on the database items that are executed as part of a transaction. END_TRANSACTION. This specifies that READ and WRITE transaction oper-ations have ended and marks the end of transaction execution. However, at this point it may be necessary to check whether the changes introduced by the transaction can be permanently applied to the database (committed) or whether the transaction has to be aborted because it violates serializability. COMMIT_TRANSACTION. This signals a successful end of the transaction so that any changes (updates) executed by the transaction can be safely committed to the database and will not be undone. ROLLBACK (or ABORT). This signals that the transaction has ended unsuc-cessfully, so that any changes or effects that the transaction may have applied to the database must be undone. Figure 1.4 shows a state transition diagram that illustrates how a transaction moves through its execution states. A transaction goes into an acti