Database Management Systems PDF

Document Details

Malla Reddy College of Engineering & Technology

Tags

database management systems DBMS database design SQL

Summary

This document is a syllabus for a Database Management Systems course, likely for undergraduate computer science students at a college or university. It outlines the course's objectives, major units (including database system applications, data modeling and SQL), topics covered, required and recommended textbooks. The course appears to focus on fundamentals of database systems and SQL.

Full Transcript

1 DATABASE MANAGEMENT SYSTEM (DBMS) (R-13 Autonomous) Department of Computer Science & Engineering s Malla Reddy College of Engineering & Technology (Accredited by NBA with NAAC-A Grade, UGC-Autonomous, ISO Cert...

1 DATABASE MANAGEMENT SYSTEM (DBMS) (R-13 Autonomous) Department of Computer Science & Engineering s Malla Reddy College of Engineering & Technology (Accredited by NBA with NAAC-A Grade, UGC-Autonomous, ISO Certified Institution) Maisammaguda, Near Kompally, Medchal Road, Sec’bad-500 100. 2 SYLLABUS (R15A0509) DATABASE MANAGEMENT SYSTEMS Objectives: To Understand the basic concepts and the applications of database systems To Master the basics of SQL and construct queries using SQL To understand the relational database design principles To become familiar with the basic issues of transaction processing and concurrency control To become familiar with database storage structures and access techniques UNIT I: Data base System Applications, Purpose of Database Systems, View of Data – Data Abstraction – Instances and Schemas – data Models – the ER Model – Relational Model – Other Models – Database Languages – DDL – DML – database Access for applications Programs – data base Users and Administrator – Transaction Management – data base Architecture – Storage Manager – the Query Processor Data base design and ER diagrams – ER Model - Entities, Attributes and Entity sets – Relationships and Relationship sets – ER Design Issues – Concept Design – Conceptual Design for University Enterprise. Introduction to the Relational Model – Structure – Database Schema, Keys – Schema Diagrams UNIT II: Relational Query Languages, Relational Operations. Relational Algebra – Selection and projection set operations – renaming – Joins – Division – Examples of Algebra overviews – Relational calculus – Tuple relational Calculus – Domain relational calculus. Overview of the SQL Query Language – Basic Structure of SQL Queries, Set Operations, Aggregate Functions – GROUPBY – HAVING, Nested Sub queries, Views, Triggers. UNIT III: Normalization – Introduction, Non loss decomposition and functional dependencies, First, Second, and third normal forms – dependency preservation, Boyee/Codd normal form. Higher Normal Forms - Introduction, Multi-valued dependencies and Fourth normal form, Join dependencies and Fifth normal form UNIT IV: Transaction Concept- Transaction State- Implementation of Atomicity and Durability – Concurrent – Executions – Serializability- Recoverability – Implementation of Isolation – Testing for serializability- Lock –Based Protocols – Timestamp Based Protocols- Validation- Based Protocols – Multiple Granularity. Recovery and Atomicity – Log – Based Recovery – Recovery with Concurrent Transactions – Buffer Management – Failure with loss of nonvolatile storage-Advance Recovery systems- Remote Backup systems. UNIT V: File organization:– File organization – various kinds of indexes. Query Processing – Measures of query cost - Selection operation – Projection operation, - Join operation – set operation and aggregate operation – Relational Query Optimization – Transacting SQL queries – Estimating the cost – Equivalence Rules. 3 TEXT BOOKS: 1. Data base System Concepts, Silberschatz, Korth, McGraw hill, Sixth Edition.(All UNITS except III th) 2. Data base Management Systems, Raghurama Krishnan, Johannes Gehrke, TATA McGrawHill 3rd Edition. REFERENCE BOOKS: 1. Fundamentals of Database Systems, Elmasri Navathe Pearson Education. 2. An Introduction to Database systems, C.J. Date, A.Kannan, S.Swami Nadhan, Pearson, Eight Edition for UNIT III. URLs: Outcomes: Demonstrate the basic elements of a relational database management system Ability to identify the data models for relevant problems Ability to design entity relationship and convert entity relationship diagrams into RDBMS and formulate SQL queries on the respect data Apply normalization for the development of application software’s 4 UNIT-1 Introduction to Database Management System As the name suggests, the database management system consists of two parts. They are: 1. Database and 2. Management System What is a Database? To find out what database is, we have to start from data, which is the basic building block of any DBMS. Data: Facts, figures, statistics etc. having no particular meaning (e.g. 1, ABC, 19 etc). Record: Collection of related data items, e.g. in the above example the three data items had no meaning. But if we organize them in the following way, then they collectively represent meaningful information. Roll Name Age 1 ABC 19 Table or Relation: Collection of related records. Roll Name Age 1 ABC 19 2 DEF 22 3 XYZ 28 The columns of this relation are called Fields, Attributes or Domains. The rows are called Tuples or Records. Database: Collection of related relations. Consider the following collection of tables: T1 T2 Roll Name Age Roll Address 1 ABC 19 1 KOL 2 DEF 22 2 DEL 3 XYZ 28 3 MUM T3 T4 Roll Year Year Hostel 1 I I H1 2 II II H2 3 I We now have a collection of 4 tables. They can be called a “related collection” because we can clearly find out that there are some common attributes existing in a selected pair of tables. Because of these common attributes we may combine the data of two or more tables together to find out the complete details of a student. Questions like “Which hostel does the youngest student live in?” can be answered now, although 5 Age and Hostel attributes are in different tables. A database in a DBMS could be viewed by lots of different people with different responsibilities. Figure 1.1: Empolyees are accessing Data through DBMS For example, within a company there are different departments, as well as customers, who each need to see different kinds of data. Each employee in the company will have different levels of access to the database with their own customized front-end application. In a database, data is organized strictly in row and column format. The rows are called Tuple or Record. The data items within one row may belong to different data types. On the other hand, the columns are often called Domain or Attribute. All the data items within a single attribute are of the same data type. What is Management System? A database-management system (DBMS) is a collection of interrelated data and a set of programs to access those data. This is a collection of related data with an implicit meaning and hence is a database. The collection of data, usually referred to as the database, contains information relevant to an enterprise. The primary goal of a DBMS is to provide a way to store and retrieve database information that is both convenient and efficient. By data, we mean known facts that can be recorded and that have implicit meaning. The management system is important because without the existence of some kind of rules and regulations it is not possible to maintain the database. We have to select the particular attributes which should be included in a particular table; the common attributes to create relationship between two tables; if a new record has to be inserted or deleted then which tables should have to be handled etc. These issues must be resolved by having some kind of rules to follow in order to maintain the integrity of the database. Database systems are designed to manage large bodies of information. Management of data involves both defining structures for storage of information and providing mechanisms for the manipulation of information. In addition, the database system must ensure the safety of the information stored, despite system crashes or attempts at unauthorized access. If data are to be shared among several users, the system must avoid possible anomalous results. Because information is so important in most organizations, computer scientists have developed a large body of concepts and techniques for managing data. These concepts and technique form the focus of this book. This 6 chapter briefly introduces the principles of database systems. Database Management System (DBMS) and Its Applications: A Database management system is a computerized record-keeping system. It is a repository or a container for collection of computerized data files. The overall purpose of DBMS is to allow he users to define, store, retrieve and update the information contained in the database on demand. Information can be anything that is of significance to an individual or organization. Databases touch all aspects of our lives. Some of the major areas of application are as follows: 1. Banking 2. Airlines 3. Universities 4. Manufacturing and selling 5. Human resources Enterprise Information ◦ Sales: For customer, product, and purchase information. ◦ Accounting: For payments, receipts, account balances, assets and other accounting information. ◦ Human resources: For information about employees, salaries, payroll taxes, and benefits, and for generation of paychecks. ◦ Manufacturing: For management of the supply chain and for tracking production of items in factories, inventories of items inwarehouses and stores, and orders for items. Online retailers: For sales data noted above plus online order tracking,generation of recommendation lists, and maintenance of online product evaluations. Banking and Finance ◦ Banking: For customer information, accounts, loans, and banking transactions. ◦ Credit card transactions: For purchases on credit cards and generation of monthly statements. ◦ Finance: For storing information about holdings, sales, and purchases of financial instruments such as stocks and bonds; also for storing real-time market data to enable online trading by customers and automated trading by the firm. Universities: For student information, course registrations, and grades (in addition to standard enterprise information such as human resources and accounting). Airlines: For reservations and schedule information. Airlines were among the first to use databases in a geographically distributed manner. Telecommunication: For keeping records of calls made, generating monthly bills, maintaining balances on prepaid calling cards, and storing information about the communication networks. Purpose of Database Systems Database systems arose in response to early methods of computerized management of commercial data. As an example of such methods, typical of the 1960s, consider part of a university organization that, among other data, keeps information about all instructors, students, departments, and course offerings. One way to keep the information on a computer is to store it in operating system files. To allow users to manipulate the information, the system has a number of application programs that manipulate the files, including programs to:  Add new students, instructors, and courses  Register students for courses and generate class rosters  Assign grades to students, compute grade point averages (GPA), and generate transcripts 7 System programmers wrote these application programs to meet the needs of the university. New application programs are added to the system as the need arises. For example, suppose that a university decides to create a new major (say, computer science).As a result, the university creates a new department and creates new permanent files (or adds information to existing files) to record information about all the instructors in the department, students in that major, course offerings, degree requirements, etc. The university may have to write new application programs to deal with rules specific to the new major. New application programs may also have to be written to handle new rules in the university. Thus, as time goes by, the system acquires more files and more application programs. This typical file-processing system is supported by a conventional operating system. The system stores permanent records in various files, and it needs different application programs to extract records from, and add records to, the appropriate files. Before database management systems (DBMSs) were introduced, organizations usually stored information in such systems. Keeping organizational information in a file- processing system has a number of major disadvantages: Data redundancy and inconsistency. Since different programmers create the files and application programs over a long period, the various files are likely to have different structures and the programs may be written in several programming languages. Moreover, the same information may be duplicated in several places (files). For example, if a student has a double major (say, music and mathematics) the address and telephone number of that student may appear in a file that consists of student records of students in the Music department and in a file that consists of student records of students in the Mathematics department. This redundancy leads to higher storage and access cost. In addition, it may lead to data inconsistency; that is, the various copies of the same data may no longer agree. For example, a changed student address may be reflected in the Music department records but not elsewhere in the system. Difficulty in accessing data. Suppose that one of the university clerks needs to find out the names of all students who live within a particular postal-code area. The clerk asks the data-processing department to generate such a list. Because the designers of the original system did not anticipate this request, there is no application program on hand to meet it. There is, however, an application program to generate the list of all students. The university clerk has now two choices: either obtain the list of all students and extract the needed information manually or ask a programmer to write the necessary application program. Both alternatives are obviously unsatisfactory. Suppose that such a program is written, and that, several days later, the same clerk needs to trim that list to include only those students who have taken at least 60 credit hours. As expected, a program to generate such a list does not exist. Again, the clerk has the preceding two options, neither of which is satisfactory. The point here is that conventional file-processing environments do not allow needed data to be retrieved in a convenient and efficient manner. More responsive data-retrieval systems are required for general use. Data isolation. Because data are scattered in various files, and files may be in different formats, writing new application programs to retrieve the appropriate data is difficult. Integrity problems. The data values stored in the database must satisfy certain types of consistency constraints. Suppose the university maintains an account for each department, and records the balance amount in each account. Suppose also that the university requires that the account balance of a department may never fall below zero. Developers enforce these constraints in the system by adding appropriate code in the various application programs. However, when new constraints are added, it is difficult to change the programs to enforce them. The problem is compounded when constraints involve several data items from different files. 8 Atomicity problems. A computer system, like any other device, is subject to failure. In many applications, it is crucial that, if a failure occurs, the data be restored to the consistent state that existed prior to the failure. Consider a program to transfer $500 from the account balance of department A to the account balance of department B. If a system failure occurs during the execution of the program, it is possible that the $500 was removed from the balance of department A but was not credited to the balance of department B, resulting in an inconsistent database state. Clearly, it is essential to database consistency that either both the credit and debit occur, or that neither occur. That is, the funds transfer must be atomic—it must happen in its entirety or not at all. It is difficult to ensure atomicity in a conventional file-processing system. Concurrent-access anomalies. For the sake of overall performance of the system and faster response, many systems allow multiple users to update the data simultaneously. Indeed, today, the largest Internet retailers may have millions of accesses per day to their data by shoppers. In such an environment, interaction of concurrent updates is possible and may result in inconsistent data. Consider department A, with an account balance of $10,000. If two department clerks debit the account balance (by say $500 and $100, respectively) of department A at almost exactly the same time, the result of the concurrent executions may leave the budget in an incorrect (or inconsistent) state. Suppose that the programs executing on behalf of each withdrawal read the old balance, reduce that value by the amount being withdrawn, and write the result back. If the two programs run concurrently, they may both read the value $10,000, and write back $9500 and $9900, respectively. Depending on which one writes the value last, the account balance of department A may contain either $9500 or $9900, rather than the correct value of $9400. To guard against this possibility, the system must maintain some form of supervision. But supervision is difficult to provide because data may be accessed by many different application programs that have not been coordinated previously. As another example, suppose a registration program maintains a count of students registered for a course, in order to enforce limits on the number of students registered. When a student registers, the program reads the current count for the courses, verifies that the count is not already at the limit, adds one to the count, and stores the count back in the database. Suppose two students register concurrently, with the count at (say) 39. The two program executions may both read the value 39, and both would then write back 40, leading to an incorrect increase of only 1, even though two students successfully registered for the course and the count should be 41. Furthermore, suppose the course registration limit was 40; in the above case both students would be able to register, leading to a violation of the limit of 40 students. Security problems. Not every user of the database system should be able to access all the data. For example, in a university, payroll personnel need to see only that part of the database that has financial information. They do not need access to information about academic records. But, since application programs are added to the file-processing system in an ad hoc manner, enforcing such security constraints is difficult. These difficulties, among others, prompted the development of database systems. In what follows, we shall see the concepts and algorithms that enable database systems to solve the problems with file-processing systems. Advantages of DBMS: Controlling of Redundancy: Data redundancy refers to the duplication of data (i.e storing same data multiple times). In a database system, by having a centralized database and centralized control of data by the DBA the unnecessary duplication of data is avoided. It also eliminates the extra time for processing the large volume of data. It results in saving the storage space. 9 Improved Data Sharing : DBMS allows a user to share the data in any number of application programs. Data Integrity : Integrity means that the data in the database is accurate. Centralized control of the data helps in permitting the administrator to define integrity constraints to the data in the database. For example: in customer database we can can enforce an integrity that it must accept the customer only from Noida and Meerut city. Security : Having complete authority over the operational data, enables the DBA in ensuring that the only mean of access to the database is through proper channels. The DBA can define authorization checks to be carried out whenever access to sensitive data is attempted. Data Consistency : By eliminating data redundancy, we greatly reduce the opportunities for inconsistency. For example: is a customer address is stored only once, we cannot have disagreement on the stored values. Also updating data values is greatly simplified when each value is stored in one place only. Finally, we avoid the wasted storage that results from redundant data storage. Efficient Data Access : In a database system, the data is managed by the DBMS and all access to the data is through the DBMS providing a key to effective data processing Enforcements of Standards : With the centralized of data, DBA can establish and enforce the data standards which may include the naming conventions, data quality standards etc. Data Independence : Ina database system, the database management system provides the interface between the application programs and the data. When changes are made to the data representation, the meta data obtained by the DBMS is changed but the DBMS is continues to provide the data to application program in the previously used way. The DBMs handles the task of transformation of data wherever necessary. Reduced Application Development and Maintenance Time : DBMS supports many important functions that are common to many applications, accessing data stored in the DBMS, which facilitates the quick development of application. Disadvantages of DBMS 1) It is bit complex. Since it supports multiple functionality to give the user the best, the underlying software has become complex. The designers and developers should have thorough knowledge about the software to get the most out of it. 2) Because of its complexity and functionality, it uses large amount of memory. It also needs large memory to run efficiently. 3) DBMS system works on the centralized system, i.e.; all the users from all over the world access this database. Hence any failure of the DBMS, will impact all the users. 4) DBMS is generalized software, i.e.; it is written work on the entire systems rather specific one. Hence some of the application will run slow. View of Data A database system is a collection of interrelated data and a set of programs that allow users to access and modify these data. A major purpose of a database system is to provide users with an abstract view of the data. That is, the system hides certain details of how the data are stored and maintained. 10 Data Abstraction For the system to be usable, it must retrieve data efficiently. The need for efficiency has led designers to use complex data structures to represent data in the database. Since many database-system users are not computer trained, developers hide the complexity from users through several levels of abstraction, to simplify users’ interactions with the system: Database DISK Figure 1.2 : Levels of Abstraction in a DBMS Physical level (or Internal View / Schema): The lowest level of abstraction describes how the data are actually stored. The physical level describes complex low-level data structures in detail. Logical level (or Conceptual View / Schema): The next-higher level of abstraction describes what data are stored in the database, and what relationships exist among those data. The logical level thus describes the entire database in terms of a small number of relatively simple structures. Although implementation of the simple structures at the logical level may involve complex physical-level structures, the user of the logical level does not need to be aware of this complexity. This is referred to as physical data independence. Database administrators, who must decide what information to keep in the database, use the logical level of abstraction. View level (or External View / Schema): The highest level of abstraction describes only part of the entire database. Even though the logical level uses simpler structures, complexity remains because of the variety of information stored in a large database. Many users of the database system do not need all this information; instead, they need to access only a part of the database. The view level of abstraction exists to simplify their interaction with the system. The system may provide many views for the same database. Figure 1.2 shows the relationship among the three levels of abstraction. An analogy to the concept of data types in programming languages may clarify the distinction among levels of abstraction. Many high-level programming languages support the notion of a structured type. For example, we may describe a record as follows: type instructor = record ID : char (5); name : char (20); dept name : char (20); salary : numeric (8,2); end; This code defines a new record type called instructor with four fields. Each field has a name and a type associated with it. A university organization may have several such record types, including 11 department, with fields dept_name, building, and budget course, with fields course_id, title, dept_name, and credits student, with fields ID, name, dept_name, and tot_cred At the physical level, an instructor, department, or student record can be described as a block of consecutive storage locations. The compiler hides this level of detail from programmers. Similarly, the database system hides many of the lowest-level storage details from database programmers. Database administrators, on the other hand, may be aware of certain details of the physical organization of the data. At the logical level, each such record is described by a type definition, as in the previous code segment, and the interrelationship of these record types is defined as well. Programmers using a programming language work at this level of abstraction. Similarly, database administrators usually work at this level of abstraction. Finally, at the view level, computer users see a set of application programs that hide details of the data types. At the view level, several views of the database are defined, and a database user sees some or all of these views. In addition to hiding details of the logical level of the database, the views also provide a security mechanism to prevent users from accessing certain parts of the database. For example, clerks in the university registrar office can see only that part of the database that has information about students; they cannot access information about salaries of instructors. Instances and Schemas Databases change over time as information is inserted and deleted. The collection of information stored in the database at a particular moment is called an instance of the database. The overall design of the database is called the database schema. Schemas are changed infrequently, if at all. The concept of database schemas and instances can be understood by analogy to a program written in a programming language. A database schema corresponds to the variable declarations (along with associated type definitions) in a program. Each variable has a particular value at a given instant. The values of the variables in a program at a point in time correspond to an instance of a database schema. Database systems have several schemas, partitioned according to the levels of abstraction. The physical schema describes the database design at the physical level, while the logical schema describes the database design at the logical level. A database may also have several schemas at the view level, sometimes called subschemas, which describe different views of the database. Of these, the logical schema is by far the most important, in terms of its effect on application programs, since programmers construct applications by using the logical schema. The physical schema is hidden beneath the logical schema, and can usually be changed easily without affecting application programs. Application programs are said to exhibit physical data independence if they do not depend on the physical schema, and thus need not be rewritten if the physical schema changes. Data Models Underlying the structure of a database is the data model: a collection of conceptual tools for describing data, data relationships, data semantics, and consistency constraints. A data model provides a way to describe the design of a database at the physical, logical, and view levels. The data models can be classified into four different categories: 12 Relational Model. The relational model uses a collection of tables to represent both data and the relationships among those data. Each table has multiple columns, and each column has a unique name. Tables are also known as relations. The relational model is an example of a record-based model. Record-based models are so named because the database is structured in fixed-format records of several types. Each table contains records of a particular type. Each record type defines a fixed number of fields, or attributes. The columns of the table correspond to the attributes of the record type. The relational data model is the most widely used data model, and a vast majority of current database systems are based on the relational model. Entity-Relationship Model. The entity-relationship (E-R) data model uses a collection of basic objects, called entities, and relationships among these objects. An entity is a “thing” or “object” in the real world that is distinguishable from other objects. The entity- relationship model is widely used in database design. Object-Based Data Model. Object-oriented programming (especially in Java, C++, or C#) has become the dominant software-development methodology. This led to the development of an object-oriented data model that can be seen as extending the E-R model with notions of encapsulation, methods (functions), and object identity. The object-relational data model combines features of the object-oriented data model and relational data model. Semi-structured Data Model. The semi-structured data model permits the specification of data where individual data items of the same type may have different sets of attributes. This is in contrast to the data models mentioned earlier, where every data item of a particular type must have the same set of attributes. The Extensible Markup Language (XML) is widely used to represent semi-structured data. Historically, the network data model and the hierarchical data model preceded the relational data model. These models were tied closely to the underlying implementation, and complicated the task of modeling data. As a result they are used little now, except in old database code that is still in service in some places. Database Languages A database system provides a data-definition language to specify the database schema and a data-manipulation language to express database queries and updates. In practice, the data- definition and data-manipulation languages are not two separate languages; instead they simply form parts of a single database language, such as the widely used SQL language. Data-Manipulation Language A data-manipulation language (DML) is a language that enables users to access or manipulate data as organized by the appropriate data model. The types of access are: Retrieval of information stored in the database Insertion of new information into the database Deletion of information from the database Modification of information stored in the database There are basically two types: Procedural DMLs require a user to specify what data are needed and how to get those data. Declarative DMLs (also referred to as nonprocedural DMLs) require a user to specify what data are needed without specifying how to get those data. Declarative DMLs are usually easier to learn and use than are procedural DMLs. However, since a user does not have to specify how to get the data, the database system has to figure out an efficient means of accessing 13 data. A query is a statement requesting the retrieval of information. The portion of a DML that involves information retrieval is called a query language. Although technically incorrect, it is common practice to use the terms query language and data-manipulation language synonymously. Data-Definition Language (DDL) We specify a database schema by a set of definitions expressed by a special language called a data-definition language (DDL). The DDL is also used to specify additional properties of the data. We specify the storage structure and access methods used by the database system by a set of statements in a special type of DDL called a data storage and definition language. These statements define the implementation details of the database schemas, which are usually hidden from the users. The data values stored in the database must satisfy certain consistency constraints. For example, suppose the university requires that the account balance of a department must never be negative. The DDL provides facilities to specify such constraints. The database system checks these constraints every time the database is updated. In general, a constraint can be an arbitrary predicate pertaining to the database. However, arbitrary predicates may be costly to test. Thus, database systems implement integrity constraints that can be tested with minimal overhead. Domain Constraints. A domain of possible values must be associated with every attribute (for example, integer types, character types, date/time types). Declaring an attribute to be of a particular domain acts as a constraint on the values that it can take. Domain constraints are the most elementary form of integrity constraint. They are tested easily by the system whenever a new data item is entered into the database. Referential Integrity. There are cases where we wish to ensure that a value that appears in one relation for a given set of attributes also appears in a certain set of attributes in another relation (referential integrity). For example, the department listed for each course must be one that actually exists. More precisely, the dept name value in a course record must appear in the dept name attribute of some record of the department relation. Database modifications can cause violations of referential integrity. When a referential-integrity constraint is violated, the normal procedure is to reject the action that caused the violation. Assertions. An assertion is any condition that the database must always satisfy. Domain constraints and referential-integrity constraints are special forms of assertions. However, there are many constraints that we cannot express by using only these special forms. For example, “Every department must have at least five courses offered every semester” must be expressed as an assertion. When an assertion is created, the system tests it for validity. If the assertion is valid, then any future modification to the database is allowed only if it does not cause that assertion to be violated. Authorization. We may want to differentiate among the users as far as the type of access they are permitted on various data values in the database. These differentiations are expressed in terms of authorization, the most common being: read authorization, which allows reading, but not modification, of data; insert authorization, which allows insertion of new data, but not modification of existing data; update authorization, which allows modification, but not deletion, of data; and delete authorization, which allows deletion of data. We may assign the user all, none, or a combination of these types of authorization. The DDL, just like any other programming language, gets as input some instructions (statements) and generates some output. The output of the DDL is placed in the data dictionary,which contains metadata—that is, data about data. The data dictionary is considered to be a special type of table that can only be accessed and updated by the database system itself (not a regular user). The database system consults the data dictionary before reading or modifying actual data. 14 Data Dictionary We can define a data dictionary as a DBMS component that stores the definition of data characteristics and relationships. You may recall that such “data about data” were labeled metadata. The DBMS data dictionary provides the DBMS with its self describing characteristic. In effect, the data dictionary resembles and X-ray of the company’s entire data set, and is a crucial element in the data administration function. The two main types of data dictionary exist, integrated and stand alone. An integrated data dictionary is included with the DBMS. For example, all relational DBMSs include a built in data dictionary or system catalog that is frequently accessed and updated by the RDBMS. Other DBMSs especially older types, do not have a built in data dictionary instead the DBA may use third party stand alone data dictionary systems. Data dictionaries can also be classified as active or passive. An active data dictionary is automatically updated by the DBMS with every database access, thereby keeping its access information up-to-date. A passive data dictionary is not updated automatically and usually requires a batch process to be run. Data dictionary access information is normally used by the DBMS for query optimization purpose. The data dictionary’s main function is to store the description of all objects that interact with the database. Integrated data dictionaries tend to limit their metadata to the data managed by the DBMS. Stand alone data dictionary systems are more usually more flexible and allow the DBA to describe and manage all the organization’s data, whether or not they are computerized. Whatever the data dictionary’s format, its existence provides database designers and end users with a much improved ability to communicate. In addition, the data dictionary is the tool that helps the DBA to resolve data conflicts. Although, there is no standard format for the information stored in the data dictionary several features are common. For example, the data dictionary typically stores descriptions of all: Data elements that are define in all tables of all databases. Specifically the data dictionary stores the name, datatypes, display formats, internal storage formats, and validation rules. The data dictionary tells where an element is used, by whom it is used and so on. Tables define in all databases. For example, the data dictionary is likely to store the name of the table creator, the date of creation access authorizations, the number of columns, and so on. Indexes define for each database tables. For each index the DBMS stores at least the index name the attributes used, the location, specific index characteristics and the creation date. Define databases: who created each database, the date of creation where the database is located, who the DBA is and so on. End users and The Administrators of the data base Programs that access the database including screen formats, report formats application formats, SQL queries and so on. Access authorization for all users of all databases. Relationships among data elements which elements are involved: whether the relationship are mandatory or optional, the connectivity and cardinality and so on. Database Administrators and Database Users A primary goal of a database system is to retrieve information from and store new information in the database. People who work with a database can be categorized as database users or database administrators. Database Users and User Interfaces There are four different types of database-system users, differentiated by the way they expect to interact with the system. Different types of user interfaces have been designed for the different types of users. Naive users are unsophisticated users who interact with the system by invoking one of the application programs that have been written previously. For example, a bank teller who needs to transfer $50 from account A to account B invokes a program called transfer. This program asks the teller for the amount of money to be transferred, the account from which the money is to be transferred, and the account to which the money is to be transferred. 15 As another example, consider a user who wishes to find her account balance over the World Wide Web. Such a user may access a form, where she enters her account number. An application program at the Web server then retrieves the account balance, using the given account number, and passes this information back to the user. The typical user interface for naive users is a forms interface, where the user can fill in appropriate fields of the form. Naive users may also simply read reports generated from the database. Application programmers are computer professionals who write application programs. Application programmers can choose from many tools to develop user interfaces. Rapid application development (RAD) tools are tools that enable an application programmer to construct forms and reports without writing a program. There are also special types of programming languages that combine imperative control structures (for example, for loops, while loops and if-then-else statements) with statements of the data manipulation language. These languages, sometimes called fourth-generation languages, often include special features to facilitate the generation of forms and the display of data on the screen. Most major commercial database systems include a fourth generation language. Sophisticated users interact with the system without writing programs. Instead, they form their requests in a database query language. They submit each such query to a query processor, whose function is to break down DML statements into instructions that the storage manager understands. Analysts who submit queries to explore data in the database fall in this category. Online analytical processing (OLAP) tools simplify analysts’ tasks by letting them view summaries of data in different ways. For instance, an analyst can see total sales by region (for example, North, South, East, and West), or by product, or by a combination of region and product (that is, total sales of each product in each region). The tools also permit the analyst to select specific regions, look at data in more detail (for example, sales by city within a region) or look at the data in less detail (for example, aggregate products together by category). Another class of tools for analysts is data mining tools, which help them find certain kinds of patterns in data. Specialized users are sophisticated users who write specialized database applications that do not fit into the traditional data-processing framework. Among these applications are computer-aided design systems, knowledge base and expert systems, systems that store data with complex data types (for example, graphics data and audio data), and environment-modeling systems. 16 Database Architecture: We are now in a position to provide a single picture (Figure 1.3) of the various components of a database system and the connections among them. The architecture of a database system is greatly influenced by the underlying computer system on which the database system runs. Database systems can be centralized, or client-server, where one server machine executes work on behalf of multiple client machines. Database systems can also be designed to exploit parallel computer architectures. Distributed databases span multiple geographically separated machines. Figure 1.3: Database System Architecture A database system is partitioned into modules that deal with each of the responsibilities of the overall system. The functional components of a database system can be broadly divided into the storage manager and the query processor components. The storage manager is important because databases typically require a large amount of storage space. The query processor is important because it helps the database system simplify and facilitate access to data. It is the job of the database system to translate updates and queries written in a nonprocedural language, at the logical level, into an efficient sequence of operations at the physical level. Database applications are usually partitioned into two or three parts, as in Figure 1.4. In a two-tier architecture, the application resides at the client machine, where it invokes database system functionality at the server machine through query language statements. Application program interface standards like ODBC and JDBC are used for interaction between the client and the server. In contrast, in a three-tier architecture, the client machine acts as merely a front end and does not contain any direct database calls. Instead, the client end communicates with an application server, usually through a forms interface. 17 The application server in turn communicates with a database system to access data. The business logic of the application, which says what actions to carry out under what conditions, is embedded in the application server, instead of being distributed across multiple clients. Three-tier applications are more appropriate for large applications, and for applications that run on the WorldWideWeb. Figure 1.4: Two-tier and three-tier architectures. Query Processor: The query processor components include · DDL interpreter, which interprets DDL statements and records the definitions in the data dictionary. · DML compiler, which translates DML statements in a query language into an evaluation plan consisting of low-level instructions that the query evaluation engine understands. A query can usually be translated into any of a number of alternative evaluation plans that all give the same result. The DML compiler also performs query optimization, that is, it picks the lowest cost evaluation plan from among the alternatives. Query evaluation engine, which executes low-level instructions generated by the DML compiler. Storage Manager: A storage manager is a program module that provides the interface between the lowlevel data stored in the database and the application programs and queries submitted to the system. The storage manager is responsible for the interaction with the file manager. The raw data are stored on the disk using the file system, which is usually provided by a conventional operating system. The storage manager translates the various DML statements into low-level file-system commands. Thus, the storage manager is responsible for storing, retrieving, and updating data in the database. The storage manager components include: 18 · Authorization and integrity manager, which tests for the satisfaction of integrity constraints and checks the authority of users to access data. · Transaction manager, which ensures that the database remains in a consistent (correct) state despite system failures, and that concurrent transaction executions proceed without conflicting. · File manager, which manages the allocation of space on disk storage and the data structures used to represent information stored on disk. · Buffer manager, which is responsible for fetching data from disk storage into main memory, and deciding what data to cache in main memory. The buffer manager is a critical part of the database system, since it enables the database to handle data sizes that are much larger than the size of main memory. Transaction Manager: A transaction is a collection of operations that performs a single logical function in a database application. Each transaction is a unit of both atomicity and consistency. Thus, we require that transactions do not violate any database-consistency constraints. That is, if the database was consistent when a transaction started, the database must be consistent when the transaction successfully terminates. Transaction - manager ensures that the database remains in a consistent (correct) state despite system failures (e.g., power failures and operating system crashes) and transaction failures. Conceptual Database Design - Entity Relationship(ER) Modeling: Database Design Techniques 1. ER Modeling (Top down Approach) 2. Normalization (Bottom Up approach) What is ER Modeling? A graphical technique for understanding and organizing the data independent of the actual database implementation We need to be familiar with the following terms to go further. Entity Any thing that has an independent existence and about which we collect data. It is also known as entity type. In ER modeling, notation for entity is given below. Entity instance Entity instance is a particular member of the entity type. Example for entity instance : A particular employee Regular Entity An entity which has its own key attribute is a regular entity. Example for regular entity : Employee. Weak entity An entity which depends on other entity for its existence and doesn't have any key attribute of its own is a weak 19 entity. Example for a weak entity : In a parent/child relationship, a parent is considered as a strong entity and the child is a weak entity. In ER modeling, notation for weak entity is given below. Attributes Properties/characteristics which describe entities are called attributes. In ER modeling, notation for attribute is given below. Domain of Attributes The set of possible values that an attribute can take is called the domain of the attribute. For example, the attribute day may take any value from the set {Monday, Tuesday... Friday}. Hence this set can be termed as the domain of the attribute day. Key attribute The attribute (or combination of attributes) which is unique for every entity instance is called key attribute. E.g the employee_id of an employee, pan_card_number of a person etc.If the key attribute consists of two or more attributes in combination, it is called a composite key. In ER modeling, notation for key attribute is given below. Simple attribute If an attribute cannot be divided into simpler components, it is a simple attribute. Example for simple attribute : employee_id of an employee. Composite attribute If an attribute can be split into components, it is called a composite attribute. Example for composite attribute : Name of the employee which can be split into First_name, Middle_name, and Last_name. Single valued Attributes 20 If an attribute can take only a single value for each entity instance, it is a single valued attribute. example for single valued attribute : age of a student. It can take only one value for a particular student. Multi-valued Attributes If an attribute can take more than one value for each entity instance, it is a multi-valued attribute. Multi-valued example for multi valued attribute : telephone number of an employee, a particular employee may have multiple telephone numbers. In ER modeling, notation for multi-valued attribute is given below. Stored Attribute An attribute which need to be stored permanently is a stored attribute Example for stored attribute : name of a student Derived Attribute An attribute which can be calculated or derived based on other attributes is a derived attribute. Example for derived attribute : age of employee which can be calculated from date of birth and current date. In ER modeling, notation for derived attribute is given below. Relationships Associations between entities are called relationships Example : An employee works for an organization. Here "works for" is a relation between the entities employee and organization. In ER modeling, notation for relationship is given below. However in ER Modeling, To connect a weak Entity with others, you should use a weak relationship notation as given below 21 Degree of a Relationship Degree of a relationship is the number of entity types involved. The n-ary relationship is the general form for degree n. Special cases are unary, binary, and ternary ,where the degree is 1, 2, and 3, respectively. Example for unary relationship : An employee ia a manager of another employee Example for binary relationship : An employee works-for department. Example for ternary relationship : customer purchase item from a shop keeper Cardinality of a Relationship Relationship cardinalities specify how many of each entity type is allowed. Relationships can have four possible connectivities as given below. 1. One to one (1:1) relationship 2. One to many (1:N) relationship 3. Many to one (M:1) relationship 4. Many to many (M:N) relationship The minimum and maximum values of this connectivity is called the cardinality of the relationship Example for Cardinality – One-to-One (1:1) Employee is assigned with a parking space. One employee is assigned with only one parking space and one parking space is assigned to only one employee. Hence it is a 1:1 relationship and cardinality is One-To-One (1:1) In ER modeling, this can be mentioned using notations as given below 22 Example for Cardinality – One-to-Many (1:N) Organization has employees One organization can have many employees , but one employee works in only one organization. Hence it is a 1:N relationship and cardinality is One-To-Many (1:N) In ER modeling, this can be mentioned using notations as given below Example for Cardinality – Many-to-One (M :1) It is the reverse of the One to Many relationship. employee works in organization One employee works in only one organization But one organization can have many employees. Hence it is a M:1 relationship and cardinality is Many-to-One (M :1) 23 In ER modeling, this can be mentioned using notations as given below. Cardinality – Many-to-Many (M:N) Students enrolls for courses One student can enroll for many courses and one course can be enrolled by many students. Hence it is a M:N relationship and cardinality is Many-to-Many (M:N) In ER modeling, this can be mentioned using notations as given below Relationship Participation 1. Total In total participation, every entity instance will be connected through the relationship to another instance of the other participating entity types 2. Partial Example for relationship participation Consider the relationship - Employee is head of the department. Here all employees will not be the head of the department. Only one employee will be the head of the department. In other words, only few instances of employee entity participate in the above relationship. So employee entity's participation is partial in the said relationship. However each department will be headed by some employee. So department entity's participation is total in the said relationship. 24 Advantages and Disadvantages of ER Modeling ( Merits and Demerits of ER Modeling ) Advantages 1. ER Modeling is simple and easily understandable. It is represented in business users language and it can be understood by non-technical specialist. 2. Intuitive and helps in Physical Database creation. 3. Can be generalized and specialized based on needs. 4. Can help in database design. 5. Gives a higher level description of the system. Disadvantages 1. Physical design derived from E-R Model may have some amount of ambiguities or inconsistency. 2. Sometime diagrams may lead to misinterpretations Relational Model The relational model is today the primary data model for commercial data processing applications. It attained its primary position because of its simplicity, which eases the job of the programmer, compared to earlier data models such as the network model or the hierarchical model. In this, we first study the fundamentals of the relational model. A substantial theory exists for relational databases. Structure of Relational Databases: A relational database consists of a collection of tables, each of which is assigned a unique name. For example, consider the instructor table of Figure:1.5, which stores information about instructors. The table has four column headers: ID, name, dept name, and salary. Each row of this table records information about an instructor, consisting of the instructor’s ID, name, dept name, and salary. Similarly, the course table of Figure 1.6 stores information about courses, consisting of a course id, title, dept name, and credits, for each course. Note that each instructor is identified by the value of the column ID, while each course is identified by the value of the column course id. Figure 1.7 shows a third table, prereq, which stores the prerequisite courses for each course. The table has two columns, course id and prereq id. Each row consists of a pair of course identifiers such that the second course is a prerequisite for the first course. Thus, a row in the prereq table indicates that two courses are related in the sense that one course is a prerequisite for the other. As another example, we consider the table instructor, a row in the table can be thought of as representing the relationship between a specified ID and the corresponding values for name,dept name, and salary values. 25 Figure 1.5: The instructor relation (2.1) In general, a row in a table represents a relationship among a set of values. Since a table is a collection of such relationships, there is a close correspondence between the concept of table and the mathematical concept of relation, from which the relational data model takes its name. In mathematical terminology, a tuple is simply a sequence (or list) of values. A relationship between n values is represented mathematically by an n-tuple of values, i.e., a tuple with n values, which corresponds to a row in a table. Figure: 1.6: The course relation (2.2) Figure: 1.7: The prereq relation. (2.3) Thus, in the relational model the term relation is used to refer to a table, while the term tuple is used to refer to a row. Similarly, the term attribute refers to a column of a table. Examining Figure 1.5, we can see that the relation instructor has four attributes: ID, name, dept name, and salary. We use the term relation instance to refer to a specific instance of a relation, i.e., containing a specific set of rows. The instance of instructor shown in Figure 1.5 has 12 tuples, corresponding to 12 instructors. 26 In this topic, we shall be using a number of different relations to illustrate the various concepts underlying the relational data model. These relations represent part of a university. They do not include all the data an actual university database would contain, in order to simplify our presentation. The order in which tuples appear in a relation is irrelevant, since a relation is a set of tuples. Thus, whether the tuples of a relation are listed in sorted order, as in Figure 1.5, or are unsorted, as in Figure 1.8, does not matter; the relations in the two figures are the same, since both contain the same set of tuples. For ease of exposition, we will mostly show the relations sorted by their first attribute. For each attribute of a relation, there is a set of permitted values, called the domain of that attribute. Thus, the domain of the salary attribute of the instructor relation is the set of all possible salary values, while the domain of the name attribute is the set of all possible instructor names. We require that, for all relations r, the domains of all attributes of r be atomic. A domain is atomic if elements of the domain are considered to be indivisible units. Figure: 1.8: Unsorted display of the instructor relation. (2-4) For example, suppose the table instructor had an attribute phone number, which can store a set of phone numbers corresponding to the instructor. Then the domain of phone number would not be atomic, since an element of the domain is a set of phone numbers, and it has subparts, namely the individual phone numbers in the set. The important issue is not what the domain itself is, but rather how we use domain elements in our database. Suppose now that the phone number attribute stores a single phone number. Even then, if we split the value from the phone number attribute into a country code, an area code and a local number, we would be treating it as a nonatomic value. If we treat each phone number as a single indivisible unit, then the attribute phone number would have an atomic domain. The null value is a special value that signifies that the value is unknown or does not exist. For example, suppose as before that we include the attribute phone number in the instructor relation. It may be that an instructor does not have a phone number at all, or that the telephone number is unlisted. We would then have to use the null value to signify that the value is unknown or does not exist. We shall see later that null values cause a number of difficulties when we access or update the database, and thus should be eliminated if at all possible. We shall assume null values are absent initially. Database Schema When we talk about a database, we must differentiate between the database schema, which is the logical design of the database, and the database instance, which is a snapshot of the data in the database at a given 27 instant in time. The concept of a relation corresponds to the programming-language notion of a variable, while the concept of a relation schema corresponds to the programming-language notion of type definition. In general, a relation schema consists of a list of attributes and their corresponding domains. The concept of a relation instance corresponds to the programming-language notion of a value of a variable. The value of a given variable may change with time; Figure 1.9: The department relation.(2-5) similarly the contents of a relation instance may change with time as the relation is updated. In contrast, the schema of a relation does not generally change. Although it is important to know the difference between a relation schema and a relation instance, we often use the same name, such as instructor, to refer to both the schema and the instance. Where required, we explicitly refer to the schema or to the instance, for example “the instructor schema,” or “an instance of the instructor relation.” However, where it is clear whether we mean the schema or the instance, we simply use the relation name. Consider the department relation of Figure 1.9. The schema for that relation is department (dept name, building, budget) Note that the attribute dept name appears in both the instructor schema and the department schema. This duplication is not a coincidence. Rather, using common attributes in relation schemas is one way of relating tuples of distinct relations. For example, suppose we wish to find the information about all the instructors who work in the Watson building. We look first at the department relation to find the dept name of all the departments housed in Watson. Then, for each such department, we look in the instructor relation to find the information about the instructor associated with the corresponding dept name. Let us continue with our university database example. Each course in a university may be offered multiple times, across different semesters, or even within a semester.We need a relation to describe each individual offering, or section, of the class. The schema is section (course id, sec id, semester, year, building, room number, time slot id) Figure 1.10 shows a sample instance of the section relation. We need a relation to describe the association between instructors and the class sections that they teach. The relation schema to describe this association is teaches (ID, course id, sec id, semester, year) 28 Figure 1.10: The section relation.(2-6) Figure 1.11 shows a sample instance of the teaches relation. As you can imagine, there are many more relations maintained in a real university database. In addition to those relations we have listed already, instructor, department, course, section, prereq, and teaches,we use the following relations in this text: Figure: 1.11: The teaches relation.(2-7) student (ID, name, dept name, tot cred) advisor (s id, i id) takes (ID, course id, sec id, semester, year, grade) classroom (building, room number, capacity) time slot (time slot id, day, start time, end time) Keys We must have a way to specify how tuples within a given relation are distinguished. This is expressed in terms of their attributes. That is, the values of the attribute values of a tuple must be such that they can uniquely identify the tuple. In other words, no two tuples in a relation are allowed to have exactly the same value for all attributes. A superkey is a set of one or more attributes that, taken collectively, allow us to identify uniquely a tuple in the relation. For example, the ID attribute of the relation instructor is sufficient to distinguish one instructor tuple from 29 another. Thus, ID is a superkey. The name attribute of instructor, on the other hand, is not a superkey, because several instructors might have the same name. Formally, let R denote the set of attributes in the schema of relation r. If we say that a subset K of R is a superkey for r , we are restricting consideration to instances of relations r in which no two distinct tuples have the same values on all attributes in K. That is, if t1 and t2 are in r and t1 = t2, then t1.K = t2.K. A superkey may contain extraneous attributes. For example, the combination of ID and name is a superkey for the relation instructor. If K is a superkey, then so is any superset of K. We are often interested in superkeys for which no proper subset is a superkey. Such minimal superkeys are called candidate keys. It is possible that several distinct sets of attributes could serve as a candidate key. Suppose that a combination of name and dept name is sufficient to distinguish among members of the instructor relation. Then, both {ID} and {name, dept name} are candidate keys. Although the attributes ID and name together can distinguish instructor tuples, their combination, {ID, name}, does not form a candidate key, since the attribute ID alone is a candidate key. We shall use the term primary key to denote a candidate key that is chosen by the database designer as the principal means of identifying tuples within a relation. A key (whether primary, candidate, or super) is a property of the entire relation, rather than of the individual tuples. Any two individual tuples in the relation are prohibited from having the same value on the key attributes at the same time. The designation of a key represents a constraint in the real-world enterprise being modeled. Primary keys must be chosen with care. As we noted, the name of a person is obviously not sufficient, because there may be many people with the same name. In the United States, the social-security number attribute of a person would be a candidate key. Since non-U.S. residents usually do not have social-security numbers, international enterprises must generate their own unique identifiers. An alternative is to use some unique combination of other attributes as a key. The primary key should be chosen such that its attribute values are never, or very rarely, changed. For instance, the address field of a person should not be part of the primary key, since it is likely to change. Social-security numbers, on the other hand, are guaranteed never to change. Unique identifiers generated by enterprises generally do not change, except if two enterprises merge; in such a case the same identifier may have been issued by both enterprises, and a reallocation of identifiers may be required to make sure they are unique. It is customary to list the primary key attributes of a relation schema before the other attributes; for example, the dept name attribute of department is listed first, since it is the primary key. Primary key attributes are also underlined. A relation, say r1, may include among its attributes the primary key of another relation, say r2. This attribute is called a foreign key from r1, referencing r2. The relation r1 is also called the referencing relation of the foreign key dependency, and r2 is called the referenced relation of the foreign key. For example, the attribute dept name in instructor is a foreign key from instructor, referencing department, since dept name is the primary key of department. In any database instance, given any tuple, say ta, from the instructor relation, there must be some tuple, say tb, in the department relation such that the value of the dept name attribute of ta is the same as the value of the primary key, dept name, of tb. Now consider the section and teaches relations. It would be reasonable to require that if a section exists for a course, it must be taught by at least one instructor; however, it could possibly be taught by more than one instructor. To enforce this constraint, we would require that if a particular (course id, sec id, semester, year) combination appears in section, then the same combination must appear in teaches. However, this set of values does not form a primary key for teaches, since more than one instructor may teach one such section. As a result, we cannot declare a foreign key constraint from section to teaches (although we can define a foreign key constraint in the other direction, from teaches to section). 30 The constraint from section to teaches is an example of a referential integrity constraint; a referential integrity constraint requires that the values appearing in specified attributes of any tuple in the referencing relation also appear in specified attributes of at least one tuple in the referenced relation. Schema Diagrams A database schema, along with primary key and foreign key dependencies, can be depicted by schema diagrams. Figure 1.12 shows the schema diagram for our university organization. Each relation appears as a box, with the relation name at the top in blue, and the attributes listed inside the box. Primary key attributes are shown underlined. Foreign key dependencies appear as arrows from the foreign key attributes of the referencing relation to the primary key of the referenced relation. Figure 1.12 : Schema diagram for the university database. Referential integrity constraints other than foreign key constraints are not shown explicitly in schema diagrams. We will study a different diagrammatic representation called the entity-relationship diagram. UNIT-2 Relational Algebra and Calculus PRELIMINARIES In defining relational algebra and calculus, the alternative of referring to fields by position is more convenient than referring to fields by name: Queries often involve the computation of intermediate results, which are themselves relation instances, and if we use field names to refer to fields, the definition of query language constructs must specify the names of fields for all intermediate relation instances. This can be tedious and is really a secondary issue because we can refer to fields by position anyway. On the other hand, field names make queries more readable. Due to these considerations, we use the positional notation to formally define relational algebra and calculus. We also introduce simple conventions that allow intermediate relations to ‘inherit’ field names, for convenience. We present a number of sample queries using the following schema: Sailors (sid: integer, sname: string, rating: integer, age: real) Boats (bid: integer, bname: string, color: string) Reserves (sid: integer, bid: integer, day: date) The key fields are underlined, and the domain of each field is listed after the field name. Thus sid is the key for Sailors, bid is the key for Boats, and all three fields together form the key for Reserves. Fields in an instance of one of these relations will be referred to by name, or positionally, using the order in which they are listed above. In several examples illustrating the relational algebra operators, we will use the in-stances S1 and S2 (of Sailors) and R1 (of Reserves) shown in Figures 4.1, 4.2, and 4.3, respectively, Dept of CSE, Unit-2 Page 1 RELATIONAL ALGEBRA Relational algebra is one of the two formal query languages associated with the re- lational model. Queries in algebra are composed using a collection of operators. A fundamental property is that every operator in the algebra accepts (one or two) rela- tion instances as arguments and returns a relation instance as the result. This property makes it easy to compose operators to form a complex query—a relational algebra expression is recursively defined to be a relation, a unary algebra operator applied to a single expression, or a binary algebra operator applied to two expressions. We describe the basic operators of the algebra (selection, projection, union, cross-product, and difference), as well as some additional operators that can be defined in terms of the basic operators but arise frequently enough to warrant special attention, in the following sections.Each relational query describes a step-by-step procedure for computing the desired answer, based on the order in which operators are applied in the query. The procedural nature of the algebra allows us to think of an algebra expression as a recipe, or a plan, for evaluating a query, and relational systems in fact use algebra expressions to represent query evaluation plans. Selection and Projection Relational algebra includes operators to select rows from a relation (σ) and to project columns (π). These operations allow us to manipulate data in a single relation. Con- sider the instance of the Sailors relation shown in Figure 4.2, denoted as S2. We can retrieve rows corresponding to expert sailors by using the σ operator. The expression, σrating>8(S2) evaluates to the relation shown in Figure 4.4. The subscript rating>8 specifies the selection criterion to be applied while retrieving tuples. sname rating yuppy 9 Lubber 8 Guppy 5 Rusty 10 sid sname rating age 28 Yuppy 9 35.0 58 Rusty 10 35.0 Figure 4.4 σrating>8(S2) Figure 4.5πsname,rating(S2) Dept of CSE, Unit-2 Page 2 The selection operator σ specifies the tuples to retain through a selection condition. In general, the selection condition is a boolean combination (i.e., an expression using the logical connectives ∧ and ∨ ) of terms that have the form attribute op constant or attribute1 op attribute2, where op is one of the comparison operators. The reference to an attribute can be by position (of the form.i or i) or by name (of the form.name or name). The schema of the result of a selection is the schema of the input relation instance The projection operator π allows us to extract columns from a relation; for example, we can find out all sailor names and ratings by using π. The expression πsname,rating (S2) Suppose that we wanted to find out only the ages of sailors. The expression πage(S2) a single tuple with age=35.0 appears in the result of the projection. This follows from the definition of a relation as a set of tuples. In practice, real systems often omit the expensive step of eliminating duplicate tuples, leading to relations that are multisets. However, our discussion of relational algebra and calculus assumes that duplicate elimination is always done so that relations are always sets of tuples. We can compute the names and ratings of highly rated sailors by combining two of the preceding queries. The expression πsname,rating(σrating>8(S2)) age sname rating 35.0 yuppy 9 55.5 Rusty 10 Figure 4.6 πage(S2) Figure 4.7 πsname,rating(σrating>8(S2)) Set Operations The following standard operations on sets are also available in relational algebra: union (U), intersection (∩), set-difference (−), and cross-product (×). Dept of CSE, Unit-2 Page 3  Union: R u S returns a relation instance containing all tuples that occur in either relation instance R or relation instance S (or both). R and S must be union- compatible, and the schema of the result is defined to be identical to the schema of R.  Intersection: R ∩ S returns a relation instance containing all tuples that occur in both R and S. The relations R and S must be union-compatible, and the schema of the result is defined to be identical to the schema of R.  Set-difference: R − S returns a relation instance containing all tuples that occur in R but not in S. The relations R and S must be union-compatible, and the schema of the result is defined to be identical to the schema of R.  Cross-product: R × S returns a relation instance whose schema contains all the fields of R (in the same order as they appear in R) followed by all the fields of S (in the same order as they appear in S). The result of R × S contains one tuple 〈r, s〉 (the concatenation of tuples r and s) for each pair of tuples r ∈ R, s ∈ S. The cross-product opertion is sometimes called Cartesian product. We now illustrate these definitions through several examples. The union of S1 and S2 is shown in Figure 4.8. Fields are listed in order; field names are also inherited from S1. S2 has the same field names, of course, since it is also an instance of Sailors.In general, fields of S2 may have different names; recall that we require only domains to match. Note that the result is a set of tuples. Tuples that appear in both S1 and S2 appear only once in S1 ∪ S2. Also, S1 ∪ R1 is not a valid operation because the two relations are not union-compatible. The intersection of S1 and S2 is shown in Figure 4.9, and the set-difference S1 − S2 is shown in Figure 4.10. sid sname rating age 22 Dustin 7 45.0 31 Lubber 8 55.5 58 Rusty 10 35.0 28 Yuppy 9 35.0 44 Guppy 5 35.0 Figure 4.8 S1 ∪ S2 Dept of CSE, Unit-2 Page 4 sid sname rating age sid sname rating age 31 Lubbe 8 55.5 r 58 Rusty 10 35.0 22 Dustin 7 45.0 Figure 4.9 S1 ∩ S2 Figure 4.10 S1 − S2 The result of the cross-product S1 × R1 is shown in Figure 4.11 The fields in S1 × R1 have the same domains as the corresponding fields in R1 and S1. In Figure 4.11 sid is listed in parentheses to emphasize that it is not an inherited field name; only the corresponding domain is inherited. (sid) sname rating age (sid) bid day 22 Dustin 7 45.0 22 101 10/10/96 22 Dustin 7 45.0 58 103 11/12/96 31 Lubber 8 55.5 22 101 10/10/96 31 Lubber 8 55.5 58 103 11/12/96 58 Rusty 10 35.0 22 101 10/10/96 58 Rusty 10 35.0 58 103 11/12/96 Figure 4.11 S1 × R1 Renaming We introduce a renaming operator ρ for this purpose. The expression ρ(R(F ), E) takes an arbitrary relational algebra expression E and returns an instance of a (new) relation called R. R contains the same tuples as the result of E, and has the same schema as E, but some fields are renamed. The field names in relation R are the same as in E, except for fields renamed in the renaming list F. For example, the expression ρ(C(1 → sid1, 5 → sid2), S1 × R1) returns a relation that contains the tuples shown in Figure 4.11 and has the followi ng schema: C(sid1: integer, sname: string, rating: integer, age: real, sid2: integer, bid: integer,day: dates). It is customary to include some additional operators in the algebra, but they can all be defined in terms of the operators that we have defined thus far. (In fact, the renaming operator is only needed for syntactic convenience, and even the ∩ operator is redundant; R ∩ S can be defined as R − (R − S).) We will consider these additional operators,and their definition in terms of the basic operators, in the next two subsections. Dept of CSE, Unit-2 Page 5 Joins The join operation is one of the most useful operations in relational algebra and is the most commonly used way to combine information from two or more relations. Although a join can be defined as a cross-product followed by selections and projections, joins arise much more frequently in practice than plain cross-products. joins have received a lot of attention, and there are several variants of the join operation. Condition Joins The most general version of the join operation accepts a join condition c and a pair of relation instances as arguments, and returns a relation instance. The join condition is identical to a selection condition in form. The operation is defined as follows: R ⊲⊳c S = σc(R × S) Thus ⊲⊳ is defined to be a cross-product followed by a selection. Note that the condition c can (and typically does) refer to attributes of both R and S. The reference to an attribute of a relation, say R, can be by position (of the form R.i) or by name (of the form R.name).As an example, the result of S1 ⊲⊳S1.sid7 is applied. The answer contains those instances of S that pass this test. On instance S3 of Sailors, the answer contains Sailors tuples with sid 31, 32, 58, 71, and 74. Syntax of TRC Queries We now define these concepts formally, beginning with the notion of a formula. Let Rel be a relation name, R and S be tuple variables, a an attribute of R, and b an attribute of S. Let op denote an operator in the set {,=,≤, ≥, =}. An atomic formula is one of the following: a. R E Rel b. R.a op S.b c. R.a op constant, or constant op R.a A formula is recursively defined to be one of the following, where p and q are them- selves formulas, and p(R) denotes a formula in which the variable R appears: any atomic formula ¬ p, p V q, p ^ q, or p  q  R(p(R)), where R is a tuple variable Dept of CSE, Unit-2 Page 15  R(p(R)), where R is a tuple variable We observe that every variable in a TRC formula appears in a subformula that is atomic, and every relation schema specifies a domain for each field; this observation ensures that each variable in a TRC formula has a well-defined domain from which values for the variable are drawn. That is, each variable has a well-defined type, in the programming language sense. Informally, an atomic formula R ∈ Rel gives R the type of tuples in Rel, and comparisons such as R.a op S.b and R.a op constant induce type restrictions on the field R.a. If a variable R does not appear in an atomic formula of the form R E Rel (i.e., it appears only in atomic formulas that are comparisons), we will follow the convention that the type of R is a tuple whose fields include all (and only) fields of R that appear in the formula. We will not define types of variables formally, but the type of a variable should be clear in most cases, and the important point to note is that comparisons of values having different types should always fail. (In discussions of relational calculus, the simplifying assumption is often made that there is a single domain of constants and that this is the domain associated with each field of each relation.) A TRC query is defined to be expression of the form {T | p(T)}, where T is the only free variable in the formula p. Semantics of TRC Queries What does a TRC query mean? More precisely, what is the set of answer tuples for a given TRC query? The answer to a TRC query {T | p(T)}, as we noted earlier, is the set of all tuples t for which the formula p(T ) evaluates to true with variable T assigned the tuple value t. To complete this definition, we must state which assignments of tuple values to the free variables in a formula make the formula evaluate to true. A query is evaluated on a given instance of the database. Let each free variable in a formula F be bound to a tuple value. For the given assignment of tuples to variables, Dept of CSE, Unit-2 Page 16 with respect to the given database instance, F evaluates to (or simply ‘is’) true if one of the following holds:  F is an atomic formula R  Rel, and R is assigned a tuple in the instance of relation Rel.  F is a comparison R.a op S.b, R.a op constant, or constant op R.a, and the tuples assigned to R and S have field values R.a and S.b that make the comparison true.  F is of the form ¬p, and p is not true; or of the form p ^ q, and both p and q are true; or of the form p V q, and one of them is true, or of the form p  q and q is true whenever 4 p is true.  F is of the form R(p(R)), and there is some assignment of tuples to the free variables in p(R), including the variable R,5 that makes the formula p(R) true.  F is of the form R(p(R)), and there is some assignment of tuples to the free variables in p(R) that makes the formula p(R) true no matter what tuple is assigned to R. Examples of TRC Queries We now illustrate the calculus through several examples, using the instances B1 of Boats, R2 of Reserves, and S3 of Sailors shown in Figures 4.15, 4.16, and 4.17. We will use parentheses as needed to make our formulas unambiguous. Often, a formula p(R) includes a condition RRel, and the meaning of the phrases some tuple R and for all tuples R is intuitive. We will use the notation R  Rel(p(R)) for  R(R  Rel  p(R)). Similarly, we use the notation R  Rel(p(R)) for R(R  Rel  p(R)). (Q12) Find the names and ages of sailors with a rating above 7. {P | S  Sailors(S.rating > 7  P.name = S.sname  P.age = S.age)} This query illustrates a useful convention: P is considered to be a tuple variable with exactly two fields, which are called name and age, because these are the only fields of P that are mentioned and P does not range over any of the relations in the query; that is, there is no subformula of the form P  Relname. The result of this query is a relation with two fields, name and age. The atomic formulas P.name = S.sname Dept of CSE, Unit-2 Page 17 and P.age = S.age give values to the fields of an answer tuple P. On instances B1, R2, and S3, the answer is the set of tuples 〈Lubber, 55.5〉, 〈Andy, 25.5〉, 〈Rusty, 35.0〉, 〈Zorba, 16.0〉, and 〈Horatio,35.0〉. (Q13) Find the sailor name, boat id, and reservation date for each reservation {P | R Reserves S  Sailors (R.sid = S.sid  P.bid = R.bid  P.day = R.day  P.sname = S.sname)} For each Reserves tuple, we look for a tuple in Sailors with the same sid. Given a pair of such tuples, we construct an answer tuple P with fields sname, bid, and day bycopying the corresponding fields from these two tuples. This query illustrates how we can combine values from different relations in each answer tuple. The answer to this query on instances B1, R2, and S3 is shown in Figure 4.20. sname bid day Dustin 10 10/10/98 Dustin 102 10/10/98 Dustin 103 10/8/98 Dustin 104 10/7/98 Lubber 102 11/10/98 Lubber 103 11/6/98 Lubber 104 11/12/98 Horati 101 9/5/98 ooo Horati 102 9/8/98 o Horati 103 9/8/98 ooo Figure 4.20 Answer to Query Q13 (Q1) Find the names of sailors who have reserved boat 103. {P | S  Sailors R  Reserves(R.sid = S.sid  R.bid = 103 P.sname = S.sname)} This query can be read as follows: “Retrieve all sailor tuples for which there exists a tuple in Reserves, having the same value in the sid field, and with bid = 103.” That is, for each sailor tuple, we look for a tuple in Reserves that shows that this sailor has reserved boat 103. The answer tuple P contains just one field, sname Dept of CSE, Unit-2 Page 18 (Q2) Find the names of sailors who have reserved a red boat. {P | S  Sailors R  Reserves(R.sid = S.sid  P.sname = S.sname  B Boats(B.bid = R.bid  B.color =′red′))} This query can be read as follows: “Retrieve all sailor tuples S for which there exist tuples R in Reserves and B in Boats such that S.sid = R.sid, R.bid = B.bid, and B.color =′red′.” Another way to write this query, which corresponds more closely to this reading, is as follows: (Q7) Find the names of sailors who have reserved at least two boats. {P | S  Sailors R1  Reserves R2  Reserves (S.sid = R1.sid  R1.sid = R2.sid  R1.bid ≠ R2.bid  P.sname = S.sname)} Contrast this query with the algebra version and see how much simpler the calculus version is. In part, this difference is due to the cumbersome renaming of fields in the algebra version, but the calculus version really is simpler. (Q9) Find the names of sailors who have reserved all boats. {P | S  Sailors B  Boats (R  Reserves(S.sid = R.sid R.bid = B.bid P.sname = S.sname))} This query was expressed using the division operator in relational algebra. Notice how easily it is expressed in the calculus. The calculus query directly reflects how we might express the query in English: “Find sailors S such that for all boats B there is a Reserves tuple showing that sailor S has reserved boat B.” (Q14) Find sailors who have reserved all red boats. {S | S  Sailors  B  Boats (B.color =′red′ (R ∈ Reserves(S.sid = R.sid  R.bid = B.bid)))} This query can be read as follows: For each candidate (sailor), if a boat is red, the sailor must have reserved it. That is, for a candidate sailor, a boat being red must imply the sailor having reserved it. Observe that since we can return an entire sailor tuple as the answer instead of just the sailor’s name, we have avoided introducing a new free variable (e.g., the variable P in the previous example) to hold the answer values. On instances B1, R2, and S3, the answer contains the Sailors tuples with sids 22 Dept of CSE, Unit-2 Page 19 and 31.We can write this query without using implication, by observing that an expression of the form p q is logically equivalent to ¬p  q: {S | S  Sailors B  Boats (B.color ≠′red′  (R  Reserves(S.sid = R.sid  R.bid = B.bid)))} This query should be read as follows: “Find sailors S such that for all boats B, either the boat is not red or a Reserves tuple shows that sailor S has reserved boat B.” Domain Relational Calculus A domain variable is a variable that ranges over the values in the domain of some attribute (e.g., the variable can be assigned an integer if it appears in an attribute whose domain is the set of integers). A DRC query has the form {〈x1, x2,... , xn〉 | p(〈x1,x2,..., xn〉)}, where each xi is either a domain variable or a constant and p(〈x1,x2,..., xn〉) denotes a DRC formula whose only free variables are thevari-ables among the xi, 1 ≤ i ≤ n. The result of this query is the set of all tuples 〈x1, x2,...,xn〉 for which the formula evaluates to true. A DRC formula is defined in a manner that is very similar to the definition of a TRC formula. The main difference is that the variables are now domain variables. Let op denote an operator in the set {, =, ≤, ≥, =} and let X and Y be domain variables. An atomic formula in DRC is one of the following: 〈x1,x2,...,xn〉 ∈ Rel, where Rel is a relation with n attributes; each xi, 1 ≤ i ≤ n is either a variable or a constant. X op Y X op constant, or constant op X A formula is recursively defined to be one of t

Use Quizgecko on...
Browser
Browser