ilovepdf_merged (3).pdf
Document Details
Uploaded by AdroitNobelium4927
Full Transcript
8 Chapter 1 Introduction These difficulties, among others, prompted both the initial development of database systems and the transition of file-based applications to database systems, back in the 1960s and 1970s. In what follows, we shall see the c...
8 Chapter 1 Introduction These difficulties, among others, prompted both the initial development of database systems and the transition of file-based applications to database systems, back in the 1960s and 1970s. In what follows, we shall see the concepts and algorithms that enable database systems to solve the problems with file-processing systems. In most of this book, we use a university organization as a running example of a typical data-processing application. 1.3 View of Data A database system is a collection of interrelated data and a set of programs that allow users to access and modify these data. A major purpose of a database system is to provide users with an abstract view of the data. That is, the system hides certain details of how the data are stored and maintained. 1.3.1 Data Models Underlying the structure of a database is the data model: a collection of conceptual tools for describing data, data relationships, data semantics, and consistency constraints. There are a number of different data models that we shall cover in the text. The data models can be classified into four different categories: Relational Model. The relational model uses a collection of tables to represent both data and the relationships among those data. Each table has multiple columns, and each column has a unique name. Tables are also known as relations. The relational model is an example of a record-based model. Record-based models are so named because the database is structured in fixed-format records of several types. Each table contains records of a particular type. Each record type defines a fixed number of fields, or attributes. The columns of the table correspond to the attributes of the record type. The relational data model is the most widely used data model, and a vast majority of current database systems are based on the relational model. Chapter 2 and Chapter 7 cover the relational model in detail. Entity-Relationship Model. The entity-relationship (E-R) data model uses a collec- tion of basic objects, called entities, and relationships among these objects. An en- tity is a “thing” or “object” in the real world that is distinguishable from other objects. The entity-relationship model is widely used in database design. Chapter 6 explores it in detail. Semi-structured Data Model. The semi-structured data model permits the specifi- cation of data where individual data items of the same type may have different sets of attributes. This is in contrast to the data models mentioned earlier, where every data item of a particular type must have the same set of attributes. JSON and Extensible Markup Language (XML) are widely used semi-structured data represen- tations. Semi-structured data models are explored in detail in Chapter 8. 1.3 View of Data 9 Object-Based Data Model. Object-oriented programming (especially in Java, C++, or C#) has become the dominant software-development methodology. This led initially to the development of a distinct object-oriented data model, but today the concept of objects is well integrated into relational databases. Standards exist to store objects in relational tables. Database systems allow procedures to be stored in the database system and executed by the database system. This can be seen as extending the relational model with notions of encapsulation, methods, and object identity. Object-based data models are summarized in Chapter 8. A large portion of this text is focused on the relational model because it serves as the foundation for most database applications. 1.3.2 Relational Data Model In the relational model, data are represented in the form of tables. Each table has mul- tiple columns, and each column has a unique name. Each row of the table represents one piece of information. Figure 1.1 presents a sample relational database comprising two tables: one shows details of university instructors and the other shows details of the various university departments. The first table, the instructor table, shows, for example, that an instructor named Einstein with ID 22222 is a member of the Physics department and has an annual salary of $95,000. The second table, department, shows, for example, that the Biology department is located in the Watson building and has a budget of $90,000. Of course, a real-world university would have many more departments and instructors. We use small tables in the text to illustrate concepts. A larger example for the same schema is available online. 1.3.3 Data Abstraction For the system to be usable, it must retrieve data efficiently. The need for efficiency has led database system developers to use complex data structures to represent data in the database. Since many database-system users are not computer trained, developers hide the complexity from users through several levels of data abstraction, to simplify users’ interactions with the system: Physical level. The lowest level of abstraction describes how the data are actually stored. The physical level describes complex low-level data structures in detail. Logical level. The next-higher level of abstraction describes what data are stored in the database, and what relationships exist among those data. The logical level thus describes the entire database in terms of a small number of relatively simple structures. Although implementation of the simple structures at the logical level may involve complex physical-level structures, the user of the logical level does not need to be aware of this complexity. This is referred to as physical data indepen- 10 Chapter 1 Introduction ID name dept name salary 22222 Einstein Physics 95000 12121 Wu Finance 90000 32343 El Said History 60000 45565 Katz Comp. Sci. 75000 98345 Kim Elec. Eng. 80000 76766 Crick Biology 72000 10101 Srinivasan Comp. Sci. 65000 58583 Califieri History 62000 83821 Brandt Comp. Sci. 92000 15151 Mozart Music 40000 33456 Gold Physics 87000 76543 Singh Finance 80000 (a) The instructor table dept name building budget Comp. Sci. Taylor 100000 Biology Watson 90000 Elec. Eng. Taylor 85000 Music Packard 80000 Finance Painter 120000 History Painter 50000 Physics Watson 70000 (b) The department table Figure 1.1 A sample relational database. dence. Database administrators, who must decide what information to keep in the database, use the logical level of abstraction. View level. The highest level of abstraction describes only part of the entire database. Even though the logical level uses simpler structures, complexity remains because of the variety of information stored in a large database. Many users of the database system do not need all this information; instead, they need to access only a part of the database. The view level of abstraction exists to simplify their interac- tion with the system. The system may provide many views for the same database. Figure 1.2 shows the relationship among the three levels of abstraction. An important feature of data models, such as the relational model, is that they hide such low-level implementation details from not just database users, but even from 1.3 View of Data 11 view level view 1 view 2 … view n logical level physical level Figure 1.2 The three levels of data abstraction. database-application developers. The database system allows application developers to store and retrieve data using the abstractions of the data model, and converts the abstract operations into operations on the low-level implementation. An analogy to the concept of data types in programming languages may clarify the distinction among levels of abstraction. Many high-level programming languages support the notion of a structured type. We may describe the type of a record abstractly as follows:1 type instructor = record ID : char (5); name : char (20); dept name : char (20); salary : numeric (8,2); end; This code defines a new record type called instructor with four fields. Each field has a name and a type associated with it. For example, char(20) specifies a string with 20 characters, while numeric(8,2) specifies a number with 8 digits, two of which are to the right of the decimal point. A university organization may have several such record types, including: department, with fields dept name, building, and budget. course, with fields course id, title, dept name, and credits. student, with fields ID, name, dept name, and tot cred. 1 The actual type declaration depends on the language being used. C and C++ use struct declarations. Java does not have such a declaration, but a simple class can be defined to the same effect. 12 Chapter 1 Introduction At the physical level, an instructor, department, or student record can be described as a block of consecutive bytes. The compiler hides this level of detail from program- mers. Similarly, the database system hides many of the lowest-level storage details from database programmers. Database administrators, on the other hand, may be aware of certain details of the physical organization of the data. For example, there are many possible ways to store tables in files. One way is to store a table as a sequence of records in a file, with a special character (such as a comma) used to delimit the different at- tributes of a record, and another special character (such as a new-line character) may be used to delimit records. If all attributes have fixed length, the lengths of attributes may be stored separately, and delimiters may be omitted from the file. Variable length attributes could be handled by storing the length, followed by the data. Databases use a type of data structure called an index to support efficient retrieval of records; these too form part of the physical level. At the logical level, each such record is described by a type definition, as in the previous code segment. The interrelationship of these record types is also defined at the logical level; a requirement that the dept name value of an instructor record must appear in the department table is an example of such an interrelationship. Programmers using a programming language work at this level of abstraction. Similarly, database administrators usually work at this level of abstraction. Finally, at the view level, computer users see a set of application programs that hide details of the data types. At the view level, several views of the database are defined, and a database user sees some or all of these views. In addition to hiding details of the logical level of the database, the views also provide a security mechanism to prevent users from accessing certain parts of the database. For example, clerks in the university registrar office can see only that part of the database that has information about students; they cannot access information about salaries of instructors. 1.3.4 Instances and Schemas Databases change over time as information is inserted and deleted. The collection of information stored in the database at a particular moment is called an instance of the database. The overall design of the database is called the database schema. The con- cept of database schemas and instances can be understood by analogy to a program written in a programming language. A database schema corresponds to the variable declarations (along with associated type definitions) in a program. Each variable has a particular value at a given instant. The values of the variables in a program at a point in time correspond to an instance of a database schema. Database systems have several schemas, partitioned according to the levels of ab- straction. The physical schema describes the database design at the physical level, while the logical schema describes the database design at the logical level. A database may also have several schemas at the view level, sometimes called subschemas, that describe different views of the database. Of these, the logical schema is by far the most important in terms of its effect on application programs, since programmers construct applications by using the logical 1.4 Database Languages 13 schema. The physical schema is hidden beneath the logical schema and can usually be changed easily without affecting application programs. Application programs are said to exhibit physical data independence if they do not depend on the physical schema and thus need not be rewritten if the physical schema changes. We also note that it is possible to create schemas that have problems, such as unnecessarily duplicated information. For example, suppose we store the department budget as an attribute of the instructor record. Then, whenever the value of the budget for a department (say the Physics department) changes, that change must be reflected in the records of all instructors associated with the department. In Chapter 7, we shall study how to distinguish good schema designs from bad schema designs. Traditionally, logical schemas were changed infrequently, if at all. Many newer database applications, however, require more flexible logical schemas where, for ex- ample, different records in a single relation may have different attributes. 1.4 Database Languages A database system provides a data-definition language (DDL) to specify the database schema and a data-manipulation language (DML) to express database queries and up- dates. In practice, the data-definition and data-manipulation languages are not two sep- arate languages; instead they simply form parts of a single database language, such as the SQL language. Almost all relational database systems employ the SQL language, which we cover in great detail in Chapter 3, Chapter 4, and Chapter 5. 1.4.1 Data-Definition Language We specify a database schema by a set of definitions expressed by a special language called a data-definition language (DDL). The DDL is also used to specify additional properties of the data. We specify the storage structure and access methods used by the database system by a set of statements in a special type of DDL called a data storage and definition language. These statements define the implementation details of the database schemas, which are usually hidden from the users. The data values stored in the database must satisfy certain consistency constraints. For example, suppose the university requires that the account balance of a department must never be negative. The DDL provides facilities to specify such constraints. The database system checks these constraints every time the database is updated. In general, a constraint can be an arbitrary predicate pertaining to the database. However, arbitrary predicates may be costly to test. Thus, database systems implement only those integrity constraints that can be tested with minimal overhead: Domain Constraints. A domain of possible values must be associated with every attribute (for example, integer types, character types, date/time types). Declaring an attribute to be of a particular domain acts as a constraint on the values that it 14 Chapter 1 Introduction can take. Domain constraints are the most elementary form of integrity constraint. They are tested easily by the system whenever a new data item is entered into the database. Referential Integrity. There are cases where we wish to ensure that a value that appears in one relation for a given set of attributes also appears in a certain set of attributes in another relation (referential integrity). For example, the depart- ment listed for each course must be one that actually exists in the university. More precisely, the dept name value in a course record must appear in the dept name attribute of some record of the department relation. Database modifications can cause violations of referential integrity. When a referential-integrity constraint is violated, the normal procedure is to reject the action that caused the violation. Authorization. We may want to differentiate among the users as far as the type of access they are permitted on various data values in the database. These differentia- tions are expressed in terms of authorization, the most common being: read autho- rization, which allows reading, but not modification, of data; insert authorization, which allows insertion of new data, but not modification of existing data; update authorization, which allows modification, but not deletion, of data; and delete au- thorization, which allows deletion of data. We may assign the user all, none, or a combination of these types of authorization. The processing of DDL statements, just like those of any other programming lan- guage, generates some output. The output of the DDL is placed in the data dictionary, which contains metadata — that is, data about data. The data dictionary is considered to be a special type of table that can be accessed and updated only by the database sys- tem itself (not a regular user). The database system consults the data dictionary before reading or modifying actual data. 1.4.2 The SQL Data-Definition Language SQL provides a rich DDL that allows one to define tables with data types and integrity constraints. For instance, the following SQL DDL statement defines the department table: create table department (dept name char (20), building char (15), budget numeric (12,2)); Execution of the preceding DDL statement creates the department table with three columns: dept name, building, and budget, each of which has a specific data type asso- ciated with it. We discuss data types in more detail in Chapter 3. The SQL DDL also supports a number of types of integrity constraints. For exam- ple, one can specify that the dept name attribute value is a primary key, ensuring that no 1.4 Database Languages 15 two departments can have the same department name. As another example, one can specify that the dept name attribute value appearing in any instructor record must also appear in the dept name attribute of some record of the department table. We discuss SQL support for integrity constraints and authorizations in Chapter 3 and Chapter 4. 1.4.3 Data-Manipulation Language A data-manipulation language (DML) is a language that enables users to access or ma- nipulate data as organized by the appropriate data model. The types of access are: Retrieval of information stored in the database. Insertion of new information into the database. Deletion of information from the database. Modification of information stored in the database. There are basically two types of data-manipulation language: Procedural DMLs require a user to specify what data are needed and how to get those data. Declarative DMLs (also referred to as nonprocedural DMLs) require a user to spec- ify what data are needed without specifying how to get those data. Declarative DMLs are usually easier to learn and use than are procedural DMLs. However, since a user does not have to specify how to get the data, the database system has to figure out an efficient means of accessing data. A query is a statement requesting the retrieval of information. The portion of a DML that involves information retrieval is called a query language. Although technically incorrect, it is common practice to use the terms query language and data-manipulation language synonymously. There are a number of database query languages in use, either commercially or experimentally. We study the most widely used query language, SQL, in Chapter 3 through Chapter 5. The levels of abstraction that we discussed in Section 1.3 apply not only to defining or structuring data, but also to manipulating data. At the physical level, we must define algorithms that allow efficient access to data. At higher levels of abstraction, we em- phasize ease of use. The goal is to allow humans to interact efficiently with the system. The query processor component of the database system (which we study in Chapter 15 and Chapter 16) translates DML queries into sequences of actions at the physical level of the database system. In Chapter 22, we study the processing of queries in the increasingly common parallel and distributed settings. 16 Chapter 1 Introduction 1.4.4 The SQL Data-Manipulation Language The SQL query language is nonprocedural. A query takes as input several tables (pos- sibly only one) and always returns a single table. Here is an example of an SQL query that finds the names of all instructors in the History department: select instructor.name from instructor where instructor.dept name = 'History'; The query specifies that those rows from the table instructor where the dept name is History must be retrieved, and the name attribute of these rows must be displayed. The result of executing this query is a table with a single column labeled name and a set of rows, each of which contains the name of an instructor whose dept name is History. If the query is run on the table in Figure 1.1, the result consists of two rows, one with the name El Said and the other with the name Califieri. Queries may involve information from more than one table. For instance, the fol- lowing query finds the instructor ID and department name of all instructors associated with a department with a budget of more than $95,000. select instructor.ID, department.dept name from instructor, department where instructor.dept name= department.dept name and department.budget > 95000; If the preceding query were run on the tables in Figure 1.1, the system would find that there are two departments with a budget of greater than $95,000— Computer Science and Finance; there are five instructors in these departments. Thus, the result consists of a table with two columns (ID, dept name) and five rows: (12121, Finance), (45565, Com- puter Science), (10101, Computer Science), (83821, Computer Science), and (76543, Finance). 1.4.5 Database Access from Application Programs Non-procedural query languages such as SQL are not as powerful as a universal Turing machine; that is, there are some computations that are possible using a general-purpose programming language but are not possible using SQL. SQL also does not support ac- tions such as input from users, output to displays, or communication over the network. Such computations and actions must be written in a host language, such as C/C++, Java, or Python, with embedded SQL queries that access the data in the database. Application programs are programs that are used to interact with the database in this fashion. Examples in a university system are programs that allow students to register for courses, generate class rosters, calculate student GPA, generate payroll checks, and perform other tasks. 1.5 Database Design 17 To access the database, DML statements need to be sent from the host to the database where they will be executed. This is most commonly done by using an application-program interface (set of procedures) that can be used to send DML and DDL statements to the database and retrieve the results. The Open Database Con- nectivity (ODBC) standard defines application program interfaces for use with C and several other languages. The Java Database Connectivity (JDBC) standard defines a corresponding interface for the Java language. 1.5 Database Design Database systems are designed to manage large bodies of information. These large bodies of information do not exist in isolation. They are part of the operation of some enterprise whose end product may be information from the database or may be some device or service for which the database plays only a supporting role. Database design mainly involves the design of the database schema. The design of a complete database application environment that meets the needs of the enterprise being modeled requires attention to a broader set of issues. In this text, we focus on the writing of database queries and the design of database schemas, but discuss application design later, in Chapter 9. A high-level data model provides the database designer with a conceptual frame- work in which to specify the data requirements of the database users and how the database will be structured to fulfill these requirements. The initial phase of database design, then, is to characterize fully the data needs of the prospective database users. The database designer needs to interact extensively with domain experts and users to carry out this task. The outcome of this phase is a specification of user requirements. Next, the designer chooses a data model, and by applying the concepts of the cho- sen data model, translates these requirements into a conceptual schema of the database. The schema developed at this conceptual-design phase provides a detailed overview of the enterprise. The designer reviews the schema to confirm that all data requirements are indeed satisfied and are not in conflict with one another. The designer can also examine the design to remove any redundant features. The focus at this point is on describing the data and their relationships, rather than on specifying physical storage details. In terms of the relational model, the conceptual-design process involves decisions on what attributes we want to capture in the database and how to group these attributes to form the various tables. The “what” part is basically a business decision, and we shall not discuss it further in this text. The “how” part is mainly a computer-science problem. There are principally two ways to tackle the problem. The first one is to use the entity-relationship model (Chapter 6); the other is to employ a set of algorithms (collectively known as normalization that takes as input the set of all attributes and generates a set of tables (Chapter 7). A fully developed conceptual schema indicates the functional requirements of the enterprise. In a specification of functional requirements, users describe the kinds of oper- 18 Chapter 1 Introduction ations (or transactions) that will be performed on the data. Example operations include modifying or updating data, searching for and retrieving specific data, and deleting data. At this stage of conceptual design, the designer can review the schema to ensure it meets functional requirements. The process of moving from an abstract data model to the implementation of the database proceeds in two final design phases. In the logical-design phase, the de- signer maps the high-level conceptual schema onto the implementation data model of the database system that will be used. The designer uses the resulting system-specific database schema in the subsequent physical-design phase, in which the physical features of the database are specified. These features include the form of file organization and the internal storage structures; they are discussed in Chapter 13. 1.6 Database Engine A database system is partitioned into modules that deal with each of the responsibilities of the overall system. The functional components of a database system can be broadly divided into the storage manager, the query processor components, and the transaction management component. The storage manager is important because databases typically require a large amount of storage space. Corporate databases commonly range in size from hundreds of gigabytes to terabytes of data. A gigabyte is approximately 1 billion bytes, or 1000 megabytes (more precisely, 1024 megabytes), while a terabyte is approximately 1 tril- lion bytes or 1 million megabytes (more precisely, 1024 gigabytes). The largest enter- prises have databases that reach into the multi-petabyte range (a petabyte is 1024 ter- abytes). Since the main memory of computers cannot store this much information, and since the contents of main memory are lost in a system crash, the information is stored on disks. Data are moved between disk storage and main memory as needed. Since the movement of data to and from disk is slow relative to the speed of the central process- ing unit, it is imperative that the database system structure the data so as to minimize the need to move data between disk and main memory. Increasingly, solid-state disks (SSDs) are being used for database storage. SSDs are faster than traditional disks but also more costly. The query processor is important because it helps the database system to simplify and facilitate access to data. The query processor allows database users to obtain good performance while being able to work at the view level and not be burdened with un- derstanding the physical-level details of the implementation of the system. It is the job of the database system to translate updates and queries written in a nonprocedural language, at the logical level, into an efficient sequence of operations at the physical level. The transaction manager is important because it allows application developers to treat a sequence of database accesses as if they were a single unit that either happens in its entirety or not at all. This permits application developers to think at a higher level of 1.6 Database Engine 19 abstraction about the application without needing to be concerned with the lower-level details of managing the effects of concurrent access to the data and of system failures. While database engines were traditionally centralized computer systems, today parallel processing is key for handling very large amounts of data efficiently. Modern database engines pay a lot of attention to parallel data storage and parallel query pro- cessing. 1.6.1 Storage Manager The storage manager is the component of a database system that provides the interface between the low-level data stored in the database and the application programs and queries submitted to the system. The storage manager is responsible for the interaction with the file manager. The raw data are stored on the disk using the file system provided by the operating system. The storage manager translates the various DML statements into low-level file-system commands. Thus, the storage manager is responsible for stor- ing, retrieving, and updating data in the database. The storage manager components include: Authorization and integrity manager, which tests for the satisfaction of integrity constraints and checks the authority of users to access data. Transaction manager, which ensures that the database remains in a consistent (cor- rect) state despite system failures, and that concurrent transaction executions pro- ceed without conflicts. File manager, which manages the allocation of space on disk storage and the data structures used to represent information stored on disk. Buffer manager, which is responsible for fetching data from disk storage into main memory, and deciding what data to cache in main memory. The buffer manager is a critical part of the database system, since it enables the database to handle data sizes that are much larger than the size of main memory. The storage manager implements several data structures as part of the physical system implementation: Data files, which store the database itself. Data dictionary, which stores metadata about the structure of the database, in particular the schema of the database. Indices, which can provide fast access to data items. Like the index in this textbook, a database index provides pointers to those data items that hold a particular value. For example, we could use an index to find the instructor record with a particular ID, or all instructor records with a particular name. 20 Chapter 1 Introduction We discuss storage media, file structures, and buffer management in Chapter 12 and Chapter 13. Methods of accessing data efficiently are discussed in Chapter 14. 1.6.2 The Query Processor The query processor components include: DDL interpreter, which interprets DDL statements and records the definitions in the data dictionary. DML compiler, which translates DML statements in a query language into an eval- uation plan consisting of low-level instructions that the query-evaluation engine understands. A query can usually be translated into any of a number of alternative evalua- tion plans that all give the same result. The DML compiler also performs query optimization; that is, it picks the lowest cost evaluation plan from among the alter- natives. Query evaluation engine, which executes low-level instructions generated by the DML compiler. Query evaluation is covered in Chapter 15, while the methods by which the query opti- mizer chooses from among the possible evaluation strategies are discussed in Chapter 16. 1.6.3 Transaction Management Often, several operations on the database form a single logical unit of work. An exam- ple is a funds transfer, as in Section 1.2, in which one account A is debited and another account B is credited. Clearly, it is essential that either both the credit and debit occur, or that neither occur. That is, the funds transfer must happen in its entirety or not at all. This all-or-none requirement is called atomicity. In addition, it is essential that the execution of the funds transfer preserves the consistency of the database. That is, the value of the sum of the balances of A and B must be preserved. This correctness require- ment is called consistency. Finally, after the successful execution of a funds transfer, the new values of the balances of accounts A and B must persist, despite the possibility of system failure. This persistence requirement is called durability. A transaction is a collection of operations that performs a single logical function in a database application. Each transaction is a unit of both atomicity and consistency. Thus, we require that transactions do not violate any database-consistency constraints. That is, if the database was consistent when a transaction started, the database must be consistent when the transaction successfully terminates. However, during the exe- cution of a transaction, it may be necessary temporarily to allow inconsistency, since 1.7 Database and Application Architecture 21 either the debit of A or the credit of B must be done before the other. This temporary inconsistency, although necessary, may lead to difficulty if a failure occurs. It is the programmer’s responsibility to properly define the various transactions so that each preserves the consistency of the database. For example, the transaction to transfer funds from account A to account B could be defined to be composed of two separate programs: one that debits account A and another that credits account B. The execution of these two programs one after the other will indeed preserve consistency. However, each program by itself does not transform the database from a consistent state to a new consistent state. Thus, those programs are not transactions. Ensuring the atomicity and durability properties is the responsibility of the database system itself— specifically, of the recovery manager. In the absence of failures, all transactions complete successfully, and atomicity is achieved easily. However, be- cause of various types of failure, a transaction may not always complete its execution successfully. If we are to ensure the atomicity property, a failed transaction must have no effect on the state of the database. Thus, the database must be restored to the state in which it was before the transaction in question started executing. The database sys- tem must therefore perform failure recovery, that is, it must detect system failures and restore the database to the state that existed prior to the occurrence of the failure. Finally, when several transactions update the database concurrently, the consis- tency of data may no longer be preserved, even though each individual transaction is correct. It is the responsibility of the concurrency-control manager to control the inter- action among the concurrent transactions, to ensure the consistency of the database. The transaction manager consists of the concurrency-control manager and the recovery manager. The basic concepts of transaction processing are covered in Chapter 17. The man- agement of concurrent transactions is covered in Chapter 18. Chapter 19 covers failure recovery in detail. The concept of a transaction has been applied broadly in database systems and applications. While the initial use of transactions was in financial applications, the concept is now used in real-time applications in telecommunication, as well as in the management of long-duration activities such as product design or administrative work- flows. 1.7 Database and Application Architecture We are now in a position to provide a single picture of the various components of a database system and the connections among them. Figure 1.3 shows the architecture of a database system that runs on a centralized server machine. The figure summarizes how different types of users interact with a database, and how the different components of a database engine are connected to each other. The centralized architecture shown in Figure 1.3 is applicable to shared-memory server architectures, which have multiple CPUs and exploit parallel processing, but all 22 Chapter 1 Introduction naive users sophisticated application database (tellers, agents, users programmers administrators web users) (analysts) use write use use application application query administration interfaces programs tools tools compiler and DML queries DDL interpreter linker application program DML compiler object code and organizer query evaluation engine query processor buffer manager file manager authorization transaction and integrity manager manager storage manager disk storage indices data dictionary data statistical data Figure 1.3 System structure. the CPUs access a common shared memory. To scale up to even larger data volumes and even higher processing speeds, parallel databases are designed to run on a cluster consisting of multiple machines. Further, distributed databases allow data storage and query processing across multiple geographically separated machines. 1.7 Database and Application Architecture 23 In Chapter 20, we cover the general structure of modern computer systems, with a focus on parallel system architectures. Chapter 21 and Chapter 22 describe how query processing can be implemented to exploit parallel and distributed processing. Chapter 23 presents a number of issues that arise in processing transactions in a parallel or a distributed database and describes how to deal with each issue. The issues include how to store data, how to ensure atomicity of transactions that execute at multiple sites, how to perform concurrency control, and how to provide high availability in the presence of failures. We now consider the architecture of applications that use databases as their back- end. Database applications can be partitioned into two or three parts, as shown in Figure 1.4. Earlier-generation database applications used a two-tier architecture, where the application resides at the client machine, and invokes database system functionality at the server machine through query language statements. In contrast, modern database applications use a three-tier architecture, where the client machine acts as merely a front end and does not contain any direct database calls; web browsers and mobile applications are the most commonly used application clients today. The front end communicates with an application server. The application server, in turn, communicates with a database system to access data. The business logic of the application, which says what actions to carry out under what conditions, is embedded in the application server, instead of being distributed across multiple clients. Three- tier applications provide better security as well as better performance than two-tier applications. user user client application application client network network application server database system server database system (a) Two-tier architecture (b) Three-tier architecture Figure 1.4 Two-tier and three-tier architectures. 24 Chapter 1 Introduction 1.8 Database Users and Administrators A primary goal of a database system is to retrieve information from and store new information in the database. People who work with a database can be categorized as database users or database administrators. 1.8.1 Database Users and User Interfaces There are four different types of database-system users, differentiated by the way they expect to interact with the system. Different types of user interfaces have been designed for the different types of users. Naı̈ve users are unsophisticated users who interact with the system by using prede- fined user interfaces, such as web or mobile applications. The typical user interface for naı̈ve users is a forms interface, where the user can fill in appropriate fields of the form. Naı̈ve users may also view read reports generated from the database. As an example, consider a student, who during class registration period, wishes to register for a class by using a web interface. Such a user connects to a web application program that runs at a web server. The application first verifies the identity of the user and then allows her to access a form where she enters the desired information. The form information is sent back to the web application at the server, which then determines if there is room in the class (by retrieving information from the database) and if so adds the student information to the class roster in the database. Application programmers are computer professionals who write application pro- grams. Application programmers can choose from many tools to develop user in- terfaces. Sophisticated users interact with the system without writing programs. Instead, they form their requests either using a database query language or by using tools such as data analysis software. Analysts who submit queries to explore data in the database fall in this category. 1.8.2 Database Administrator One of the main reasons for using DBMSs is to have central control of both the data and the programs that access those data. A person who has such central control over the system is called a database administrator (DBA). The functions of a DBA include: Schema definition. The DBA creates the original database schema by executing a set of data definition statements in the DDL. Storage structure and access-method definition. The DBA may specify some param- eters pertaining to the physical organization of the data and the indices to be cre- ated. 1.9 History of Database Systems 25 Schema and physical-organization modification. The DBA carries out changes to the schema and physical organization to reflect the changing needs of the organiza- tion, or to alter the physical organization to improve performance. Granting of authorization for data access. By granting different types of authoriza- tion, the database administrator can regulate which parts of the database various users can access. The authorization information is kept in a special system struc- ture that the database system consults whenever a user tries to access the data in the system. Routine maintenance. Examples of the database administrator’s routine mainte- nance activities are: ° Periodically backing up the database onto remote servers, to prevent loss of data in case of disasters such as flooding. ° Ensuring that enough free disk space is available for normal operations, and upgrading disk space as required. ° Monitoring jobs running on the database and ensuring that performance is not degraded by very expensive tasks submitted by some users. 1.9 History of Database Systems Information processing drives the growth of computers, as it has from the earliest days of commercial computers. In fact, automation of data processing tasks predates com- puters. Punched cards, invented by Herman Hollerith, were used at the very beginning of the twentieth century to record U.S. census data, and mechanical systems were used to process the cards and tabulate results. Punched cards were later widely used as a means of entering data into computers. Techniques for data storage and processing have evolved over the years: 1950s and early 1960s: Magnetic tapes were developed for data storage. Data- processing tasks such as payroll were automated, with data stored on tapes. Pro- cessing of data consisted of reading data from one or more tapes and writing data to a new tape. Data could also be input from punched card decks and output to printers. For example, salary raises were processed by entering the raises on punched cards and reading the punched card deck in synchronization with a tape containing the master salary details. The records had to be in the same sorted or- der. The salary raises would be added to the salary read from the master tape and written to a new tape; the new tape would become the new master tape. Tapes (and card decks) could be read only sequentially, and data sizes were much larger than main memory; thus, data-processing programs were forced to 26 Chapter 1 Introduction process data in a particular order by reading and merging data from tapes and card decks. Late 1960s and early 1970s: Widespread use of hard disks in the late 1960s changed the scenario for data processing greatly, since hard disks allowed direct access to data. The position of data on disk was immaterial, since any location on disk could be accessed in just tens of milliseconds. Data were thus freed from the tyranny of sequentiality. With the advent of disks, the network and hierarchical data models were developed, which allowed data structures such as lists and trees to be stored on disk. Programmers could construct and manipulate these data structures. A landmark paper by Edgar Codd in 1970 defined the relational model and non- procedural ways of querying data in the relational model, and relational databases were born. The simplicity of the relational model and the possibility of hiding im- plementation details completely from the programmer were enticing indeed. Codd later won the prestigious Association of Computing Machinery Turing Award for his work. Late 1970s and 1980s: Although academically interesting, the relational model was not used in practice initially because of its perceived performance disadvantages; relational databases could not match the performance of existing network and hierarchical databases. That changed with System R, a groundbreaking project at IBM Research that developed techniques for the construction of an efficient relational database system. The fully functional System R prototype led to IBM’s first relational database product, SQL/DS. At the same time, the Ingres system was being developed at the University of California at Berkeley. It led to a commercial product of the same name. Also around this time, the first version of Oracle was released. Initial commercial relational database systems, such as IBM DB2, Oracle, Ingres, and DEC Rdb, played a major role in advancing techniques for efficient processing of declarative queries. By the early 1980s, relational databases had become competitive with network and hierarchical database systems even in the area of performance. Relational databases were so easy to use that they eventually replaced network and hierar- chical databases. Programmers using those older models were forced to deal with many low-level implementation details, and they had to code their queries in a procedural fashion. Most importantly, they had to keep efficiency in mind when designing their programs, which involved a lot of effort. In contrast, in a rela- tional database, almost all these low-level tasks are carried out automatically by the database system, leaving the programmer free to work at a logical level. Since at- taining dominance in the 1980s, the relational model has reigned supreme among data models. The 1980s also saw much research on parallel and distributed databases, as well as initial work on object-oriented databases. 1.9 History of Database Systems 27 1990s: The SQL language was designed primarily for decision support applica- tions, which are query-intensive, yet the mainstay of databases in the 1980s was transaction-processing applications, which are update-intensive. In the early 1990s, decision support and querying re-emerged as a major ap- plication area for databases. Tools for analyzing large amounts of data saw a large growth in usage. Many database vendors introduced parallel database products in this period. Database vendors also began to add object-relational support to their databases. The major event of the 1990s was the explosive growth of the World Wide Web. Databases were deployed much more extensively than ever before. Database systems now had to support very high transaction-processing rates, as well as very high reliability and 24 × 7 availability (availability 24 hours a day, 7 days a week, meaning no downtime for scheduled maintenance activities). Database systems also had to support web interfaces to data. 2000s: The types of data stored in database systems evolved rapidly during this period. Semi-structured data became increasingly important. XML emerged as a data-exchange standard. JSON, a more compact data-exchange format well suited for storing objects from JavaScript or other programming languages subsequently grew increasingly important. Increasingly, such data were stored in relational database systems as support for the XML and JSON formats was added to the major commercial systems. Spatial data (that is, data that include geographic in- formation) saw widespread use in navigation systems and advanced applications. Database systems added support for such data. Open-source database systems, notably PostgreSQL and MySQL saw increased use. “Auto-admin” features were added to database systems in order to allow au- tomatic reconfiguration to adapt to changing workloads. This helped reduce the human workload in administering a database. Social network platforms grew at a rapid pace, creating a need to manage data about connections between people and their posted data, that did not fit well into a tabular row-and-column format. This led to the development of graph databases. In the latter part of the decade, the use of data analytics and data mining in enterprises became ubiquitous. Database systems were developed specifically to serve this market. These systems featured physical data organizations suitable for analytic processing, such as “column-stores,” in which tables are stored by column rather than the traditional row-oriented storage of the major commercial database systems. The huge volumes of data, as well as the fact that much of the data used for analytics was textual or semi-structured, led to the development of programming frameworks, such as map-reduce, to facilitate application programmers’ use of par- allelism in analyzing data. In time, support for these features migrated into tradi- tional database systems. Even in the late 2010s, debate continued in the database 28 Chapter 1 Introduction research community over the relative merits of a single database system serving both traditional transaction processing applications and the newer data-analysis applications versus maintaining separate systems for these roles. The variety of new data-intensive applications and the need for rapid devel- opment, particularly by startup firms, led to “NoSQL” systems that provide a lightweight form of data management. The name was derived from those systems’ lack of support for the ubiquitous database query language SQL, though the name is now often viewed as meaning “not only SQL.” The lack of a high-level query lan- guage based on the relational model gave programmers greater flexibility to work with new types of data. The lack of traditional database systems’ support for strict data consistency provided more flexibility in an application’s use of distributed data stores. The NoSQL model of “eventual consistency” allowed for distributed copies of data to be inconsistent as long they would eventually converge in the absence of further updates. 2010s: The limitations of NoSQL systems, such as lack of support for consistency, and lack of support for declarative querying, were found acceptable by many ap- plications (e.g., social networks), in return for the benefits they provided such as scalability and availability. However, by the early 2010s it was clear that the lim- itations made life significantly more complicated for programmers and database administrators. As a result, these systems evolved to provide features to support stricter notions of consistency, while continuing to support high scalability and availability. Additionally, these systems increasingly support higher levels of ab- straction to avoid the need for programmers to have to reimplement features that are standard in a traditional database system. Enterprises are increasingly outsourcing the storage and management of their data. Rather than maintaining in-house systems and expertise, enterprises may store their data in “cloud” services that host data for various clients in multiple, widely distributed server farms. Data are delivered to users via web-based services. Other enterprises are outsourcing not only the storage of their data but also whole applications. In such cases, termed “software as a service,” the vendor not only stores the data for an enterprise but also runs (and maintains) the application software. These trends result in significant savings in costs, but they create new issues not only in responsibility for security breaches, but also in data ownership, particularly in cases where a government requests access to data. The huge influence of data and data analytics in daily life has made the man- agement of data a frequent aspect of the news. There is an unresolved tradeoff between an individual’s right of privacy and society’s need to know. Various na- tional governments have put regulations on privacy in place. High-profile security breaches have created a public awareness of the challenges in cybersecurity and the risks of storing data. 274 Chapter 6 Database Design Using the E-R Model Higher- and lower-level entity sets also may be designated by the terms superclass and subclass, respectively. The person entity set is the superclass of the employee and student subclasses. For all practical purposes, generalization is a simple inversion of specialization. We apply both processes, in combination, in the course of designing the E-R schema for an enterprise. In terms of the E-R diagram itself, we do not distinguish between specialization and generalization. New levels of entity representation are distinguished (specialization) or synthesized (generalization) as the design schema comes to express fully the database application and the user requirements of the database. Differences in the two approaches may be characterized by their starting point and overall goal. Specialization stems from a single entity set; it emphasizes differences among en- tities within the set by creating distinct lower-level entity sets. These lower-level entity sets may have attributes, or may participate in relationships, that do not apply to all the entities in the higher-level entity set. Indeed, the reason a designer applies special- ization is to represent such distinctive features. If student and employee have exactly the same attributes as person entities, and participate in exactly the same relationships as person entities, there would be no need to specialize the person entity set. Generalization proceeds from the recognition that a number of entity sets share some common features (namely, they are described by the same attributes and partici- pate in the same relationship sets). On the basis of their commonalities, generalization synthesizes these entity sets into a single, higher-level entity set. Generalization is used to emphasize the similarities among lower-level entity sets and to hide the differences; it also permits an economy of representation in that shared attributes are not repeated. 6.8.3 Attribute Inheritance A crucial property of the higher- and lower-level entities created by specialization and generalization is attribute inheritance. The attributes of the higher-level entity sets are said to be inherited by the lower-level entity sets. For example, student and employee in- herit the attributes of person. Thus, student is described by its ID, name, street, and city attributes, and additionally a tot cred attribute; employee is described by its ID, name, street, and city attributes, and additionally a salary attribute. Attribute inheritance ap- plies through all tiers of lower-level entity sets; thus, instructor and secretary, which are subclasses of employee, inherit the attributes ID, name, street, and city from person, in addition to inheriting salary from employee. A lower-level entity set (or subclass) also inherits participation in the relationship sets in which its higher-level entity (or superclass) participates. Like attribute inheri- tance, participation inheritance applies through all tiers of lower-level entity sets. For example, suppose the person entity set participates in a relationship person dept with department. Then, the student, employee, instructor and secretary entity sets, which are subclasses of the person entity set, also implicitly participate in the person dept relation- ship with department. These entity sets can participate in any relationships in which the person entity set participates. 6.8 Extended E-R Features 275 Whether a given portion of an E-R model was arrived at by specialization or gen- eralization, the outcome is basically the same: A higher-level entity set with attributes and relationships that apply to all of its lower-level entity sets. Lower-level entity sets with distinctive features that apply only within a particular lower-level entity set. In what follows, although we often refer to only generalization, the properties that we discuss belong fully to both processes. Figure 6.18 depicts a hierarchy of entity sets. In the figure, employee is a lower-level entity set of person and a higher-level entity set of the instructor and secretary entity sets. In a hierarchy, a given entity set may be involved as a lower-level entity set in only one ISA relationship; that is, entity sets in this diagram have only single inheritance. If an entity set is a lower-level entity set in more than one ISA relationship, then the entity set has multiple inheritance, and the resulting structure is said to be a lattice. 6.8.4 Constraints on Specializations To model an enterprise more accurately, the database designer may choose to place certain constraints on a particular generalization/specialization. One type of constraint on specialization which we saw earlier specifies whether a specialization is disjoint or overlapping. Another type of constraint on a specializa- tion/generalization is a completeness constraint, which specifies whether or not an en- tity in the higher-level entity set must belong to at least one of the lower-level entity sets within the generalization/specialization. This constraint may be one of the following: Total specialization or generalization. Each higher-level entity must belong to a lower-level entity set. Partial specialization or generalization. Some higher-level entities may not belong to any lower-level entity set. Partial specialization is the default. We can specify total specialization in an E-R dia- gram by adding the keyword “total” in the diagram and drawing a dashed line from the keyword to the corresponding hollow arrowhead to which it applies (for a total spe- cialization), or to the set of hollow arrowheads to which it applies (for an overlapping specialization). The specialization of person to student or employee is total if the university does not need to represent any person who is neither a student nor an employee. However, if the university needs to represent such persons, then the specialization would be partial. The completeness and disjointness constraints, do not depend on each other. Thus, specializations may be partial-overlapping, partial-disjoint, total-overlapping, and total- disjoint. 276 Chapter 6 Database Design Using the E-R Model We can see that certain insertion and deletion requirements follow from the con- straints that apply to a given generalization or specialization. For instance, when a total completeness constraint is in place, an entity inserted into a higher-level entity set must also be inserted into at least one of the lower-level entity sets. An entity that is deleted from a higher-level entity set must also be deleted from all the associated lower-level entity sets to which it belongs. 6.8.5 Aggregation One limitation of the E-R model is that it cannot express relationships among relation- ships. To illustrate the need for such a construct, consider the ternary relationship proj guide, which we saw earlier, between an instructor, student and project (see Figure 6.6). Now suppose that each instructor guiding a student on a project is required to file a monthly evaluation report. We model the evaluation report as an entity evaluation, with a primary key evaluation id. One alternative for recording the (student, project, instructor) combination to which an evaluation corresponds is to create a quaternary (4-way) relationship set eval for between instructor, student, project, and evaluation. (A quaternary relationship is required— a binary relationship between student and evalua- tion, for example, would not permit us to represent the (project, instructor) combination to which an evaluation corresponds.) Using the basic E-R modeling constructs, we ob- tain the E-R diagram of Figure 6.19. (We have omitted the attributes of the entity sets, for simplicity.) It appears that the relationship sets proj guide and eval for can be combined into one single relationship set. Nevertheless, we should not combine them into a single project instructor student proj_ guide eval_ for evaluation Figure 6.19 E-R diagram with redundant relationships. 6.8 Extended E-R Features 277 project instructor student proj_ guide eval_ for evaluation Figure 6.20 E-R diagram with aggregation. relationship, since some instructor, student, project combinations may not have an as- sociated evaluation. There is redundant information in the resultant figure, however, since every instruc- tor, student, project combination in eval for must also be in proj guide. If evaluation was modeled as a value rather than an entity, we could instead make evaluation a multi- valued composite attribute of the relationship set proj guide. However, this alternative may not be an option if an evaluation may also be related to other entities; for example, each evaluation report may be associated with a secretary who is responsible for further processing of the evaluation report to make scholarship payments. The best way to model a situation such as the one just described is to use aggrega- tion. Aggregation is an abstraction through which relationships are treated as higher- level entities. Thus, for our example, we regard the relationship set proj guide (relating the entity sets instructor, student, and project) as a higher-level entity set called proj guide. Such an entity set is treated in the same manner as is any other entity set. We can then create a binary relationship eval for between proj guide and evaluation to rep- resent which (student, project, instructor) combination an evaluation is for. Figure 6.20 shows a notation for aggregation commonly used to represent this situation. 6.8.6 Reduction to Relation Schemas We are in a position now to describe how the extended E-R features can be translated into relation schemas. 278 Chapter 6 Database Design Using the E-R Model 6.8.6.1 Representation of Generalization There are two different methods of designing relation schemas for an E-R diagram that includes generalization. Although we refer to the generalization in Figure 6.18 in this discussion, we simplify it by including only the first tier of lower-level entity sets— that is, employee and student. We assume that ID is the primary key of person. 1. Create a schema for the higher-level entity set. For each lower-level entity set, create a schema that includes an attribute for each of the attributes of that entity set plus one for each attribute of the primary key of the higher-level entity set. Thus, for the E-R diagram of Figure 6.18 (ignoring the instructor and secretary entity sets) we have three schemas: person (ID, name, street, city) employee (ID, salary) student (ID, tot cred) The primary-key attributes of the higher-level entity set become primary-key at- tributes of the higher-level entity set as well as all lower-level entity sets. These can be seen underlined in the preceding example. In addition, we create foreign-key constraints on the lower-level entity sets, with their primary-key attributes referencing the primary key of the relation cre- ated from the higher-level entity set. In the preceding example, the ID attribute of employee would reference the primary key of person, and similarly for student. 2. An alternative representation is possible, if the generalization is disjoint and com- plete— that is, if no entity is a member of two lower-level entity sets directly below a higher-level entity set, and if every entity in the higher-level entity set is also a member of one of the lower-level entity sets. Here, we do not create a schema for the higher-level entity set. Instead, for each lower-level entity set, we create a schema that includes an attribute for each of the attributes of that entity set plus one for each attribute of the higher-level entity set. Then, for the E-R diagram of Figure 6.18, we have two schemas: employee (ID, name, street, city, salary) student (ID, name, street, city, tot cred) Both these schemas have ID, which is the primary-key attribute of the higher-level entity set person, as their primary key. One drawback of the second method lies in defining foreign-key constraints. To illustrate the problem, suppose we have a relationship set R involving entity set person. With the first method, when we create a relation schema R from the relationship set, we also define a foreign-key constraint on R, referencing the schema person. Unfortu- nately, with the second method, we do not have a single relation to which a foreign-key 6.9 Entity-Relationship Design Issues 279 constraint on R can refer. To avoid this problem, we need to create a relation schema person containing at least the primary-key attributes of the person entity. If the second method were used for an overlapping generalization, some values would be stored multiple times, unnecessarily. For instance, if a person is both an employee and a student, values for street and city would be stored twice. If the generalization were disjoint but not complete— that is, if some person is nei- ther an employee nor a student— then an extra schema person (ID, name, street, city) would be required to represent such people. However, the problem with foreign-key constraints mentioned above would remain. As an attempt to work around the problem, suppose employees and students are additionally represented in the person relation. Unfortunately, name, street, and city information would then be stored redundantly in the person relation and the student relation for students, and similarly in the person relation and the employee relation for employees. That suggests storing name, street, and city information only in the person relation and removing that information from student and employee. If we do that, the result is exactly the first method we presented. 6.8.6.2 Representation of Aggregation Designing schemas for an E-R diagram containing aggregation is straightforward. Con- sider Figure 6.20. The schema for the relationship set eval for between the aggregation of proj guide and the entity set evaluation includes an attribute for each attribute in the primary keys of the entity set evaluation and the relationship set proj guide. It also includes an attribute for any descriptive attributes, if they exist, of the relationship set eval for. We then transform the relationship sets and entity sets within the aggregated entity set following the rules we have already defined. The rules we saw earlier for creating primary-key and foreign-key constraints on relationship sets can be applied to relationship sets involving aggregations as well, with the aggregation treated like any other entity set. The primary key of the aggregation is the primary key of its defining relationship set. No separate relation is required to represent the aggregation; the relation created from the defining relationship is used instead. 6.9 Entity-Relationship Design Issues The notions of an entity set and a relationship set are not precise, and it is possible to define a set of entities and the relationships among them in a number of different ways. In this section, we examine basic issues in the design of an E-R database schema. Section 6.11 covers the design process in further detail. 280 Chapter 6 Database Design Using the E-R Model student department ID stud—dept dept_name name building tot_cred budget dept_name (a) Incorrect use of attribute assignment marks student stud_section section (b) Erroneous use of relationship attributes Figure 6.21 Example of erroneous E-R diagrams 6.9.1 Common Mistakes in E-R Diagrams A common mistake when creating E-R models is the use of the primary key of an entity set as an attribute of another entity set, instead of using a relationship. For example, in our university E-R model, it is incorrect to have dept name as an attribute of student, as depicted in Figure 6.21a, even though it is present as an attribute in the relation schema for student. The relationship stud dept is the correct way to represent this information in the E-R model, since it makes the relationship between student and department ex- plicit, rather than implicit via an attribute. Having an attribute dept name as well as a relationship stud dept would result in duplication of information. Another related mistake that people sometimes make is to designate the primary- key attributes of the related entity sets as attributes of the relationship set. For example, ID (the primary-key attributes of student) and ID (the primary key of instructor) should not appear as attributes of the relationship advisor. This should not be done since the primary-key attributes are already implicit in the relationship set.6 A third common mistake is to use a relationship with a single-valued attribute in a situation that requires a multivalued attribute. For example, suppose we decided to represent the marks that a student gets in different assignments of a course offering (section). A wrong way of doing this would be to add two attributes assignment and marks to the relationship takes, as depicted in Figure 6.21b. The problem with this design is that we can only represent a single assignment for a given student-section pair, 6 When we create a relation schema from the E-R schema, the attributes may appear in a schema created from the advisor relationship set, as we shall see later; however, they should not appear in the advisor relationship set. 6.9 Entity-Relationship Design Issues 281 marks student marks_in assignment sec_assign section (c) Correct alternative to erroneous E-R diagram (b) {assignment_marks assignment marks } student stud_section section (d) Correct alternative to erroneous E-R diagram (b) Figure 6.22 Correct versions of the E-R diagram of Figure 6.21. since relationship instances must be uniquely identified by the participating entities, student and section. One solution to the problem depicted in Figure 6.21c, shown in Figure 6.22a, is to model assignment as a weak entity identified by section, and to add a relationship marks in between assignment and student; the relationship would have an attribute marks. An alternative solution, shown in Figure 6.22d, is to use a multivalued composite attribute {assignment marks} to takes, where assignment marks has component attributes assign- ment and marks. Modeling an assignment as a weak entity is preferable in this case, since it allows recording other information about the assignment, such as maximum marks or deadlines. When an E-R diagram becomes too big to draw in a single piece, it makes sense to break it up into pieces, each showing part of the E-R model. When doing so, you may need to depict an entity set in more than one page. As discussed in Section 6.2.2, attributes of the entity set should be shown only once, in its first occurrence. Subse- quent occurrences of the entity set should be shown without any attributes, to avoid repeating the same information at multiple places, which may lead to inconsistency. 6.9.2 Use of Entity Sets versus Attributes Consider the entity set instructor with the additional attribute phone number (Figure 6.23a.) It can be argued that a phone is an entity in its own right with attributes phone 282 Chapter 6 Database Design Using the E-R Model instructor instructor phone ID inst_phone phone_number name ID name location salary phone_number salary (a) (b) Figure 6.23 Alternatives for adding phone to the instructor entity set. number and location; the location may be the office or home where the phone is lo- cated, with mobile (cell) phones perhaps represented by the value “mobile.” If we take this point of view, we do not add the attribute phone number to the instructor. Rather, we create: A phone entity set with attributes phone number and location. A relationship set inst phone, denoting the association between instructors and the phones that they have. This alternative is shown in Figure 6.23b. What, then, is the main difference between these two definitions of an instructor? Treating a phone as an attribute phone number implies that instructors have precisely one phone number each. Treating a phone as an entity phone permits instructors to have several phone numbers (including zero) associated with them. However, we could instead easily define phone number as a multivalued attribute to allow multiple phones per instructor. The main difference then is that treating a phone as an entity better models a situation where one may want to keep extra information about a phone, such as its location, or its type (mobile, IP phone, or plain old phone), or all who share the phone. Thus, treating phone as an entity is more general than treating it as an attribute and is appropriate when the generality may be useful. In contrast, it would not be appropriate to treat the attribute name (of an instruc- tor) as an entity; it is difficult to argue that name is an entity in its own right (in contrast to the phone). Thus, it is appropriate to have name as an attribute of the instructor entity set. Two natural questions thus arise: What constitutes an attribute, and what consti- tutes an entity set? Unfortunately, there are no simple answers. The distinctions mainly depend on the structure of the real-world enterprise being modeled and on the seman- tics associated with the attribute in question. 6.9.3 Use of Entity Sets versus Relationship Sets It is not always clear whether an object is best expressed by an entity set or a relationship set. In Figure 6.15, we used the takes relationship set to model the situation where a 6.9 Entity-Relationship Design Issues 283 registration section_reg... student_reg...... section student sec_id ID semester name year tot_cred Figure 6.24 Replacement of takes by registration and two relationship sets. student takes a (section of a) course. An alternative is to imagine that there is a course- registration record for each course that each student takes. Then, we have an entity set to represent the course-registration record. Let us call that entity set registration. Each registration entity is related to exactly one student and to exactly one section, so we have two relationship sets, one to relate course-registration records to students and one to relate course-registration records to sections. In Figure 6.24, we show the entity sets section and student from Figure 6.15 with the takes relationship set replaced by one entity set and two relationship sets: registration, the entity set representing course-registration records. section reg, the relationship set relating registration and course. student reg, the relationship set relating registration and student. Note that we use double lines to indicate total participation by registration entities. Both the approach of Figure 6.15 and that of Figure 6.24 accurately represent the university’s information, but the use of takes is more compact and probably preferable. However, if the registrar’s office associates other information with a course-registration record, it might be best to make it an entity in its own right. One possible guideline in determining whether to use an entity set or a relationship set is to designate a relationship set to describe an action that occurs between entities. This approach can also be useful in deciding whether certain attributes may be more appropriately expressed as relationships. 6.9.4 Binary versus n-ary Relationship Sets Relationships in databases are often binary. Some relationships that appear to be nonbi- nary could actually be better represented by several binary relationships. For instance, one could create a ternary relationship parent, relating a child to his/her mother and father. However, such a relationship could also be represented by two binary relation- ships, mother and father, relating a child to his/her mother and father separately. Using 284 Chapter 6 Database Design Using the E-R Model the two relationships mother and father provides us with a record of a child’s mother, even if we are not aware of the father’s identity; a null value would be required if the ternary relationship parent were used. Using binary relationship sets is preferable in this case. In fact, it is always possible to replace a nonbinary (n-ary, for n > 2) relationship set by a number of distinct binary relationship sets. For simplicity, consider the abstract ternary (n = 3) relationship set R, relating entity sets A, B, and C. We replace the relationship set R with an entity set E, and we create three relationship sets as shown in Figure 6.25: RA , a many-to-one relationship set from E to A. RB , a many-to-one relationship set from E to B. RC , a many-to-one relationship set from E to C. E is required to have total participation in each of RA , RB , and RC. If the relationship set R had any attributes, these are assigned to entity set E; further, a special identifying attribute is created for E (since it must be possible to distinguish different entities in an entity set on the basis of their attribute values). For each relationship (ai , bi , ci ) in the relationship set R, we create a new entity ei in the entity set E. Then, in each of the three new relationship sets, we insert a relationship as follows: (ei , ai ) in RA. (ei , bi ) in RB. (ei , ci ) in RC. We can generalize this process in a straightforward manner to n-ary relationship sets. Thus, conceptually, we can restrict the E-R model to include only binary relation- ship sets. However, this restriction is not always desirable. A A RA B R C B RB E RC C (a) (b) Figure 6.25 Ternary relationship versus three binary relationships. 6.10 Alternative Notations for Modeling Data 285 An identifying attribute may have to be created for the entity set created to rep- resent the relationship set. This attribute, along with the extra relationship sets required, increases the complexity of the design and (as we shall see in Section 6.7) overall storage requirements. An n-ary relationship set shows more clearly that several entities participate in a single relationship. There may not be a way to translate constraints on the ternary relationship into constraints on the binary relationships. For example, consider a constraint that says that R is many-to-one from A, B to C; that is, each pair of entities from A and B is associated with at most one C entity. This constraint cannot be expressed by using cardinality constraints on the relationship sets RA , RB , and RC. Consider the relationship set proj guide in Section 6.2.2, relating instructor, stu- dent, and project. We cannot directly split proj guide into binary relationships between instructor and project and between instructor and student. If we did so, we would be able to record that instructor Katz works on projects A and B with students Shankar and Zhang; however, we would not be able to record that Katz works on project A with student Shankar and works on project B with student Zhang, but does not work on project A with Zhang or on project B with Shankar. The relationship set proj guide can be split into binary relationships by creating a new entity set as described above. However, doing so would not be very natural. 6.10 Alternative Notations for Modeling Data A diagrammatic representation of the data model of an application is a very important part of designing a database schema. Creation of a database schema requires not only data modeling experts, but also domain experts who know the requirements of the application but may not be familiar with data modeling. An intuitive diagrammatic representation is particularly important since it eases communication of information between these groups of experts. A number of alternative notations for modeling data have been proposed, of which E-R diagrams and UML class diagrams are the most widely used. There is no universal standard for E-R diagram notation, and different books and E-R diagram software use different notations. In the rest of this section, we study some of the alternative E-R diagram notations, as well as the UML class diagram notation. To aid in comparison of our notation with these alternatives, Figure 6.26 summarizes the set of symbols we have used in our E-R diagram notation. 6.10.1 Alternative E-R Notations Figure 6.27 indicates some of the alternative E-R notations that are widely used. One alternative representation of attributes of entities is to show them in ovals connected 286 Chapter 6 Database Design Using the E-R Model E E entity set A1 attributes: A2 simple (A1), A2.1 composite (A2) and R relationship set A2.2 multivalued (A3) derived (A4)