CMP 214.pdf

CMP 214: DATA MANAGEMENT I (3 Units) Course Contents Information storage and retrieval, information management applications, information capture and representation, analysis and indexing, search, retrieval, information privacy, integrity, security, scalability, efficiency and effectiveness. Introduction to database systems, components of database system, DBMS functions, database architecture and data independence use of database query language. Objectives: 1. Demonstrate good knowledge of basic database concepts, including the structure and operation of the relational data model. 2. Understand and successfully apply logical database design principles, including E-R diagrams and database normalization. 3. Understand Information storage and retrieval. 4. Assess the quality and ease of use of data modelling and diagramming tools. 5. Understand Information capture and representation. 6. Design and implement a small database project. Learning outcome: At the end of this course, students should be able to: 1. Define Data Management and its Important. 2. Know the Basic database Terminologies. 3. Know Database management system. 4. Know DBMS architecture. 5. Know DBMS Languages. 6. Know DBMS data model. 7. Know ERM model. 8. Understand design theory for relational database. 9. Understand Relational Algebra. 10. Understand file organization method. 11. Understand file characteristics & file processing activities. 12. Understand Concept of buffer & File Security Techniques. 13. Design and implement a small database project using Microsoft Access/MySQL. References/Further Reading 1. Silberschatz, Abraham; Henry F. Korth; & S. Sudarshan (2011). Database system concepts— 6th ed. The McGraw-Hill Companies, Inc., New York. Course Writer/Developer: Dr. F.E. AYO, Department of Mathematical Sciences, Computer science unit, Olabisi Onabanjo University, Ago-Iwoye. E-mail: [email protected] INTRODUCTION Data management plays a significant role in an organization’s ability to generate revenue, control costs and mitigate risks. Successfully being able to share, store, protect and retrieve the ever-increasing amount of data can be the competitive advantage needed to grow in today’s business environment. Management of data generally focuses on the defining of the data element, how it is structured, stored and moved. Management of information is more concerned with the security, accuracy, completeness and timeliness of multiple pieces of data. These are all concerns that accountants are trained to assess and help manage for an organization. What is Data Management and Why is It Important? The definition provided by the Data Management Association (DAMA) is: “Data management is the development, execution and supervision of plans, policies, programs and practices that control, protect, deliver and enhance the value of data and information assets”. Data management is the process of storing, organizing and maintaining the data created and collected by an organization. The data management process includes a combination of different functions that collectively aim to make sure the data in organization is accurate, available and accessible. Managing customer data results in improved customer relationships, which ultimately drives revenues. While expanded data storage requirements have increased equipment investments; there also are many other hidden costs associated with data management. Some of these costs include power consumption, cooling requirements, installation, cabling, backup management, and data recovery. Inherent within all of these costs is the need for more time and space leading to increases in payroll and occupancy expenses. Lastly, but just as important, data management plays a key role in helping an organization mitigate risks. For example, establishing a formal data retention policy can help decrease storage costs and reduce litigation risks. Basic Terminologies Data Data is raw representation of unprocessed facts, figures, concepts or instruction. It can exist in any form, usable or not. Data are facts presented without relation to other things. E.g. it is raining Information Information is processed data that has been given meaning by way of relational connection. This "meaning" can be useful, but does not have to be. In computer parlance, a relational database makes information from the data stored within it. Information embodies the understanding of some sort. E.g. the temperature dropped to 15 degrees and then it started raining. Database Database is the collection of organized and related files. For instance, the collection of staff- file, student-file & equipment-file of an institution is referred to as the Institution’s Database. This collection of data organized for storage in a computer memory and designed for easy access by authorized users. The data may be in the form of text, numbers, or encoded graphics. A Database is a shared, integrated computer structure that is repository to: End-user data, that is, raw facts of interest to the end user Metadata, or data about data describes of the data characteristics and the set of relationships that link the data found within the db. A database is a collection of data, typically describing the activities of one or more related organizations. For example, a university database might contain information about the following: Entities such as students, faculty, courses, and classrooms. Relationships between entities, such as students' enrolment in courses, faculty teaching courses, and the use of rooms for courses. Proper storage of data in a database will enhance efficient - Data Management - Data processing - Data retrieval Database System Refers to an organization of components that define and regulate the collection, storage, management from general management point of view, the DB system is composed of - Hardware - Software - People System administrators: database systems operations DB administrators: manage the DBMS and ensure the DB is functioning properly DB designers System analysts and programmers design and implement the application programs End user - Procedures - Data Database management system (DBMS) A database management system (DBMS) is a collection of programs that manages the database structure and controls access to the data stored in the database. In a sense, a database resembles a very well-organized electronic filing cabinet in which powerful software (the DBMS) helps manage the cabinet’s contents. Advantages of a DBMS Data independence: This is the technique that allow data to be changed without affecting the applications that process it. We can change the way the database is physically stored and accessed without having to make corresponding changes to the way the database is perceived by the user. Changing the way the database is physically stored and accessed is almost always to improve performance; and the fact that we can make such changes without having to change the way the database looks to the user means that existing application programs, queries, and the like can all still work after the change. Application programs should be as independent as possible from details of data representation and storage. The DBMS can provide an abstract view of the data to insulate application code from such details. Efficient data access: A DBMS deploys sophisticated techniques to store and retrieve data efficiently. Data integrity control: the DBMS can enforce integrity constraints on the data. For example, before inserting salary information for an employee, the DBMS can check that the department budget is not exceeded. Also, updating the status for supplier S1 to 200 will rejected, if status values are supposed never to exceed 100. Security Control: the DBMS can enforce access controls that govern what data is visible to different classes of users. Users are only allowed to perform an operation he or she is allowed to carry out on data. Concurrent access and crash recovery: A DBMS schedules concurrent accesses to the data in such a manner that users can think of the data as being accessed by only one user at a time. Further, the DBMS protects users from the effects of system failures. Reduced application development time: Clearly, the DBMS supports many important functions that are common to many applications accessing data stored in the DBMS. This, in conjunction with the high-level interface to the data, facilitates quick development of applications. Such applications are also likely to be more robust than applications developed from scratch because many important tasks are handled by the DBMS instead of being implemented by the application. Data Dictionary Management: Data Dictionary is where the DBMS stores definitions of the data elements and their relationships (metadata). The DBMS uses this function to look up the required data component structures and relationships. Data Storage Management: This particular function is used for the storage of data and any related data entry forms or screen definitions, report definitions, data validation rules, procedural code, and structures that can handle video and picture formats. Users do not need to know how data is stored or manipulated. Also involved with this structure is a term called performance tuning that relates to a database’s efficiency in relation to storage and access speed. Data Transformation and Presentation: This function exists to transform any data entered into required data structures. By using the data transformation and presentation function the DBMS can determine the difference between logical and physical data formats. Multiuser Access Control: Data integrity and data consistency are the basis of this function. Multiuser access control is a very useful tool in a DBMS, it enables multiple users to access the database simultaneously without affecting the integrity of the database. Backup and Recovery Management: Backup and recovery is brought to mind whenever there is potential outside threats to a database. For example if there is a power outage, recovery management is how long it takes to recover the database after the outage. Backup management refers to the data safety and integrity; for example backing up all your mp3 files on a disk. Data Integrity Management: The DBMS enforces these rules to reduce things such as data redundancy, which is when data is stored in more than one place unnecessarily, and maximizing data consistency, making sure database is returning correct/same answer each time for same question asked. Database Access Languages and Application Programming Interfaces: A query language is a nonprocedural language. An example of this is SQL (structured query language). SQL is the most common query language supported by the majority of DBMS vendors. The use of this language makes it easy for user to specify what they want done without the headache of explaining how to specifically do it. Database Communication Interfaces: This refers to how a DBMS can accept different end user requests through different network environments. An example of this can be easily related to the internet. A DBMS can provide access to the database using the Internet through Web Browsers (Mozilla Firefox, Internet Explorer, and Netscape). Transaction Management: This refers to how a DBMS must supply a method that will guarantee that all the updates in a given transaction are made or not made. All transactions must follow what is called the ACID (Atomicity, Consistency, Isolation, Durability) properties. DBMS ARCHITECTURE The DBMS provides users with an abstract view of the data in it i.e. the system hides certain details of how the data is stored and maintained from users. A DBMS can be viewed as divided into levels of abstraction. A common architecture generally used is the ANSI/SPARC (American National Standards Institute - Standards Planning and Requirements Committee) model. The ANSI/SPARC model abstracts the DBMS into a 3-tier architecture as follows: - External level - Conceptual level - Internal level DBMS architecture External level: The external level is the user’s view of the database and closest to the users. It presents only the relevant part of the DBMS to the user. E.g. A bank database stores a lot more information but an account holder is only interested in his/her account details such as the current account balance, transaction history etc. An external schema describes each external view. The external schema consists of the definition of the logical records and the relationships in the external view. In the external level, the different views may have different representations of the same data. Conceptual level: At this level of database abstraction, all the database entities and relationships among them are included. Conceptual level provides the community view of the database and describes what data is stored in the database and the relationships among the data. In other words, the conceptual view represents the entire database of an organization. It is a complete view of the data requirements of the organization that is independent of any storage consideration. The conceptual schema defines conceptual view. It is also called the logical schema. There is only one conceptual schema per database. The figure shows the conceptual view record of a data base. Internal level or physical level: The lowest level of abstraction is the internal level. It is the one closest to physical storage device. This level is also termed as physical level, because it describes how data are actually stored on the storage medium such as hard disk, magnetic tape etc. This level indicates how the data will be stored in the database and describe the data structures, file structures and access methods to be used by the database. The internal schema defines the internal level. The internal schema contains the definition of the stored record, the methods of representing the data fields and accessed methods used. The figure shows the internal view record of a database. Database Architecture Database applications are usually partitioned into two or three parts as in Figure (a) below. Two-tier architecture In a two-tier architecture, the application resides at the client machine, where it invokes database system functionality at the server machine through query language statements. Application program interface standards like ODBC and JDBC are used for interaction between the client and the server. Three-tier architecture In contrast, in a three-tier architecture, the client machine acts as merely a front end and does not contain any direct database calls. Instead, the client end communicates with an application server, usually through a forms interface. The application server in turn communicates with a database system to access data. The business logic of the application, which says what actions to carry out under what conditions, is embedded in the application server, instead of being distributed across multiple clients. Three-tier applications are more appropriate for large applications, and for applications that run on the World Wide Web. Data independence A database system normally contains a lot of data in addition to users’ data. For example, it stores data about data, known as metadata, to locate and retrieve data easily. It is rather difficult to modify or update a set of metadata once it is stored in the database. But as a DBMS expands, it needs to change over time to satisfy the requirements of the users. If the entire data is dependent, it would become a tedious and highly complex job. Metadata itself follows a layered architecture, so that when we change data at one layer, it does not affect the data at another level. This data is independent but mapped to each other. Logical Data Independence Logical data is data about database, that is, it stores information about how data is managed inside. For example, a table (relation) stored in the database and all its constraints applied on that relation. Logical data independence is a kind of mechanism, which liberalizes itself from actual data stored on the disk. If we do some changes on table format, it should not change the data residing on the disk. Physical Data Independence All the schemas are logical, and the actual data is stored in bit format on the disk. Physical data independence is the power to change the physical data without impacting the schema or logical data. For example, in case we want to change or upgrade the storage system itself — suppose we want to replace hard-disks with SSD — it should not have any impact on the logical data or schemas. INFORMATION STORAGE AND RETRIEVAL An information storage and retrieval system is a network with a built-in user interface that facilitates the creation, searching, and modification of stored data. Files on a computer can be created, moved, modified, grown, shrunk and deleted. In most cases, computer programs that are executed on the computer handle these operations, but the user of a computer can also manipulate files if necessary. For instance, Microsoft Word files are normally created and modified by the Microsoft Word program in response to user commands, but the user can also move, rename, or delete these files directly by using a file manager program such as Windows Explorer. DBMS Languages The workings of a DBMS is controlled by four different languages, they are Data Definition Language (DDL): Used by the DBA and database designers to specify the conceptual schema of a database. In many DBMSs, the DDL is also used to define internal and external schemas (views). In some DBMSs, separate storage definition language (SDL) and view definition language (VDL) are used to define internal and external schemas. SDL is typically realized via DBMS commands provided to the DBA and database designers. Some examples include: - CREATE - to create objects in the database - ALTER - alters the structure of the database - DROP - delete objects from the database - TRUNCATE - remove all records from a table, including all spaces allocated for the records are removed - COMMENT - add comments to the data dictionary - RENAME - rename an object CREATE Creates new databases, tables, and views from RDBMS. For example: Create database tutorialspoint; Create table article; Create view for_students; DROP Drops commands, views, tables, and databases from RDBMS. For example: Drop object_type object_name; Drop database tutorialspoint; Drop table article; Drop view for_students; ALTER Modifies database schema. Alter object_type object_name parameters; For example: Alter table article add subject varchar; This command adds an attribute in the relation article with the name subject of string type. Data Manipulation Language (DML): these statements managing data within schema objects. They specify database retrievals and updates. DML commands (data sublanguage) can be embedded in a general-purpose programming language (host language), such as COBOL, C, C++, or Java. - A library of functions can also be provided to access the DBMS from a programming language - Alternatively, stand-alone DML commands can be applied directly (called a query language). Some examples in SQL include: - SELECT - Retrieve data from the a database - INSERT - Insert data into a table - UPDATE - Updates existing data within a table - DELETE - deletes all records from a table, the space for the records remain - MERGE - UPSERT operation (insert or update) - CALL - Call a PL/SQL or Java subprogram - EXPLAIN PLAN - explain access path to data - LOCK TABLE - control concurrency SELECT/FROM/WHERE ▪ SELECT This is one of the fundamental query command of SQL. It is similar to the projection operation of relational algebra. It selects the attributes based on the condition described by WHERE clause. ▪ FROM This clause takes a relation name as an argument from which attributes are to be selected/projected. In case more than one relation names are given, this clause corresponds to Cartesian product. ▪ WHERE This clause defines predicate or conditions, which must match in order to qualify the attributes to be projected. For example: Select author_name From book_author Where age > 50; This command will yield the names of authors from the relation book_author whose age is greater than 50. INSERT INTO/VALUES This command is used for inserting values into the rows of a table (relation). Syntax: INSERT INTO table (column1 [, column2, column3... ]) VALUES (value1 [, value2, value3... ]) Or INSERT INTO table VALUES (value1, [value2,... ]) For example: INSERT INTO tutorialspoint (Author, Subject) VALUES ("anonymous", "computers"); UPDATE/SET/WHERE This command is used for updating or modifying the values of columns in a table (relation). Syntax: UPDATE table_name SET column_name = value [, column_name = value...] [WHERE condition] For example: UPDATE tutorialspoint SET Author="webmaster" WHERE Author="anonymous"; DELETE/FROM/WHERE This command is used for removing one or more rows from a table (relation). Syntax: DELETE FROM table_name [WHERE condition]; For example: DELETE FROM tutorialspoint WHERE Author="unknown"; Data Control Language (DCL): used for granting and revoking user access on a database - To grant access to user – GRANT - To revoke access from user – REVOKE Transaction Control (TCL): Statements are used to manage the changes made by DML statements. It allows statements to be grouped together into logical transactions. Some examples include: - COMMIT - save work done - SAVEPOINT - identify a point in a transaction to which you can later roll back - ROLLBACK - restore database to original since the last COMMIT - SET TRANSACTION - Change transaction options like isolation level and what rollback segment to use in practical data definition language, data manipulation language and data control languages are not separate language; rather they are the parts of a single database language such as SQL. Example Write the SQL code that will create the table structure for a table named EMP_1. This table is a subset of the EMPLOYEE table. The basic EMP_1 table structure is summarized in the following table. EMP_NUM is the primary key and JOB_CODE is the FK to JOB. Create Table Emp_1( Emp_Num Char(6) Not Null, Emp_Lname Varchar(15), Emp_Fname Varchar(15), Emp_Initial Char(1), Emp_Hiredate Date, Job_Code Char(3), Primary Key (Emp_Num), Foreign Key(Job_Code) References Job (Job_Code) ); Having created the table structure in (a), write the SQL code to enter the first two rows for the table Emp_1 below: Insert into Emp_1 (Emp_Num, Emp_Lname, Emp_Fname, Emp_Initial, Emp_Hiredate, Job_Code) Values ("101", "News", "John", "G", "08-Nov-00", "502"), ("102", "Senior", "David", "H", "12-Jul-89", "500"); Assuming the data shown in the Emp_1 table have been entered, write the SQL code that will list all attributes for a job code of 502. Select * From Emp_1 Where Job_Code = ‘502’; Write the SQL code that will save the changes made to the Emp_1 table. Commit Work; NB: Work is optional. DBMS DATA MODEL A data model is a notation for describing data or information. The description generally consists of three parts: i. Structure of the data: The data structures used to implement data in the computer are sometimes referred to, in discussions of database systems, as a physical data model. ii. Operations on the data: In database data models, there is usually a limited set of operations that can be performed but DBA can describe database operations at a very high level, yet have the database management system implement the operations efficiently. iii. Constraints on the data. Database data models usually have a way to describe limitations on what the data can be. These constraints can range from the simple (e.g., “a day of the week is an integer between 1 and 7” or “a movie has at most one title”) to some very complex limitations. STRUCTURE OF THE DATA Traditionally, there are four DBMS. These three data models also represent the historical developments of the DBMS: Hierarchical Database Model This is the oldest DBMS data model. In this model, information is organized as a collection of inverted trees of records. The record at the root of a tree has zero or more child records; the child records, in turn, serve as parent records for their immediate descendants. This parent- child relationship recursively continues down the tree. The records consist of fields, where each field may contain simple data values (e.g. integer, real, text)., or a pointer to a record. The pointer graph is not allowed to contain cycles. Some combinations of fields may form the key for a record relative to its parent. Only a few hierarchical DBMSs support null values or variable-length fields. Example of Hierarchical data model Applications can navigate a hierarchical database by starting at a root and successively navigate downward from parent to children until the desired record is found. Searching down a hierarchical tree is very fast since the storage layer for hierarchical databases use contiguous storage for hierarchical structures. All other types of queries require sequential search techniques. A DDL for hierarchical data model must allow the definition of record types, fields types, pointers, and parent-child relationships. And the DML must support direct navigation using the parent-child relationships and through pointers. Limitations - Hierarchical model only permits one to many relationship. The concept of Logical relationship is often used to circumvent this limitation. Logical relationship superimpose another set of connection between data items separate from the physical tree structure. This of course increases its complexity - Often a natural hierarchy does not exist and it is awkward to impose a parent-child relationship. Pointers partially compensate for this weakness, but it is still difficult to specify suitable hierarchical schemas for large models and this means Programs have to navigate very close to the physical data structure level, implying that the hierarchical data model offers only very limited data independence. - Lack of ad hoc query capability placed burden on programmers to generate code for reports. Network model: It represents complex data relationships more effectively than the hierarchical model. The major improvement is that the one-to-many limitation was removed; the models still views data in a hierarchical one-to-many structure but now record may have more than one parent. Network data models represent data in a symmetric manner, unlike the hierarchical data model (distinction between a parent and a child). Information is organized as a collection of graphs of record that are related with pointers. More flexible than a hierarchical data model and still permits efficient navigation. Example of network data model The records consist of lists of fields (fixed or variable length with maximum length), where each field contains a simple value (fixed or variable size). The network data model also introduces the notion of indexes of fields and records, sets of pointers, and physical placement of records. A DDL for network data models must allow the definition of record types, fields types, pointers and indexes. And the DML must allow navigation through the graphs through the pointers and indexes. Programs also navigates closely to the physical storage structures, implying that the network data model only supports limited data independence, and are therefore difficult to maintain as the data models evolve over time. Concepts introduced under the network model include: - Schema: conceptual design of the entire database usually managed by the dba - Sub-schema: virtual view of portions of the database visible to application programmers - Data management language: enables definition of and access to the schema and sub- schema. It consist of DDL to construct the schema and DML to develop programs - Data Definition Language. Limitations - Cumbersome - Lack of ad hoc query capability placed burden on programmers to generate code for reports - Structural change in the database could produce havoc in all application programs The relational database Model Developed by E.F. Codd (IBM) in 1970, the relational data model has a mathematical foundation in relational algebra. The model is based on first-order predicate logic and defines a table as an n-ary relation. Data is organized in relations (two-dimensional tables). Each relation contains a set of tuples (records). Each tuple contains a number of fields. A field may contain a simple value (fixed or variable size) from some domain (e.g. integer, real, text, etc.). Advantages of relational model - Built-in multilevel integrity: Data integrity is built into the model at the field level to ensure the accuracy of the data; at the table level to ensure that records are not duplicated and to detect missing primary key values; at the relationship level to ensure that the relationship between a pair of tables is valid; and at the business level to ensure that the data is accurate in terms of the business itself. (Integrity is discussed in detail as the design process unfolds). - Logical and physical data independence from database applications: Neither changes a user makes to the logical design of the database, nor changes a database software vendor makes to the physical implementation of the database, will adversely affect the applications built upon it. - Guaranteed data consistency and accuracy: Data is consistent and accurate due to the various levels of integrity you can impose within the database. (This will become quite clear as you work through the design process.) - Easy data retrieval: At the user’s command, data can be retrieved either from a particular table or from any number of related tables within the database. This enables a user to view information in an almost unlimited number of ways. One commonly perceived disadvantage of the relational database was that software programs based on it ran very slowly. SOME DEFINITIONS RELATION A relation, as defined by E. F. Codd, is a set of tuples (d1, d2,..., dn), where each element dj is a member of Dj, a data domain, for each j=1, 2,..., n. A data domain is simply a data type. It specifies a data abstraction: the possible values for the data and the operations available on the data. For example, a String can have zero or more characters in it, and has operations for comparing strings, concatenating string, and creating strings. A relation is a truth predicate. It defines what attributes are involved in the predicate and what the meaning of the predicate is. In relational data model, relations are represented in the table format. This format stores the relation among entities. A table has rows and columns, where rows represent records and columns represent the attributes. E.g. TUPLE A single row of a table, which contains a single record for that relation is called a tuple. A tuple has attribute values which match the required attributes in the relation. The ordering of attribute values is immaterial. Every tuple in the body of a given relation is required to conform to the heading (attribute) of that relation, i.e. it contains exactly one value, of the applicable type, for each attribute. ATTRIBUTE The columns of a relation are named by attributes. Attributes appear at the tops of the columns. Usually, an attribute describes the meaning of entries in the column below. For instance, the column with attribute length holds the length, in minutes, of each movie. ATTRIBUTE DOMAIN Every attribute has some predefined value scope, known as attribute domain. ATTRIBUTE VALUE/INSTANCE An attribute value is the value for an attribute in a particular tuple. An attribute value must come from the domain that the attribute specifies. Most relational DBMS allows NULL attribute values. Each attribute value in a relational model must be atomic i.e. it must be of some elementary type such as integer or string. It is not permitted for a value to be a record structure, set, list, array, or any other type that reasonably can have its values broken into smaller components. SCHEMAS The name of a relation and the set of attributes for a relation is called the schema for that relation. The schema is depicted by the relation name followed by a parenthesized list of its attributes. Thus, the schema for relation Movies above is Movies (title, year, length, genre) In the relational model, a database consists of one or more relations. The set of schemas for the relations of a database is called a relational database schema, or just a database schema. Data Types All attributes must have a data type. The following are the primitive data types that are supported by SQL (Structured Query Language) systems. i. Character strings of fixed or varying length. The type CHAR(n) denotes a fixed- length string of up to n characters. VARCHAR(n) also denotes a string of up to n characters. The difference is implementation-dependent; typically CHAR implies that short strings are padded to make n characters, while VARCHAR implies that an endmarker or string-length is used. Normally, a string is padded by trailing blanks if it becomes the value of a component that is a fixed-length string of greater length. For example, the string ’foo’ if it became the value of a component for an attribute of type CHAR(5), would assume the value ’foo ’ (with two blanks following the second o). ii. Bit strings of fixed or varying length. These strings are analogous to fixed and varying-length character strings, but their values are strings of bits rather than characters. The type BIT (n) denotes bit strings of length n, while BIT VARYING (n) denotes bit strings of length up to n. iii. BOOLEAN. The type BOOLEAN denotes an attribute whose value is logical. The possible values of such an attribute are TRUE, FALSE. iv. INT or INTEGER. The type INT or INTEGER (these names are synonyms) denotes typical integer values. The type SHORTINT also denotes integers, but the number of bits permitted may be less, depending on the implementation (as with the types int and short int in C). v. FLOAT or REAL. Floating-point numbers can be represented in a variety of ways. We may use the type FLOAT or REAL (these are synonyms) for typical floating point numbers. A higher precision can be obtained with the type DOUBLE PRECISION. We can also specify real numbers with a fixed decimal point. For example, DECIMAL(n,d) allows values that consist of n decimal digits, with the decimal point assumed to be d positions from the right. Thus, 0123.45 is a possible value of type DECIMAL(6,2). NUMERIC is almost a synonym for DECIMAL, although there are possible implementation-dependent differences. vi. DATE and TIME. Dates and times can be represented by the data types DATE and TIME, respectively. These values are essentially character strings of a special form. We may, in fact, coerce dates and times to string types, and we may do the reverse if the string “makes sense” as a date or time. A relational database Schema is depicted by stating both the attributes and their datatype: Movies (title CHAR(100), year INT, length INT, genre CHAR(10), studioName CHAR(30), producer INT) Relation instance: A finite set of tuples in the relational database system represents relation instance. Relation instances do not have duplicate tuples. { , , } It is more common and concise to show a relation value as a table. All ordering within the table is artificial and meaningless. DESIGN THEORY FOR RELATIONAL DATABASE A common problem with schema design involves trying to combine too much into one relation thus leading to redundancy. Thus, improvements to relational schemas pay close attention to eliminating redundancy. The theory of “dependences” is a well-developed theory for relational databases providing guidelines on how to develop good schema and eliminate flaws if any. The first concept we need to consider is Functional Dependency (FD). FUNCTIONAL DEPENDENCY: the term functional dependence can be defined most easily this way: Definition: Let A and B be subsets of the attribute of a relation R. Then the functional dependency (FD) A→B holds in R if and only if, whenever two tuples of R have the same value for A, they also have the same value for B. A and B are the determinant and the dependent, respectively, and the FD overall can be read as “A functionally determines B” or “B is functionally dependent on A,” or more simply just as A→B If A and B are composite, then we have A1, A2, …, An → B1, B2, …, Bm This is also equivalent to A1, A2, …, An → B1, A1, A2, …, An → B2,..., A1, A2, …, An → Bm The attribute(s) B is functionally dependent on attributes(s)A, if A determines B. e.g. STU_PHONE is functionally dependent on STU_NUM. STU_NUM is not functionally dependent on STU_PHONE because the STU_PHONE value 2267 is associated with two STU_NUM values: 324274 and 324291. (This could happen when roommates share a single land line phone number). The functional dependence definition can be generalized to cover the case in which the determining attribute values occur more than once in a table. Functional dependence can then be defined this way: Attribute B is functionally dependent on A if all of the rows in the table that agree in value for attribute A also agree in value for attribute B. RELATION KEYS: The key’s role is based on a concept known as determination, i.e. the statement “A determines B” indicates that if you know the value of attribute A, you can look up (determine) the value of attribute B. E.g.: an invoice number identifies all of the invoice attributes such as invoice date and the customer name. If we know STU_NUM in a STUDENT table we can look up (determine) student’s last name, grade point average, phone number, etc. Table name: Student The shorthand notation for “A determines B” is A → B. If A determines B, C, and D, we write A → B, C, D. For the student example we can write: STU_NUM → STU_LNAME, STU_FNAME, STU_INIT, STU_DOB, STU_TRANSFER In contrast, STU_NUM is not determined by STU_LNAME because it is quite possible for several students to have the last name Smith. Proper understanding of the principle of determination is vital to the understanding of a central relational database concept known as functional dependence (FD). Definitions Key Attribute(s): We say a set of one or more attributes {A1, A2,..., An} is a key for a relation R if: i. Those attributes functionally determine all other attributes of the relation. That is, it is impossible for two distinct tuples of R to agree on all of A1, A2,..., An (uniqueness). ii. No proper subset of {A1, A2,..., An} functionally determines all other attributes of R; i.e., a key must be minimal. When a key consists of a single attribute A, we often say that A (rather than {A}) is a key. An attribute that is part of a key is called key attribute. Consider the Relation Movies below: Attributes {title, year, starName} form a key for the relation Movies because it meets the two conditions: Condition 1: Do they functionally determine all the other attributes? Yes Condition 2: Do any proper subset of {title, year, starName} functionally determines all other attributes? {title, year} do not determine starName thus {title, year} is not a key. {year, starName} is not a key because we could have a star in two movies in the same year; therefore{Year, starName} → title is not an FD. {title, starName} is not a key, because two movies with the same title, made in different years, can have a star in common. Therefore, no proper subset of {title, year, starName} functionally determines all other attributes. Super Key (shortened: super set of keys): An attribute or a combination of attributes that is used to identify the records uniquely is known as Super Key. It is to be noted that some superkeys are not (minimal) keys. Note that every superkey satisfies the first condition of a key: it functionally determines all other attributes of the relation. However, a superkey need not satisfy the second condition: minimality. A table can have many Super Keys. E.g. of Super Key ID ID, Name ID, Address ID, Department_ID ID, Salary Name, Address Candidate Key: It can be defined as minimal Super Key or irreducible Super Key. In other words an attribute or a combination of attribute that identifies the record uniquely but none of its proper subsets can identify the records uniquely. E.g. of Candidate Key Code Name, Address Primary Key: A Candidate Key that is used by the database designer for unique identification of each row in a table is known as Primary Key. A Primary Key can consist of one or more attributes of a table. E.g. of Primary Key - Database designer can use one of the Candidate Key as a Primary Key. In this case we have “Code” and “Name, Address” as Candidate Key, The designer may prefer “Code” as the Primary Key as the other key is the combination of more than one attribute. Null values should never be part of a primary key, they should also be avoided to the greatest extent possible in other attributes too. A null is no value at all. It does not mean a zero or a space. There are rare cases in which nulls cannot be reasonably avoided when you are working with non-key attributes. For example, one of an EMPLOYEE table’s attributes is likely to be the EMP_INITIAL. However, some employees do not have a middle initial. Therefore, some of the EMP_INITIAL values may be null. Null can also exist because of the nature of the relationship between two entities. Conventionally, the existence of nulls in a table is often an indication of poor database design. Nulls, if used improperly, can create problems because they have many different meanings. For example, a null can represent: An unknown attribute value. A known, but missing, attribute value. A “not applicable” condition. Foreign Key: A foreign key is an attribute or combination of attributes in one base table that points to the candidate key (generally it is the primary key) of another table. The purpose of the foreign key is to ensure referential integrity of the data i.e. only values that are supposed to appear in the database are permitted. E.g. Consider two table Employee (EmployeeID, EmployeeName, DOB, DOJ, SSN, DeptID, MgrID) and DeptTbl (Dept_ID, Dept_Name, Manager_ID, Location_ID) Dept_ID is the primary key in Table DeptTbl, the DeptID attribute of table Employee (dependent or child table) can be defined as the Foreign Key as it can reference to the Dept_ID attribute of the table DeptTbl (the referenced or parent table), a Foreign Key value must match an existing value in the parent table or be NULL. Composite Key: If we use multiple attributes to create a Primary Key then that Primary Key is called Composite Key (also called a Compound Key or Concatenated Key). Full functional dependency (FFD): If the attribute (B) is functionally dependent on a composite key (A) but not on any subset of that composite key, the attribute (B) is fully functionally dependent on (A). Alternate Key: Alternate Key can be any of the Candidate Keys except for the Primary Key. Secondary Key: The attributes that are not even the Super Key but can be still used for identification of records (not unique) are known as Secondary Key. E.g. of Secondary Key can be Name, Address, Salary, Department_ID etc. as they can identify the records but they might not be unique. An Example of relational db with primary key and foreign key Exercise Suppose R is a relation with attributes A1, A2,..., An. As a function of n, tell how many superkeys R has, if: i. The only key is A1. ii. The only keys are A1 and A2 iii. The only keys are {A1, A2} and {A3, A4} iv. The only keys are {A1, A2} and {A1, A3} Rules About Functional Dependencies These rules guide us on how we can infer a functional dependency from other given FD’s. Transitive rule E.g., given that a relation R (A, B, C) satisfies the FD’s A —> B and B —> C, then we can deduce that R also satisfies the FD A —> C. Proof: Consider two tuples of R that agree on A. Let the tuples agreeing on attribute A be (a, b1, c1) and (a, b2, c2). Since R satisfies A → B, and these tuples agree on A, they must also agree on B. That is, b1 = b2. The tuples are now (a, b, c1) and (a, b, c2), where b is both b1 and b2. Similarly, since R satisfies B → C, and the tuples agree on B, they agree also on C. Thus, c1= c2; i.e., the tuples do agree on C. We have proved that any two tuples of R that agree on A also agree on C, and that is the FD A → C. This rule is called the transitive rule. The Splitting/Combining Rule The splitting/ combining rule is stated as follows: Suppose we have two tuples that agree in A1, A2,..., An. As a single FD, we would assert “then the tuples must agree in all of B1, B2,..., Bm.” As individual FD’s, we assert “then the tuples agree in B1, and they agree in B2, and,..., and they agree in Bm.” Recall that the FD: A1, A2, …, An → B1, B2, …, Bm is equivalent to the set of FD’s: A1, A2, …, An → B1, A1, A2, …, An → B2,..., A1, A2, …, An → Bm In other words, we may split attributes on the right side so that only one attribute appears on the right of each FD. Likewise, we can replace a collection of FD’s having a common left side by a single FD with the same left side and all the right sides combined into one set of attributes. In either event, the new set of FD’s is equivalent to the old. The equivalence noted above can be used in two ways. We can replace an FD A1, A2, …, An → B1, B2, …, Bm by a set of FD’s A1, A2, …, An → Bi for i = 1, 2,..., m We call this transformation the splitting rule. We can replace a set of FD’s A1, A2, …, An → Bi for i = 1, 2,..., m by the single FD A1, A2, …, An → B1, B2, …, Bm. We call this transformation the combining rule. E.g. the set of FD’s: title year → length title year → genre title year → studioName is equivalent to the single FD: title year → length, genre, studioName Trivial-dependency rule. Trivial Functional Dependencies: If a functional dependency (FD) α → β holds in Relation R, then the term trivial is attached to the dependency if it is satisfied by all possible r(R) i.e. α → β is trivial if β ⊆ α or β ∪ α = R where β is a subset of α, then it is called a trivial FD. e.g. title, year → title title → title are both trivial FD There is an intermediate situation in which some, but not all, of the attributes on the right side of an FD are also on the left. This FD is not trivial. Non-trivial: If an FD X → Y holds, where Y is not a subset of X, then it is called a non-trivial FD. Completely non-trivial: If an FD X → Y holds, where x intersect Y = Φ, it is said to be a completely non-trivial FD. Computing the Closure of Attributes Given a set a = {A1, A2,..., A n} of attributes of R and a set of functional dependencies FD, we need a way to find all of the attributes of R that are functionally determined by a. This set of attributes is called the closure of a under F and is denoted a+. Finding a+ is useful because: if a+ = R, then a is a superkey for R With closure we can find all FD’s easily - To check if X → A - Compute X+ - Check if A ∈ X if we find a+ for all a ⊆ R, we've computed F+ (except that we'd need to use decomposition to get all of it). Formal definition of closure: Suppose a ={A1, A2,..., An} is a set of attributes and S is a set of FD’s. The closure of a under the FD’s in S is the set of attributes B such that every relation that satisfies all the FD’s in set S also satisfies A1, A2, …, An → B. That is, A1, A2, …, An → B follows from the FD’s of S. We denote the closure of a set of attributes A1, A2, …, An by {A1, A2,..., An}+. Note that A1, A2,..., An are always in {A1, A2, …, An}+ because the FD A1, A2, …, An → Ai is trivial when i is one of 1,2,... , n. An algorithm for computing a+: result := a repeat temp := result for each functional dependency 𝛽 → 𝛾 in F do if 𝛽 ⊆ result then result := result ⋃ 𝛾 until temp = result Example: Consider a relation with attributes A, B, C, D, E, and F. Suppose that this relation has the FD’s AB → C, BC → AD, D → E, and CF → B. What is the closure of {A, B}? Solution First, split BC → AD into BC → A and BC → D. Result = {A, B}. For AB → C AB ⊆ Result, so we have Result = Result ⋃ C i.e. Result = {A, B, C}. For BC → A and BC → D BC ⊆ Result, so we have Result = Result ⋃ A and D i.e., Result = {A, B, C, D} For D → E D ⊆ Result, so we have Result = Result ⋃ E i.e. Result = {A, B, C, D, E} For CF → B C and F not a subset of Result, so no more changes to Result are possible. Thus, {A, B}+ = {A, B, C, D, E} By computing the closure of any set of attributes, we can test whether any given FD A1, A2, …, An → B follows from a set of FD’s S. First compute {A1, A2, …, An}+ using the set of FD’s S. If B is in {A1, A2, …, An}+, then A1, A2, …, An → B does follow from S, and if B is not in {A1, A2, …, An}+, then this FD does not follow from S. More generally, A1, A2, …, An → B1, B2, …, Bm follows from set of FD’s S if and only if all of B1, B2,..., Bm are in {A1, A2, …, An}+. Example: Consider the relation and FD’s in the example above, suppose we wish to test whether AB → D follows from these FD’s. We compute {A, B}+, which is {A, B, C, D, E}. Since D is a member of the closure, we conclude that AB → D does follow. On the other hand, consider the FD D → A. To test whether this FD follows from the given FD’s, first compute {D}+. {D}+ = {D, E}. Since A is not a member of {D, E}, we conclude that D → A does not follow. Armstrong's Axioms If F is a set of functional dependencies then the closure of F, denoted as F+, is the set of all functional dependencies logically implied by F. Armstrong's Axioms are a set of rules, that when applied repeatedly, generates a closure of functional dependencies. Reflexivity / reflexive rule: If {B1, B2,..., Bm} ⊆{A1, A2,..., An}, then A1, A2, …, An → B1, B2, …, Bm. These are what we have called trivial FD’s. Augmentation rule: If A1A2 … An → B1B2 … Bm, then A1A2 … AnC1C2 … Ck → B1B2, … BmC1C2 … Ck for any set of attributes C1, C2,..., Ck Since some of the C ’s may also be A’s or B’s or both, we should eliminate from the left side duplicate attributes and do the same for the right side. Transitivity rule: If A1, A2, …, An → B1, B2, …, Bm and B1, B2, …, Bm → C1, C2,…, Ck hold in relation R, then A1, A2, …, An → C1, C2, …, Ck also holds in R. Additional rules: - Union: If X → Y and X → Z, then X → Y Z - Pseudotransitivity: If X → Y and W Y → Z, then W X → Z - Composition: If X → Y and Z → W, then XZ → Y W Transitive dependence: an attribute Y is said to be transitively dependent on attribute X if Y is functionally dependent on another attribute Z which is functionally dependent on X. Exercise Consider a relation with schema R (A, B, C, D) and FD’s AB → C, D → D and D → A. i. What are all the nontrivial FD’s that follow from the given FD’s? You should restrict yourself to FD’s with single attributes on the right side. ii. What are all the keys of R? iii. What are all the superkeys for R that are not keys? THE RELATIONAL ALGEBRA The relational algebra is a procedural query language. It consists of a set of operations that take one or two relations as input and produce a new relation as their result. The fundamental operations in the relational algebra are i. Select ii. Project iii. Union iv. set difference v. Cartesian product, and vi. Rename. In addition to the fundamental operations, there are several other operations—namely, set intersection natural join, and assignment. We shall define these operations in terms of the fundamental operations. Fundamental Operations The select, project, and rename operations are called unary operations, because they operate on one relation. The other three operations operate on pairs of relations and are, therefore, called binary operations. The Select Operation The select operation selects tuples that satisfy a given predicate. We use the lowercase Greek letter sigma (𝜎) to denote selection. The predicate appears as a subscript to 𝜎. The argument relation is in parentheses after the 𝜎. The instructor relation. Thus, to select those tuples of the instructor relation where the instructor is in the “Physics” department, we write: 𝜎𝑑𝑒𝑝𝑡_𝑛𝑎𝑚𝑒=“Physics” (instructor) Result of 𝜎𝑑𝑒𝑝𝑡_𝑛𝑎𝑚𝑒=“Physics” (instructor) We can find all instructors with salary greater than $90,000 by writing: 𝜎salary>90000 (instructor) Furthermore, we can combine several predicates into a larger predicate by using the connectives and (∧), or (∨), and not (¬). Thus, to find the instructors in Physics with a salary greater than $90,000, we write 𝜎dept name = “Physics” ∧ salary>90000 (instructor) To find all departments whose name is the same as their building name, we can write: 𝜎dept name = building (department) The Project Operation Suppose we want to list all instructors’ ID, name, and salary, but do not care about the dept name. The project operation allows us to produce this relation. The project operation is a unary operation that returns its argument relation, with certain attributes left out. Since a relation is a set, any duplicate rows are eliminated. Projection is denoted by the uppercase Greek letter pi (∏). We list those attributes that we wish to appear in the result as a subscript to ∏. The argument relation follows in parentheses. We write the query to produce such a list as: ∏ID,name,salary (instructor) Result of ∏ID,name,salary (instructor). Composition of Relational Operations The fact that the result of a relational operation is itself a relation is important. Consider the more complicated query “Find the name of all instructors in the Physics department.” We write ∏name (𝜎dept_name ="Physics" (department)) The Union Operation Consider a query to find the set of all courses taught in the Fall 2009 semester, the Spring 2010 semester, or both. The information is contained in the section relation below. The section relation. To find the set of all courses taught in the Fall 2009 semester, we write: ∏course_id (𝜎semester = “Fall” ∧ year=2009 (section)) To find the set of all courses taught in the Spring 2010 semester, we write: ∏course_id (𝜎semester = “Spring” ∧ year=2010 (section)) To answer the query, we need the union of these two sets; that is, we need all section IDs that appear in either or both of the two relations. We find these data by the binary operation union, denoted, as in set theory, by ∪. So the expression needed is: ∏course_id (𝜎semester = “Fall” ∧ year=2009 (section)) ⋃ ∏course_id (𝜎semester = “Spring” ∧ year=2010 (section)) The result relation for this query appears in Figure below Courses offered in either Fall 2009, Spring 2010 or both semesters. Observe that, in our example, we took the union of two sets, both of which consisted of course id values. In general, we must ensure that unions are taken between compatible relations. For example, it would not make sense to take the union of the instructor relation and the student relation. Although both relations have four attributes, they differ on the salary and tot cred domains. The union of these two attributes would not make sense in most situations. Therefore, for a union operation r ∪ s to be valid, we require that two conditions hold: 1. The relations r and s must be of the same arity. That is, they must have the same number of attributes. 2. The domains of the ith attribute of r and the ith attribute of s must be the same, for all i. Note that r and s can be either database relations or temporary relations that are the result of relational-algebra expressions. The Set-Difference Operation The set-difference operation, denoted by −, allows us to find tuples that are in one relation but are not in another. The expression r − s produces a relation containing those tuples in r but not in s. We can find all the courses taught in the Fall 2009 semester but not in Spring 2010 semester by writing: ∏course_id (𝜎semester = “Fall” ∧ year=2009 (section)) − ∏course_id (𝜎semester = “Spring” ∧ year=2010 (section)) The result relation for this query appears in Figure below Courses offered in the Fall 2009 semester but not in Spring 2010 semester. As with the union operation, we must ensure that set differences are taken between compatible relations. Therefore, for a set-difference operation r − s to be valid, we require that the relations r and s be of the same arity, and that the domains of the ith attribute of r and the ith attribute of s be the same, for all i. The Cartesian-Product Operation The Cartesian-product operation, denoted by a cross (×), allows us to combine information from any two relations. We write the Cartesian product of relations r1 and r2 as r1 × r2. Recall that a relation is by definition a subset of a Cartesian product of a set of domains. From that definition, we should already have an intuition about the definition of the Cartesian- product operation. However, since the same attribute name may appear in both r1 and r2, we need to devise a naming schema to distinguish between these attributes. We do so here by attaching to an attribute the name of the relation from which the attribute originally came. For example, the relation schema for r = instructor × teaches is: (instructor.ID, instructor.name, instructor.dept_name, instructor.salary teaches.ID, teaches.course_id, teaches.sec_id, teaches.semester, teaches.year) With this schema, we can distinguish instructor.ID from teaches.ID. For those attributes that appear in only one of the two schemas, we shall usually drop the relation-name prefix. This simplification does not lead to any ambiguity. We can then write the relation schema for r as: (instructor.ID, name, dept name, salary, teaches.ID, course id, sec id, semester, year) This naming convention requires that the relations that are the arguments of the Cartesian- product operation have distinct names. This requirement causes problems in some cases, such as when the Cartesian product of a relation with itself is desired. A similar problem arises if we use the result of a relational-algebra expression in a Cartesian product, since we shall need a name for the relation so that we can refer to the relation’s attributes. The teaches relation. The Rename Operation Unlike relations in the database, the results of relational-algebra expressions do not have a name that we can use to refer to them. It is useful to be able to give them names; the rename operator, denoted by the lowercase Greek letter rho (𝜌), lets us do this. Given a relational- algebra expression E, the expression 𝜌𝑥 (E) The Set-Intersection Operation The first additional relational-algebra operation that we shall define is set intersection (∩). Suppose that we wish to find the set of all courses taught in both the Fall 2009 and the Spring 2010 semesters. Using set intersection, we can write ∏course_id (𝜎semester = “Fall” ∧ year=2009 (section)) ⋂ ∏course_id (𝜎semester = “Spring” ∧ year=2010 (section)) The result relation for this query appears in Figure below Courses offered in both the Fall 2009 and Spring 2010 semesters. The Natural-Join Operation The natural join is a binary operation that allows us to combine certain selections and a Cartesian product into one operation. It is denoted by the join symbol ⋈. The natural-join operation forms a Cartesian product of its two arguments, performs a selection forcing equality on those attributes that appear in both relation schemas, and finally removes duplicate attributes. Returning to the example of the relations instructor and teaches, computing instructor natural join teaches considers only those pairs of tuples where both the tuple from instructor and the tuple from teaches have the same value on the common attribute ID. For example: Find the names of all instructors together with the course id of all courses they taught. We express this query by using the natural join as follows: ∏name,course_id (instructor ⋈ teaches) Since the schemas for instructor and teaches have the attribute ID in common, the natural-join operation considers only pairs of tuples that have the same value on ID. It combines each such pair of tuples into a single tuple on the union of the two schemas; that is, (ID, name, dept name, salary, course_id). After performing the projection, we obtain the relation in Figure below Result of ∏name,course_id (instructor ⋈ teaches). We are now ready for a formal definition of the natural join. Consider two relations r(R) and s(S). The natural join of r and s, denoted by 𝑟 ⋈ 𝑠, is a relation on schema R ∪ S formally defined as follows: 𝑟 ⋈ 𝑠 = ∏𝑅 ⋃ 𝑆 (𝜎𝑟.𝐴1 =𝑠.𝐴1 ∧ 𝑟.𝐴2=𝑠.𝐴2∧…∧ 𝑟.𝐴𝑛=𝑠.𝐴𝑛 (𝑟 × 𝑠)) where R ∩ S = {A1, A2,..., An} Please note that if r(R) and s(S) are relations without any attributes in common, that is, R ∩ S = ∅, then 𝑟 ⋈ 𝑠 = r × s RELATIONSHIPS IN RELATIONAL DATABASE Relationships are classified as: one-to-one (1:1), one-to-many (1:M), and many-to-many (M:N or M:M). In developing a good database designs, we must focus on the following points: The 1:1 relationship should be rare in any relational database design. The 1:M relationship is the relational modeling ideal. Therefore, this relationship type should be the norm in any relational database design. M:N relationships cannot be implemented as such in the relational model. The 1:M Relationship The 1:M relationship between PAINTER and PAINTING The implemented 1:M relationship between PAINTER and PAINTING The one-to-many (1:M) relationship is easily implemented in the relational model by putting the primary key of the 1 side in the table of the many side as a foreign key. The 1:M relationship between COURSE and CLASS The implemented 1:M relationship between COURSE and CLASS The 1:1 relationship: As the 1:1 label implies, in this relationship, one entity can be related to only one other entity, and vice versa. For example, one department chair— a professor—can chair only one department, and one department can have only one department chair. The entities PROFESSOR and DEPARTMENT thus exhibit a 1:1 relationship. The 1:M relationship between PROFRSSOR and DEPARTMENT If we the examine the PROFESSOR and DEPARTMENT tables, we note some important features: Each professor is a College employee; thus, the professor identification is through the EMP_NUM. (However, note that not all employees are professors—there’s another optional relationship). The 1:1 PROFESSOR chairs DEPARTMENT relationship is implemented by having the EMP_NUM as foreign key in the DEPARTMENT table. Note that the 1:1 relationship is treated as a special case of the 1:M relationship in which the “many” side is restricted to a single occurrence. In this case, DEPARTMENT contains the EMP_NUM as a foreign key to indicate that it is the department that has a chair. Also, note that the PROFESSOR table contains the DEPT_CODE foreign key to implement the 1:M DEPARTMENT employs PROFESSOR relationship. This is a good example of how two entities can participate in two (or even more) relationships simultaneously. The preceding “PROFESSOR chairs DEPARTMENT” example illustrates a proper 1:1 relationship. In fact, the use of a 1:1 relationship ensures that two entity sets are not placed in the same table when they should not be. The M:N Relationship: A many-to-many (M:N) relationship is not supported directly in the relational environment. However, M:N relationships can be implemented by creating a new entity in 1:M relationships with the original entities. To explore the many-to-many (M:N) relationship, consider a rather typical college environment in which each STUDENT can take many CLASSes, and each CLASS can contain many STUDENTs. The ER model for this M:N relationship is below: Table name: PROFESSOR Primary key: EMP_NUM Foreign key: DEPT_CODE The ERM’s M:N relationship between STUDENT and CLASS Note the features of the ERM above: Each CLASS can have many STUDENTs, and each STUDENT can take many CLASSes. There can be many rows in the CLASS table for any given row in the STUDENT table, and there can be many rows in the STUDENT table for any given row in the CLASS table. To examine the M:N relationship more closely, imagine a small college with two students, each of whom takes three classes. The table below shows the enrolment data for the two students. Given the data relationship and the sample data in the table above, it can be wrongly assumed that M:N relationship can be implemented by simply adding a foreign key in the many side of the relationship that points to the primary key of the related table. This not correct. The tables will create many redundancies. For example, note that the STU_NUM values occur many times in the STUDENT table. In a real-world situation, additional student attributes such as address, classification, major, and home phone would also be contained in the STUDENT table, and each of those attribute values would be repeated in each of the records shown here. Similarly, the CLASS table contains many duplications: each student taking the class generates a CLASS record. The problem would be even worse if the CLASS table included such attributes as credit hours and course description. Given the structure and contents of the two tables, the relational operations become very complex and are likely to lead to system efficiency errors and output errors. The problems inherent in the many-to-many (M:N) relationship can easily be avoided by creating a composite entity (also referred to as a bridge entity or an associative entity). Because such a table is used to link the tables that were originally related in an M:N relationship, the composite entity structure includes—as foreign keys—at least the primary keys of the tables that are to be linked. The database designer can then define the composite table’s primary key either by: using the combination of those foreign keys or create a new primary key. In the example above, we can create the composite ENROLL table CLASS and STUDENT. In this example, the ENROLL table’s primary key is the combination of its foreign keys CLASS_CODE and STU_NUM. But the designer could have decided to create a single- attribute new primary key such as ENROLL_LINE, using a different line value to identify each ENROLL table row uniquely. (Microsoft Access users might use the Autonumber data type to generate such line values automatically). Because the ENROLL table links two tables, STUDENT and CLASS, it is also called a linking table. In other words, a linking table is the implementation of a composite entity. The ENROLL table yields the required M:N to 1:M conversion. Observe that the composite entity represented by the ENROLL table must contain at least the primary keys of the CLASS and STUDENT tables (CLASS_CODE and STU_NUM, respectively) for which it serves as a connector. Also note that the STUDENT and CLASS tables now contain only one row per entity. The ENROLL table contains multiple occurrences of the foreign key values, but those controlled redundancies are incapable of producing anomalies as long as referential integrity is enforced. Additional attributes may be assigned as needed. In this case, ENROLL_GRADE is selected to satisfy a reporting requirement. Also note that the ENROLL table’s primary key consists of the two attributes CLASS_CODE and STU_NUM because both the class code and the student number are needed to define a particular student’s grade. Naturally, the conversion is reflected in the ERM, too. The revised relationship is shown below: Changing the M:N relationship to two 1:M relationships Note that the composite entity named ENROLL represents the linking table between STUDENT and CLASS. We can increase the amount of available information even as we control the database’s redundancies. Below is the expanded ERM, including the 1:M relationship between COURSE and CLASS. Note that the model is able to handle multiple sections of a CLASS while controlling redundancies by making sure that all of the COURSE data common to each CLASS are kept in the COURSE table. Expanded entity relationship model The relationship diagram that corresponds to the ERM shown above is as below: CODD’S RELATIONAL DATABASE RULES In 1985, Dr. E. F. Codd published a list of 12 rules to define a relational database system. The reason Dr. Codd published the list was his concern that many vendors were marketing products as “relational” even though those products did not meet minimum relational standards. Dr. Codd’s list, serves as a frame of reference for what a truly relational database should be. Note that even the dominant database vendors do not fully support all 12 rules. Dr. Codd’s 12 Relational Database Rules RULE RULE NAME DESCRIPTION 1. Information All information in a relational database must be logically represented as column values in rows within tables. 2. Guaranteed Access Every value in a table is guaranteed to be accessible through a combination of table name, primary key value, and column name. 3. Systematic Treatment Nulls must be represented and treated in a systematic way, of Nulls independent of data type. 4. Dynamic Online The metadata must be stored and managed as ordinary data, Catalogue Based on that is, in tables within the database. Such data must be the relational model available to authorized users using the standard database relational language 5. Comprehensive Data The relational dataset may support many languages. Sublanguage However, it must support one well-defined, declarative language with support for data definition, view definition, data manipulation. 6. View Updating Any view that is theoretically updatable must be updatable through the system. 7. High-Level insert, The database must support set-level inserts, updates, and Update, and Delete deletes. 8. Physical Data Application programs and ad hoc facilities are logically Independence unaffected when physical access methods or storage structures are changed. 9. Logical Data Application programs and ad hoc facilities are logically Independence unaffected when changes are made to the table structures that preserve the original table values (changing order of column or inserting columns). 10. Integrity All relational integrity constraints must be definable in the Independence relational language and stored in the system catalog, not at the application level. 11. Distribution The end users and application programs are unaware and Independence unaffected by the data location (distributed vs. local databases). 12. Nonsubversion If the system supports low-level access to the data, there must not be a way to bypass the integrity rules of the database. Rule Zero All preceding rules are based on the notion that in order for a dataset to be considered relational, it must use its relational facilities exclusively to manage the database. THE ENTITY RELATIONSHIP MODEL (ERM) Peter Chen first introduced the ER data model in 1976; it was the graphical representation of entities and their relationships in a database structure that quickly became popular because it complemented the relational data model concepts. The relational data model and ERM combined to provide the foundation for tightly structured database design. ER models are normally represented in an entity relationship diagram (ERD), which uses graphical representations to model database components. The ERD represents the conceptual database as viewed by the end user. ERDs depict the database’s main components: entities, attributes, and relationships. Because an entity represents a real-world object, the words entity and object are often used interchangeably. The notations used with ERDs are the original Chen notation and the newer Crow’s Foot and UML notations. Some conceptual database modeling concepts can be expressed only using the Chen notation. Because of its implementation emphasis, the Crow’s Foot notation can represent only what could be implemented. In summary: The Chen notation favors conceptual modeling. The Crow’s Foot notation favors a more implementation-oriented approach. The UML notation can be used for both conceptual and implementation modeling. Components of the ER model: Entity: An entity is anything about which data are to be collected and stored. An entity is represented in the ERD by a rectangle, also known as an entity box. The name of the entity, a noun, is written in the center of the rectangle. The entity name is generally written in capital letters and is written in the singular form: PAINTER rather than PAINTERS, and EMPLOYEE rather than EMPLOYEES. Usually, when applying the ERD to the relational model, an entity is mapped to a relational table. Each row in the relational table is known as an entity instance or entity occurrence in the ER model. Each entity is described by a set of attributes that describes particular characteristics of the entity. For example, the entity EMPLOYEE will have attributes such as a Social Security number, a last name, and a first name. A collection of like entities is known as an entity set. The word entity in the ERM corresponds to a table—not to a row—in the relational environment. The ERM refers to a table row as an entity instance or entity occurrence. Attributes: Attributes are characteristics of entities. For example, the STUDENT entity includes, among many others, the attributes STU_LNAME, STU_FNAME, and STU_INITIAL. In the original Chen notation, attributes are represented by ovals and are connected to the entity rectangle with a line. Each oval contains the name of the attribute it represents. In the Crow’s Foot notation, the attributes are written in the attribute box below the entity rectangle. Because the Chen representation is rather space-consuming, software vendors have adopted the Crow’s Foot attribute display. Required and Optional Attributes: A required attribute is an attribute that must have a value; in other words, it cannot be left empty. As shown below there are two boldfaced attributes in the Crow’s Foot notation. This indicates that a data entry will be required. In this example, STU_LNAME and STU_FNAME require data entries because of the assumption that all students have a last name and a first name. But students might not have a middle name, and perhaps they do not (yet) have a phone number and an e-mail address. Therefore, those attributes are not presented in boldface in the entity box. An optional attribute is an attribute that does not require a value; therefore, it can be left empty. Attributes of the STUDENT entity: Chen and crow’s foot Attribute domains: Attributes have a domain. A domain is the set of possible values for a given attribute. For example, the domain for the grade point average (GPA) attribute is written (0,4) because the lowest possible GPA value is 0 and the highest possible value is 4. The domain for the gender attribute consists of only two possibilities: M or F (or some other equivalent code). The domain for a company’s date of hire attribute consists of all dates that fit in a range (for example, company startup date to current date). Attributes may share a domain. For instance, a student address and a professor address share the same domain of all possible addresses. In fact, the data dictionary may let a newly declared attribute inherit the characteristics of an existing attribute if the same attribute name is used. For example, the PROFESSOR and STUDENT entities may each have an attribute named ADDRESS and could therefore share a domain. Identifiers (Primary Keys): The ERM uses identifiers, that is, one or more attributes that uniquely identify each entity instance. In the relational model, such identifiers are mapped to primary keys (PKs) in tables. Identifiers are underlined in the ERD. Key attributes are also underlined in a frequently used table structure shorthand notation using the format: Table Name (Key_Attribute 1, Attribute 2, Attribute 3,... Attribute K) Composite Identifiers: Ideally, an entity identifier is composed of only a single attribute. However, it is possible to use a composite identifier, that is, a primary key composed of more than one attribute. E.g. CLASS entity of CRS_CODE and CLASS_SECTION instead of using CLASS_CODE. Either approach uniquely identifies each entity instance. Composite and Simple Attributes: Attributes are classified as simple or composite. A composite attribute, not to be confused with a composite key, is an attribute that can be further subdivided to yield additional attributes. For example, the attribute ADDRESS can be subdivided into street, city, state, and zip code. Similarly, the attribute PHONE_NUMBER can be subdivided into area code and exchange number. A simple attribute is an attribute that cannot be subdivided. For example, age, sex and marital status would be classified as simple attributes. To facilitate detailed queries, it is wise to change composite attributes into a series of simple attributes. Single-Valued Attributes: A single-valued attribute is an attribute that can have only a single value. For example, a person can have only one Social Security number, and a manufactured part can have only one serial number. Keep in mind that a single-valued attribute is not necessarily a simple attribute. For instance, a part’s serial number, such as SE-08-02-189935, is single-valued, but it is a composite attribute because it can be subdivided into the region in which the part was produced (SE), the plant within that region (08), the shift within the plant (02), and the part number (189935). Multivalued Attributes: Multivalued attributes are attributes that can have many values. For instance, a person may have several college degrees, and a household may have several different phones, each with its own number. Similarly, a car’s color may be subdivided into many colors (that is, colors for the roof, body, and trim). In the Chen ERM, the multivalued attributes are shown by a double line connecting the attribute to the entity. The Crow’s Foot notation does not identify multivalued attributes. A multivalued attribute in an entity The ERD above contains all of the components introduced thus far. Note that CAR_VIN is the primary key, and CAR_COLOR is a multivalued attribute of the CAR entity. Implementing Multivalued Attributes Although the conceptual model can handle M:N relationships and multivalued attributes, it is poor practice to implement them in the RDBMS. In the relational table, each column/row intersection represents a single data value. The designer must decide on one of two possible courses of action to handle multivalued attributes. 1. Split the multivalued attribute to create several new attributes. For example, the CAR entity’s attribute CAR_COLOR can be split to create the new attributes CAR_TOPCOLOR, CAR_BODYCOLOR, and CAR_TRIMCOLOR, which are then assigned to the CAR entity. Although this solution seems to work, its adoption can lead to major structural problems in the table. For example, if additional color components—such as a logo color—are added for some cars, the table structure must be modified to accommodate the new color section. In that case, cars that do not have such color sections generate nulls for the nonexisting components, or their color entries for those sections are entered as N/A to indicate “not applicable.” Also consider the employee entity containing employee degrees and certifications. If some employees have 10 degrees and certifications while most have fewer or none, the number of degree/certification attributes would number 10, and most of those attribute values would be null for most of the employees.) In short, while solution 1 is practicable, it is not an acceptable solution. Splitting the multivalued attribute into new attributes 2. Create a new entity composed of the original multivalued attribute’s components. This new entity allows the designer to define color for different sections of the car. (See Table below). Components of the Multivalued Attribute SECTION COLOR Top White Body Blue Trim Gold Interior Blue Another benefit we can derive from this approach is that we are now able to assign as many colors as necessary without having to change the table structure. A new entity set composed of a multivalued attribute’s components Note that the ERM shown in Figure above reflects the components listed in previous table. This is the preferred way to deal with multivalued attributes. Creating a new entity in a 1:M relationship with the original entity yields several benefits: it’s a more flexible, expandable solution, and it is compatible with the relational model. Derived Attributes: A derived attribute is an attribute whose value is calculated (derived) from other attributes. The derived attribute need not be physically stored within the database; instead, it can be derived by using an algorithm. For example, an employee’s age, EMP_AGE, may be found by computing the integer value of the difference between the current date and the EMP_DOB. In Microsoft Access, we use: INT((DATE() – EMP_DOB)/365). In Microsoft SQL Server, we use SELECT DATEDIFF(“YEAR”, EMP_DOB, GETDATE()), where DATEDIFF is a function that computes the difference between dates. The first parameter indicates the measurement, in this case, years. In Oracle, we use SYSDATE instead of DATE(). A derived attribute is indicated in the Chen notation by a dashed line connecting the attribute and the entity. The Crow’s Foot notation does not have a method for distinguishing the derived attribute from other attributes. Depiction of a derived attribute Derived attributes are sometimes referred to as computed attributes. A derived attribute computation can be as simple as adding two attribute values located on the same row, or it can be the result of aggregating the sum of values located on many table rows (from the same table or from a different table). The decision to store derived attributes in database tables depends on the processing requirements and the constraints placed on a particular application. The designer should be able to balance the design in accordance with such constraints. Table below shows the advantages and disadvantages of storing (or not storing) derived attributes in the database. Advantages and disadvantages of storing (or not storing) derived attributes in the database. Relationships: Relationships describe associations among data. Most relationships describe associations between two entities. The three types of relationships among data include: one-to-many (1:M) many-to-many (M:N) and one-to-one (1:1). The ER model uses the term connectivity to label the relationship types. The name of the relationship is usually an active or passive verb. For example, a PAINTER paints many PAINTINGs; an EMPLOYEE learns many SKILLs; an EMPLOYEE manages a STORE. Illustrated below are the different types of relationships using two ER notations: the original Chen notation and the more current Crow’s Foot notation. The left side of the ER diagram shows the Chen notation, based on Peter Chen’s landmark paper. In this notation, the connectivities are written next to each entity box. Relationships are represented by a diamond connected to the related entities through a relationship line. The relationship name is written inside the diamond. The right side illustrates the Crow’s Foot notation. The name “Crow’s Foot” is derived from the three-pronged symbol used to represent the “many” side of the relationship. In the basic Crow’s Foot ERD represented above, the connectivities are represented by symbols. For example, the “1” is represented by a short line segment, and the “M” is represented by the three-pronged “crow’s foot.” The relationship name is written above the relationship line. In Figure above, entities and relationships are shown in a horizontal format, but they may also be oriented vertically. The entity location and the order in which the entities are presented are immaterial; just remember to read a 1:M relationship from the “1” side to the “M” side. Connectivity and Cardinality As stated above, the term connectivity is used to describe the relationship classification. Cardinality expresses the minimum and maximum number of entity occurrences associated with one occurrence of the related entity. In the ERD, cardinality is indicated by placing the appropriate numbers beside the entities, using the format (x,y). The first value represents the minimum number of associated entities, while the second value represents the maximum number of associated entities. Many database designers who use Crow’s Foot modeling notation do not depict the specific cardinalities on the ER diagram itself because the specific limits described by the cardinalities cannot be implemented directly through the database design. Correspondingly, some Crow’s Foot ER modeling tools do not print the numeric cardinality range in the diagram; instead, you can add it as text if you want to have it shown. Connectivity and Cardinality Knowing the minimum and maximum number of entity occurrences is very useful at the application software level. A college might want to ensure that a class is not taught unless it has at least 10 students enrolled. Similarly, if the classroom can hold only 30 students, the application software should use that cardinality to limit enrollment in the class. However, keep in mind that the DBMS cannot handle the implementation of the cardinalities at the table level—that capability is provided by the application software or by triggers. Existence Dependence: An entity is said to be existence-dependent if it can exist in the database only when it is associated with another related entity occurrence. In implementation terms, an entity is existence-dependent if it has a mandatory foreign key—that is, a foreign key attribute that cannot be null. For example, if an employee wants to claim one or more dependents for tax-withholding purposes, the relationship “EMPLOYEE claims DEPENDENT” would be appropriate. In that case, the DEPENDENT entity is clearly existence dependent on the EMPLOYEE entity because it is impossible for the dependent to exist apart from the EMPLOYEE in the database. If an entity can exist apart from all of its related entities (it is existence-independent), then that entity is referred to as a strong entity or regular entity. Relationship Strength: The concept of relationship strength is based on how the primary key of a related entity is defined. To implement a relationship, the primary key of one entity appears as a foreign key in the related entity. For example, the 1:M relationship between VENDOR and PRODUCT is implemented by using the VEND_CODE primary key in VENDOR as a foreign key in PRODUCT. There are times when the foreign key also is a primary key component in the related entity. Relationship strength decisions affect primary key arrangement in database design. Weak (Non-identifying) Relationships: A weak relationship, also known as a non-identifying relationship, exists if the PK of the related entity does not contain a PK component of the parent entity. By default, relationships are established by having the PK of the parent entity appear as an FK on the related entity. For example, suppose that the COURSE and CLASS entities are defined as: COURSE(CRS_Code, Dept_Code, CRS_Description, CRS_Credit) CLASS(Class_Code, CRS_Code, Class_Section, Class_Time, Room_Code, Prof_Num) In this case, a weak relationship exists between COURSE and CLASS because the CLASS_CODE is the CLASS entity’s PK, while the CRS_CODE in CLASS is only an FK. In this example, the CLASS PK did not inherit the PK component from the COURSE entity. Crow’s Foot notation depicts a weak relationship Strong (Identifying) Relationships: A strong relationship, also known as an identifying relationship, exists when the PK of the related entity contains a PK component of the parent entity. For example, the definitions of the COURSE and CLASS entities COURSE(CRS_CODE,DEPT_CODE,CRS_DESCRIPTION,CRS_CREDIT) CLASS(CRS_CODE,CLASS_SECTION,CLASS_TIME,ROOM_CODE,PROF_NUM) indicate that a strong relationship exists between COURSE and CLASS, because the CLASS entity’s composite PK is composed of CRS_CODE + CLASS_SECTION. (Note that the CRS_CODE in CLASS is also the FK to the COURSE entity.) The Crow’s Foot notation depicts the strong (identifying) relationship with a solid line between the entities. Whether the relationship between COURSE and CLASS is strong or weak depends on how the CLASS entity’s primary key is defined. Keep in mind that the order in which the tables are created and loaded is very important. For example, in the “COURSE generates CLASS” relationship, the COURSE table must be created before the CLASS table. After all, it would not be acceptable to have the CLASS table’s foreign key reference a COURSE table that does not yet exist. Crow’s Foot notation depicts a strong relationship Weak Entities: a weak entity is one that meets two conditions: The entity is existence-dependent; that is, it cannot exist without the entity with which it has a relationship. The entity has a primary key that is partially or totally derived from the parent entity in the relationship. For example, a company insurance policy insures an employee and his/her dependents. For the purpose of describing an insurance policy, an EMPLOYEE might or might not have a DEPENDENT, but the DEPENDENT must be associated with an EMPLOYEE. Moreover, the DEPENDENT cannot exist without the EMPLOYEE; that is, a person cannot get insurance coverage as a dependent unless s(he) happens to be a dependent of an employee. DEPENDENT is the weak entity in the relationship “EMPLOYEE has DEPENDENT.” Note that the Chen notation above identifies the weak entity by using a double-walled entity rectangle. The Crow’s Foot notation generated by Visio Professional uses the relationship line and the PK/FK designation to indicate whether the related entity is weak. A strong (identifying) relationship indicates that the related entity is weak. Such a relationship means that both conditions for the weak entity definition have been met—the related entity is existence-dependent, and the PK of the related entity contains a PK component of the parent entity. Remember that the weak entity inherits part of its primary key from its strong counterpart. For example, at least part of the DEPENDENT entity’s key shown in Figure above was inherited from the EMPLOYEE entity: EMPLOYEE (EMP_NUM, EMP_LNAME, EMP_FNAME, EMP_INITIAL, EMP_DOB, EMP_HIREDATE) DEPENDENT (EMP_NUM, DEP_NUM, DEP_FNAME, DEP_DOB) Crowfoot symbols Relationship Degree: A relationship degree indicates the number of entities or participants associated with a relationship. A unary relationship exists when an association is maintained within a single entity. A binary relationship exists when two entities are associated. A ternary relationship exists when three entities are associated. Although higher degrees exist, they are rare and are not specifically named. (For example, an association of four entities is described simply as a four-degree relationship.) FILE SYSTEM In computing, a file system is a method for storing and organizing computer files and the data they contain to make it easy to find and access them. File systems may use a data storage device such as a hard disk or CD-ROM and involve maintaining the physical location of the files. More formally, a file system is a special-purpose database for the storage, hierarchical organization, manipulation, navigation, access, and retrieval of data. The most familiar file systems make use of an underlying data storage device that offers access to an array of fixed-size blocks, sometimes called sectors, generally a power of 2 in size (512 bytes or 1, 2, or 4 KiB are most common). The file system software is responsible for organizing these sectors into files and directories, and keeping track of which sectors belong to which file and which are not being used. Most file systems address data in fixed-sized units called "clusters" or "blocks" which contain a certain number of disk sectors (usually 1- 64). This is the smallest logical amount of disk space that can be allocated to hold a file. File names Whether the file system has an underlying storage device or not, file systems typically have directories which associate file names with files, usually by connecting the file name to an index in a file allocation table of some sort. Directory structures may be flat, or allow hierarchies where directories may contain subdirectories. In some file systems, file names are structured, with special syntax for filename extensions and version numbers. In others, file names are simple strings, and per-file metadata is stored elsewhere. File organization method There are a large number of ways data can be organized on disk or tape. The main methods of file organization used for files are: Serial Sequential Indexed Sequential Random (or Direct) Serial Organization Serial files are stored in chronological order, that is, as each record is received it is stored in the next available storage position. In general it is only used on a serial medium such as magnetic tape. This type of file organization means that the records are in no particular order and therefore to retrieve a single record, the whole file needs to be read from the beginning to end. Serial organization is usually the method used for creating Transaction files (unsorted), Work and Dump files. Sequential Organization Sequential files are serial files whose records are sorted and stored in an ascending or descending on a particular key field. The physical order of the records on the disk is not necessarily sequential, as most manufacturers support an organization where certain records (inserted after the file has been set up) are held in a logical sequence but are physically placed into an overflow area. They are no longer physically contiguous with the preceding and following logical records, but they can be retrieved in sequence. Indexed Sequential Organization Indexed Sequential file organization is logically the same as sequential organization, but an index is built indicating the block containing the record with a given value for the Key field. This method combines the advantages of a sequential file with the possibility of direct access using the Primary Key (the primary Key is the field that is used to control the sequence of the records). These days’ manufacturers providing Indexed Sequential Software allow for the building of indexes using fields other than the primary Key. These additional fields on which indexes are built are called Secondary Keys. There are three major types of indexes used: Basic Index: This provides a location for each record (key) that exists in the system. Implicit Index: This type of index gives a location of all possible records (keys) whether they exist or not. Limit Index: This index groups the records (keys) and only provides the location of the highest key in the group. Generally they form a hierarchical index. Data records are blocked before being written to disk. An index may consist of the highest key in each block, (or on each track). In the above example, data records are shown as being 3 to a block. The index, then, holds the key of the highest record in each block. (An essential element of the index, which has been omitted from the diagram for simplicity, is the physical address of the block of data records).

Document Details

Tags

Related

Full Transcript

Upgrade to continue