Database Design (Functional Dependencies) PDF
Document Details
Uploaded by Deleted User
Tags
Summary
This document provides an introduction to database design, focusing on functional dependencies. It explores the concept of functional dependencies as a many-to-one relation between attributes. The document also discusses different types of relationships and use cases within a database context.
Full Transcript
Database Design (Functional Dependencies) Introduction Task Given some body of data to be represented in database, to decide on a suitable logical structure for that data. In other words, one of the hardest things is to take an informally set-up table co...
Database Design (Functional Dependencies) Introduction Task Given some body of data to be represented in database, to decide on a suitable logical structure for that data. In other words, one of the hardest things is to take an informally set-up table containing some information and convert this to a database appropriately. Correct Approach for Design Correct approach to do database design is get the logical design right first without paying any attention to physical design. Functional Dependency in a relation A functional dependency is a property of the semantics of the attributes in a relation that is i) how attributes relate to one another, and ii) specify the functional dependencies between attributes. Functional Dependency is a many-to-one relationship from one set attribute to another within a given relvar (relation). Example 1: Relationship between Entity Sets In a many-to-one relationship from Loan to Customer entity sets a loan is associated with several (including 0) customers via borrower, and a customer is associated with at most one loan via borrower Example 2: Between Attributes In a many-to-one relationship between attributes S# and CITY of S relation : a) Many Suppliers can reside in same City, but b) a Supplier can reside in only one City. Important When a functional dependency is present, the dependency is specified as a constraint between the attributes. Definition : FD Consider a relation R with attributes A and B, where attribute B is functionally dependent on attribute A whenever two tuples of R agree on their A value, they also agree on their B value. In the figure above, A is the determinant of B and B is the consequent of A (or B is dependent on A) The DETERMINANT of a functional dependency is the attribute or group of attributes on the left- hand side of the arrow in the functional dependency The DEPENDENT (Consequent) of a FD is the attribute or group of attributes on the right- hand side of the arrow. Observe If we know the value of A and we examine the relation that holds this dependency, we will find only one value of B in all of the tuples that have a given value of A, at any moment in time. however, for a given value of B there may be several different values of A Example : In Supplier relation, a particular Supplier (S1) will have a specified City value (London), but there may be other suppliers who are living in that particular City {S#} → {CITY} Important When identifying FDs between attributes in a relation, it is important to distinguish clearly between a) the values held by an attribute at a given point in time and b) the set of all possible values that an attributes may hold at different times. Example 1: Consider shipment relvar (relation) SP For a given pair of attributes S# and , P#, there is a one corresponding value of of attribute QTY and Many distinct values of the pair S#, P# can have same value for attribute QTY { S#,P# } → { QTY } Example 2 : Consider the following Relational schema STAFFBRANCH The functional dependency staff# → position holds on this relation instance but, The relationship between staff# and position is 1:1 i.e. for each staff member there is only one position. the reverse functional dependency position → staff# does not hold. the relationship between position and staff# is 1:M – there are several staff numbers associated with a given position. Types of FDs Time Dependent FD Definition of functional dependency based on an instance of a relvar (or relation value) Let r be a relation, and let X and Y be arbitrary subset of the set of attributes of r. Then, Y is functionally dependent on X X→Y (or X functionally determines Y) if and only if each X value in r has associated with it precisely one Y value in r i.e. whenever two tuples of r value agree on their X value they also agree on their Y value. Example 3 : Consider relation SCP(S#, CITY, P#, QTY) with primary key {S#,P#} and consider a possible value of SCP. As clear from relation SCP {S#} → {CITY} If S# value is same then City value will also be same. whenever two tuples of R agree on their A value, they also agree on their B value. Other FDs in relation SCP are { S#,P# } → { QTY } { S#,P# } → { CITY } { S#,P# } → { CITY,QTY } { S#,P# } → { S# } { S#,P# } → { S#,P#,CITY,QTY } Example : Time Dependent FDs { S# } → { QTY } { QTY} → {S#} It is important to note Whenever S# is S1 value of Quantity 100 is fixed and whenever Quantity is 400 value of S# is S4 is fixed. Note last two FDs in the list hold only for a particular instance of relation SCP not for all possible values of the relation ( Relation Variable). Time Independent Fd Definition of functional dependency based on Base relvar (or relation variable) Let R be a relation variable, and let X and Y be arbitrary subset of set of attributes of R. Then Y is functionally dependent on X X→ Y ( X functionally determines Y ) if and only if, in every possible legal value of R, each X value has associated with it precisely one Y value, i.e. whenever two tuples of R agree on their X value they also agree on their Y value. Example : Consider time independent FDs in SCP { S#,P# } → { QTY } { S#,P# } → { CITY } { S#,P# } → { CITY,QTY } { S#,P# } → { S# } { S#,P# } → { S#,P#,CITY,QTY } {S#} → {CITY} Functional dependency {S#} → {CITY} hold for all possible values of the relvar SCP, i.e. holds for all time and is an integrity constraint on relvar SCP. Why Time Independent Fds? The reason that we need to identify Time Independent Fds of a relation is that these Fds represent the types of integrity constraints that we need to identify Such constraints indicate the limitations on the values that a relation can legitimately assume. In other words, They identify the legal instances which are possible. Requirement In order to identify the time invariant Fds, we need to clearly understand the semantics of the various attributes in each of the relation schemas in question. IMPORTANT: Observations If relvar R satisfies the FD X→ Y 1. If X is a candidate key of relvar R, then all attributes Y of relvar R must be functionally dependent on X Example : PARTS relvar P and SUPPLIER relvar S. 2. If X is not a candidate key, then relation R will involve some redundancy, Example : In SCP relation {S#} → {CITY} Primary key is { S#,P# } attribute CITY appears many times as S# appears in SCP relation. Important : a functional dependency is a property of a relational schema (its intension) and not a property of a particular instance of the schema (extension). Fds represent integrity constraints, and the DBMS needs to enforce them. Problems : with all fds : As complete set of FDs S for a relation can be very large, it is desirable to find set of FDs T that is much smaller than S. When enforced then, all FDs in S will automatically be implemented i.e. set of FDs in T imply all Fds in S. Trivial and Non-Trivial FDs Identifying Fds which hold for all possible values of the attributes involved in the Fd, we also want to ignore trivial functional dependencies. Trivial Functional dependency A functional dependency is trivial if, the dependent is a subset of the determinant. Example 1: Consider relation SCP { S#,P# } → { S# } Example 2 : Using the relation instances STAFFBRANCH, the trivial dependencies include: { staff#, sname} → sname { staff#, sname} → staff# Although trivial Fds are valid, they offer no additional information about integrity constraints for the relation. As far as normalization is concerned, trivial Fds are ignored. Non-Trivial Functional dependency If a FD is not trivial then it is Non- Trivial. Closure of a Set of Dependencies FDs might imply other Fds. Example 1 : consider FD in relation SCP { S#,P# } → { CITY,QTY } implies { S#,P# } → { CITY } { S#,P# } → { QTY } Example 2 : Consider a relation (relvar) R with attributes A, B and, C such that Fds A → B and B → C holds in R then it can easily be observed that FD A→C also holds in R i.e. C is transitively dependent on A via B. Consider S as the set of functional dependencies that are specified on a relational schema R. Typically, the schema designer specifies the Fds that are semantically obvious usually however, numerous other Fds hold in all legal relation instances that satisfy the dependencies in S These additional Fds can be inferred or deduced from the Fds in S. Definition : Closure of Fd The set of all functional dependencies implied by a set of functional dependencies S is called the closure of S and is denoted by S+. Inference Rules for Functional Dependencies Axioms or rules of inference provide simpler techniques for reasoning about functional dependencies. These rules are called Armstrong’s Axioms, by which new s, set of FDs can be inferred from given one’s. Axioms Or Rules of Inference Let A, B, and C be arbitrary subsets of the set of attributes of the given relvar R, and Assume that AB means union of A and B. Then: 1. Reflexivity: If B is a subset of A, then A → B 2. Augmentation: If A → B, then AC → BC 3. Transitivity: If A → B and B → C, then A →C The rules are complete, in the sense that, given a set S of FDs, all FDs implied by S can be derived from S using the rules. They are also sound, in the sense that no additional FDs (i.e., FDs not implied by S) can be so derived. In other words, the rules can be used to derive precisely the closure S+. Additional rules can be used to simplify the practical task of computing S+ from S. Consider D as another arbitrary subset of the set of attributes of R. 4. Self-determination: A → A 5. Decomposition: If A → BC, then A → B and A → C 6. Union: If A → B and A → C then A → BC 7. Composition: If A → B and C → D then AC → BD Darwen proves the General Unification Theorem: 8. lf A → B and C→ D, then A U(C - B) → BD (where ”U” is union and ”-” is set difference). Example: Consider a relvar R with attributes A, B, C, D, E, F, and the FDs A → BC B →E CD → EF Here we are using BC for the set consisting of attributes B and C Show that the FD AD → F holds for R and is thus a member of the closure of the given set: Now, to make problem more clear consider : A as employee number, B as department number, C as manager’s employee number, D as project number for a project directed by that manager (unique within manager), E as department name F as percentage of time allocated by the specified manager to the specified project. Given A → BC, B → E, CD → EF Show that the FD AD → F Solution : 1. A → BC (given) 2. A → C (Rule 4, decomposition) If A → BC, then A → B and A → C 3. AD → CD (Rule 2, augmentation) If A → B, then AC → BC 4. CD → EF (given) 5. AD → EF (Rule 3, transitivity) If A → B and B → C, then A →C 6. AD → F (Rule 5, decomposition) Proved Example 2 : Given R = (A,B,C,D,E,F,G,H, I, J) and F = {AB → E, AG→ J, BE→ I, E→ G, GI→ H} Show that AB→ GH Solution : 1. AB→ E, given in F 2. AB→AB, reflexive rule IR1 3. AB→B, projective rule IR4 from step 2 4. AB→ BE, additive rule IR5 from steps 1 and 3 5. BE→ I, given in F 6. AB→ I, transitive rule IR3 from steps 4 and 5 7. E→ G, given in F 8. AB→ G, transitive rule IR3 from steps 1 and 7 9. AB→ GI, additive rule IR5 from steps 6 and 8 10. GI→ H, given in F 11. AB→ H, transitive rule IR3 from steps 9 and 12. AB→ GH, additive rule IR5 from steps 8 CLOSURE OF A SET OF ATTRIBUTES In principle, we can compute the closure S+ of a given set S of FDs by means of an algorithm that says “Repeatedly apply the rules from the previous section until they stop producing new FDs.” It can be shown that, given a relvar R, a set Z of attributes of R, and a set S of FD that hold for R, we can determine the set of all attributes of R that are functionally dependent on Z—the so- called closure Z+ of Z under S Example: Suppose we are given relvar R with attributes A, B, C, D, E, F, and FDs A → BC E → CF B → E CD → EF Steps To Compute The Closure (closure Z+ of Z under S) We now compute the closure {A,B }+ of the set of attributes {A,B } under this set of FDs. We initialize the result CLOSURE[Z,S ] to {A,B} We now go round the inner loop four times, once for each of the given FDs. Iteration 1: On the first iteration (for the FD A → BC) we find that the left-hand side is indeed a subset of CLOSURE[Z,S] as computed so far, so we add attributes B and C to the result. CLOSURE[Z,S] is now the set {A,B,C}. Iteration 2:On the second iteration (for the FD E → CF). we find that the left-hand side is not a subset of the result as computed so far, which thus remains unchanged. Iteration 3:On the third iteration (for the FD B → E), we add E to to CLOSURE[Z,S], which now has the value {A,B,C,E} Iteration 4:On the fourth iteration (for the FD CD → EF), CLOSURE[Z,S] remains unchanged Now we go round the inner loop four times again. On the first iteration, FD A → BC the result does not change; On the second, E → CF it expands to {A,B,C,E,F}; On the third B → E and fourth, CD → EF it does not change Now we go round the inner loop four times again. CLOSURE[Z,S ] does not change, and so the whole process terminates, with {A,B}+ = {A,B,C,E,F}. An important corollary of the foregoing is as follows: Given a set S of FDs. we can easily tell whether a specific FD X → Y follows from S. because that FD will follow if and only if Y is a subset of the closure X+ of X under S. In other words, we now have a simple way of determining whether a given FD X → Y is in the closure S+ of S. without actually having to compute that closure S. Another important corollary is the following: We know that a superkey for a relvar R is a set of attributes of R that includes some candidate key of R as a subset— not necessarily a proper subset, of course. Now, it follows immediately from the definition that the superkeys for a given relvar R are precisely those subsets K of the attributes of R such that the FD K→A holds true for every attribute A of R. In other words, K is a superkey if and only if the closure K of K—under the given set of FDs—is precisely the set of all attributes of R (and K is a candidate key if and only if it is an irreducible superkey). Irreducible Sets Of Dependencies (Related terms) Let S1 and S2 be two sets of Fds, Cover if every FD implied by S1 is implied by S2 i.e. if S1+ is a subset of S2+ we say that S2 is a cover for S1(Cover here means equivalent set). if the DBMS enforces the Fds in S2, then it will automatically be enforcing the Fds in S1. Equivalent if S2 is a cover for S1 and S1 is a cover for S2 i.e. if S1+ = S2+ we say that S1 and S2 are equivalent, if the DBMS enforces the Fds in S2 it will automatically be enforcing the Fds in S1, and vice versa. Properties of Irreducible Set S of Fds Now we define a set S of Fds to be irreducible (Usually called minimal in the literature) if and only if , it satisfies the following three properties 1. The right hand side (the dependent) of every Fds in S involves just one attribute (that is, it is singleton set) 2. The left hand side (determinant) of every FDs in S is irreducible i.e. no attribute can be discarded from the determinant without changing the closure S+ (that is, with out converting S into some set not equivalent to S). We will say that such an Fd is left irreducible. 3. No Fd in S can be discarded from S without changing the closure S+(That is, without converting S into some set not equivalent to S) Database Design (Normalisation Process) Topics Covered Database Design : Introduction to Normalization theory. 1NF relations and update anomalies Nonloss decomposition Normal forms Normalisation in Relational Databases One of the hardest things is to take an informally set-up table containing some information and convert this to a database appropriately. Normal Forms The process normalisation is built around the concept of normal forms. A relvar is said to be in a particular normal form if it satisfies a certain prescribed set of conditions. For example : A relvar is said to be in second normal form (2NF) if and only if it is in 1NF and also satisfies another condition. Many normal forms have been defined as shown in figure below. The first three (1NF, 2NF, 3NF were defined by Codd) As Fig. suggests, all normalized relvar, in 1NF; some 1NF relvars are also in 2NF; and some 2NF relvars are also in 3NF. The motivation behind Codd’s definitions was that 2NF was “more desirable” than 1NF, and 3NF in turn was more desirable than 2NF. Important Thus, the database designer should generally aim for a design involving relvars in 3NF, not ones that merely in 2NF or 1NF. Normalisation Process By applying Normalisation Procedure, a relvar that happens to be in some given normal form, say 2NF, be replaced by a set of relvars in some more desirable form, say 3NF. Definition Normalisation process can be characterized as that procedure of the, successive reduction of a given collection of relvars to some more desirable form and that procedure is reversible. Problem Description Suppose we have some information about the newspaper-reading habits of various persons: Smith reads the Record and the Mail Lee reads the Herald etc. Let’s represent this information in a table: Unnormalised Relations Here’s a first try: Problem This is not ideal. Each person is associated with an unspecified number of papers. The items in the PaperList column do not have a consistent form. Requirement: Generally, in RDBMS each entry in a table needs to have a single data item in it. Above relation is an unnormalised relation. All RDBMS require relations not to be like this i.e. not to have multiple values in any column (no repeating groups) First Normal Form (1NF) Here is another try: Observation Above relation clearly contains the same information, and It has the property that we sought. It is in First Normal Form (1NF). Definition : 1NF Relation A relation is in 1NF if no entry consists of more than one value (i.e. does not have repeating groups) Finally This will be the first requirement in designing our databases: our relations must be in 1NF Achieving 1NF In general, to achieve 1NF we need to get rid of repeating groups in our tables Two Ways There are two alternative ways of doing this: One-Table Approach Two-Table Approach One-Table Approach we extend the table rows by replicating the non-repeated columns for each repeated item value. This is what we did in the above example. Two-Table Approach split the repeating and non-repeating data into separate tables (Non-loss Decomposition) then choose a primary key for the repeating data table, and insert this as a foreign key in the non- repeating data table Note The two-table approach is often better as it takes up less space and leads us to the next 2nd Normal Form Overall Normalisation Process Using Supplier_Part Database Example 2 : Let us consider Supplier-Part data base, consists of three relations S(Supplier), P(Part) and, SP(Shipment), with logical design (schema) given below : S{S#,SNAME, STATUS, CITY} Primary Key {S#} P{P#, PNAME, COLOR, WEIGHT, CITY} Primary Key {P#} SP{S#, P#, QTY} Primary Key {S#,P#} Important These relations are almost correctly designed and various attributes are appropriately placed. Problems : With Poorly Designed Relations let us consider another relation SCP with following schema SCP{S#, CITY, P#, QTY} Primary Key {S#, P#} Although, SCP relation is normalized ( 1NF), SCP relation has a good deal of redundancy, as City of a Supplier is repeated as many time as a particular Supplier supplies some part, which in turn causes other problems. If relvar R satisfies the FD X→ Y If X is not a candidate key, then relation R will involve some redundancy, Problems : The relation SCP suffers from update anomalies, i.e. Insert, delete and update anomalies Important The principle of normalisation allows us to recognise such problems and replace (or non-loss decompose) such relation in more desirable state. Solution Replace SCP relvar by two relvars with heading (S#, CITY) and (S#,P#, QTY) Nonloss Decomposition (NLD) Normalisation procedure involves decomposing a given relation into other relations (Relvar) but this decomposition should be nonloss, so that no information is lost i.e. join of the component relations should give back original relation. Example 1: Show it with SCP relation Example 2: consider relation S with attributes S#, Status and, City satisfies FDs S# → CITY ans S# → STATUS This relation can be decomposed into (a) SST(S#, Status) and SC(S#, City) It is lossless decomposition. It is easy to tell about a supplier’s Status and City (b) SST(S#, Status) and STC(Status, City) This decomposition is lossy and it is difficult to tell about a about City of a Supplier. In decomposition (a), above mentioned FDs for relation S, can easily be observed. S# → CITY ans S# → STATUS Heath’s theorem Concept of nonloss decompositions are based on Heath’s theorem. Let R{A,B,C} be a relation Variable, and A, B, C are set of attribute. If R satisfies the FD A→B then R is equal to the join of its projections on R1 [A, B] and R2 [A,C]. Here A is S#, B is CITY, C is Status. Points to remember Left-irreducible FDs As discussed earlier, that a FD is said to be left- irreducible if its left-hand side is “not too big.” For example, relvar SCP satisfies the FD {S#, P# } → CITY we also have the FD S# →CITY (i.e., CITY is also functionally dependent on S# alone). This latter FD is left- irreducible, but the previous one is not. Left-irreducible FDs and irreducible dependencies will turn out to be important in the definition of second and third normal form. FD Diagrams Let R be a relvar and let I be some irreducible set of FDs that apply to R. It is convenient to represent the set I by means of a functional dependency diagram (FD diagram). FD Diagrams for Relvars S, SP, and P S{S#,SNAME, STATUS, CITY} Primary Key {S#} P{P#, PNAME, COLOR, WEIGHT, CITY} Primary Key {P#} SP{S#, P#, QTY} Primary Key {S#,P#} Note that every arrow in the above Fig., is an arrow out of a candidate key (actually the primary key) of the relevant relvar. By definition, there will always be arrows out of each candidate key, because, for one value of each candidate key, there is always one value of everything else; those arrows can never be eliminated. Important It is important to note that if there are any other arrows then difficulties arise. The normalisation procedure can be characterized, as a procedure for eliminating arrows that are not arrows out of candidate keys. FDs are a semantic notion FDs are, of course, a special kind of integrity constraint. they are definitely a semantic notion. Recognizing the FDs is part of the process of understanding what the data means. Example the fact that relvar S satisfies the FD S# → CITY, means that each supplier is located in precisely one city. To look at this another way: There is a constraint in the real world that the database represents, namely that each supplier is located in precisely one city; Since it is part of the semantics of the situation, that constraint must somehow be observed in the database; The way to ensure that it is so observed is to specify it in the database definition, so that the DBMS can enforce it; The way to specify it in the database definition is to declare the FD. 1st, 2nd and 3rd Normal Forms To understand complete normalisation process we first start with informal definition of 3rd Normal Form. 3 NF: Defnition 1 A relation (relvar) is in 3NF if and only if the non-key attributes (if any) are mutually independent, and irreducibly dependent on the primary key. Example : Relation P is in 3 NF. Also S and SP. 3 NF: Definition 2 A relvar (relation) is in 3NF if and only if, for all time, each tuple consists of a primary key value that identifies some entity, together with a set of zero and more mutually independent attribute values that describe that entity in some way. Description Of Complete Normalization Process Lets begin our discussion with definition of a relation in 1 NF. A relvar (relation) is in 1 NF if and only if, in every legal value of that relvar, every tuple contains exactly one value for each attribute. Example 1: Staff borrowers in a Library. their staff number functions as a Library number and someone has had the ingenious idea of adding details of books borrowed (in the same relation/table) Here’s a first go: Primary key: (Sno,Bno) Problems With A 1NF Relation: Duplication Now suppose that Smith borrows another book. To ensure 1NF, we shall have to have a complete new row: We have stored all the other details about this member of staff again, in the new row i.e. complete information about smith is entered again Update Anomalies Such repetition as mentioned above means that updates can be difficult. Suppose that Smith goes on to a new grade. o Changes would be required to all records for Smith. o (And there is a danger that we may miss some.) Suppose that the salary for grade 2.7 is changed. o All records for all staff members on grade 2.7 would have to be changed. A fact should be stored only once. Updates are then problem free. This example relation is poorly structured, being subject to update anomalies. Insertion and Deletion Anomalies Suppose that a new scale point is created (2.8 earns £27,491) and as yet there is no-one on that scale point. How do we record this? The following violates entity integrity i.e. Primary Key is missing: We call this an insertion anomaly. Also: suppose that Smith returns all her books – do we delete all the corresponding rows? - removing all trace of Smith, – or do we remove the Bno and Date_out entries from all the rows? - leaving duplicated information about Smith These are deletion anomalies. Solution: Non-loss decomposition (NLD) How do we deal with these problems? We carry out a non-loss decomposition (NLD): we replace the relation by projections from which the original relation can be re-created (by joining) The re-creation must give no less and no more than we started with. Example 2 : Consider a relation FIRST, is in 1 NF and contains attributes taken from relations of supplier-Part database. FIRST(S#, Status, City, P#,Qty) primary key {S#,P#} additional constraint is City → Status The FD diagram of relation FIRST is more complex than relation P, as shown below. The redundancies in relation FIRST lead to various update anomalies as shown below. Update Anomalies in relation FIRST Problems in insertion, deleltion and updation of information in relation FIRST Insertion : It is not possible to insert that a particular supplier S5 is residing in Athens, because a supplier should supply some part because Part number is component of a primary key. Deletion : We cannot delete the sole first tuple for a particular supplier, because we will also lose information about that supplier’s City, e.g. tuple related to S3. Updation : This problem is due to redundancy and involves both searching and udation operation. The City value for a supplier appears as many times as a supplier number appears e.g. if a supplier S1 moves from London to Amsterdam, we face a problem of searching a City, London and replacing it with Amsterdam, may cause inconsistent results, as in one tuple City is Amsterdam and at another place it shows London. Solution The solution to the above problem is to replace FIRST by two projections SECOND{S#, Status, City} and SP{S#, P#, QTY} FD diagrams and Result relations are shown below. Now, problem faced with FIRST will not arise and it is possible to perform insertion (S5 in SECOND), deletion (S3 in SP) and updation (changing CITY value for S1) operations in the database. Now relations SECOND and SP (similar to Supplier- Part SP relation) are in 2 NF. 2 NF definition : A relation (relvar) is in 2 NF if and only if it is in 1 NF and every nonkey attribute is irreducibly dependent on the primary key. (assume only one candidate key which is a the primary key). Relations SECOND with primary key {S#} and SP with primary key {S#, P#} are in 2 NF and join of relations SECOND and SP will give back relation FIRST. Important First step towards Normalization is to eliminate non-irreducible functional dependencies. Relation SP is satisfactory (is already in 3NF). But relation SECOND still suffers from a lack of mutual independence among non key attributes i.e. attribute Status transitively dependent on S# via City. Problems with Relation SECOND Relation SECOND still suffers from update anomalies. Insertion : It is not possible to insert that a City has a particular Status e.g. Status of Rome is 50, until some supplier resides their. Deletion : A sole tuple for a particular city cannot be deleted (S5), because information about a particular supplier in that city and status of a city is also lost. Updation: the Status value for a particular City is repeated many times. It is difficult to change status of a particular City (London, 20 to 30), because it will cause inconsistency. Solution Again, solution to the problem is to decompose relation SECOND into relations SC {S#, City} and CS {City, Status} FD diagram and contents are shown below. Now, problem faced with Second will not arise it is possible to perform insertion (Status of Rome is 50), deletion (S5 in S) and updation (changing STATUS value for CITY) operations in the database. New structure does not have problems faced in relation SECOND. 3 NF definition : A relvar (relation) is in 3NF if and only if it is in 2NF and every non-key attribute is non transitively dependent on the primary key. Dependency Preservation Concept of independent projections (Good and Bad decomposition). Relation second can be nonloss decomposed in two different combinations : SECOND{S#, Status, City} (a) SC{S#, City} and CS{City, Status} (b) SC{S#, City} and SS{S#, Status} case (a) SC{S#, City} and CS{City, Status} projections are independent of one another, i.e. updates can be made easily provided Foreign key constraints are not violated and FD City → Status is not violated and FD S# → Status will be enforced automatically if remaining constraints are followed. case (b) SC{S#, City} and SS{S#, Status} less satisfactory because FD City → Status become a database constraint and spans over two relations, in other words whenever two tuples have same City, they must have same Status, i.e. whenever City is London, Status should be 20. In decomposition (a), by contrast, it is the transitive FD S# → STATUS that has become the database constraint, and that constraint will be enforced automatically if the two relvar constraints S# → CITY and CITY → STATUS are enforced. Independent Projections The concept of independent projections thus provides a guideline for choosing a particular decomposition when there is more than one possibility. Rissanen’s Theorem for independent projections Projections R1 and R2 of a relvar R are independent if and only if: Every FD in R is a logical consequence of those in R1 and R2, and The common attributes of R1 and R2 form a candidate key for at least one of the pair. In option (a) the two projections are independent, because their common attribute City constitutes the primary key for CS, and every FD in SECOND either appears in one of the two projections or is a logical consequence of those that do. In option (b), the two projections are not independent, because the FD City → Status cannot be deduced from the FDs for those projections although their common attribute, S#, does constitute a candidate key for both. Atomic Relation (relvar) A relvar that cannot be decomposed into independent projections is referred to as atomic. Important The concept of independent projections, according to Rissanen’s theorem is known as dependency preservation. Decomposition (a) satisfies this condition, so is independent. Boyce/Codd Normal Form (BCNF) For considering BCNF, previous condition that a relation has only one candidate key and that is the primary key is not applicable. Problems with 3NF Definition Codd’s 3NF definition did not consider a relation (relvar) with Two or more candidate keys such as a) Composite Candidate keys b) Overlapped Candidate keys Boyce and Codd proposed stronger definition called BCNF BCNF Definition 1: A relvar (realtion) is in BCNF if and only if every nontrivial, left-irreducible FD has a candidate key as its determinant BCNF Definition 2: A relvar is in BCNF if and only if the only determinants are candidate keys. Trivial FD -- LHS is a superset of RHS. Determinant -- LHS part of the FD. Observations : a) Relation FIRST is not in BCNF because determinants S# and City is not a candidate key. FIRST(S#, Status, City, P#,Qty) primary key {S#,P#} and additional constraint (Determinant) is City → Status b) Similarly, relation SECOND is also not in BCNF because determinant City is not a candidate key. SECOND{S#, Status, City}, primary key {S#} additional constraint (Determinant) is City → Status c) But relations SP,SC and CS are in BCNF SP{S#, P#, QTY}, SC {S#, City} and CS {City, Status} Here every determinant is the candidate key Case 1 : Relations with non-overlapping (disjoint) Candidate keys Consider Supplier relation (relvar) S{S#, Sname, Status, City}, with S# and Sname as candidate keys and attributes City and Status are mutually independent FD City → Status is not applicable as considered for relation FIRST. FD diagram for relation S is shown below. Relation S is in BCNF because all the determinants are candidate keys. Case 2 : Relations with overlapping candidate keys Example 1: Consider relation SSP{S#, Sname, P#,QTY}, candidate keys are {S#,P#} and {Sname, P#} assume that S# and Sname are unique but not candidate keys. Observations: The relation SSP is not in BCNF because S# and Sname are determinants but not candidate keys. Relation SSP is in 3 NF, because 3 NF definition does not require an attribute to irreducibly dependent on each candidate key if it was itself a component of some candidate key of the relation. Problem The relation SSP also has redundancies and suffers from similar type of update anomalies as with relations FIRST and SECOND. Solution to the above problem is to decompose Relation SSP into projections SS{S#, Sname} and SP{S#, P#, QTY} or SS{S#, Sname} and SP{Sname, P#, Qty} Example 2 : Consider relvar (relation) SJT with attributes S for student, J for subject and T for teacher. Primary key is {S, J}. A SJT tuple {S:s, J:j, T:t} means that, a student s is taught subject j by teacher t. Constraints on SJT are as follows : 1. For each subject, each student of that subjects taught by only one teacher. 2. Each teacher teaches only one subject 3. but each subject is taught by several teacher FDs in SJT are shown below {S,J} → T and T → J Relational value of relvar SJT and FDs diagram Overlapping candidate keys are {S, T} and {S, J} Problem The relation SJT is in 3NF but not in BCNF because of FD T→J Update Anomalies Relation SJT also suffers from update anomalies 1. Insertion Partial information for particular teacher and subject can not be inserted as {S,J} is primary key and 2. Deletion Sole tuple for a teacher Jones is studying Physics can not be deleted without loosing info about Prof. Brown. Solution Again the solution to the above problem is to decompose relation SJT into two projections ST {S, T} and TJ {T, J} Further Problem these projections are not independent because FD {S,J} → T can not be observed from two projections. Care has to be taken that a student should not be taught by a teacher of same subject again in relation ST. Example : Insertion of a tuple Smith taught by Prof. Brown, is wrong because Physics was taught by Prof. Green to Smith FOURTH NORMAL FORM Consider a relvar HCTX (H for “hierarchic”) containing information about courses, teachers, and texts, in which the attributes corresponding to teachers and texts are relation-valued. Each HCTX tuple consists of a course name, plus a relation containing teacher names, plus a relation containing text names. Meaning : The specified course can be taught by any of the specified teachers and uses all of the specified texts as references. Assumptions 1. for a given course, there can exist any number of corresponding teachers and any number of corresponding texts 2. teachers and texts at quite independent of one another; that is, no matter who actually teaches any particular offering of a given course, the same texts are used. 3. a given teacher or a given text can be associated with any number of courses. Solution Replace relvar HCTX by a relvar CTX with three scalar attributes COURSE, TEACHER, and TEXT as indicated in Fig. below. Note the resulting relvar CTX is “all key” the sole candidate key for HCTX, by contrast, was just COURSE. The meaning of relvar CTX A tuple {COURSE:c,TEACHER: t, TEXT:x} appears in CTX if and only if course c can be taught by teacher t and uses text x as a reference. Important Observe that, for a given course, all possible combinations of teacher and text appear, Constraint CTX satisfies the (relvar) constraint If tuples (c,t1,x1), (c,t2,x2) both appear then tuples (c,t1,x2), (c,t2,x1) both appear also Important It is important to note that relvar CTX involves a good deal of redundancy, leading as usual to certain update anomalies. Update Problem 1. Addition Add the information that the physics course can be taught by a new teacher, it is necessary to insert two new tuples, one for each of the two texts. Observation 1: The problems in CTX are because teachers and texts are completely independent of one another Solution Decompose CTX into its two projections CT {COURSE,TEACHER} and CX {COURSE,TEXT} Now, above information that the physics course can be taught by a new teacher insert a single tuple into relvar CT. Important decomposition of relvar CTX is nonloss and, can be recovered by joining CT and CX again. Problem Informally, therefore, it is obvious that the design of CTX is bad and the decomposition into CT and CX is better. The trouble is, however, these facts are not formally obvious. Observation 2: CTX satisfies no functional dependencies at all (apart from trivial ones such as COURSE → COURSE); CTX is in BCNF, since it is all key projections CT and CX are also all key and hence in BCNF Important The knowledge of the previous work (FDs) are therefore of no help. The existence of “problem” BCNF relvars like CTX was recognized very early on, and the way to deal with them was also soon understood. Multi-Valued Dependencies In 1977 Fagin introduces the notion of Multi- Valued Dependencies (MVDs). Multi-valued dependencies are a generalization of functional dependencies, in the sense that every FD is an MVD, but the converse is not true. i.e., there exist MVDs that are not FDs In the case of relvar CTX there are two MVDs that hold: COURSE →→ TEACHER COURSE →→ TEXT the MVD : A →→ B read as “B is multi-dependent on A” or, equivalently, “A multi-determines B.” Explanation COURSE →→ TEACHER Intuitively, this MVD means is that, although a course does not have a single corresponding teacher—i.e., the functional dependence COURSE → TEACHER does not hold nevertheless, each course does have a well- defined set of corresponding teachers i.e. for a given course c and given text x, the set of teachers t matching the pair (c,x) in CTX depends on the value c alone— it makes no difference which particular value of text x is chosen. The second MVD COURSE →→ TEXT is interpreted analogously. Definition: Multi-valued dependence Let R be a relvar, and let A, B, and C be subsets of the attributes of R. Then we say that B is multi-dependent on A, A →→ B (read “A multidetermines B,” or simply “A double arrow B”) if and only if, in every possible legal value of R, the set of B values matching a given (A value, C value) pair depends only on the A value and is independent of the C value. Important It can be shown that, given the relvar R{A,B,C}, the MVD A →→ B holds if and only if the MVD A →→ C also holds. MVDs always go together in pairs. For this reason it is common to represent them both in one statement, thus: A →→ B │ C For example COURSE →→ TEACHER │TEXT Important Now, we stated above that multi-valued dependencies are a generalization of functional dependencies, in the sense that every FD is an MVD. More precisely, an FD is an MVD in which the set of dependent (right-hand side) values matching a given determinant (left-hand side) value is always a singleton set. Thus, if A→B, then certainly A →→ B. CTX problem Problems with, like CTX relvars is that 1.they involve MVDs that are not also FDs, because CTX is all key 2.The existence of those MVDs, leads to the necessity of inserting two tuples to add another physics teacher. Those two tuples are needed in order to maintain the integrity constraint that is represented by the MVD.) CTX satisfies the (relvar) constraint If tuples (c,t1,x1), (c,t2,x2) both appear then tuples (c,t1,x2), (c,t2,x1) both appear also Note: The two projections CT and CX do not involve any such MVDs, which is why they represent an improvement over the original design. An important theorem proved by Fagin allows us to make exactly that replacement: Fagin’s Theorem for decomposition Let R{A,B,C) be a relvar, where A, B, and C are sets of attributes. Then R is equal to the join of its projections on {A,B} and {A,C} if and only if R satisfies the MVDs A →→ B │C. This is a stronger version of Heath’s theorem. Fourth Normal Form Following Fagin’s theorem, we can now define fourth normal form : Definition 1 : Relvar R is in 4NF if and only if, whenever i) there exist subsets A and B of the attributes of R such that the nontrivial* MVD A →→ B is satisfied, ii) then all attributes of R are also functionally dependent on A. In other words, the only nontrivial dependencies (FDs or MVDs) in R, are of the form K→X i.e., a functional dependency from a superkey K to some other attribute X. Definition 2 : Equivalently R is in 4NF a) if it is in BCNF and b) all MVDs in R are in fact “FDs out of keys.” Note 4NF implies BCNF Observation for CTX Relvar CTX is not in 4NF, since it involves an MVD that is not an FD at all, let alone an FD “out of a key.” The two projections CT and CX are both in 4NF, however. Thus 4NF is an improvement over BCNF, in that it eliminates another form of undesirable dependency. Fagin shows that 4NF is always achievable; that is, any relvar can be nonloss-decomposed into an equivalent collection of 4NF relvars Important in some cases (e.g. SJT{Student,Subject,Teacher}), it might not be desirable to carry the decomposition that far (or even as far as BCNF) Rissanen’s work : Independent Projections Rissanen’ s work on independent projections in terms of FDs, is applicable to MVDs also. FDs Recall that a relvar R{A,B,C) satisfying the FDs A → B and B → C is better decomposed into its projections on {A,B} and {B,C} rather than into those on {A,B} and {A,C}. MVDs The same holds true if we replace the FDs by the MVDs A →→ B and B →→ C JOIN DEPENDENCIES AND FIFTH NORMAL FORM Upto the 4 NF it was assumed that the solution to further normalisation is to nonloss decompose a relation exactly into two projections. There exist relvars that cannot be nonloss- decomposed into two projections but can be nonloss-decomposed into three (or more) i.e. such a relvar are “n-decomposable” (for some n > 2). The phenomenon of n-decomposability for n > 2 was first noted by Aho, Beeri, and Ullman. The particular case n=3 was also studied by Nicolas. Example Consider relvar SPJ from the suppliers-parts- projects database as shown below. Observation Relvar SPJ is all key and involves no nontrivial FDs or MVDs at all, and therefore in 4NF Description : Join Operation i. The three binary projections SP, PJ, and JS are corresponding to the SPJ relation value. ii. The effect of joining the SP and PJ projections (over P#). The result of the first join produces A copy of the original SPJ relation plus one additional (spurious) tuple, and iii. The effect of joining that result and the JS projection (over J# and S#). The effect of the second join brings back to the original SPJ relation. Which shows that the original SPJ relation is 3- decomposable. Note Now, the example is expressed in terms of relations, not relvars. Time-Independent Integrity Constraint The 3-decomposability of SPJ could be a more fundamental, time-independent property, if the relvar satisfies a certain time- independent integrity constraint. Constraint 1 To understand what that constraint must be, observe first that the statement “SPJ is equal to the join of its three projections SP, PJ, and JS” is precisely equivalent to the following statement: because the triple (s1,p1,j1) appears in the join of SP, PJ and JS. The converse of the above statement, that if (s1, p1, j1) appears in SPJ then (s1,p1) appears in projection SP and (p1,j1) appears in projection PJ and (j1,s1) appears in projection JS is also true for any degree-3 relation. Constraints 2 Since (s1,p1) appears in SP if and only if (s1,p1,j2) appears in SPJ for some j2. and Similarly, for (p1,j1) and (j1,s1). we can rewrite the statement above as a constraint on SPJ And if this statement is true for all time i.e.. for all possible legal values of relvar SPJ— then we do have a time-independent constraint on the relvar. The cyclic nature of the constraint If s1 is linked to p1 and p1 is linked to j1 and j1 is linked back to s1 again, Then s1 and p1 and j1 must all coexist in the same tuple Important A relvar will be n-decomposable for some n > 2 if and only if it satisfies some such (n-way) cyclic constraint. Suppose then that relvar SPJ does in fact satisfy that time- independent constraint. For brevity, let us agree to refer to that constraint as Constraint 3D (3D for 3-decomposable). Meaning of 3D in real-world terms The constraint says that in the portion of the real world that relvar SPJ is supposed to represent. Example If a.Smith supplies(S1) monkey wrenches(P1), and b.Monkey wrenches (P1) are used in the Manhattan project (J1) and c.Smith supplies(S1) the Manhattan project (J1) Then d.Smith supplies monkey wrenches to the Manhattan project. Important It is important to note that statements a, b and c together normally do not imply d. In the case at hand, however, we are saying there is no trap because, there is an additional real-world constraint in effect, namely Constraint 3D that makes the inference of d, from a, b, and c. Important Because constraint 3D is satisfied if and only if, the relvar concerned is equal to the join of certain of its projections This constraint is referred to as a join dependency (JD). A JD is a constraint on the relvar concerned, just as an MVD or an FD is a constraint on the relvar concerned. Definition Join Dependency Let R be a relvar and let A, B,..., Z be subsets of the attributes of R. Then we say that R satisfies the JD * { A, B,...,Z } (read “star A, B,..., Z”) if and only if every possible legal value of R is equal to the join of its projections on A, B,..., Z. Example If we agree to use SP to mean the subset { S#,P#} of the set of attributes of SPJ and similarly for PJ and JS, then relvar SPJ satisfies the JD *(SP,PJ,JS) Update Anomalies with SPJ 1. Problem to Insert a Tuple Because of following constraint mentioned earlier 2. Problem to Delete a Tuple Reconsidering constraint discussed earlier, we can solve above problem easily. Fagin’s theorem of Join Dependency Fagin’s theorem for MVD R{A, B, C} can be nonloss—decomposed into its projections on {A,B} and {A,C}, if and only if the MVDs A →→ B and A →→ C hold in R. can now he restated for JD as follows: R{A, B, C} satisfies the JD *{AB, AC} if and only if it satisfies the MVDs A →→ B │C Important It follows that an MVD is just a special case of a JD or (equivalently) that JDs are a generalisation of MVDs. It is clear from the definition that join dependencies are the most general form of dependency possible, that is, there does not exist a still higher form of dependency such that JDs are merely a special case of that higher form so long as we restrict our attention to dependencies that deal with a relvar being decomposed via projection and recomposed via join. (However, if we permit other decomposition and recomposition operators, then other types of dependencies might come into play. We have seen that Problem with relvar SPJ the problem with relvar SPJ is that it involves a JD that is not an MVD, and hence not an FD either. it is possible, and probably desirable, to decompose such a relvar into smaller components—namely, into the projections specified by the join dependency. Important The decomposition process can be repeated until all resulting relvars are in fifth normal form, Fifth normal form A relvar R is in 5NF - also called Projection-Join Normal Form (PJ/NF) — if and only if every nontrivial join dependency that holds for R is implied by the candidate keys of R. Relvar SPJ is not in 5NF as it satisfies a certain join dependency, namely Constraint 3D, that is certainly not implied by its sole candidate key (that key being the combination of all of its attributes). To state this differently, relvar SPJ is not in 5NF, because (a) it can be 3-decomposed and (b) that 3-decomposability is not implied by the fact that the combination {S#,P#,J#} is a candidate key. Important By contrast after 3-decomposition the three projections SP, PJ, and JS are each in 5NF, since they do not involve any (nontrivial) JDs at all. It is a fact that any relvar in 5NF is automatically in 4NF also, because (as we have seen) an MVD is a special case of a JD. Fagin also shows that any given relvar can be nonloss- decomposed into an equivalent collection of 5NF relvars; that is, 5NF is always achievable. Fifth normal form : Def 2 A relvar R is in 5NF – also called projection-join normal form (PJ/NF) - if and only if every nontrivial join dependency that holds for R is implied by the candidate keys of R. Explanation Suppliers relvar S has two candidate keys, {S#} and {SNAME}. Then that relvar satisfies several join dependencies. Example 1 It satisfies the JD * { { S#, SNANE, STATUS }, { S#, CITY} } That is, relvar S is equal to the join of its projections on {S#,SNAME,STATUS) and {S#,CITY } and hence can be nonloss-decomposed into those projections This JD is implied by the fact that {S#} is a candidate key. Example 2 Likewise, relvar S also satisfies the JD *{{ S#, SNAME }, { S#, STATUS }, { SNANE, CITY} } This JD is implied by the fact that {S#} and {SNAME} are both candidate keys. As the foregoing examples suggest, a given JD * {A,B,..., Z} is implied by candidate keys if and only if each of A, B,..., Z is in fact a superkey for the relvar in question. Thus, given a relvar R, we can tell if R is in 5NF so long as we know all candidate keys and all JDs in R. In conclusion, we note that it follows from the definition that 5NF is the ultimate normal form with respect to decomposition and composition of projections. That is a reIvar in 5NF is guaranteed to be free of anomalies that can be eliminated by taking projections.