Database Normalization for Relational Databases PDF

12/15/2024 CHAPTER 14 Basics of Functional Dependencies and Normalization for Relational Databases 1 1 We need to create a database schema design based on the following (simplified) requirements of the COMPANY Database: The company is organized into DEPARTMENTs. Each department has a name, number and an employee who manages the department. We keep track of the start date of the department manager. A department may have several locations. Each department controls a number of PROJECTs. Each project has a unique name, unique number and is located at a single location. 2 2 1 12/15/2024 Outline 1 Informal Design Guidelines for 3 Normal Forms Based on Primary Keys Relational Databases 3.1 Normalization of Relations 1.1 Semantics of the Relation Attributes 3.2 Practical Use of Normal Forms 1.2 Redundant Information in Tuples and 3.3 Definitions of Keys and Attributes Update Anomalies Participating in Keys 1.3 Null Values in Tuples 3.4 First Normal Form 1.4 Spurious Tuples 3.5 Second Normal Form 3.6 Third Normal Form 2 Functional Dependencies (FDs) 2.1 Definition of Functional Dependency 3 3 1. Design Guidelines for Relational Databases What is a good relational database design? The grouping of attributes to form "good" relation schemas Informal Design Guidelines for Relation Schemas: Making sure that the semantics of the attributes is clear in the schema Reducing the redundant information in tuples Reducing the NULL values in tuples Disallowing the possibility of generating false tuples 4 4 2 12/15/2024 1.1 Semantics of the Relational Attributes must be clear GUIDELINE 1: Each tuple in a relation should represent one entity or relationship instance. (Applies to individual relations and their attributes). Attributes of different entities (EMPLOYEEs, DEPARTMENTs, PROJECTs) should not be mixed in the same relation Only foreign keys should be used to refer to other entities Entity and relationship attributes should be kept apart as much as possible. Bottom Line: Design a schema that can be explained easily relation by relation. The semantics of attributes should be easy to interpret. D_project_number 5 5 Figure 14.1 A simplified COMPANY relational database schema The ease with which the meaning of a relation’s attributes can be explained is an informal measure of how well the relation is designed. 6 6 3 12/15/2024 1.1 Semantics of the Relational Attributes must be clear Example of Violating Guideline 1. This schema combines the attributes of two real word entities Employee and Department which violates guideline 1. 7 7 1.2 Redundant Information in Tuples and Update Anomalies One goal of schema design is to minimize the storage space used by the base relations Consider the relation schema below: The table EMP_DEPT is the result of join of EMPLOYEE and DEPARTMENT 8 8 4 12/15/2024 1.2 Redundant Information in Tuples and Update Anomalies 9 9 1.2 Redundant Information in Tuples and Update Anomalies The table EMP_DEPT is the result of join of EMPLOYEE and DEPARTMENT. In EMP_DEPT, the attribute values (Dnumber, Dname, Dmgr_ssn) are repeated for every employee who works for that department. Information is stored redundantly Wastes storage Causes problems with update anomalies (means unexpected or unintended problem) Insertion anomalies Deletion anomalies Modification anomalies 10 10 5 12/15/2024 EXAMPLE OF AN INSERT ANOMALY Consider the relation: Insert Anomaly: To insert a new tuple for an employee who works in department number 5, we must enter all the attribute values of department 5 correctly so that they are consistent with the corresponding values for department 5 in other tuples It is difficult to insert a new department that has no employees as yet in the EMP_DEPT relation 11 11 EXAMPLE OF A DELETE ANOMALY Consider the relation: Deletion Anomalies: If we delete an employee tuple that happens to be the last employee working for a particular department, the information on that department is lost from the database. 12 12 6 12/15/2024 EXAMPLE OF A MODIFICATION ANOMALY Consider the relation: Modification Anomalies: If we change the value of one of the attributes of a particular department, e.g. the manager of department 5, we must update the tuples of all employees who work in that department; otherwise, the database will be inconsistent 13 13 Guideline for Redundant Information in Tuples and Update Anomalies GUIDELINE 2: Design a schema that does not suffer from the insertion, deletion and update anomalies. If there are any anomalies present, then note them so that applications can be made to take them into account. 14 14 7 12/15/2024 Figure 14.3 Two relation schemas suffering from update anomalies 15 15 Figure 14.4 Sample states for EMP_DEPT and EMP_PROJ Figure 14.4 Sample states for EMP_DEPT and EMP_PROJ resulting from applying NATURAL JOIN to the relations in Figure 14.2. These may be stored as base relations for performance reasons. 16 16 8 12/15/2024 1.3 Null Values in Tuples GUIDELINE 3: Relations should be designed such that their tuples will have as few NULL values as possible. This is to reduce ambiguity, query complexity, storage overheads and data integrity issues. Attributes that are NULL frequently could be placed in separate relations (with the primary key) Reasons for nulls: Attribute unapplicable or invalid Attribute value unknown (may exist) Value known to exist, but unavailable 17 17 1.4 Generation of Spurious Tuples – avoid at any cost Bad designs for a relational database may result in erroneous results for certain JOIN operations The "lossless join" property is used to guarantee meaningful results for join operations. "lossless join" property - when a relation is decomposed into two or more smaller relations, the original relation can be reconstructed by joining these smaller relations, without any loss of data or spurious/extra tuples. 18 18 9 12/15/2024 1.4 Generation of Spurious Tuples – avoid at any cost 19 19 Result of applying NATURAL JOIN to the tuples in EMP_PROJ1 and EMP_LOCS of Figure 14.5 just for employee with Ssn = “123456789”. Generated spurious tuples are marked by asterisks. 20 20 10 12/15/2024 1.4 Generation of Spurious Tuples GUIDELINE 4: The relations should be designed to satisfy the lossless join condition. No spurious tuples should be generated by doing a natural-join of any relations. 21 21 EXERCISE: Which update anomalies occur in the EMP_PROJ? Insertion anomalies: To insert a new tuple for an employee who works in, e.g. project number 5, we must enter all the attribute values of project 5 correctly so that they are consistent with the corresponding values for project 5 in other tuples in EMP_PROJ relation It is difficult to insert a new PROJECT that has no employees as yet in the EMP_PROJ relation 22 22 11 12/15/2024 EXERCISE: Which update anomalies occur in the EMP_PROJ? Deletion anomalies: If we delete an employee tuple that happens to represent the last employee working for a particular department, the information concerning that department is lost from the database. 23 23 EXERCISE: Which update anomalies occur in the EMP_PROJ? Modification Anomalies In EMP_PROJ, if we change the value of one of the attributes of a particular PROJECT we must update the tuples of all employees who work in that PROJECT ; otherwise, the database will become inconsistent 24 24 12 12/15/2024 Chapter Outline 2 Functional Dependencies (FDs) 2.1 Definition of Functional Dependency 3 Normal Forms Based on Primary Keys 3.1 Normalization of Relations 3.2 Practical Use of Normal Forms 3.3 Definitions of Keys and Attributes Participating in Keys 3.4 First Normal Form 3.5 Second Normal Form 3.6 Third Normal Form 25 25 2. Functional Dependencies Functional dependency exists when one attribute (or a group of attributes) uniquely determines another attribute. A functional dependency 𝑋→𝑌 is defined as a constraint between two sets of attributes from the database. Notation: 𝑋→𝑌 x→y indicates that knowledge of 𝑋x uniquely determines 𝑌y. This relationship means that if two tuples have the same value for 𝑋X, they must also have the same value for 𝑌Y. 26 26 13 12/15/2024 Examples of FD constraints (1) Employee SSN →→ Employee Name: Knowing an employee's SSN allows you to determine their name. SSN  ENAME Project Number →→ Project Name and Location: A project's number determines both its name and location. PNUMBER  {PNAME, PLOCATION} Composite Dependency: {EmpSSN, ProjectNum} →→ Hours Worked: The combination of an SSN and a project number determines the hours worked. {SSN, PNUMBER}  HOURS 27 27 Defining FDs from instances 1. It’s crucial to know what each data point represents and how different data points relate to one another. 2. FDs are based on the database schema, which is like a blueprint that defines how data is organized. 3. By examining the actual data, you can identify if certain data points always lead to specific other data points, suggesting a potential FD. 4. If you find any cases where the expected pattern doesn’t occur (like two same IDs with different Names), then the proposed FD is not valid. 28 28 14 12/15/2024 Figure 14.7 Ruling Out FDs Note that given the state of the TEACH relation, we can say that: 1. FD: Text → Course may exist (possible). ✓ 2. However, the FDs Teacher → Course, Teacher → Text and Couse → Text are ruled out. Х 29 29 Figure 14.8 What FDs may exist? A relation R(A, B, C, D) with its extension. Which FDs may exist in this relation? The following FDs may hold : B → C; C → B; {A, B} → C; {A, B} → D ; {C, D} → B. The following do not hold : A → B(tuples 1 and 2 violate this constraint); B → A(tuples 2 and 3 violate this constraint); D → C(tuples 3 and 4 violate it). 30 30 15 12/15/2024 Chapter Outline 2 Functional Dependencies (FDs) 2.1 Definition of Functional Dependency 3 Normal Forms Based on Primary Keys 3.1 Normalization of Relations 3.2 Practical Use of Normal Forms 3.3 Definitions of Keys and Attributes Participating in Keys 3.4 First Normal Form 3.5 Second Normal Form 3.6 Third Normal Form 31 31 3.1 Normalization of Relations (1) Normalization: The process of decomposing unsatisfactory "bad" relations by breaking up their attributes into smaller relations Normal form: Condition using keys and FDs of a relation to certify whether a relation schema is in a particular normal form A relation is in a specific normal form if it satisfies certain criteria or constraints regarding the organization of attributes and functional dependencies. Each successive normal form builds upon the previous one, imposing stricter rules. 32 32 16 12/15/2024 Normalization of Relations (2) 2NF, 3NF, BCNF based on keys and FDs of a relation schema 4NF based on keys, multi-valued dependencies : MVDs; 5NF based on keys, join dependencies : JDs Additional properties may be needed to ensure a good relational design (lossless join, dependency preservation; see Chapter 15) 33 33 3.2 Practical Use of Normal Forms Purpose of Normalization is carried out in practice so that the resulting designs are of high quality and meet the desirable properties Limitations of Normal Forms: The practical utility of these normal forms becomes questionable when the constraints on which they are based are hard to understand or to detect Extent of Normalization: The database designers need not normalize to the highest possible normal form, (usually up to 3NF and BCNF, 4NF rarely used in practice), i.e. reach a balance between data integrity and performance. Denormalization: The process of storing the join of higher normal form relations as a base relation—which is in a lower normal form 34 34 17 12/15/2024 3.3 Definitions of Keys and Attributes Participating in Keys (1) A superkey of a relation schema R = {A1, A2,...., An} is a set of attributes S subset-of R with the property that no two tuples t1 and t2 in any legal relation state r of R will have t1[S] = t2[S] A key K is a superkey with the additional property that removal of any attribute from K will cause K not to be a superkey any more. 35 35 Definitions of Keys and Attributes Participating in Keys (2) If a relation schema has more than one key, each is called a candidate key. One of the candidate keys is arbitrarily designated to be the primary key, and the others are called secondary keys. A Prime attribute must be a member of some candidate key A Nonprime attribute is not a prime attribute—that is, it is not a member of any candidate key. 36 36 18 12/15/2024 First Normal Form (1NF): Rule: Each table cell should contain a single value, and each record needs to be unique. Purpose: This eliminates duplicate records and organizes data where each column contains values of a single type. Also Disallowing : composite attributes multivalued attributes nested relations - attributes whose values for an individual tuple are non-atomic 37 37 Figure 14.9 Normalization into 1NF A relation schema that is not in 1NF. Sample state of relation DEPARTMENT 1NF version of the same relation with redundancy 38 38 19 12/15/2024 Figure 14.10 Normalizing nested relations into 1NF Figure 14.10 Normalizing nested relations into 1NF. (a) Schema of the EMP_PROJ relation with a nested relation attribute PROJS. (b) Sample extension of the EMP_PROJ relation showing nested relations within each tuple. (c) Decomposition of EMP_PROJ into relations EMP_PROJ1 and EMP_PROJ2 by propagating the primary key. 39 39 Second Normal Form (2NF): Rule: The table must be in 1NF and all non-key attributes must be fully functionally dependent on the primary key. Purpose: This reduces data redundancy and dependency by ensuring that all data attributes in a table rely only on the primary key for identification. 40 40 20 12/15/2024 3.5 Second Normal Form (1) Uses the concepts of FDs, primary key Definitions Prime attribute: An attribute that is member of the primary key K Full functional dependency: a FD Y -> Z where removal of any attribute from Y means the FD does not hold any more Examples: {SSN, PNUMBER} -> HOURS is a full FD since neither SSN -> HOURS nor PNUMBER -> HOURS hold {SSN, PNUMBER} -> ENAME is not a full FD (it is called a partial dependency ) since SSN -> ENAME also holds 41 41 Figure 14.11 Normalizing into 2NF and 3NF Normalizing EMP_PROJ into 2NF relations Normalizing EMP_DEPT into 3NF relations 42 42 21 12/15/2024 Third Normal Form (3NF): Rule: The table must be in 2NF and all its columns must not be transitively dependent on the primary key. Purpose: This further reduces redundancy by eliminating columns that do not depend directly on the primary key but rather on other non-primary key columns. 43 43 3.6 Third Normal Form (1) Definition: Transitive functional dependency: a FD X -> Z that can be derived from two FDs X -> Y and Y -> Z, i.e. an attribute that indirectly depends on the primary key (or candidate key). Examples: SSN -> DMGRSSN is a transitive FD Since SSN -> DNUMBER and DNUMBER -> DMGRSSN hold SSN -> ENAME is non-transitive Since there is no set of attributes X where SSN -> X and X -> ENAME 44 44 22 12/15/2024 Third Normal Form (2) A relation schema R is in third normal form (3NF) if it is in 2NF and no non- prime attribute A in R is transitively dependent on the primary key R can be decomposed into 3NF relations via the process of 3NF normalization NOTE: In X -> Y and Y -> Z, with X as the primary key, we consider this a problem only if Y is not a candidate key. When Y is a candidate key, there is no problem with the transitive dependency. E.g., Consider EMP (SSN, Emp#, Salary ). Here, SSN -> Emp# -> Salary and Emp# is a candidate key. 45 45 Figure 14.12 Normalization into 2NF and 3NF (a) The LOTS relation with its functional dependencies FD1 through FD4 (c) Decomposing LOTS1 into the 3NF relations LOTS1A and LOTS1B The LOTS schema violates the 2NF because Tax_rate is partially dependent on the candidate key {County_name, Lot#}, due to FD3. (d) Progressive normalization of LOTS (b) Decomposing into the 2NF relations LOTS1 and LOTS2 into a 3NF design 46 46 23 12/15/2024 Normal Forms Defined Informally 1st normal form :All attributes depend on the key 2nd normal form: All attributes depend on the whole key 3rd normal form: All attributes depend on nothing but the key 47 47 END CHAPTER 14 48 48 24 12/15/2024 49 49 25

Database Normalization for Relational Databases PDF

Document Details

Tags

Related

Summary

Full Transcript