Normilization.pdf
Document Details
Uploaded by IngeniousKeytar
Department of Computer Science
Full Transcript
CSC 2702 Basics of Functional Dependencies and Normalization Department of Computer Science School of Natural Sciences University of Zambia Overview Relational Database Design Focuses on evaluating and improving relational schemas for better design quality The design process is guided...
CSC 2702 Basics of Functional Dependencies and Normalization Department of Computer Science School of Natural Sciences University of Zambia Overview Relational Database Design Focuses on evaluating and improving relational schemas for better design quality The design process is guided by the goals of information preservation and minimizing redundancy Approaches to Database Design Bottom-up (Design by Synthesis) Starts with basic relationships among individual attributes Not commonly used in practice due to complexity Top-down (Design by Analysis) Begins with natural groupings of attributes and refines them More practical and widely applicable, especially for decomposing real-world forms and reports Overview Design Criteria Good relational schemas ensure logical grouping of attributes, clear user interpretation, and efficient physical storage Poor design may lead to redundancy and data anomalies Functional Dependencies A formal constraint among attributes used to assess the appropriateness of attribute groupings Normalization Process Successive normal forms (e.g., 1NF, 2NF, 3NF) are defined to meet desirable constraints based on primary keys and functional dependencies Relations are decomposed as needed to meet these normal forms Informal Design Guidelines for Relation Schemas Four informal guidelines that may be used as measures to determine the quality of relation schema design: 1. Clear Semantics of Attributes: Ensure that the meaning of each attribute is well- defined and understood within the schema 2. Reducing Redundant Information: Minimize the repetition of data across tuples to avoid unnecessary duplication 3. Reducing NULL Values: Design schemas in a way that minimizes the presence of NULL values, as they can complicate data handling and interpretation 4. Avoiding Spurious Tuples: Ensure that the schema design prevents the generation of meaningless or incorrect tuples when performing join operations These guidelines help assess and improve the quality of a relational schema, although they are interconnected and sometimes overlap in their application. Informal Design Guidelines for Relation Schemas - Imparting Clear Semantics to Attributes in Relations The meaning of a relation schema is derived from the interpretation of the attribute values in its tuples A well-designed schema should have a clear, unambiguous meaning Following a systematic conceptual design process ensures that the relational schema has clear semantics Each relation should be easy to explain. For example: EMPLOYEE: Represents an employee, with attributes like name, Social Security number, birth date, address, and department number DEPARTMENT: Represents a department, with attributes like department name, number, and manager's Social Security number PROJECT: Represents a project, with attributes like project name, number, location, and controlling department number Informal Design Guidelines for Relation Schemas - Imparting Clear Semantics to Attributes in Relations Guideline 1: Design a relation schema to be easily explainable Avoid combining attributes from multiple entity types and relationship types into a single relation, as this can create semantic ambiguities Violations of Guideline 1 EMP_DEPT: Combines employee information with department information, leading to mixed semantics EMP_PROJ: Combines employee information with project information and the WORKS_ON relationship, creating semantic confusion These mixed schemas may be logical but are problematic as base relations due to their complex, ambiguous semantics They might be more appropriate as views rather than base relations. Informal Design Guidelines for Relation Schemas - Imparting Clear Semantics to Attributes in Relations Violations of Guideline 1 Informal Design Guidelines for Relation Schemas - Redundant Information in Tuples & Update Anomalies Redundant Information Effective schema design minimizes storage space by reducing redundancy in the database Example of Redundancy: The EMP_DEPT relation (a NATURAL JOIN of EMPLOYEE and DEPARTMENT) contains repeated information, such as department attributes, for every employee in that department, leading to increased storage space compared to the original EMPLOYEE and DEPARTMENT tables Informal Design Guidelines for Relation Schemas - Redundant Information in Tuples & Update Anomalies Update Anomalies Insertion Anomalies Inserting a new employee requires including consistent department information, risking inconsistencies Inserting a department with no employees is difficult without violating entity integrity due to NULLs Deletion Anomalies Deleting the last employee in a department can result in the inadvertent loss of department information Modification Anomalies Changing department information (e.g., manager) requires updating all related employee tuples, risking inconsistencies if not all tuples are updated Informal Design Guidelines for Relation Schemas - Redundant Information in Tuples & Update Anomalies Guideline 2: Design base relations to avoid insertion, deletion, and modification anomalies If anomalies are present, document them and ensure update operations maintain consistency Trade-offs Sometimes, anomalies may be tolerated for performance reasons, such as in a materialized view like EMP_DEPT In such cases, triggers or stored procedures should be used to maintain consistency when the base relations are updated Recommendation: Use anomaly-free base relations and create views to combine frequently accessed attributes Informal Design Guidelines for Relation Schemas - Redundant Information in Tuples & Update Anomalies Informal Design Guidelines for Relation Schemas - Redundant Information in Tuples & Update Anomalies Informal Design Guidelines for Relation Schemas - NULL Values in Tuples Grouping many attributes into a single relation can lead to numerous NULL values if many attributes do not apply to all tuples Storage and Logical Problems Wasted Space: Storing many NULLs wastes space Logical Issues: NULLs can complicate understanding the meaning of attributes and specifying JOIN operations Aggregate Operations and NULLs NULLs can cause unpredictable results in operations like COUNT or SUM Comparisons involving NULLs in SELECT and JOIN operations may also yield unexpected outcomes Informal Design Guidelines for Relation Schemas - NULL Values in Tuples Multiple Interpretations of NULLs Attribute does not apply (e.g., Visa_status for citizens attending a local university) Attribute value is unknown (e.g., Date_of_birth for an employee) Value is known but absent or not recorded yet (e.g., Home_Phone_Number not available) Guideline 3: Avoid placing attributes in a base relation if their values may frequently be NULL If NULLs are unavoidable, ensure they are used only in exceptional cases and do not affect the majority of tuples Practical Example: If only a small percentage (e.g., 15%) of employees have individual offices, do not include Office_number in the EMPLOYEE relation Instead, create a separate relation like EMP_OFFICES(Essn, Office_number) for those employees This approach optimizes space usage and reduces complications related to NULLs Informal Design Guidelines for Relation Schemas - Generation of Spurious Tuples Example of Decomposition: The relations EMP_LOCS and EMP_PROJ1 (Figure 14.5(a)) are derived from EMP_PROJ (Figure 14.3(b)) EMP_LOCS: Indicates that an employee (Ename) works on at least one project at a given location (Plocation) EMP_PROJ1: Describes an employee (Ssn) working a certain number of hours (Hours) on a specific project (Pname, Pnumber, Plocation) Issue with Decomposition When decomposing EMP_PROJ into EMP_LOCS and EMP_PROJ1, the original information in EMP_PROJ cannot be fully recovered by simply performing a NATURAL JOIN on EMP_PROJ1 and EMP_LOCS The NATURAL JOIN produces additional tuples that were not in the original EMP_PROJ, known as spurious tuples Informal Design Guidelines for Relation Schemas - Generation of Spurious Tuples Informal Design Guidelines for Relation Schemas - Generation of Spurious Tuples Informal Design Guidelines for Relation Schemas - Generation of Spurious Tuples Example In Figure 14.6, the NATURAL JOIN on EMP_PROJ1 and EMP_LOCS for an employee with Ssn = "123456789" generates spurious tuples (marked with asterisks) that do not reflect the correct original data from EMP_PROJ Problem Cause: The issue arises because Plocation, the attribute used to relate EMP_LOCS and EMP_PROJ1, is neither a primary key nor a foreign key in either relation Guideline 4: Design relations so that they can be joined with equality conditions on appropriately related attributes (i.e., primary key, foreign key pairs) to ensure no spurious tuples are generated Avoid relations that contain matching attributes that are not primary key/foreign key pairs, as joining on such attributes can lead to spurious tuples Informal Design Guidelines for Relation Schemas - Generation of Spurious Tuples Informal Design Guidelines for Relation Schemas - Summary and Discussion of Design Guidelines Problematic Relation Schemas Anomalies: These can cause redundant work during insertion and modification, and may result in accidental loss of data during deletion NULL Values: Excessive use of NULLs leads to wasted storage space and complicates selection, aggregation, and join operations Spurious Data: Joins on base relations with improperly matched attributes (not forming valid foreign key-primary key relationships) can generate invalid or spurious data Formal Concepts and Theory Functional Dependency: A critical tool for analyzing the appropriateness of relation schemas Normal Forms: The three normal forms and Boyce-Codd Normal Form (BCNF) are established standards for ensuring quality in relational design Decomposition Strategy: Badly designed relations should be decomposed appropriately to achieve higher normal forms, ensuring better design quality Functional Dependencies Functional dependency is a formal tool used to analyze relational schemas It helps detect and describe issues like redundancy, anomalies, and spurious data The concept of functional dependency is fundamental in relational schema design theory, and it plays a crucial role in determining the quality of database designs Functional dependencies are used to define normal forms, which are standards for evaluating the structure and quality of relation schemas Functional dependency allows us to detect design issues and provides a precise method for improving database structures Functional Dependencies - Definition of Functional Dependency Functional Dependency (FD) Denoted as X → Y between two sets of attributes X and Y (subsets of relation schema R) X → Y means that for any two tuples t1 and t2 in a relation state r(R), if t1[X] = t2[X], then t1[Y] = t2[Y] X (left-hand side) determines Y (right-hand side) in a relation schema. If two tuples share the same value for X, they must also share the same value for Y Interpretation If X is a candidate key for a relation R, then X → R, meaning that X uniquely identifies the entire relation The FD X → Y doesn't imply that Y → X Functional Dependencies - Definition of Functional Dependency Example: In a relation EMP_PROJ: Ssn → Ename: Social Security number uniquely determines the employee name Pnumber → {Pname, Plocation}: Project number uniquely determines the project name and location {Ssn, Pnumber} → Hours: A combination of employee SSN and project number uniquely determines the hours worked Characteristics of FD FD is a property of the relation schema: It is derived from the semantics (meaning) of the attributes, not just the data Legal Relation States: Extensions of a relation r(R) that satisfy FD constraints are called legal relation states Functional Dependencies - Definition of Functional Dependency Functional Dependencies - Definition of Functional Dependency Inference of FD Functional dependencies cannot be directly inferred from relation data but are defined based on the semantic understanding of the attributes A single counterexample can disprove an FD - for instance, if two tuples have the same value for X but different values for Y, then X → Y does not hold FD in Practice When analyzing relation schemas, designers define FDs based on attribute relationships and meaning This helps maintain consistency, eliminate redundancy, and reduce anomalies in database design In essence, FD is a formal way to describe how attributes are related within a database, ensuring the correctness and integrity of relational schemas It serves as the foundation for higher-level database design concepts, such as normal forms Normal Forms Based on Primary Keys Functional Dependencies and Normalization Functional dependencies are used to develop a formal methodology for testing and improving relation schemas Each relation has a set of functional dependencies and a designated primary key This information, combined with normal form conditions, drives the normalization process for schema design Two Main Approaches to Relational Design 1. Conceptual Schema Design: Use models like ER to create a schema and map it into relational schemas 2. Design Based on Existing Knowledge: Derive schemas from existing systems, such as files, forms, or reports Normal Forms Based on Primary Keys Normalization Process After creating initial relations, they are evaluated for goodness using normalization theory Relations are decomposed further as needed to achieve higher normal forms Focus on First Three Normal Forms (1NF, 2NF, 3NF) First Normal Form (1NF): Ensures all values are atomic and each relation has a primary key Second Normal Form (2NF): Eliminates partial dependencies (non-key attributes must depend on the entire primary key) Third Normal Form (3NF): Removes transitive dependencies (non-key attributes must depend only on the primary key) The goal is to achieve higher normal forms to reduce redundancy, avoid anomalies, and ensure the integrity of data Normal Forms Based on Primary Keys - Normalization of Relations Normalization Process Normalization evaluates a relation schema against normal form criteria and decomposes relations if necessary It is a top-down, relational design process by analysis, aiming to improve schema quality Key Normal Forms 1. First, Second, and Third Normal Forms (1NF, 2NF, 3NF): Initially proposed by Codd 2. Boyce-Codd Normal Form (BCNF): A stronger version of 3NF by Boyce and Codd 3. Fourth (4NF) and Fifth Normal Form (5NF): Based on multivalued and join dependencies Normal Forms Based on Primary Keys - Normalization of Relations Goal of normalization is to minimizing redundancy and to reduce insertion, deletion, and update anomalies Procedure The process involves analyzing relation schemas using functional dependencies (FDs) and primary keys Relations that don't meet normal form tests are decomposed into smaller, more efficient schemas Normalization Provides A formal method for analyzing schemas based on keys and FDs Tests for normal forms that help normalize databases to the desired level Normal Forms Based on Primary Keys - Normalization of Relations Definition of Normal Form: A relation's normal form is the highest normal form condition it meets, indicating its degree of normalization Properties to Achieve Nonadditive (Lossless) Join Property: Ensures no spurious tuples are generated after decomposition Dependency Preservation: Ensures all functional dependencies are preserved in the decomposed relations Important Considerations The nonadditive join property is essential and must always be achieved The dependency preservation property is desirable but can sometimes be sacrificed Normal Forms Based on Primary Keys - Practical Use of Normal Forms In practice, database designs are often acquired from legacy systems or existing files The goal is to ensure designs are of high quality and sustainable over time by applying normal form tests Normalization is used to improve existing designs and ensure desirable properties like minimizing redundancy and anomalies Focus on 3NF, BCNF, or 4NF Although higher normal forms like 4NF and 5NF exist, their practical utility is limited The constraints for higher normal forms are rare and difficult for designers to identify or understand Most industrial practices normalize up to 3NF, BCNF, or at most 4NF Normal Forms Based on Primary Keys - Practical Use of Normal Forms Performance Considerations Designers may choose not to fully normalize to the highest normal form for performance reasons Leaving a relation in a lower normal form, like 2NF, may improve performance but introduces the risk of anomalies Denormalization Definition: Storing the join of relations in higher normal forms as a single base relation in a lower normal form Denormalization can be used to improve performance but comes with the drawback of dealing with potential data anomalies Normal Forms Based on Primary Keys - Definitions of Keys and Attributes Participating in Keys Superkey: A set of attributes in a relation schema R = {A1, A2, … , An} such that no two tuples in any legal state of R have the same values for the attributes in the set Example: In Figure 14.1, {Ssn} is a superkey for EMPLOYEE Key: A minimal superkey If any attribute is removed from a key, it will no longer be a superkey Example: {Ssn} is a key for EMPLOYEE, while sets like {Ssn, Ename} are superkeys but not keys (since they are not minimal) Normal Forms Based on Primary Keys - Definitions of Keys and Attributes Participating in Keys Normal Forms Based on Primary Keys - Definitions of Keys and Attributes Participating in Keys Candidate Key: If there are multiple keys, each is a candidate key. One of these is chosen as the primary key Example: In Figure 14.1, {Ssn} is the primary key for EMPLOYEE Prime Attribute: An attribute that is part of a candidate key Example: Ssn in EMPLOYEE and Pnumber in WORKS_ON are prime attributes Nonprime Attribute: An attribute that is not part of any candidate key Normal Forms Based on Primary Keys - Definitions of Keys and Attributes Participating in Keys Normal Forms (1NF, 2NF, 3NF) These normal forms were introduced by Codd to improve the design of relation schemas by resolving problems related to functional dependencies The process typically follows this order 1. First Normal Form (1NF) 2. Second Normal Form (2NF) 3. Third Normal Form (3NF) 3NF resolves the functional dependency issues while also ensuring that the schema satisfies both 1NF and 2NF Normal Forms Based on Primary Keys - First Normal Form Ensures that attribute domains contain only atomic (simple, indivisible) values Each attribute value in a tuple must be a single value from its domain Disallows multivalued, composite attributes or combinations of both within a single tuple No relations within relations or nested structures are allowed in 1NF Normal Forms Based on Primary Keys - First Normal Form Example - DEPARTMENT Relation If a department has multiple locations (attribute Dlocations), it violates 1NF as Dlocations is not atomic There are three ways to handle this violation 1. Decomposition: Create a new relation for Dlocations with the department's primary key (Dnumber) - this is the preferred method 2. Expand the key: Create a new tuple for each location, but this leads to redundancy 3. Multiple attributes: Use separate attributes (e.g., Dlocation1, Dlocation2), which introduces NULL values and complexity Normal Forms Based on Primary Keys - First Normal Form Normal Forms Based on Primary Keys - First Normal Form Dealing with Multivalued Attributes If a relation contains a nested attribute (e.g., an employee's projects and hours), normalize it by separating the nested relation into a new table Example: EMP_PROJ(Ssn, Ename, {PROJS(Pnumber, Hours)}) becomes two tables: EMP_PROJ1 and EMP_PROJ2 Handling Multiple Multivalued Attributes Example: If a person has multiple car licenses and phone numbers, decomposing into two relations (one for car licenses and one for phone numbers) is the correct approach to avoid redundancy Complex Objects Modern databases store objects like images, documents, or videos as atomic values (BLOBs or CLOBs), maintaining 1NF Normal Forms Based on Primary Keys - First Normal Form Normal Forms Based on Primary Keys - Second Normal Form A relation schema is in 2NF if every nonprime attribute (an attribute that is not part of any candidate key) is fully functionally dependent on the primary key Full Functional Dependency A functional dependency X → Y is considered full if no attribute can be removed from X without breaking the dependency Example: In the relation {Ssn, Pnumber} → Hours, neither Ssn → Hours nor Pnumber → Hours hold, making it a full dependency Partial Dependency A functional dependency X → Y is partial if some attribute A can be removed from X and the dependency still holds Example: In the dependency {Ssn, Pnumber} → Ename, the attribute Ssn → Ename holds, making it a partial dependency Normal Forms Based on Primary Keys - Second Normal Form Test for 2NF 1. Check if any nonprime attributes are partially dependent on the primary key 2. If the primary key has only one attribute, the relation is automatically in 2NF 3. If the relation is not in 2NF, decompose it by ensuring each nonprime attribute is fully dependent on the part of the primary key it relies on Normal Forms Based on Primary Keys - Second Normal Form Example - EMP_PROJ Relation In the relation {Ssn, Pnumber} → {Ename, Pname, Plocation, Hours} Ename depends on Ssn (partial dependency) Pname and Plocation depend on Pnumber (partial dependency) Hours fully depends on {Ssn, Pnumber} (full dependency) This means the relation is in 1NF but not in 2NF due to partial dependencies of Ename, Pname, and Plocation Normal Forms Based on Primary Keys - Second Normal Form Example - EMP_PROJ Relation In the relation {Ssn, Pnumber} → {Ename, Pname, Plocation, Hours} Ename depends on Ssn (partial dependency) Pname and Plocation depend on Pnumber (partial dependency) Hours fully depends on {Ssn, Pnumber} (full dependency) This means the relation is in 1NF but not in 2NF due to partial dependencies of Ename, Pname, and Plocation Normal Forms Based on Primary Keys - Second Normal Form Decomposition to 2NF To achieve 2NF, decompose the relation into smaller relations where each nonprime attribute is fully dependent on its corresponding part of the primary key For the EMP_PROJ relation, it decomposes into three relations: 1. EP1(Ssn, Ename) 2. EP2(Pnumber, Pname, Plocation) 3. EP3(Ssn, Pnumber, Hours) This ensures each relation is free of partial dependencies and satisfies 2NF Normal Forms Based on Primary Keys - Second Normal Form Normal Forms Based on Primary Keys - Third Normal Form A relation schema is in 3NF if: 1. It satisfies 2NF 2. No nonprime attribute (an attribute that is not part of any candidate key) is transitively dependent on the primary key Transitive Dependency: A functional dependency X → Y is transitive if: There is a set of attributes Z in the relation such that: X → Z and Z → Y both hold Z is not a candidate key and not a subset of any key Example: In the EMP_DEPT relation, the dependency Ssn → Dmgr_ssn is transitive because: Ssn → Dnumber and Dnumber → Dmgr_ssn hold Dnumber is neither a key nor a subset of the key in EMP_DEPT Normal Forms Based on Primary Keys - Third Normal Form Violation of 3NF The EMP_DEPT relation in Figure 14.3(a) violates 3NF because of the transitive dependency Dmgr_ssn on Ssn through Dnumber Solution (Normalization to 3NF) To remove transitive dependencies, decompose the relation into smaller relations that do not have these dependencies Example: Decompose EMP_DEPT into: ED1(Ssn, Ename, Dnumber) ED2(Dnumber, Dname, Dmgr_ssn) This decomposition represents independent facts about employees and departments, ensuring no spurious tuples are generated when joining Normal Forms Based on Primary Keys - Third Normal Form Normal Forms Based on Primary Keys - Third Normal Form Any functional dependency where: The left-hand side is a subset of the primary key, or The left-hand side is a nonkey attribute, is problematic. 2NF and 3NF normalization address these issues by decomposing the relation into new relations Although transitive dependencies can be removed without removing partial dependencies first, 3NF was historically defined with the assumption that a relation is tested for 2NF before being tested for 3NF Normal Forms Based on Primary Keys - Third Normal Form Normal Forms Based on Primary Keys - Third Normal Form Informal Summary of Normal Forms 1. 1NF: Remove multivalued and composite attributes 2. 2NF: Remove partial dependencies (dependencies on part of a composite primary key) 3. 3NF: Remove transitive dependencies (dependencies on nonprime attributes through another nonprime attribute) General Definitions of Second and Third Normal Forms Goal: To design relation schemas free from Partial dependencies (dependencies on part of a candidate key) Transitive dependencies (dependencies through non-prime attributes) Why?: These types of dependencies lead to update anomalies (insert, delete, and update issues) Previously The definitions of 2NF and 3NF focused on the primary key only However, relations may have multiple candidate keys, and dependencies should be evaluated for all candidate keys General Definitions of Second and Third Normal Forms General Definitions Prime Attribute Any attribute that is part of any candidate key (not just the primary key) Partial/Full Dependencies A full dependency means an attribute is fully functionally dependent on the entire candidate key A partial dependency means an attribute is functionally dependent on only a part of the candidate key Transitive Dependency A dependency where one non-prime attribute depends on another non-prime attribute, which, in turn, depends on a candidate key General Definitions of Second and Third Normal Forms More General Definitions of 2NF and 3NF 2NF: A relation is in 2NF if every non-prime attribute is fully dependent on all candidate keys, not just the primary key 3NF: A relation is in 3NF if there are no transitive dependencies on any candidate key This general approach ensures a more thorough normalization process by considering all candidate keys, not just the primary key, thus further reducing the potential for anomalies General Definitions of Second and Third Normal Forms - General Definition of Second Normal Form To be added later General Definitions of Second and Third Normal Forms - General Definition of Third Normal Form To be added later General Definitions of Second and Third Normal Forms - Interpreting the General Definition of Third Normal Form To be added later Boyce-Codd Normal Form To be added later Boyce-Codd Normal Form - Decomposition of Relations not in BCNF To be added later Multivalued Dependency and Fourth Normal Form To be added later Multivalued Dependency and Fourth Normal Form - Formal Definition of Multivalued Dependency To be added later Join Dependencies and Fifth Normal Form To be added later