DBMS_UNIT3_A_Normalization of Database Tables.docx
Document Details
Uploaded by HealthfulHyperbola
Tags
Full Transcript
**A-1 Database Tables and Normalization** Having good relational database software is not enough to avoid the data redundancy. If the database tables are treated as though they are files in a file system, the relational database management system (RDBMS) never has a chance to demonstrate its super...
**A-1 Database Tables and Normalization** Having good relational database software is not enough to avoid the data redundancy. If the database tables are treated as though they are files in a file system, the relational database management system (RDBMS) never has a chance to demonstrate its superior data-handling capabilities. Normalization is a process for evaluating and correcting table structures to minimize data redundancies, thereby reducing the likelihood of data anomalies. Normalization works through a series of stages called normal forms. The first three stages are described as first normal form (1NF), second normal form (2NF), and third normal form (3NF). From a structural point of view, 2NF is better than 1NF, and 3NF is better than 2NF. For most purposes in business database design, 3NF is as high as you need to go in the normalization process. However, you will discover that properly designed 3NF structures also meet the requirements of fourth normal form (4NF). Although normalization is a very important ingredient in database design, you should not assume that the highest level of normalization is always the most desirable. Generally, the higher the normal form, the more relational join operations you need to produce a specified output. Also, more resources are required by the database system to respond to end-user queries. A successful design must also consider end-user demand for fast performance. Therefore, you will occasionally need to *denormalize* some portions of a database design to meet performance requirements. Denormalization produces a lower normal form; that is, a 3NF will be converted to a 2NF through denormalization. However, the price you pay for increased performance through denormalization is greater data redundancy. Functional Dependence Before outlining the normalization process, it is a good idea to review the concepts of determination and functional dependence.Table 6.3 summarizes the main concepts. It is crucial to understand these concepts because they are used to derive the set of functional dependencies for a given relation. The normalization process works one relation at a time, identifying the dependencies on that relation and normalizing the relation. As you will see in the following sections, **normalization starts by identifying** **the dependencies of a given relation and progressively breaking up the relation (table)** **into a set of new relations (tables) based on the identified dependencies.** Two types of functional dependencies that are of special interest in normalization **are partial dependencies and transitive dependencies**. A **partial dependency** exists when there is a functional dependence in which the determinant is only part of the primary key (remember the assumption that there is only one candidate key). For example, **if (A, B) → (C, D), B → C, and (A, B)** is the primary key, then the functional dependence B → C is a partial dependency because only part of the primary key (B) is needed to determine the value of C. Partial dependencies tend to be straightforward and easy to identify. A **transitive dependency** exists when there are functional dependencies such that X → Y, Y → Z, and X is the primary key. In that case, the dependency X → Z is a transitive dependency because X determines the value of Z via Y. Unlike partial dependencies, transitive dependencies are more difficult to identify among a set of data. Fortunately, there is an effective way to identify transitive dependencies: they occur only when a functional dependence exists among nonprime attributes. In the previous example, the actual transitive dependency is X → Z. However, the dependency Y → Z signals that a transitive dependency exists. Hence, throughout the discussion of the normalization process, the existence of a functional dependence among nonprime attributes will be considered a sign of a transitive dependency. To address the problems related to transitive dependencies, changes to the table structure are made based on the functional dependence that signals the transitive dependency's existence. Therefore, to simplify the description of normalization, from this point forward the signaling dependency will be called the *transitive dependency*. - **First Normal Form** Because the relational model views data as part of a table or a collection of tables in which all key values must be identified, the data depicted in Figure 6.1 might not be stored as shown. ![](media/image2.png) Note that Figure 6.1 contains what is known as repeating groups. A **repeating group** derives its name from the fact that a group of multiple entries of the same type can exist for any *single* key attribute occurrence. In Figure 6.1, note that each single project number (PROJ\_NUM) occurrence can reference a group of related data entries. For example, the Evergreen project (PROJ\_NUM = 15) shows five entries at this point---and those entries are related because they each share the PROJ\_NUM = 15 characteristic. Each time a new record is entered for the Evergreen project, the number of entries in the group grows by one. A relational table must not contain repeating groups. The existence of repeating groups provides evidence that the RPT\_FORMAT table in Figure 6.1 fails to meet even the lowest normal form requirements, thus reflecting data redundancies. Normalizing the table structure will reduce the data redundancies. If repeating groups do exist, they must be eliminated by making sure that each row defines a single entity. In addition, the dependencies must be identified to diagnose the normal form. Identification of the normal form lets you know where you are in the normalization process. Normalization starts with a simple three-step procedure. Step 1: Eliminate the Repeating Groups Start by presenting the data in a tabular format, where each cell has a single value and there are no repeating groups. To eliminate the repeating groups, eliminate the nulls by making sure that each repeating group attribute contains an appropriate data value. That change converts the table in Figure 6.1 to 1NF in Figure 6.2. Step 2: Identify the Primary Key The layout in Figure 6.2 represents more than a mere cosmetic change. Even a casual observer will note that PROJ\_NUM is not an adequate primary key because the project number does not uniquely identify all of the remaining entity (row) attributes. For example, the PROJ\_NUM value 15 can identify any one of five employees. To maintain a proper primary key that will *uniquely* identify any attribute value, the new key must be composed of a *combination* of PROJ\_NUM and EMP\_NUM. Step 3: Identify All Dependencies The identification of the PK in Step 2 means that you have already identified the following dependency: PROJ\_NUM, EMP\_NUM → PROJ\_NAME, EMP\_NAME, JOB\_CLASS, CHG\_HOUR, HOURS That is, the PROJ\_NAME, EMP\_NAME, JOB\_CLASS, CHG\_HOUR, and HOURS values are all dependent on---they are determined by---the combination of PROJ\_NUM and EMP\_NUM. There are additional dependencies. For example, the project number identifies (determines) the project name. In other words, the project name is dependent on the project number. You can write that dependency as: PROJ\_NUM → PROJ\_NAME Also, if you know an employee number, you also know that employee's name, job classification, and charge per hour. Therefore, you can identify the dependency shown next: EMP\_NUM → EMP\_NAME, JOB\_CLASS, CHG\_HOUR In simpler terms, an employee has the following attributes: a number, a name, a job classification, and a charge per hour. However, by further studying the data in Figure 6.2, you can see that knowing the job classification means knowing the charge per hour for that job classification. (Notice that all "System Analyst" or "Programmer" positions have the same charge per hour regardless of the project or employee.) In other words, the charge per hour depends on the job classification, not the employee. Therefore, you can identify one last dependency: JOB\_CLASS → CHG\_HOUR - Second Normal Form Conversion to 2NF occurs only when the 1NF has a composite primary key. If the 1NF has a single-attribute primary key, then the table is automatically in 2NF. The 1NF-to- 2NF conversion is simple. Starting with the 1NF format displayed in Figure 6.3, you take the following steps: **Step 1: Make New Tables to Eliminate Partial Dependencies** For each component of the primary key that acts as a determinant in a partial dependency, create a new table with a copy of that component as the primary key. While these components are placed in the new tables, it is important that they also remain in the original table as well. The determinants must remain in the original table because they will be the foreign keys for the relationships needed to relate these new tables to the original table. To construct the revised dependency diagram, write each key component on a separate line and then write the original (composite) key on the last line. For example: PROJ\_NUM EMP\_NUM PROJ\_NUM EMP\_NUM Each component will become the key in a new table. In other words, the original table is now divided into three tables (PROJECT, EMPLOYEE, and ASSIGNMENT). **Step 2: Reassign Corresponding Dependent Attributes** The attributes that are dependent in a partial dependency are removed from the original table and placed in the new table with the dependency's determinant. Any attributes that are not dependent in a partial dependency will remain in the original table. In other words, the three tables that result from the conversion to 2NF are given appropriate names (PROJECT, EMPLOYEE, and ASSIGNMENT) and are described by the following relational schemas: **PROJECT ([PROJ\_NUM], PROJ\_NAME)** **EMPLOYEE ([EMP\_NUM], EMP\_NAME, JOB\_CLASS, CHG\_HOUR)** **ASSIGNMENT ([PROJ\_NUM, EMP\_NUM], ASSIGN\_HOURS)** Because the number of hours spent on each project by each employee is dependent on both PROJ\_NUM and EMP\_NUM in the ASSIGNMENT table, you leave those hours in the ASSIGNMENT table as ASSIGN\_HOURS. Notice that the ASSIGNMENT table contains a composite primary key composed of the attributes PROJ\_NUM and EMP\_NUM. Notice that by leaving the determinants in the original table as well as making them the primary keys of the new tables, primary key/foreign key relationships have been created. For example, in the EMPLOYEE table, EMP\_NUM is the primary key. In the ASSIGNMENT table, EMP\_NUM is part of the composite primary key (PROJ\_NUM, EMP\_NUM) and is a foreign key relating the EMPLOYEE table to the ASSIGNMENT table. The results of Steps 1 and 2 are displayed in Figure 6.4. At this point, most of the anomalies discussed earlier have been eliminated. ![](media/image4.png) - **Third Normal Form** Figure 6.4 still shows a transitive dependency, which can generate anomalies. For example, if the charge per hour changes for a job classification held by many employees, that change must be made for *each* of those employees. If you forget to update some of the employee records that are affected by the charge per hour change, different employees with the same job description will generate different hourly charges. The data anomalies created by the database organization shown in Figure 6.4 are easily eliminated by completing the following two steps: Step 1: Make New Tables to Eliminate Transitive Dependencies For every transitive dependency, write a copy of its determinant as a primary key for a new table. A **determinant** is any attribute whose value determines other values within a row. If you have three different transitive dependencies, you will have three different determinants. As with the conversion to 2NF, it is important that the determinant remain in the original table to serve as a foreign key. Figure 6.4 shows only one table that contains a transitive dependency. Therefore, write the determinant for this transitive dependency as: JOB\_CLASS **Step 2: Reassign Corresponding Dependent Attributes** Using Figure 6.4, identify the attributes that are dependent on each determinant identified in Step 1. Place the dependent attributes in the new tables with their determinants and remove them from their original tables. In this example, eliminate CHG\_HOUR from the EMPLOYEE table shown in Figure 6.4 to leave the EMPLOYEE table dependency definition as: EMP\_NUM → EMP\_NAME, JOB\_CLASS Draw a new dependency diagram to show all of the tables you have defined in Steps 1 and 2. Name the table to reflect its contents and function. In this case, JOB seems appropriate. Check all of the tables to make sure that each table has a determinant and that no table contains inappropriate dependencies. When you have completed these steps, you will see the results in Figure 6.5. In other words, after the 3NF conversion has been completed, your database will contain four tables: PROJECT (**PROJ\_NUM**, PROJ\_NAME) EMPLOYEE (**EMP\_NUM**, EMP\_NAME, JOB\_CLASS) JOB (**JOB\_CLASS**, CHG\_HOUR) ASSIGNMENT (**PROJ\_NUM, EMP\_NUM**, ASSIGN\_HOURS) Note that this conversion has eliminated the original EMPLOYEE table's transitive dependency. The tables are now said to be in third normal form (3NF). **Note:** **It is interesting to note the similarities between resolving 2NF and 3NF problems. To** **convert a table from 1NF to 2NF, it is necessary to remove the partial dependencies. To convert a table from 2NF to 3NF, it is necessary to remove the transitive dependencies.** **Improving the Design** Now that the table structures have been cleaned up to eliminate the troublesome partial and transitive dependencies, you can focus on improving the database's ability to provide information and on enhancing its operational characteristics. - **Evaluate PK Assignments** Each time a new employee is entered into the EMPLOYEE table, a JOB\_CLASS value must be entered. Unfortunately, it is too easy to make data-entry errors that lead to referential integrity violations. For example, entering *DB* *Designer* instead of *Database Designer* for the JOB\_CLASS attribute in the EMPLOYEE table will trigger such a violation. Therefore, it would be better to add a JOB\_CODE attribute to create a unique identifier. The addition of a JOB\_CODE attribute produces the following dependency: JOB\_CODE → JOB\_CLASS, CHG\_HOUR If you assume that the JOB\_CODE is a proper primary key, this new attribute does produce the following dependency: JOB\_CLASS → CHG\_HOUR However, this dependency is not a transitive dependency because the determinant is a candidate key. Further, the presence of JOB\_CODE greatly decreases the likelihood of referential integrity violations. Note that the new JOB table now has two candidate keys---JOB\_CODE and JOB\_CLASS. In this case, JOB\_CODE is the chosen primary key - **Evaluate Naming Conventions** It is best to adhere to the naming conventions outlined in Chapter 2, Data Models. Therefore, CHG\_HOUR will be changed to JOB\_CHG\_ HOUR to indicate its association with the JOB table. In addition, the attribute name JOB\_CLASS does not quite describe entries such as *Systems Analyst*, *Database Designer*, and so on; the label JOB\_DESCRIPTION fits the entries better. Also, you might have noticed that HOURS was changed to ASSIGN\_HOURS in the conversion from 1NF to 2NF. That change lets you associate the hours worked with the ASSIGNMENT table. - **Refine Attribute Atomicity** It is generally good practice to pay attention to the *atomicity* requirement. An **atomic attribute** is one that cannot be further subdivided. Such an attribute is said to display **atomicity**. Clearly, the use of the EMP\_NAME in the EMPLOYEE table is not atomic because EMP\_NAME can be decomposed into a last name, a first name, and an initial. By improving the degree of atomicity, you also gain querying flexibility. For example, if you use EMP\_LNAME, EMP\_FNAME, and EMP\_INITIAL, you can easily generate phone lists by sorting last names, first names, and initials. Such a task would be very difficult if the name components were within a single attribute. In general, designers prefer to use simple, single-valued attributes, as indicated by the business rules and processing requirements. - **Identify New Attributes** If the EMPLOYEE table were used in a real-world environment, several other attributes would have to be added. For example, year-to-date gross salary payments, Social Security payments, and Medicare payments would be desirable. An employee hire date attribute (EMP\_HIREDATE) could be used to track an employee's job longevity, and it could serve as a basis for awarding bonuses to long-term employees and for other morale-enhancing measures. The same principle must be applied to all other tables in your design. - **Identify New Relationships** According to the original report, the users need to track which employee is acting as the manager of each project. This can be implemented as a relationship between EMPLOYEE and PROJECT. From the original report, it is clear that each project has only one manager. Therefore, the system's ability to supply detailed information about each project's manager is ensured by using the EMP\_NUM as a foreign key in PROJECT. That action ensures that you can access the details of each PROJECT's manager data without producing unnecessary and undesirable data duplication. The designer must take care to place the right attributes in the right tables by using normalization principles. - **Refine Primary Keys as Required for Data Granularity** - **Maintain Historical Accuracy** Writing the job charge per hour into the ASSIGNMENT table is crucial to maintaining the historical accuracy of the table's data. It would be appropriate to name this attribute ASSIGN\_CHG\_HOUR. Although this attribute would appear to have the same value as JOB\_CHG\_HOUR, this is true *only* if the JOB\_ CHG\_HOUR value remains the same forever. It is reasonable to assume that the job charge per hour will change over time. However, suppose that the charges to each project were calculated and billed by multiplying the hours worked from the ASSIGNMENT table by the charge per hour from the JOB table. Those charges would always show the current charge per hour stored in the JOB table rather than the charge per hour that was in effect at the time of the assignment. - **Evaluate Using Derived Attributes** Finally, you can use a derived attribute in the ASSIGNMENT table to store the actual charge made to a project. That derived attribute, named ASSIGN\_CHARGE, is the result of multiplying ASSIGN\_HOURS by ASSIGN\_ CHG\_HOUR. This creates a transitive dependency such that: (ASSIGN\_CHARGE + ASSIGN\_HOURS) → ASSIGN\_CHG\_HOUR From a system functionality point of view, such derived attribute values can be calculated when they are needed to write reports or invoices. However, storing the derived attribute in the table makes it easy to write the application software to produce the desired **Higher-Level Normal Forms** Tables in 3NF will perform suitably in business transactional databases. However, higher normal forms are sometimes useful. In this section, you will learn about a special case of 3NF, known as Boyce-Codd normal form, and about fourth normal form (4NF). - **The Boyce-Codd Normal Form** A table is in Boyce-Codd normal form (BCNF) when every determinant in the table is a candidate key. (Recall from Chapter 3 that a candidate key has the same characteristics as a primary key, but for some reason, it was not chosen to be the primary key.) Clearly, when a table contains only one candidate key, the 3NF and the BCNF are equivalent. In other words, BCNF can be violated only when the table contains more than one candidate key. In the previous normal form examples, tables with only one candidate key were used to simplify the explanations. Remember, however, that multiple candidate keys are always possible, and normalization rules focus on candidate keys, not just the primary key. Most designers consider the BCNF to be a special case of the 3NF. In fact, if the techniques shown in this chapter are used, most tables conform to the BCNF requirements once the 3NF is reached. So, how can a table be in 3NF and not be in BCNF? To answer that question, you must keep in mind that a transitive dependency exists when one nonprime attribute is dependent on another nonprime attribute. In other words, a table is in 3NF when it is in 2NF and there are no transitive dependencies, but what about a case in which one key attribute is the determinant of another key attribute? That condition does not violate 3NF, yet it fails to meet the BCNF requirements (see Figure 6.8) because BCNF requires that every determinant in the table be a candidate key. ![](media/image7.png) A + B → C, D A + C → B, D C → B Notice that this structure has two candidate keys: (A + B) and (A + C). The table structure shown in Figure 6.8 has no partial dependencies, nor does it contain transitive dependencies. (The condition C → B indicates that *one key attribute determines part of* *the primary key*---and *that* dependency *is not* transitive or partial because the dependent is a prime attribute!) Thus, the table structure in Figure 6.8 meets the 3NF requirements, although the condition C → B causes the table to fail to meet the BCNF requirements. To convert the table structure in Figure 6.8 into table structures that are in 3NF and in BCNF, first change the primary key to A + C. This change is appropriate because the dependency C → B means that C is effectively a superset of B. At this point, the table is in 1NF because it contains a partial dependency, C → B. Next, follow the standard decomposition procedures to produce the results shown in Figure 6.9. **Denormalization** When we normalize tables, we break them into multiple smaller tables. So when we want to retrieve data from multiple tables, we need to perform some kind of join operation on them. In that case, we use the denormalization technique that eliminates the drawback of normalization. Denormalization is a technique used by database administrators to optimize the efficiency of their database infrastructure. This method allows us to add redundant data into a normalized database to alleviate issues with database queries that merge data from several tables into a single table. The denormalization concept is based on the definition of normalization that is defined as arranging a database into tables correctly for a particular purpose. NOTE: Denormalization does not indicate not doing normalization. It is an optimization strategy that is used after normalization has been achieved. Pros of Denormalization The following are the advantages of denormalization: **1. Enhance Query Performance** Fetching queries in a normalized database generally requires joining a large number of tables, but we already know that the more joins, the slower the query. To overcome this, we can add redundancy to a database by copying values between parent and child tables, minimizing the number of joins needed for a query. **2. Make database more convenient to manage** A normalized database is not required calculated values for applications. Calculating these values on-the-fly will take a longer time, slowing down the execution of the query. Thus, in denormalization, fetching queries can be simpler because we need to look at fewer tables. **3. Facilitate and accelerate reporting** Suppose you need certain statistics very frequently. It requires a long time to create them from live data and slows down the entire system. Suppose you want to monitor client revenues over a certain year for any or all clients. Generating such reports from live data will require \"searching\" throughout the entire database, significantly slowing it down. Cons of Denormalization The following are the disadvantages of denormalization: - It takes large storage due to data redundancy. - It makes it expensive to updates and inserts data in a table. - It makes update and inserts code harder to write. - Since data can be modified in several ways, it makes data inconsistent. Hence, we\'ll need to update every piece of duplicate data. It\'s also used to measure values and produce reports. We can do this by using triggers, transactions, and/or procedures for all operations that must be performed together.