Data_management4.pdf
Document Details
Uploaded by CapableAmethyst
Tags
Related
- Session 3 - Database Design Process .pdf
- BUSA1000 Introduction to Data, Analytics and People Lecture Notes PDF
- Week 10 - Data Models PDF
- Chapter 3: Modeling Data in the Organization PDF
- Chapter 4 Data Modeling Using the Entity-Relationship (ER) Model PDF
- Data Management and Storage Lecture Notes PDF
Full Transcript
4.1 ENTITIES, RELATIONSHIPS, AND ATTRIBUTES An entity-relationship model is a high-level representation of data requirements, ignoring implementation details. An entity-relationship model guides implementation in a particular database system, such as MySQL. An entity-relationship model includes thre...
4.1 ENTITIES, RELATIONSHIPS, AND ATTRIBUTES An entity-relationship model is a high-level representation of data requirements, ignoring implementation details. An entity-relationship model guides implementation in a particular database system, such as MySQL. An entity-relationship model includes three kinds of objects: ● An entity is a person, place, product, concept, or activity. ● A relationship is a statement about two entities. ● An attribute is a descriptive property of an entity. A relationship is usually a statement about two different entities, but the two entities may be the same. A reflexive relationship relates an entity to itself. When the model is implemented in SQL, entities typically become tables. Relationships and attributes typically become foreign keys and columns, respectively. Entity-relationship diagram and glossary An entity-relationship diagram, commonly called an ER diagram, is a schematic picture of entities, relationships, and attributes. Entities are drawn as rectangles. Relationships are drawn as lines connecting rectangles. Attributes appear as additional text within an entity rectangle, under the entity name. Entities and relationships always appear in ER diagrams. Attributes are optional and only appear when additional detail is needed. A glossary, also known as a data dictionary or repository, documents additional detail in text format. A glossary includes names, synonyms, and descriptions of entities, relationships, and attributes. Types and instances In entity-relationship modeling, a type is a set: An entity type is a set of things. Ex: All employees in a company. A relationship type is a set of related things. Ex: Employee-Manages-Department is a set of (employee, department) pairs, where the employee manages the department. ● An attribute type is a set of values. Ex: All employee salaries. Entity, relationship, and attribute types usually become tables, foreign keys, and columns, respectively. An instance is an element of a set: ● ● An entity instance is an individual thing. Ex: The employee Sam Snead. A relationship instance is a statement about entity instances. Ex: "Maria Rodriguez manages Sales." ● An attribute instance is an individual value. Ex: The salary $35,000. Entity, relationship, and attribute instances usually become rows, foreign key values, and column values, respectively ● ● Database design Complex databases are developed in three phases: 1. Analysis develops an entity-relationship model, capturing data requirements while ignoring implementation details. 2. Logical design converts the entity-relationship model into tables, columns, and keys for a particular database system. 3. Physical design adds indexes and specifies how tables are organized on storage media. Analysis is particularly important for complex databases with many users when documenting requirements is challenging. For small databases with just a few tables and users, analysis is less important and often omitted. Analysis and logical design steps are summarized in the table below. Although these steps are presented in sequence, in practice execution is not always sequential. Often an early step is revisited after a later step is completed. Analysis steps Step Name 1 Discover entities, relationships, and attributes 2 Determine cardinality 3 Distinguish strong and weak entities 4 Create supertype and subtype entities Logical design steps Step Name 5 Implement entities 6 Implement relationships 7 Implement attributes 8 Apply normal form 4.2 DISCOVERY Entities, relationships, and attributes are discovered in interviews with database users and managers. Users and managers are usually familiar with data requirements from an old database, or perhaps a manual process with paper records. When users are difficult to reach, a database designer may communicate with surrogates. Ex: A sales representative might communicate on behalf of prospective customers. In addition to interviews, written documents are a good source of data requirements. Ex: The user manual for an older version of the database is a good source of requirements. In interviews and documents, entities, relationships, and attributes surface as nouns and verbs: ● ● ● Entities usually appear as nouns, but not all nouns are entities. Designers should ignore nouns that denote specific data or are not relevant to the database. Relationships are often expressed as verbs. Designers should ignore statements that are not about entities, not relevant to the database, or redundant to other relationships. Designers should look for relationships that are not explicitly stated, since users may overlook important information. Attributes are usually nouns that denote specific data, such as names, dates, quantities, and monetary values. Names Entity names are a singular noun. Ex: Employee rather than Employees. The best names are commonly used and easily understood by database users. Relationships names have the form Entity-Verb-Entity, such as Division-Contains-Department. When the related entities are obvious, in ER diagrams or informal conversation, Verb is sufficient and entity names can be omitted. The verb should be active rather than passive. Ex: Manages rather than IsManagedBy. Occasionally, the same verb relates different entity pairs. Ex: Order-Contains-LineItem and Division-Contains-Department. Attribute names have the form EntityQualifierType, such as EmployeeFirstName: ● ● ● Entity is the name of the entity that the attribute describes. When the entity is obvious, in ER diagrams or informal conversation, QualifierType is sufficient and the entity name can be omitted. Qualifier describes the meaning of the attribute. Ex: First, Last, and Alternate. Sometimes a qualifier is unnecessary and can be omitted. Ex: StudentNumber. Type is chosen from a list of standard attribute types such as Name, Number, and Count. Attribute types are not identical to SQL data types. Ex: "Amount" might be an attribute type representing monetary values, implemented as the MONEY data type in SQL. "Count" might be an attribute type representing quantity, implemented as NUMBER in SQL. Synonyms and descriptions Often, entity, relationship, and attribute names have synonyms. Ex: Representative may be a synonym for SalesAgent. Synonyms are common in informal communications. To avoid confusion, one official name is selected for each entity, relationship, and attribute. Other names are documented in the glossary as synonyms. The glossary also contains complete descriptions of entities, relationships, and attributes. The description states the meaning of each entity, relationship, or attribute in complete sentences. The description begins with the name and includes examples and counterexamples to illustrate usage. Database design The first step of the analysis phase is discovery of entities, relationships, and attributes in interviews and document review. As discovery proceeds, the designer draws an ER diagram, determines standard attributed types, and documents names, synonyms, and descriptions in the glossary. Although the step numbers suggest a sequence, database designers commonly move back and forth between steps. As names, synonyms, and descriptions are documented, additional entities, relationships, and attributes are discovered. The ER diagram and glossary are usually developed in parallel. 4.3 CARDINALITY Relationship maximum In entity-relationship modeling, cardinality refers to maxima and minima of relationships and attributes. Relationship maximum is the greatest number of instances of one entity that can relate to a single instance of another entity. A relationship has two maxima, one for each of the related entities. Maxima are usually specified as one or many. A related entity is singular when the maximum is one and plural when the maximum is many. Relationship minimum Relationship minimum is the least number of instances of one entity that can relate to a single instance of another entity. A relationship has two minima, one for each of the related entities. Minima are usually specified as zero or one. A related entity is optional when the minimum is zero and required when the minimum is one. Attribute maximum and minimum Attribute maximum is the greatest number of attribute values that can describe each entity instance. Attribute maximum is usually specified as one (singular) or many (plural). Attribute minimum is the least number of attribute values that can describe each entity instance. Attribute minimum is usually specified as zero (optional) or one (required). In ER diagrams, attribute maximum and minimum follow the attribute name. The minimum appears in parentheses. Unique attributes Each value of a unique attribute describes at most one entity instance. A unique attribute is not the same as a singular attribute: ● A unique attribute has at most one entity instance for each attribute value. ● A singular attribute has at most one attribute value for each entity instance. In ER diagrams, 1 indicates a unique attribute and M indicates a non-unique attribute. The 1 or M appears before the attribute maximum and minimum. 4.4 STRONG AND WEAK ENTITIES Strong entities An identifying attribute is unique, singular, and required. Identifying attribute values correspond one-to-one to, or identify, entity instances. A strong entity has one or more identifying attributes. When a strong entity is implemented as a table, one of the identifying attributes may become the primary key. Weak entities A weak entity does not have an identifying attribute. Instead, a weak entity usually has a relationship, called an identifying relationship, to another entity, called an identifying entity. Cardinality of the identifying entity is 1(1). In an ER diagram, an identifying relationship has a diamond next to the identifying entity. Cardinality of the identifying entity is always 1(1), so the diamond replaces the cardinality symbol. Identifying entities A weak entity is usually identified by a strong entity. However, a weak entity can be identified by another weak entity or by several entities. 4.5 SUPERTYPE AND SUBTYPE ENTITIES Supertype and subtype entities An entity type is a set of entity instances. A subtype entity is a subset of another entity type, called the supertype entity. Ex: Managers are a subset of employees, so Manager is a subtype entity of the Employee supertype entity. On ER diagrams, subtype entities are drawn within the supertype. A supertype entity usually has several subtypes. Attributes of the supertype apply to all subtypes. Attributes of a subtype do not apply to other subtypes or the supertype. A supertype entity identifies its subtype entities. The identifying relationship is called an IsA relationship. Ex: Manager-IsAn-Employee relates each manager instance to the corresponding employee instance. Since a supertype entity always identifies its subtypes, the IsA relationship is assumed and can be omitted from the ER diagram. Similar entities and optional attributes Supertype and subtype entities are often created from similar entities and optional attributes. Similar entities are entities that have many common attributes and relationships. Similar entities become subtypes of a new supertype entity, Partitions A partition of a supertype entity is a group of mutually exclusive subtype entities. A supertype entity can have several partitions. Subtype entities within each partition are disjoint and do not share instances. Subtype entities in different partitions overlap and do share instances. In diagrams, subtype entities within each partition are vertically aligned. Subtype entities in different partitions are horizontally aligned. Each partition corresponds to an optional partition attribute of the supertype entity. The partition attribute indicates which subtype entity is associated with each supertype instance. Database design After entities, relationships, attributes, cardinality, and strong and weak entities are determined, the database designer looks for supertype and subtype entities. Similar entities and optional attributes suggest new supertype and subtype entities and warrant special attention. Mutually exclusive subtype entities are grouped into partitions. For each partition, a partition attribute is added to the supertype entity. Creating supertype and subtype entities is the last of four analysis steps: 4. Discover entities, relationships, and attributes 5. Determine cardinality 6. Distinguish strong and weak entities 7. Create supertype and subtype entities Logical design follows analysis. Logical design converts an entity-relationship model to tables, columns, and keys for a specific database system. 4.6 ALTERNATIVE MODELING CONVENTIONS Diagram conventions ER diagram conventions vary widely. Ex: Some ER diagrams may: ● Depict relationship names inside a diamond. ● Depict weak entities and identifying relationships with double lines. ● Depict subtype entities with IsA relationships rather than inside of supertype entities. ● Use color, dashed lines, or double lines to convey additional information. Variations in cardinality conventions are common. One popular convention, called crow's foot notation, depicts cardinality as a circle (zero), a short line (one), or three short lines (many). The three short lines look like a bird's foot, hence the name "crow's foot notation". Model conventions ER modeling concepts also vary. Ex: Some ER models may: ● Allow relationships between three or more entities. ● Decompose a complex model into a group of related entities, called a subject area. ● Refer to strong entities as independent and weak entities as dependent. Several model conventions are standardized and widely used. Leading conventions include: Unified Modeling Language, or UML, is commonly used for software development. Software data structures are similar to database structures, so UML includes ER conventions. ● IDEF1X stands for Information DEFinition version 1X. IDEF1X became popular, in part, due to early adoption by the United States Department of Defense. ● Chen notation appeared in an early ER modeling paper by Peter Chen. Chen notation is not standardized but often appears in literature and tools. By and large, differences between conventions are stylistic rather than substantial. The choice of convention does not usually affect the resulting database design. ● An intangible entity is documented in the data model, but not tracked with data in the database. Ex: In the animations above, the Part entity is intangible. In an ER diagram, intangible entities are distinguished with special notation, such as a dashed rectangle or distinct color. 4.7 IMPLEMENTING ENTITIES Selecting primary keys In the first step of the logical design phase, each entity becomes a table and each attribute becomes a column. Tables and columns are revised in subsequent steps. As tables and columns are specified, primary keys are selected. Primary keys must be unique and not NULL, and thus correspond to unique and required attributes. Primary keys should also be: Stable. Primary key values should not change. When a primary key value changes, statements that specify the old value must also change. Furthermore, the new primary key value must cascade to matching foreign keys. ● Simple. Primary key values should be easy to type and store. Small values are easy to specify in an SQL WHERE clause and speed up query processing. Ex: A 2-byte integer is easier to type and faster to process than a 15-byte character string. ● Meaningless. Primary keys should not contain descriptive information. Descriptive information occasionally changes, so primary keys containing descriptive information are unstable. Stable, simple, and meaningless primary keys are desirable but not required. Depending on database standards, these guidelines may be violated in some cases. ● Implementing strong entities A strong entity becomes a strong table. The primary key must be unique and non-NULL, and should be stable, simple, and meaningless. Single-column primary keys are best, but if no such column exists, a composite primary key may have the required properties. An artificial key is a single-column primary key created by the database designer when no suitable single-column or composite primary key exists. Usually artificial key values are integers, generated automatically by the database as new rows are inserted to the table. Artificial keys are stable, simple, and meaningless. Implementing subtype entities A subtype entity becomes a subtype table and is implemented as follows: ● ● The primary key is identical to the supertype primary key. The primary key is also a foreign key that references the supertype primary key. The foreign key implements the IsA identifying relationship. Foreign keys that implement identifying relationships usually have the following referential integrity actions: ● Cascade on primary key update and delete ● Restrict on foreign key insert and update 4.8 IMPLEMENTING RELATIONSHIPS Implementing many-one relationships The 'implement relationships' step converts relationships into keys or tables, depending on relationship cardinality. A many-one or one-many relationship becomes a foreign key: The foreign key goes in the table on the 'many' side of the relationship. The foreign key refers to the primary key on the 'one' side. The foreign key name is the primary key name with an optional prefix. The prefix is derived from the relationship name and clarifies the meaning of the foreign key. Implementing one-one relationships A one-one relationship becomes a foreign key: ● ● ● The foreign key can go in the table on either side of the relationship. Usually, the foreign key is placed in the table with fewer rows, to minimize the number of NULL values. ● The foreign key refers to the primary key on the opposite side of the relationship. ● The foreign key name is the primary key name with an optional prefix. The prefix is derived from the relationship name and clarifies the meaning of the foreign key. Implementing many-many relationships A many-many relationship becomes a new weak table: ● The new table contains two foreign keys, referring to the primary keys of the related tables. ● The primary key of the new table is the composite of the two foreign keys. ● The new table is identified by the related tables, so primary key cascade and foreign key restrict rules are specified. ● The new table name consists of the related table names with an optional qualifier in between. The qualifier is derived from the relationship name and clarifies the meaning of the table. Database design The 'implement relationships' step adds foreign keys to the initial table design. Each many-one and one-one relationship becomes a new foreign key. Each many-many relationship becomes a new dependent table containing two foreign keys. ● Foreign keys that implement dependency relationships usually have the following referential integrity actions: ● ● Cascade on primary key update and delete Restrict on foreign key insert and update 4.9 IMPLEMENTING ATTRIBUTES Implementing plural attributes In the 'implement entities' step, entities become tables and attributes become columns. Singular attributes remain in the initial table, but plural attributes move to a new weak table: ● ● ● ● The new table contains the plural attribute and a foreign key referencing the initial table. The primary key of the new table is the composite of the plural attribute and the foreign key. The new table is identified by the initial table, so primary key cascade and foreign key restrict rules are specified. The new table name consists of the initial table name followed by the attribute name. If a plural attribute has a small, fixed maximum, the plural attribute can be implemented as multiple columns in the initial table. However, implementing plural attributes in a new table simplifies queries and is usually a better solution. 4.10 FIRST, SECOND, AND THIRD NORMAL FORM Functional dependence Column A depends on column B means each B value is related to at most one A value. Columns A and B may be simple or composite. 'A depends on B' is denoted B → A. Dependence of one column on another is called functional dependence. Functional dependence reflects business rules. Normal forms Redundancy is the repetition of related values in a table. Redundancy causes database management problems. When related values are updated, all copies must be changed, which makes queries slow and complex. If copies are not updated uniformly, the copies become inconsistent and the correct version is uncertain. Normal forms are rules for designing tables with less redundancy. Normal forms are numbered, first through fifth. An additional normal form, Boyce-Codd, is an improved version of third normal form. The six normal forms comprise a sequence, with each successive normal form allowing less redundancy. First normal form Every cell of a table contains exactly one value. A table is in first normal form when, in addition, the table has a primary key. This definition has two corollaries: In a first normal form table, every non-key column depends on the primary key. Each primary key value appears in exactly one row, and each non-key cell contains exactly one value. So each primary key value is related to exactly one non-key value. ● A first normal form table has no duplicate rows. Every row contains a different primary key value and therefore every row is different. Second normal form A table is in second normal form when all non-key columns depend on the whole primary key. In other words, a non-key column cannot depend on part of a composite primary key. A table with a simple primary key is automatically in second normal form. ● Third normal form Redundancy can occur in a second normal form table when a non-key column depends on another non-key column. Informally, a table is in third normal form when all non-key columns depend on the key, the whole key, and nothing but the key. 4.11 BOYCE-CODD NORMAL FORM In a Boyce-Codd normal form table, all dependencies are on unique columns. Dependence on a unique column never creates redundancy, so Boyce-Codd normal form eliminates all redundancy arising from functional dependence. A candidate key is a simple or composite column that is unique and minimal. Minimal means all columns are necessary for uniqueness. A table may have several candidate keys. The database designer designates one candidate key as the primary key. A non-key column is a column that is not contained in a candidate key. A table is in third normal form if, whenever a non-key column A depends on column B, then B is unique. Columns A and B may be simple or composite. Although B is unique, B is not necessarily minimal and therefore is not necessarily a candidate key. Boyce-Codd normal form The definition of third normal form applies to non-key columns only, which allows for occasional redundancy. Boyce-Codd normal form applies to all columns and eliminates this redundancy. A table is in Boyce-Codd normal form if, whenever column A depends on column B, then B is unique. Columns A and B may be simple or composite. This definition is identical to the definition of third normal form with the term 'non-key' removed. Boyce-Codd normal form is considered the gold standard of table design. Trivial dependencies When the columns of A are a subset of the columns of B, A always depends on B. Ex: FareClass depends on (FlightCode, FareClass). These dependencies are called trivial. 4.12 APPLYING NORMAL FORM Normalization Implementing entities, relationships, and attributes usually generates tables with no redundancy. Occasionally, however, implementation results in redundant tables. This redundancy is eliminated with normalization, the last step of logical design. Normalization eliminates redundancy by decomposing a table into two or more tables in higher normal form. Ex: A table in first normal form might be replaced by two tables in third normal form. In principle, normalization decomposes tables to any higher normal form. Column A depends on column B when each B value is related to at most one A value. A and B may be simple or composite columns. In a Boyce-Codd normal form table, if column A depends on column B, then B must be unique. Normalizing a table to Boyce-Codd normal form involves three steps: 8. List all unique columns. Unique columns may be simple or composite. In composite columns, remove any columns that are not necessary for uniqueness. The primary key is unique and therefore always on this list. 9. Identify dependencies on non-unique columns. Non-unique columns are either external to all unique columns or contained within a composite unique column. 10. Eliminate dependencies on non-unique columns. If column A depends on a non-unique column B, A is removed from the original table. A new table is created containing A and B. B is a primary key in the new table and a foreign key in the original table. Normalization eliminates redundancy by removing A from the original table. Since the data relating A and B is recorded in a new table, no information is lost. Denormalization Boyce-Codd normal form is ideal for tables with frequent inserts, updates, and deletes. In a database used primarily for reporting, changes are infrequent and redundancy is acceptable. In fact, redundancy can be desirable in reporting databases, as processing is faster and queries are simpler. Therefore, reporting databases may contain tables that, by design, are not in third normal form. Denormalization means intentionally introducing redundancy by merging tables. Denormalization eliminates join queries and therefore improves query performance. Denormalization results in first and second normal form tables and should be applied selectively and cautiously. Database design As tables and keys are specified, the database designer reviews each table for Boyce-Codd normal form. Dependencies and unique columns are identified. If any dependencies are not on unique columns, the table is decomposed into smaller tables in Boyce-Codd normal form. Tables that experience infrequent inserts, updates, and deletes may be denormalized to simplify and accelerate SELECT queries.