Chapter5- Data Modelling & Design.pdf

C H AP T ER 5 Data Modeling and Design 1. Introduction D ata modeling is the process of discovering, analyzing, and scoping data requirements, and then representing and communicating these data requirements in a precise form called the data model. Data modeling is a critical component of data management. The modeling process requires that organizations discover and document how their data fits together. The modeling process itself designs how data fits together (Simsion, 2013). Data models depict and enable an organization to understand its data assets. 123 Order 11611 by LEXIE MAY on August 25, 2017 124 D MBO K 2 There are a number of different schemes used to represent data. The six most commonly used schemes are: Relational, Dimensional, Object-Oriented, Fact-Based, Time-Based, and NoSQL. Models of these schemes exist at three levels of detail: conceptual, logical, and physical. Each model contains a set of components. Examples of components are entities, relationships, facts, keys, and attributes. Once a model is built, it needs to be reviewed and once approved, maintained. Data Modeling and Design Definition: Data modeling is the process of discovering, analyzing, and scoping data requirements, and then representing and communicating these data requirements in a precise form called the data model.This process is iterative and may include a conceptual, logical, and physical model. Goal: To confirm and document an understanding of different perspectives, which leads to applications that more closely align with current and future business requirements, and creates a foundation to successfully complete broad-scoped initiatives such as master data management and data governance programs. Business Drivers Activities: 1. Plan for Data Modeling (P) 2. Build the Data Models (D) 1. Create the Conceptual Data Model 2. Create the Logical Data Model 3. Create the Physical Data Model 3. Review the Data Models (C) Inputs: Existing data models and databases Data standards Data sets Initial data requirements Original data requirements Data architecture Enterprise taxonomy 4. Suppliers: Consumers: Business Analysts Data Modelers Technical Drivers Tools: Naming conventions Database design Data modeling tools Lineage tools Metadata repositories Data model patterns Database type selection Industry data models Techniques: Physical Data Model Manage the Data Models (O) Participants: Business Professionals Business Analysts Data Architects Database Administrators and Developers Subject Matter Experts Data Stewards Metadata Administrators Deliverables: Conceptual Data Model Logical Data Model Business Analysts Data Modelers Database Administrators and Developers Software Developers Data Stewards Data Quality Analysts Data Consumers Metrics: Data model validation measurement (P) Planning, (C) Control, (D) Development, (O) Operations Figure 28 Context Diagram: Data Modeling and Design Order 11611 by LEXIE MAY on August 25, 2017 D A T A MO D E LI NG A ND D E SI GN 125 Data models comprise and contain Metadata essential to data consumers. Much of this Metadata uncovered during the data modeling process is essential to other data management functions. For example, definitions for data governance and lineage for data warehousing and analytics. This chapter will describe the purpose of data models, the essential concepts and common vocabulary used in data modeling, and data modeling goals and principles. It will use a set of examples from data related to education to illustrate how data models work and to show differences between them. 1.1 Business Drivers Data models are critical to effective management of data. They: Provide a common vocabulary around data Capture and document explicit knowledge about an data and systems Serve as a primary communications tool during projects Provide the starting point for customization, integration, or even replacement of an application 1.2 Goals and Principles The goal of data modeling is to confirm and document understanding of different perspectives, which leads to applications that more closely align with current and future business requirements, and creates a foundation to successfully complete broad-scoped initiatives such as Master Data Management and data governance programs. Proper data modeling leads to lower support costs and increases the reusability opportunities for future initiatives, thereby reducing the costs of building new applications. Data models are an important form of Metadata. Confirming and documenting understanding of different perspectives facilitates: Formalization: A data model documents a concise definition of data structures and relationships. It enables assessment of how data is affected by implemented business rules, for current as-is states or desired target states. Formal definition imposes a disciplined structure to data that reduces the possibility of data anomalies occurring when accessing and persisting data. By illustrating the structures and relationships in the data, a data model makes data easier to consume. Scope definition: A data model can help explain the boundaries for data context and implementation of purchased application packages, projects, initiatives, or existing systems. Knowledge retention/documentation: A data model can preserve corporate memory regarding a system or project by capturing knowledge in an explicit form. It serves as documentation for future projects to use as the as-is version. Data models help us understand an organization or business area, an existing application, or the impact of modifying an existing data structure. The data model becomes a reusable map to help business professionals, project managers, analysts, modelers, and developers Order 11611 by LEXIE MAY on August 25, 2017 126 D MBO K 2 understand data structure within the environment. In much the same way as the mapmaker learned and documented a geographic landscape for others to use for navigation, the modeler enables others to understand an information landscape (Hoberman, 2009). 1.3 Essential Concepts This section will explain the different types of data that can be modeled, the component pieces of data models, the types of data models that can be developed, and the reasons for choosing different types in different situations. This set of definitions is extensive, in part, because data modeling itself is about the process of definition. It is important to understand the vocabulary that supports the practice. 1.3.1 Data Modeling and Data Models Data modeling is most frequently performed in the context of systems development and maintenance efforts, known as the system development lifecycle (SDLC). Data modeling can also be performed for broad-scoped initiatives (e.g., Business and Data Architecture, Master Data Management, and data governance initiatives) where the immediate end result is not a database but an understanding of organizational data. A model is a representation of something that exists or a pattern for something to be made. A model can contain one or more diagrams. Model diagrams make use of standard symbols that allow one to understand content. Maps, organization charts, and building blueprints are examples of models in use every day. organization understands it, or as the organization wants it to be. A data model contains a set of symbols with text labels that attempts visually to represent data requirements as communicated to the data modeler, for a specific set of data that can range in size from small, for a project, to large, for an organization. The model is a form of documentation for data requirements and data definitions resulting from the modeling process. Data models are the main medium used to communicate data requirements from business to IT and within IT from analysts, modelers, and architects, to database designers and developers. 1.3.2 Types of Data that are Modeled Four main types of data can be modeled (Edvinsson, 2013). The types of data being modeled in any given organization reflect the priorities of the organization or the project that requires a data model: Category information: Data used to classify and assign types to things. For example, customers classified by market categories or business sectors; products classified by color, model, size, etc.; orders classified by whether they are open or closed. Order 11611 by LEXIE MAY on August 25, 2017 D A T A MO D E LI NG A ND D E SI GN 127 Resource information: Basic profiles of resources needed conduct operational processes such as Product, Customer, Supplier, Facility, Organization, and Account. Among IT professionals, resource entities are sometimes referred to as Reference Data. Business event information: Data created while operational processes are in progress. Examples include Customer Orders, Supplier Invoices, Cash Withdrawal, and Business Meetings. Among IT professionals, event entities are sometimes referred to as transactional business data. Detail transaction information: Detailed transaction information is often produced through point-ofsale systems (either in stores or online). It is also produced through social media systems, other Internet interactions (clickstream, etc.), and by sensors in machines, which can be parts of vessels and vehicles, industrial components, or personal devices (GPS, RFID, Wi-Fi, etc.). This type of detailed information can be aggregated, used to derive other data, and analyzed for trends, similar to how the business information events are used. This type of data (large volume and/or rapidly changing) is usually referred to as Big Data. These types refer Data in motion can also be modeled, for example, in schemes for systems, including protocols, and schemes for messaging and event-based systems. 1.3.3 Data Model Components As will be discussed later in the chapter, different types of data models represent data through different conventions (See Section 1.3.4). However, most data models contain the same basic building blocks: entities, relationships, attributes, and domains. 1.3.3.1 Entity Outside of data modeling, the definition of entity is a thing that exists separate from other things. Within data modeling, an entity is a thing about which an organization collects information. Entities are sometimes referred to as the nouns of an organization. An entity can be thought of as the answer to a fundamental question who, what, when, where, why, or how or to a combination of these questions (see Chapter 4). Table 7 defines and gives examples of commonly used entity categories (Hoberman, 2009). Table 7 Commonly Used Entity Categories Category Who Definition Person or organization of interest. That is, Who is with a party generalization, or role such as Customer or Vendor. Persons or organizations can have multiple roles or be included in multiple parties. Examples Employee, Patient, Player, Suspect, Customer, Vendor, Student, Passenger, Competitor, Author Order 11611 by LEXIE MAY on August 25, 2017 128 D MBO K 2 Category Definition Examples What Product or service of interest to the enterprise. It often refers to what the organization makes or what service it provides. That is, What is important to the business? Attributes for categories, types, etc. are very important here. Product, Service, Raw Material, Finished Good, Course, Song, Photograph, Book When Calendar or time interval of interest to the enterprise. That is, When is the business in operation? Time, Date, Month, Quarter, Year, Calendar, Semester, Fiscal Period, Minute, Departure Time Where Location of interest to the enterprise. Location can refer to actual places as well as electronic places. That is, Where is business conducted? Mailing Address, Distribution Point, Website URL, IP Address Why Event or transaction of interest to the enterprise. These events keep the business afloat. That is, Why is the business in business? Order, Return, Complaint, Withdrawal, Deposit, Compliment, Inquiry, Trade, Claim How Documentation of the event of interest to the enterprise. Documents provide the evidence that the events occurred, such as a Purchase Order recording an Order event. That is, How do we know that an event occurred? Invoice, Contract, Agreement, Account, Purchase Order, Speeding Ticket, Packing Slip, Trade Confirmation Measurement Counts, sums, etc. of the other categories (what, where) at or over points in time (when). Sales, Item Count, Payments, Balance 1.3.3.1.1 Entity Aliases The generic term entity can go by other names. The most common is entity-type, as a type of something is being represented (e.g., Jane is of type Employee), therefore Jane is the entity and Employee is the entity type. However, in widespread use today is using the term entity for Employee and entity instance for Jane. Table 8 Entity, Entity Type, and Entity Instance Usage Entity Common Use Jane Recommended Use Employee Entity Type Entity Instance Employee Jane Entity instances are the occurrences or values of a particular entity. The entity Student may have multiple student instances, with names Bob Jones, Joe Jackson, Jane Smith, and so forth. The entity Course can have instances of Data Modeling Fundamentals, Advanced Geology, and English Literature in the 17 th Century. Entity aliases can also vary based on scheme. (Schemes will be discussed in Section 1.3.4.) In relational schemes the term entity is often used, in dimensional schemes the terms dimension and fact table are often used, Order 11611 by LEXIE MAY on August 25, 2017 D A T A MO D E LI NG A ND D E SI GN 129 in object-oriented schemes the terms class or object are often used, in time-based schemes the terms hub, satellite, and link are often used, and in NoSQL schemes terms such as document or node are used. Entity aliases can also vary based on level of detail. (The three levels of detail will be discussed in Section 1.3.5.) An entity at the conceptual level can be called a concept or term, an entity at the logical level is called an entity (or a different term depending on the scheme), and at the physical level the terms vary based on database technology, the most common term being table. 1.3.3.1.2 Graphic Representation of Entities In data models, entities are generally depicted as rectangles (or rectangles with rounded edges) with their names inside, such as in Figure 29, where there are three entities: Student, Course, and Instructor. Student Course Instructor Figure 29 Entities 1.3.3.1.3 Definition of Entities Entity definitions are essential contributors to the business value of any data model. They are core Metadata. High quality definitions clarify the meaning of business vocabulary and provide rigor to the business rules governing entity relationships. They assist business and IT professionals in making intelligent business and application design decisions. High quality data definitions exhibit three essential characteristics: Clarity: The definition should be easy to read and grasp. Simple, well-written sentences without obscure acronyms or unexplained ambiguous terms such as sometimes or normally. Accuracy: The definition is a precise and correct description of the entity. Definitions should be reviewed by experts in the relevant business areas to ensure that they are accurate. Completeness: All of the parts of the definition are present. For example, in defining a code, examples of the code values are included. In defining an identifier, the scope of uniqueness in included in the definition. 1.3.3.2 Relationship A relationship is an association between entities (Chen, 1976). A relationship captures the high-level interactions between conceptual entities, the detailed interactions between logical entities, and the constraints between physical entities. Order 11611 by LEXIE MAY on August 25, 2017 130 D MBO K 2 1.3.3.2.1 Relationship Aliases The generic term relationship can go by other names. Relationship aliases can vary based on scheme. In relational schemes the term relationship is often used, dimensional schemes the term navigation path is often used, and in NoSQL schemes terms such as edge or link are used, for example. Relationship aliases can also vary based on level of detail. A relationship at the conceptual and logical levels is called a relationship, but a relationship at the physical level may be called by other names, such as constraint or reference, depending on the database technology. 1.3.3.2.2 Graphic Representation of Relationships Relationships are shown as lines on the data modeling diagram. See Figure 30 for an Information Engineering example. Instructor Teach Student Attend Course Figure 30 Relationships In this example, the relationship between Student and Course captures the rule that a Student may attend Courses. The relationship between Instructor and Course captures the rule than an Instructor may teach Courses. The symbols on the line (called cardinality) capture the rules in a precise syntax. (These will be explained in Section 1.3.3.2.3.) A relationship is represented through foreign keys in a relational database and through alternative methods for NoSQL databases such as through edges or links. 1.3.3.2.3 Relationship Cardinality In a relationship between two entities, cardinality captures how many of one entity (entity instances) participates in the relationship with how many of the other entity. Cardinality is represented by the symbols that appear on both ends of a relationship line. Data rules are specified and enforced through cardinality. Without cardinality, the most one can say about a relationship is that two entities are connected in some way. For cardinality, the choices are simple: zero, one, or many. Each side of a relationship can have any combination of zero, one, or many to capture whether or not an entity instance is required in a relationship. Specifying one or many allows us to capture how many of a particular instance participates in a given relationship. Order 11611 by LEXIE MAY on August 25, 2017 D A T A MO D E LI NG A ND D E SI GN 131 These cardinality symbols are illustrated in the following information engineering example of Student and Course. Attend Student Course Figure 31 Cardinality Symbols The business rules are: Each Student may attend one or many Courses. Each Course may be attended by one or many Students. 1.3.3.2.4 Arity of Relationships relationship. The most common are unary, binary, and ternary relationships. 1.3.3.2.4.1 Unary (Recursive) Relationship A unary (also known as a recursive or self-referencing) relationship involves only one entity. A one-to-many recursive relationship describes a hierarchy, whereas a many-to-many relationship describes a network or graph. In a hierarchy, an entity instance has at most one parent (or higher-level entity). In relational modeling, child entities are on the many side of the relationship, with parent entities on the one side of the relationship. In a network, an entity instance can have more than one parent. For example, a Course can require prerequisites. If, in order to take the Biology Workshop, one would first need to complete the Biology Lecture, the Biology Lecture is the prerequisite for the Biology Workshop. In the following relational data models, which use information engineering notation, one can model this recursive relationship as either a hierarchy or network: Require as a pre-requisite Figure 32 Unary Relationship - Hierarchy Require as a pre-requisite Course Figure 33 Unary Relationship - Network Order 11611 by LEXIE MAY on August 25, 2017 132 D MBO K 2 This first example (Figure 32) is a hierarchy and the second (Figure 33) is a network. In the first example, the Biology Workshop requires first taking the Biology Lecture and the Chemistry Lecture. Once the Biology Lecture is chosen as the prerequisite for the Biology Workshop, the Biology Lecture cannot be the prerequisite for any other courses. The second example allows the Biology Lecture to be the prerequisite for other courses as well. 1.3.3.2.4.2 Binary Relationship An arity of two is also known as binary. A binary relationship, the most common on a traditional data model diagram, involves two entities. Figure 34, a UML class diagram, shows that both Student and Course are entities participating in a binary relationship. Figure 34 Binary Relationship 1.3.3.2.4.3 Ternary Relationship An arity of three, known as ternary, is a relationship that includes three entities. An example in fact-based modeling (object-role notation) appears in Figure 35. Here Student can register for a particular Course in a given Semester. Figure 35 Ternary Relationship 1.3.3.2.5 Foreign Key A foreign key is used in physical and sometimes logical relational data modeling schemes to represent a relationship. A foreign key may be created implicitly when a relationship is defined between two entities, depending on the database technology or data modeling tool, and whether the two entities involved have mutual dependencies. In the example shown in Figure 36, Registration contains two foreign keys, Student Number from Student and Course Code from Course. Foreign keys appear in the entity on the many side of the relationship, often called the child entity. Student and Course are parent entities and Registration is the child entity. Order 11611 by LEXIE MAY on August 25, 2017 D A T A MO D E LI NG A ND D E SI GN Student 133 Registration Student Number (FK) Student Number Registration Date Have registered Course Name Figure 36 Foreign Keys 1.3.3.3 Attribute An attribute is a property that identifies, describes, or measures an entity. Attributes may have domains, which will be discussed in Section 1.3.3.4. The physical correspondent of an attribute in an entity is a column, field, tag, or node in a table, view, document, graph, or file. 1.3.3.3.1 Graphic Representation of Attributes In data models, attributes are generally depicted as a list within the entity rectangle, as shown in Figure 37, where the attributes of the entity Student include Student Number, Student First Name, Student Last Name, and Student Birth Date. Student Student Number Student First Name Student Last Name Student Birth Date Figure 37 Attributes 1.3.3.3.2 Identifiers An identifier (also called a key) is a set of one or more attributes that uniquely defines an instance of an entity. This section defines types of keys by construction (simple, compound, composite, surrogate) and function (candidate, primary, alternate). 1.3.3.3.2.1 Construction-type Keys A simple key is one attribute that uniquely identifies an entity instance. Universal Product Codes (UPCs) and Vehicle Identification Numbers (VINs) are examples of simple keys. A surrogate key is also an example of a simple key. A surrogate key is a unique identifier for a table. Often a counter and always system-generated without intelligence, a surrogate key is an integer whose meaning is unrelated to its face value. (In other words, a Month Identifier of 1 cannot be assumed to represent January.) Surrogate keys serve technical functions and should not be visible to end users of a database. They remain behind the scenes to help maintain uniqueness, allow for more efficient navigation across structures, and facilitate integration across applications. Order 11611 by LEXIE MAY on August 25, 2017 134 D MBO K 2 A compound key is a set of two or more attributes that together uniquely identify an entity instance. Examples are US phone number (area code + exchange + local number) and credit card number (issuer ID + account ID + check digit). A composite key contains one compound key and at least one other simple or compound key or non-key attribute. An example is a key on a multi-dimensional fact table, which may contain several compound keys, simple keys, and optionally a load timestamp. 1.3.3.3.2.2 Function-type Keys A super key is any set of attributes that uniquely identify an entity instance. A candidate key is a minimal set of one or more attributes (i.e., a simple or compound key) that identifies the entity instance to which it belongs. Minimal means that no subset of the candidate key uniquely identifies the entity instance. An entity may have multiple candidate keys. Examples of candidate keys for a customer entity are email address, cell phone number, and customer account number. Candidate keys can be business keys (sometimes called natural keys). A business key is one or more attributes that a business professional would use to retrieve a single entity instance. Business keys and surrogate keys are mutually exclusive. A primary key is the candidate key that is chosen to be the unique identifier for an entity. Even though an entity may contain more than one candidate key, only one candidate key can serve as the primary key for an entity. An alternate key is a candidate key that although unique, was not chosen as the primary key. An alternate key can still be used to find specific entity instances. Often the primary key is a surrogate key and the alternate keys are business keys. 1.3.3.3.2.3 Identifying vs. Non-Identifying Relationships An independent entity is one where the primary key contains only attributes that belong to that entity. A dependent entity is one where the primary key contains at least one attribute from another entity. In relational schemes, most notations depict independent entities on the data modeling diagram as rectangles and dependent entities as rectangles with rounded corners. In the student example shown in Figure 38, Student and Course are independent entities and Registration is a dependent entity. Student Registration Student Number Student Number (FK) Student First Name Student Last Name Registration Date Have registered Figure 38 Dependent and Independent Entity Order 11611 by LEXIE MAY on August 25, 2017 Course Name D A T A MO D E LI NG A ND D E SI GN 135 Dependent entities have at least one identifying relationship. An identifying relationship is one where the primary key of the parent (the entity on the one side of the relationship) is migrated as a foreign key to the ith the relationship from Student to Registration, and from Course to Registration. In non-identifying relationships, the primary key of the parent is migrated as a non-primary foreign key attribute to the child. 1.3.3.4 Domain In data modeling, a domain is the complete set of possible values that an attribute can be assigned. A domain may be articulated in different ways (see points at the end of this section). A domain provides a means of standardizing the characteristics of the attributes. For example, the domain Date, which contains all possible valid dates, can be assigned to any date attribute in a logical data model or date columns/fields in a physical data model, such as: EmployeeHireDate OrderEntryDate ClaimSubmitDate CourseStartDate All values inside the domain are valid values. Those outside the domain are referred to as invalid values. An attribute should not contain values outside of its assigned domain. EmployeeGenderCode, for example, may be limited to the domain of female and male. The domain for EmployeeHireDate may be defined simply as valid dates. Under this rule, the domain for EmployeeHireDate does not include February 30 of any year. One can restrict a domain with additional rules, called constraints. Rules can relate to format, logic, or both. For example, by restricting the EmployeeHireDate March 10, 2050 from the domain of valid values, even though it is a valid date. EmployeeHireDate could also be restricted to days in a typical workweek (e.g., dates that fall on a Monday, Tuesday, Wednesday, Thursday, or Friday). Domains can be defined in different ways. Data Type: Domains that specify the standard types of data one can have in an attribute assigned to that domain. For example, Integer, Character(30), and Date are all data type domains. Data Format: Domains that use patterns including templates and masks, such as are found in postal codes and phone numbers, and character limitations (alphanumeric only, alphanumeric with certain special characters allowed, etc.) to define valid values. List: Domains that contain a finite set of values. These are familiar to many people from functionality like dropdown lists. For example, the list domain for OrderStatusCode can restrict values to only {Open, Shipped, Closed, Returned}. Order 11611 by LEXIE MAY on August 25, 2017 136 D MBO K 2 Range: Domains that allow all values of the same data type that are between one or more minimum and/or maximum values. Some ranges can be open-ended. For example, OrderDeliveryDate must be between OrderDate and three months in the future. Rule-based: Domains defined by the rules that values must comply with in order to be valid. These include rules comparing values to calculated values or other attribute values in a relation or set. For example, ItemPrice must be greater than ItemCost. 1.3.4 Data Modeling Schemes The six most common schemes used to represent data are: Relational, Dimensional, Object-Oriented, FactBased, Time-Based, and NoSQL. Each scheme uses specific diagramming notations (see Table 9). Table 9 Modeling Schemes and Notations Scheme Relational Dimensional Object-Oriented Fact-Based Time-Based NoSQL Sample Notations Information Engineering (IE) Integration Definition for Information Modeling (IDEF1X) Barker Notation Chen Dimensional Unified Modeling Language (UML) Object Role Modeling (ORM or ORM2) Fully Communication Oriented Modeling (FCO-IM) Data Vault Anchor Modeling Document Column Graph Key-Value This section will briefly explain each of these schemes and notations. The use of schemes depends in part on the database being built, as some are suited to particular technologies, as shown in Table 10. For the relational scheme, all three levels of models can be built for RDBMS, but only conceptual and logical models can be built for the other types of databases. This is true for the fact-based scheme as well. For the dimensional scheme, all three levels of models can be built for both RDBMS and MDBMS databases. The object-oriented scheme works well for RDBMS and object databases. The time-based scheme is a physical data modeling technique primarily for data warehouses in a RDBMS environment. The NoSQL scheme is heavily dependent on the underlying database structure (document, column, graph, or key-value), and is therefore a physical data modeling technique. Table 10 illustrates several important points including that even with a non-traditional database such as one that is document-based, a relational CDM and LDM can be built followed by a document PDM. Order 11611 by LEXIE MAY on August 25, 2017 D A T A MO D E LI NG A ND D E SI GN 137 Table 10 Scheme to Database Cross Reference Scheme Relational Dimensional Object-Oriented Fact-Based Time-Based NoSQL CDM LDM PDM CDM LDM PDM CDM LDM PDM CDM LDM PDM PDM CDM LDM CDM LDM CDM LDM CDM LDM CDM LDM CDM LDM CDM LDM PDM CDM LDM CDM LDM CDM LDM CDM LDM CDM LDM PDM PDM PDM PDM PDM CDM LDM PDM CDM LDM 1.3.4.1 Relational First articulated by Dr. Edward Codd in 1970, relational theory provides a systematic way to organize data so that they reflected their meaning (Codd, 1970). This approach had the additional effect of reducing redundancy data could most effectively be managed in terms of two-dimensional relations. The term relation was derived from the mathematics (set theory) upon which his approach was based. (See Chapter 6.) The design objectives for the relational model are to have an exact expression of business data and to have one fact in one place (the removal of redundancy). Relational modeling is ideal for the design of operational systems, which require entering information quickly and having it stored accurately (Hay, 2011). There are several different kinds of notation to express the association between entities in relational modeling, including Information Engineering (IE), Integration Definition for Information Modeling (IDEF1X), Barker Notation, and Chen Notation. The depict cardinality. (See Figure 39.) Student Attend Course Figure 39 IE Notation Order 11611 by LEXIE MAY on August 25, 2017 138 D MBO K 2 1.3.4.2 Dimensional The concept of dimensional modeling started from a joint research project conducted by General Mills and 33 Dartmouth College in the In dimensional models, data is structured to optimize the query and analysis of large amounts of data. In contrast, operational systems that support transaction processing are optimized for fast processing of individual transactions. Dimensional data models capture business questions focused on a particular business process. The process being measured on the dimensional model in Figure 40 is Admissions. Admissions can be viewed by the Zone the student is from, School Name, Semester, and whether the student is receiving financial aid. Navigation can be made from Zone up to Region and Country, from Semester up to Year, and from School Name up to School Level. Geography Country Region Zone Calendar Year Semester Admissions Name Level School Yes/No Financial Aid Figure 40 Axis Notation for Dimensional Models The diagramming notation used to build this model can be a very effective communication tool with those who prefer not to read traditional data modeling syntax. Both the relational and dimensional conceptual data models can be based on the same business process (as in this example with Admissions). The difference is in the meaning of the relationships, where on the relational model the relationship lines capture business rules, and on the dimensional model, they capture the navigation paths needed to answer business questions. http://bit.ly/2tsSP7w. Order 11611 by LEXIE MAY on August 25, 2017 D A T A MO D E LI NG A ND D E SI GN 139 1.3.4.2.1 Fact Tables Within a dimensional scheme, the rows of a fact table correspond to particular measurements and are numeric, such as amounts, quantities, or counts. Some measurements are the results of algorithms, in which case Metadata is critical to proper understanding and usage. Fact tables take up the most space in the database (90% is a reasonable rule of thumb), and tend to have large numbers of rows. 1.3.4.2.2 Dimension Tables Dimension tables represent the important objects of the business and contain mostly textual descriptions. Dimensions serve as the primary source acting as the entry points or links into the fact tables. Dimensions are typically highly denormalized and typically account for about 10% of the total data. Dimensions must have a unique identifier for each row. The two main approaches to identifying keys for dimension tables are surrogate keys and natural keys. Dimensions also have attributes that change at different rates. Slowly changing dimensions (SCDs) manage changes based on the rate and type of change. The three main types of change are sometimes known by ORC. Overwrite (Type 1): The new value overwrites the old value in place. New Row (Type 2): The new values are written in a new row, and the old row is marked as not current. New Column (Type 3): Multiple instances of a value are listed in columns on the same row, and a new value means writing the values in the series one spot down to make space at the front for the new value. The last value is discarded. 1.3.4.2.3 Snowflaking Snowflaking is the term given to normalizing the flat, single-table, dimensional structure in a star schema into the respective component hierarchical or network structures. 1.3.4.2.4 Grain The term grain stands for the meaning or description of a single row of data in a fact table; this is the most detail any row will have. Defining the grain of a fact table is one of the key steps in dimensional design. For example, if a dimensional model is measuring the student registration process, the grain may be student, day, and class. Order 11611 by LEXIE MAY on August 25, 2017 140 D MBO K 2 1.3.4.2.5 Conformed Dimensions Conformed dimensions are built with the entire organization in mind instead of just a particular project; this allows these dimensions to be shared across dimensional models, due to containing consistent terminology and values. For example, if Calendar is a conformed dimension, a dimensional model built to count student applicants by Semester will contain the same values and definition of Semester as a dimensional model built to count student graduates. 1.3.4.2.6 Conformed Facts Conformed facts use standardized definitions of terms across individual marts. Different business users may use different concepts across organizations, or conversely things that are named differently but are actually the same concept across organizations. 1.3.4.3 Object-Oriented (UML) The Unified Modeling Language (UML) is a graphical language for modeling software. The UML has a variety of notations of which one (the class model) concerns databases. The UML class model specifies classes (entity types) and their relationship types (Blaha, 2013). Class Name Attributes Operations Student Stdntno : integer Strtdt: date Prgm: text ExpctGraddt: date ActlGraddt: date Figure 41 UML Class Model Figure 41 illustrates the characteristics of a UML Class Model: A Class diagram resembles an ER diagram except that the Operations or Methods section is not present in ER. In ER, the closest equivalent to Operations would be Stored Procedures. Attribute types (e.g., Date, Minutes) are expressed in the implementable application code language and not in the physical database implementable terminology. Default values can be optionally shown in the notation. Access to data is through the ncapsulation or data hiding is based on a and the instances that it maintains are exposed through Operations. Order 11611 by LEXIE MAY on August 25, 2017 D A T A MO D E LI NG A ND D E SI GN 141 business logic because it still needs to be sequenced and timed. In ER terms, the table has stored procedures/triggers. Class Operations can be: Public: Externally visible Internally Visible: Visible to children Objects Private: Hidden In comparison, ER Physical models only offer Public access; all data is equally exposed to processes, queries, or manipulations. 1.3.4.4 Fact-Based Modeling (FBM) Fact-Based Modeling, a family of conceptual modeling languages, originated in the late 1970s. These languages are based in the analysis of natural verbalization (plausible sentences) that might occur in the business domain. Fact-based languages view the world in terms of objects, the facts that relate or characterize those objects, and each role that each object plays in each fact. An extensive and powerful constraint system relies on fluent automatic verbalization and automatic checking against the concrete examples. Fact-based models do not use attributes, reducing the need for intuitive or expert judgment by expressing the exact relationships between objects (both entities and values). The most widely used of the FBM variants is Object Role Modeling (ORM), which was formalized as a first-order logic by Terry Halpin in 1989. 1.3.4.4.1 Object Role Modeling (ORM or ORM2) required information or queries presented in any external formulation familiar to users, and then verbalizes these examples at the conceptual level, in terms of simple facts expressed in a controlled natural language. This language is a restricted version of natural language that is unambiguous, so the semantics are readily grasped by humans; it is also formal, so it can be used to automatically map the structures to lower levels for implementation (Halpin, 2015). Figure 42 illustrates an ORM model. Semester Student Course in enrolled in Figure 42 ORM Model Order 11611 by LEXIE MAY on August 25, 2017 142 D MBO K 2 1.3.4.4.2 Fully Communication Oriented Modeling (FCO-IM) FCO-IM is similar in notation and approach to ORM. The numbers in Figure 43 are references to verbalizations of facts. For example, 2 Semester Student 4 5 6 Course Attendance Figure 43 FCO-IM Model 1.3.4.5 Time-Based Time-based patterns are used when data values must be associated in chronological order and with specific time values. 1.3.4.5.1 Data Vault The Data Vault is a detail-oriented, time-based, and uniquely linked set of normalized tables that support one or more functional areas of business. It is a hybrid approach, encompassing the best of breed between third normal form (3NF, to be discussed in Section 1.3.6) and star schema. Data Vaults are designed specifically to meet the needs of enterprise data warehouses. There are three types of entities: hubs, links, and satellites. The Data Vault design is focused around the functional areas of business with the hub representing the primary key. The links provide transaction integration between the hubs. The satellites provide the context of the hub primary key (Linstedt, 2012). In Figure 44, Student and Course are hubs, which represent the main concepts within a subject. Attendance is a link, which relates two hubs to each other. Student Contact, Student Characteristics, and Course Description are satellites that provide the descriptive information on the hub concepts and can support varying types of history. Anchor Modeling is a technique suited for information that changes over time in both structure and content. It provides graphical notation used for conceptual modeling similar to traditional data modeling, with extensions for working with temporal data. Anchor Modeling has four basic modeling concepts: anchors, attributes, ties, Order 11611 by LEXIE MAY on August 25, 2017 D A T A MO D E LI NG A ND D E SI GN 143 and knots. Anchors model entities and events, attributes model properties of anchors, ties model the relationships between anchors, and knots are used to model shared properties, such as states. Student Student Contact Attendance Student Characteristics Course Course Description Figure 44 Data Vault Model 1.3.4.5.2 Anchor Modeling On the anchor model in Figure 45, Student, Course, and Attendance are anchors, the gray diamonds represent ties, and the circles represent attributes. Figure 45 Anchor Model 1.3.4.6 NoSQL NoSQL is a name for the category of databases built on non-relational technology. Some believe that NoSQL is not a good name for what it represents, as it is less about how to query the database (which is where SQL comes in) and more about how the data is stored (which is where relational structures comes in). There are four main types of NoSQL databases: document, key-value, column-oriented, and graph. Order 11611 by LEXIE MAY on August 25, 2017 144 D MBO K 2 1.3.4.6.1 Document Instead of taking a business subject and breaking it up into multiple relational structures, document databases frequently store the business subject in one structure called a document. For example, instead of storing Student, Course, and Registration information in three distinct relational structures, properties from all three will exist in a single document called Registration. 1.3.4.6.2 Key-value Keyfeature of storing both simple (e.g., dates, numbers, codes) and complex information (unformatted text, video, 1.3.4.6.3 Column-oriented Out of the four types of NoSQL databases, column-oriented is closest to the RDBMS. Both have a similar way of looking at data as rows and values. The difference, though, is that RDBMSs work with a predefined structure and simple data types, such as amounts and dates, whereas column-oriented databases, such as Cassandra, can work with more complex data types including unformatted text and imagery. In addition, column-oriented databases store each column in its own structure. 1.3.4.6.4 Graph A graph database is designed for data whose relations are well represented as a set of nodes with an undetermined number of connections between these nodes. Examples where a graph database can work best are social relations (where nodes are people), public transport links (where nodes could be bus or train stations), or roadmaps (where nodes could be street intersections or highway exits). Often requirements lead to traversing the graph to find the shortest routes, nearest neighbors, etc., all of which can be complex and time-consuming to navigate with a traditional RDMBS. Graph databases include Neo4J, Allegro, and Virtuoso. 1.3.5 Data Model Levels of Detail In 1975, the American National Standards Standards Planning and Requirements Committee (SPARC) published their three-schema approach to database management. The three key components were: Conceptual: This embodies the of the enterprise being modeled in the database. It Order 11611 by LEXIE MAY on August 25, 2017 D A T A MO D E LI NG A ND D E SI GN 145 External: The various users of the database management system operate on sub-sets of the total enterprise model that are relevant to their particular needs. These subsets are represented Internal: The the data is described by the internal schema. This schema describes These three levels most commonly translate into the conceptual, logical, and physical levels of detail, respectively. Within projects, conceptual data modeling and logical data modeling are part of requirements planning and analysis activities, while physical data modeling is a design activity. This section provides an overview of conceptual, logical, and physical data modeling. In addition, each level will be illustrated with examples from two schemes: relational and dimensional. 1.3.5.1 Conceptual A conceptual data model captures the high-level data requirements as a collection of related concepts. It contains only the basic and critical business entities within a given realm and function, with a description of each entity and the relationships between entities. For example, if we were to model the relationship between students and a school, as a relational conceptual data model using the IE notation, it might look like Figure 46. School Contain Student Submit Application Figure 46 Relational Conceptual Model Each School may contain one or many Students, and each Student must come from one School. In addition, each Student may submit one or many Applications, and each Application must be submitted by one Student. The relationship lines capture business rules on a relational data model. For example, Bob the student can attend County High School or Queens College, but cannot attend both when applying to this particular university. In addition, an application must be submitted by a single student, not two and not zero. Recall Figure 40, which is reproduced below as Figure 47. This dimensional conceptual data model using the Axis notation, illustrates concepts related to school: Order 11611 by LEXIE MAY on August 25, 2017 146 D MBO K 2 Geography Country Region Zone Calendar Year Semester Admissions Name Level School Yes/No Financial Aid Figure 47 Dimensional Conceptual Model 1.3.5.2 Logical A logical data model is a detailed representation of data requirements, usually in support of a specific usage context, such as application requirements. Logical data models are still independent of any technology or specific implementation constraints. A logical data model often begins as an extension of a conceptual data model. In a relational logical data model, the conceptual data model is extended by adding attributes. Attributes are assigned to entities by applying the technique of normalization (see Section 1.3.6), as shown in Figure 48. There is a very strong relationship between each attribute and the primary key of the entity in which it resides. For instance, School Name has a strong relationship to School Code. For example, each value of a School Code brings back at most one value of a School Name. School School Code School Name Contain Student Student Number School Code (FK) Student First Name Student Last Name Student Birth Date Application Submit Application Number Student Number (FK) Application Submission Date Figure 48 Relational Logical Data Model Order 11611 by LEXIE MAY on August 25, 2017 D A T A MO D E LI NG A ND D E SI GN 147 A dimensional logical data model is in many cases a fully-attributed perspective of the dimensional conceptual data model, as illustrated in Figure 49. Whereas the logical relational data model captures the business rules of a business process, the logical dimensional captures the business questions to determine the health and performance of a business process. Admissions Count in Figure 49 is the measure that answers the business questions related to Admissions. The entities surrounding the Admissions provide the context to view Admissions Count at different levels of granularity, such as by Semester and Year. Country Name Semester Code Figure 49 Dimensional Logical Data Model Order 11611 by LEXIE MAY on August 25, 2017 148 D MBO K 2 1.3.5.3 Physical A physical data model (PDM) represents a detailed technical solution, often using the logical data model as a starting point and then adapted to work within a set of hardware, software, and network tools. Physical data models are built for a particular technology. Relational DBMSs, for example, should be designed with the specific capabilities of a database management system in mind (e.g., IBM DB2, UDB, Oracle, Teradata, Sybase, Microsoft SQL Server, or Microsoft Access). Figure 50 illustrates a relational physical data model. In this data model, School has been denormalized into the Student entity to accommodate a particular technology. Perhaps whenever a Student is accessed, their school information is as well and therefore storing school information with Student is a more performant structure than having two separate structures. STUDENT STUDENT_NUM STUDENT_FIRST_NAM STUDENT_LAST_NAM STUDENT_BIRTH_DT SCHOOL_CD SCHOOL_NAM APPLICATION Submit APPLICATION_NUM STUDENT_NUM (FK) APPLICATION_SUBMISSION_DT Figure 50 Relational Physical Data Model Because the physical data model accommodates technology limitations, structures are often combined (denormalized) to improve retrieval performance, as shown in this example with Student and School. Figure 51 illustrates a dimensional physical data model (usually a star schema, meaning there is one structure for each dimension). Similar to the relational physical data model, this structure has been modified from its logical counterpart to work with a particular technology to ensure business questions can be answered with simplicity and speed. 1.3.5.3.1 Canonical A variant of a physical scheme is a Canonical Model, used for data in motion between systems. This model describes the structure of data being passed between systems as packets or messages. When sending data through web services, an Enterprise Service Bus (ESB), or through Enterprise Application Integration (EAI), the canonical model describes what data structure the sending service and any receiving services should use. These structures should be designed to be as generic as possible to enable re-use and simplify interface requirements. This structure may only be instantiated as a buffer or queue structure on an intermediary messaging system (middleware) to hold message contents temporarily. Order 11611 by LEXIE MAY on August 25, 2017 D A T A MO D E LI NG A ND D E SI GN 149 Country Name Figure 51 Dimensional Physical Data Model 1.3.5.3.2 Views A view is a virtual table. Views provide a means to look at data from one or many tables that contain or reference the actual attributes. A standard view runs SQL to retrieve data at the point when an attribute in the used to simplify queries, control data access, and rename columns, without the redundancy and loss of referential integrity due to denormalization. 1.3.5.3.3 Partitioning Partitioning refers to the process of splitting a table. It is performed to facilitate archiving and to improve retrieval performance. Partitioning can be either vertical (separating groups of columns) or horizontal (separating groups of rows). Vertically split: To reduce query sets, create subset tables that contain subsets of columns. For example, split a customer table in two based on whether the fields are mostly static or mostly volatile (to improve load / index performance), or based on whether the fields are commonly or uncommonly included in queries (to improve table scan performance). Order 11611 by LEXIE MAY on August 25, 2017 150 D MBO K 2 Horizontally split: To reduce query sets, create subset tables using the value of a column as the differentiator. For example, create regional customer tables that contain only customers in a specific region. 1.3.5.3.4 Denormalization Denormalization is the deliberate transformation of normalized logical data model entities into physical tables with redundant or duplicate data structures. In other words, denormalization intentionally puts one attribute in multiple places. There are several reasons to denormalize data. The first is to improve performance by: Combining data from multiple other tables in advance to avoid costly run-time joins Creating smaller, pre-filtered copies of data to reduce costly run-time calculations and/or table scans of large tables Pre-calculating and storing costly data calculations to avoid run-time system resource competition Denormalization can also be used to enforce user security by segregating data into multiple views or copies of tables according to access needs. This process does introduce a risk of data errors due to duplication. Therefore, denormalization is frequently chosen if structures such as views and partitions fall short in producing an efficient physical design. It is good practice to implement data quality checks to ensure that the copies of the attributes are correctly stored. In general, denormalize only to improve database query performance or to facilitate enforcement of user security. Although the term denormalization is used in this section, the process does not apply just to relational data models. For example, one can denormalize in a document database, but it would be called something different such as embedding. In dimensional data modeling, denormalization is called collapsing or combining. If each dimension is collapsed into a single structure, the resulting data model is called a Star Schema (see Figure 51). If the dimensions are not collapsed, the resulting data model is called a Snowflake (See Figure 49). 1.3.6 Normalization Normalization is the process of applying rules in order to organize business complexity into stable data structures. The basic goal of normalization is to keep each attribute in only one place to eliminate redundancy and the inconsistencies that can result from redundancy. The process requires a deep understanding of each a Normalization rules sort attributes according to primary and foreign keys. Normalization rules sort into levels, with each level applying granularity and specificity in search of the correct primary and foreign keys. Each level Order 11611 by LEXIE MAY on August 25, 2017 D A T A MO D E LI NG A ND D E SI GN 151 comprises a separate normal form, and each successive level does not need to include previous levels. Normalization levels include: First normal form (1NF): Ensures each entity has a valid primary key, and every attribute depends on the primary key; removes repeating groups, and ensures each attribute is atomic (not multi-valued). 1NF includes the resolution of many-to-many relationships with an additional entity often called an associative entity. Second normal form (2NF): Ensures each entity has the minimal primary key and that every attribute depends on the complete primary key. Third normal form (3NF): Ensures each entity has no hidden primary keys and that each attribute Boyce / Codd normal form (BCNF): Resolves overlapping composite candidate keys. A candidate key is either a primary attributes primary or alternate keys), and means there are hidden business rules between the keys. Fourth normal form (4NF): Resolves all many-to-many-to-many relationships (and beyond) in pairs until they cannot be broken down into any smaller pieces. Fifth normal form (5NF): Resolves inter-entity dependencies into basic pairs, and all join dependencies use parts of primary keys. The term normalized model usually means the data is in 3NF. Situations requiring BCNF, 4NF, and 5NF occur rarely. 1.3.7 Abstraction Abstraction is the removal of details in such a way as to broaden applicability to a wide class of situations while preserving the important properties and essential nature from concepts or subjects. An example of abstraction is the Party/Role structure, which can be used to capture how people and organizations play certain roles (e.g., employee and customer). Not all modelers or developers are comfortable with, or have the ability to work with abstraction. The modeler needs to weigh the cost of developing and maintaining an abstract structure versus the amount of rework required if the unabstracted structure needs to be modified in the future (Giles 2011). Abstraction includes generalization and specialization. Generalization groups the common attributes and relationships of entities into supertype entities, while specialization separates distinguishing attributes within an entity into subtype entities. This specialization is usually based on attribute values within an entity instance. Subtypes can also be created using roles or classification to separate instances of an entity into groups by function. An example is Party, which can have subtypes of Individual and Organization. Order 11611 by LEXIE MAY on August 25, 2017 152 D MBO K 2 The subtyping relationship implies that all of the properties from the supertype are inherited by the subtype. In the relational example shown in Figure 52, University and High School are subtypes of School. Contain Submit Figure 52 Supertype and Subtype Relationships Subtyping reduces redundancy on a data model. It also makes it easier to communicate similarities across what otherwise would appear to be distinct and separate entities. 2. Activities This section will briefly cover the steps for building conceptual, logical, and physical data models, as well as maintaining and reviewing data models. Both forward engineering and reverse engineering will be discussed. 2.1 Plan for Data Modeling A plan for data modeling contains tasks such as evaluating organizational requirements, creating standards, and determining data model storage. The deliverables of the data modeling process include: Diagram: A data model contains one or more diagrams. The diagram is the visual that captures the requirements in a precise form. It depicts a level of detail (e.g., conceptual, logical, or physical), a scheme (relational, dimensional, object-oriented, fact-based, time-based, or NoSQL), and a notation within that scheme (e.g., information engineering, unified modeling language, object-role modeling). Definitions: Definitions for entities, attributes, and relationships are essential to maintaining the precision on a data model. Issues and outstanding questions: Frequently the data modeling process raises issues and questions that may not be addressed during the data modeling phase. In addition, often the people or groups responsible for resolving these issues or answering these questions reside outside of the group building the data model. Therefore, often a document is delivered that contains the current set of issues and Order 11611 by LEXIE MAY on August 25, 2017 D A T A MO D E LI NG A ND D E SI GN 153 Student leaves and then returns, are they assigned a different Student Number or do they keep their original Student Number Lineage: For physical and so

Chapter5- Data Modelling & Design.pdf

Document Details

Tags

Related

Full Transcript

Upgrade to continue