ITM900 - Midterm Exam Notes - Book 3.pdf

35 Week 3 - Chapter 1 - Database Systems Data Versus Information Data: raw facts or facts that have not yet been processed to reveal their meaning to the end user Information: The result of processing raw data to reveal its meaning. Information consists of transformed data and facilitates decision making. Knowledge: the body of information and facts about a specific subject. Knowledge implies familiarity, awareness and understanding of information as it applies to an environment. A key characteristic of is that new knowledge can be derived from old knowledge Let’s summarize some key points: ○ Data constitutes the building blocks of information. ○ Information is produced by processing data. ○ Information is used to reveal the meaning of data. ○ Accurate, relevant, and timely information is the key to good decision making. ○ Good decision making is the key to organizational survival in a global environment. Data management: A process that focuses on data collection, storage, retrieval. Common data management functions include addition, deletion, modification, and listing Introducing the Database Database: a shared, integrated computer structure that houses a collection of related data. A database contains two types of data: end-user-data(raw facts) and metadata ○ end-user-data: raw facts that are of interest to the end user ○ Metadata: Data about data, that is, data about data characteristics and relationships. Database management system (DBMS): A collection of programs that manages the database structure and controls access to the data stored in the database. Roles and Advantages of the DBMS ○ Role Intermediary Function: The DBMS acts as a go-between for the user and the database. Users access the database through the DBMS, which translates their requests into the necessary operations to fulfill them. Integration and Abstraction: It offers an integrated view of the database's data, hiding the complexity of its internal operations from users and application programs. ○ Advantages 36 1. Improved Data Sharing: By facilitating better access to data across different users and applications, a DBMS enhances the availability and management of data, leading to quicker responses to changing conditions. 2. Improved Data Security: As data access increases, so do the potential risks of breaches. A DBMS supports the enforcement of data privacy and security measures more effectively. 3. Better Data Integration: Access to well-managed data encourages a unified perspective on organizational operations, clarifying the overall picture and interconnections within the company. 4. Minimized Data Inconsistency: The DBMS reduces the likelihood of encountering discrepancies in data across different parts of the organization, enhancing data reliability. 5. Improved Data Access: The DBMS makes it possible to produce quick answers to ad hoc queries. From a database perspective, a query is a specific request issued to the DBMS for data manipulation—for example, to read or update the data. Simply put, a query is a question, and an ad hoc query is a spur-of-the-moment question. The DBMS sends back an answer (called the query result set) to the application. For example, when dealing with large amounts of sales data, end users might want quick answers to questions (ad hoc queries). 6. Improved Decision Making: Access to high-quality data and improved data management support better decision-making based on more accurate and timely information. 7. Increased End-user Productivity: The combination of available data and analytical tools boosts end-user efficiency, aiding in faster and more informed decisions critical in a competitive market environment. ○ Data inconsistency: A condition in which different versions of the same data yield different (inconsistent) results. ○ Query: A question or task asked by an end user of a database in the form of SQL code. A specific request for data manipulation issued by the end user or the application to the DBMS. ○ Ad hoc query: A “spur-of-the-moment” question. ○ Query result set: The collection of data rows returned by a query. ○ Data quality: A comprehensive approach to ensuring the accuracy, validity, and timeliness of data Type of databases ○ Single-user database: A database that supports only one user at a time. ○ Desktop database: A single-user database that runs on a personal computer. ○ Multiuser database: A database that supports multiple concurrent users. 37 ○ Workgroup database: A multiuser database that usually supports fewer than 50 users or is used for a specific department in an organization. ○ Enterprise database: The overall company data representation, which provides support for present and expected future needs. ○ Centralized database: A database located at a single site. ○ Distributed database: A logically related database that is stored in two or more physically independent sites. ○ Cloud database: A database that is created and maintained using cloud services, such as Microsoft Azure or Amazon AWS. ○ General-purpose database: A database that contains a wide variety of data used in multiple disciplines. ○ Discipline-specific database: A database that contains data focused on specific subject areas. ○ Operational database: A database designed primarily to support a company’s day-to-day operations. Also known as a transactional database, OLTP database, or production database. ○ Online transaction processing (OLTP) database OR Transactional database OR Production database: See operational database. ○ Analytical database: A database focused primarily on storing historical data and business metrics used for tactical or strategic decision making. ○ XML database: A database system that stores and manages semistructured XML data. Data warehouse: A specialized database that stores historical and aggregated data in a format optimized for decision support. Online analytical processing (OLAP): A set of tools that provide advanced data analysis for retrieving, processing, and modeling data from the data warehouse. Business intelligence: A set of tools and processes used to capture, collect, integrate, store, and analyze data to support business decision making. Unstructured data: Data that exists in its original, raw state; that is, in the format in which it was collected. Structured data: Data that has been formatted to facilitate storage, use, and information generation. Semistructured data: Data that has already been processed to some extent. Extensible Markup Language (XML):A metalanguage used to represent and manipulate data elements. Unlike other markup languages, XML permits the manipulation of a document’s data elements. social media: Web and mobile technologies that enable “anywhere, anytime, always on” human interactions. NoSQL: A new generation of DBMS that is not based on the traditional relational database mod 38 Why Database Design is Important Database design: The process that yields the description of the database structure and determines the database components. The second phase of the database life cycle The passage highlights several critical points regarding the importance of proper data modeling and database design skills, especially when dealing with personal productivity tools like spreadsheets and desktop database programs. Here's what is essential to understand: ○ Challenges with Personal Productivity Tools Lack of Proper Skills: Users often do not possess adequate data-modeling and database design skills, leading to inefficient and ineffective data organization. Narrow View of Data: Individuals typically have a limited perspective on how data should be structured, mistaking a collection of related data items for a cohesive unit, which can lead to poor database design when translated into a single table structure. ○ Importance of Good Database Design Foundation for Data Management: A well-designed database is crucial for efficiently and effectively managing data. It impacts how data is stored, accessed, and utilized. Impact on DBMS Performance: Even with a good DBMS, a poorly designed database can result in poor performance, including slow data retrieval and processing. ○ Consequences of Poor Design The text illustrates the pitfalls of inadequate database design with an example where data about employees and their skills are improperly stored, leading to: Difficulty in performing simple queries and generating reports. Increased risk of data inconsistency and redundancy. The need for structural changes to accommodate new data, making the database difficult to maintain. ○ Advantages of Proper Design Flexibility and Scalability: A well-designed database, as shown in the improved structure, allows for easy manipulation and expansion. Simple SQL commands can perform complex queries efficiently. Consistency and Accuracy: Proper database design ensures that data is stored consistently, reducing the risk of errors and discrepancies. ○ Key Takeaways Understanding Use Case: Effective database design requires a clear understanding of the database's intended use, whether for transactional purposes, data warehousing, or other applications. 39 Design Complexity: The design process involves decomposing data into its constituent parts and carefully establishing relationships between tables to support the accurate recreation of the integrated view of the data for end-users. Importance of Education and Expertise: Due to the complexities and critical nature of database design, it is an area of focus in academic and professional training. Expertise in database design is valuable and sought after in the industry. Evolution of File System Data Processing Data processing (DP) specialist: The person responsible for developing and managing a computerized file processing system. Data — Raw facts, such as a telephone number, a birth date, a customer name, and a year-to-date (YTD) sales value. Data has little meaning unless it has been organized in some logical manner. Field — A character or group of characters (alphabetic or numeric) that has a specific meaning. A field is used to define and store data. Record — A logically connected set of one or more fields that describes a person, place, or thing. For example, the fields that constitute a record for a customer might consist of the customer's name, address, phone number, date of birth, credit limit, and unpaid balance. File — A collection of related records. For example, a file might contain data about the students currently enrolled at Gigantic University The evolution of file system data processing is a narrative that underscores the transition from manual to computerized systems, and how this shift has both advanced and complicated data management practices. Understanding this evolution is crucial for grasping the challenges and limitations that databases are designed to address. Here are the key points regarding this evolution: Manual File Systems ○ Early Organization: Initially, data was managed manually using paper-and-pencil systems, organized in file folders and filing cabinets. This was adequate for small data sets and simple reporting requirements. ○ Limitations with Growth: As organizations expanded and reporting needs became more complex, manual systems became cumbersome and inefficient, prompting a shift towards computerized solutions. Computerized File Systems ○ Transition to Computers: To overcome the limitations of manual systems, organizations adopted computerized systems, hiring data processing (DP) specialists to manage data and generate reports. 40 ○ Early Computerized Files: Initially, these systems mimicked manual filing systems, storing data in files similar to physical folders. Each file was managed by specific application programs, developed to store, retrieve, and modify the data. ○ Specialized Vocabulary: The transition to computerized systems introduced a specialized vocabulary for file structures, including terms like records, fields, and files, to facilitate clear communication among practitioners. ○ Challenges with Expansion: As more data was computerized, issues such as data redundancy, inconsistency, and lack of integrated management emerged, due to the isolated nature of file systems developed for specific departmental needs. Schism Between Users and Data ○ End User Frustration: The separation between end users and their data, caused by the reliance on DP specialists for data manipulation and reporting, led to delays and frustrations, especially as the need for rapid decision-making increased. ○ Desire for Direct Access: The limitations of early computerized file systems fueled the adoption of personal computers and productivity tools, as users sought direct access to data for analysis and decision-making. Modern End-User Productivity Tools ○ Adoption and Misuse of Spreadsheets: Tools like Microsoft Excel enabled sophisticated data analysis, but their widespread use also led to their misuse as substitutes for databases, reintroducing problems such as data redundancy and inconsistency. ○ Similarities to Early Systems: The use of spreadsheets and other personal productivity tools often replicates the limitations of early computerized file systems, highlighting the ongoing challenge of managing data efficiently and accurately. Conclusion ○ Understanding the evolution from manual to computerized file systems, and the subsequent challenges, is essential for database designers and developers. It emphasizes the importance of avoiding past pitfalls through proper database design and underscores the need for databases that offer integrated, consistent, and efficient data management solutions. This historical context illustrates that while technological advancements have significantly improved data processing capabilities, they also necessitate a thorough understanding of data management principles to avoid repeating the limitations of earlier systems. Problems with File System Data Processing A critique of the file system method serves two major purposes: ○ Understanding the shortcomings of the file system enables you to understand the development of modern databases. 41 ○ Failure to understand such problems is likely to lead to their duplication in a database environment, even though database technol The following problems associated with file systems, whether created by DP specialists or through a series of spreadsheets, severely challenge the types of information that can be created from the data as well as the accuracy of the information ○ Lengthy development times ○ Difficulty of getting quick answers ○ Complex system administration ○ Lack of security and limited data sharing ○ Extensive programming Structural and data dependence ○ Structural Dependence: A data characteristic in which a change in the database schema affects data access, thus requiring changes in all access programs. ○ Structural Independence: A data characteristic in which changes in the database schema do not affect data access. ○ Data type: Defines the kind of values that can be used or stored. Also, used in programming languages and database systems to determine the operations that can be applied to such data. ○ Data dependence: A data condition in which data representation and manipulation are dependent on the physical data storage characteristics. ○ Data independence: A condition in which data access is unaffected by changes in the physical data storage characteristics. ○ Logical data format: The way a person views data within the context of a problem domain. ○ Physical data format: The way a computer “sees” (stores) data. Data Redundancy ○ Data Redundancy: Occurs when identical pieces of data are stored in multiple places within a file system or across different file systems, such as databases and spreadsheets. This issue is often exacerbated by the lack of integrated data management and the independent creation and maintenance of spreadsheets by individual users. ○ Consequences of Data Redundancy Poor Data Security: Multiple data copies increase the risk of unauthorized access. With data spread across different locations, ensuring comprehensive security measures becomes more challenging, leaving data vulnerable to breaches. Data Inconsistency: When the same data exists in more than one place, it's likely to be updated in one location but not in others. This leads to conflicting versions of data, causing confusion and errors in reporting and decision-making. 42 Data-entry Errors: The likelihood of errors increases with the number of times data is entered or re-entered across various files. Errors can range from minor inaccuracies to significant discrepancies that can affect operations and decision-making. Data Integrity Problems: Redundant data storage can result in the inclusion of inaccurate or fictitious information, such as the details of non-existent sales agents. This undermines the reliability of the database and can have real-world consequences for the organization and its stakeholders. ○ Implications for Database Design and Management The presence of "islands of information," or scattered, redundant data locations, underlines the importance of centralized data management systems like databases that can reduce redundancy, enhance security, and ensure consistency. Effective database design and strict data management policies are crucial to preventing redundancy-related issues. These include implementing a single source of truth for each data item and ensuring that updates are propagated throughout the system. Addressing data redundancy requires a comprehensive understanding of how data is used, stored, and accessed within an organization. This involves not just technological solutions but also training and guidelines for users on how to manage data efficiently and securely. ○ In summary, data redundancy is a pervasive issue that can lead to security vulnerabilities, inconsistency, errors, and integrity problems. Addressing it requires a coordinated approach to database design, management practices, and user education to ensure data remains accurate, consistent, and secure. Data Anomalies ○ The passage discusses the concept of data anomalies resulting from data redundancy in database systems, using a hypothetical CUSTOMER file to illustrate the issues. Data anomalies refer to inconsistencies and errors that arise when data is duplicated across multiple locations within a database. These anomalies are categorized into three types: Update Anomalies: Occur when a change made to data in one location requires the same change to be made in multiple other locations where the data is duplicated. For example, if an agent's contact information changes, this update must be replicated across all records where the agent's information appears, increasing the risk of inconsistencies. Insertion Anomalies: Happen when new data cannot be added to the database without adding additional, potentially unnecessary data. For 43 instance, inserting a new agent into the CUSTOMER file might require creating a dummy customer entry, leading to potential inconsistencies. Deletion Anomalies: Arise when deleting data in one area inadvertently affects or removes data in another area. Deleting customers associated with a particular agent, for example, might unintentionally delete the agent's data if not handled correctly. ○ The passage emphasizes that data redundancy fosters these anomalies by necessitating field value changes in multiple locations, thereby complicating database maintenance and integrity. However, it concludes on a positive note, suggesting that the book will equip readers with the skills necessary to design and model databases that avoid such problems, thereby maintaining data integrity and consistency. Database Systems Data anomaly: A data abnormality in which inconsistent changes have been made to a database. For example, an employee moves, but the address change is not corrected in all files in the database. Database system: An organization of components that defines and regulates the collection, storage, management, and use of data in a database environment. Data dictionary: A DBMS component that stores metadata—data about data. The data dictionary contains data definitions as well as data characteristics and relationships. May also include data that is external to the DBMS. Performance tuning: Activities that make a database perform more efficiently in terms of storage and access speed. Query language: A nonprocedural language that is used by a DBMS to manipulate its data. An example of a query language is SQL. Structured Query Language (SQL): A powerful and flexible relational database language composed of commands that enable users to create database and table structures, perform various types of data manipulation and data administration, and query the database to extract useful information The passage outlines the comprehensive structure and functionality of a database system, emphasizing its five key components, the roles of various users, and the multifaceted functions performed by a Database Management System (DBMS). Here's a concise overview: ○ Components of a Database System Hardware: Includes all physical devices like computers, storage devices, network devices, and others essential for the database system's operation. Software: Encompasses operating system software, DBMS software, and application programs/utilities necessary for managing the database and its environment. 44 People: Involves various user types including system administrators, database administrators (DBAs), database designers, system analysts and programmers, and end users, each playing critical roles in the database ecosystem. Procedures: Refer to the instructions and rules that govern the database system's design and use, ensuring standardized operations and data management. Data: Represents the collected facts stored in the database, forming the core around which the database system is built. ○ Functions of a DBMS Data Dictionary Management: Manages metadata and ensures programs access data correctly. Data Storage Management: Handles complex data storage structures and performance tuning. Data Transformation and Presentation: Converts data to match user expectations and formats. Security Management: Enforces user security and data privacy. Multiuser Access Control: Ensures data integrity and consistency with concurrent database access. Backup and Recovery Management: Provides data safety and integrity through backup and recovery mechanisms. Data Integrity Management: Enforces integrity rules to minimize redundancy and maximize consistency. Database Access Languages and APIs: Facilitates data access through query languages like SQL and application programming interfaces. Database Communication Interfaces: Supports database access via various network environments and interfaces. ○ Challenges and Considerations Increased Costs: Sophisticated hardware, software, and skilled personnel lead to substantial maintenance costs. Management Complexity: The integration with various technologies and significant impacts on resources and culture require careful management. Maintaining Currency: Frequent updates and the application of the latest security measures are necessary to keep the system efficient. Vendor Dependence: Investment in technology and training may lead to reluctance in changing vendors, affecting pricing and choices. Frequent Upgrade/Replacement Cycles: Upgrades often require additional hardware and training investments. 45 This summary encapsulates the intricate dynamics of database systems, highlighting the essential components, the DBMS's pivotal roles, and the challenges inherent in managing and optimizing these systems for organizational benefit. Summary Data consists of raw facts. Information is the result of processing data to reveal its meaning. Accurate, relevant, and timely information is the key to good decision making, and good decision making is the key to organizational survival in a global environment. Data is usually stored in a database. To implement a database and to manage its contents, you need a database management system (DBMS). The DBMS serves as the intermediary between the user and the database. The database contains the data you have collected and “data about data,” known as metadata. Database design defines the database structure. A well-designed database facilitates data management and generates accurate and valuable information. A poorly designed database can lead to poor decision making, and poor decision making can lead to the failure of an organization. Databases can be classified according to the number of users supported, where the data is located, the type of data stored, the intended data usage, and the degree to which the data is structured. Databases evolved from manual and then computerized file systems. In a file system, data is stored in independent files, each requiring its own data management programs. Although this method of data management is largely outmoded, understanding its characteristics makes database design easier to comprehend. Some limitations of file system data management are that it requires extensive programming, system administration can be complex and difficult, making changes to existing structures is difficult, and security features are likely to be inadequate. Also, independent files tend to contain redundant data, leading to problems of structural and data dependence. DBMSs were developed to address the file system’s inherent weaknesses. Rather than depositing data in independent files, a DBMS presents the database to the end user as a single data repository. This arrangement promotes data sharing, thus eliminating the potential problem of islands of information. In addition, the DBMS enforces data integrity, eliminates redundancy, and promotes data security. Knowledge of database technologies leads to many career opportunities in the ever-expanding IT industry. There is a variety of specialization within the database arena for a wide range of skills and expertise. 46 Week 3 - Chapter 2 - Data Models Data Modeling and Data Models Data modeling: The process of creating a specific data model for a determined problem domain. Data model: A representation, usually graphic, of a complex “real-world” data structure. Data models are used in the database design phase of the Database Life Cycle Is an iterative, progressive process The terms data model and database model are often used interchangeably. In this book, the term database model is used to refer to the implementation of a data model in a specific database system An implementation-ready data model should contain at least the following components: ○ A description of the data structure that will store the end-user data ○ A set of enforceable rules to guarantee the integrity of the data ○ A data manipulation methodology to support the real-world data transformations The Importance of Data Models The passage emphasizes the critical importance of data modeling in the development and understanding of database systems. Here are the key takeaways: ○ Communication Tool Facilitates Interaction: Data models serve as a vital communication tool among database designers, application programmers, and end users, enhancing mutual understanding of the database structure and its functionality. Improves Organizational Understanding: A well-developed data model can improve stakeholders' understanding of how various parts of an organization fit together, as illustrated by the client's realization about their business operations. ○ Foundation of Database Design Essential for Applications: Applications are designed to manage data and convert it into useful information. Data modeling is crucial as data is perceived differently by various stakeholders within an organization. Varied Perspectives: Different roles within an organization view data through different lenses. For instance, a manager might have an enterprise-wide perspective, whereas clerks and departmental managers might focus on subsets of data relevant to their specific functions. ○ Importance of a Unified View Different Views, One Objective: Despite the varying perspectives on data among users and developers, a comprehensive and coherent data 47 model ensures that everyone works towards a unified goal without conflicts or redundancies. Analogy to Building a House: Just as a house requires blueprints before construction, a robust database system needs a well-considered data model. This model acts as a blueprint, guiding the development and integration of various data-related elements within the system. ○ Consequences of Lack of Planning Potential for Problems: Without a clear and coherent data model, database systems are prone to issues such as conflicting data schemes, which can lead to significant financial losses and operational inefficiencies. Conclusion ○ Data modeling is a fundamental process that underpins the successful design, implementation, and use of database systems. It facilitates effective communication between different stakeholders, ensures a unified understanding of data across an organization, and provides a crucial blueprint that guides the entire database development process. By abstracting complex data relationships into a comprehensible structure, data models enable the creation of efficient, reliable, and scalable databases that meet the diverse needs of users and align with organizational goals. Data Model Basic Building Blocks Attribute: A characteristic of an entity or object. An attribute has a name and a data type. Relationship: An association between entities. One-to-many (1:M or 1..*) relationship: Associations among two or more entities that are used by data models. In a 1:M relationship, one entity instance is associated with many instances of the related entity. Many-to-many (M:N or *..*) relationship: Association among two or more entities in which one occurrence of an entity is associated with many occurrences of a related entity and one occurrence of the related entity is associated with many occurrences of the first entity. One-to-one (1:1 or 1..1) relationship: Associations among two or more entities that are used by data models. In a 1:1 relationship, one entity instance is associated with only one instance of the related entity. Constraint: A restriction placed on data, usually expressed in the form of rules. For example, “A student’s GPA must be between 0.00 and 4.00” The passage highlights the foundational concepts of data modeling, which are essential for designing and understanding database systems. Here are the key elements: 48 Entities ○ Definition: An entity is a person, place, thing, or event about which data is collected and stored. Entities are unique and distinguishable instances in the model, representing types of objects in the real world, whether physical or abstract. ○ Example: Customers like John Smith or Pedro Dinamita are instances of the CUSTOMER entity. Attributes ○ Definition: Attributes are characteristics or properties of an entity. They describe various aspects of the entity. ○ Example: For a CUSTOMER entity, attributes might include last name, first name, phone number, address, and credit limit. Relationships ○ Definition: Relationships describe how entities associate with each other within the database. They can be one-to-many (1:M), many-to-many (M:N), or one-to-one (1:1). ○ Examples: One-to-Many (1:M): A painter creates many paintings, but each painting is created by only one painter. Many-to-Many (M:N): A student can take many classes, and each class can have many students enrolled. One-to-One (1:1): Each retail store is managed by a single employee, and each store manager manages only one store. Constraints ○ Definition: Constraints are restrictions placed on the data, ensuring the accuracy and reliability of the database by enforcing rules. ○ Examples: An employee's salary must be between $6,000 and $350,000. A student’s GPA must range from 0.00 to 4.00. Each class must have exactly one teacher. Identifying Components ○ Business Rules: The identification of entities, attributes, relationships, and constraints begins with a clear understanding of the business rules relevant to the domain being modeled. Importance ○ This structured approach to data modeling facilitates the accurate representation of real-world entities and their interactions within a database. It serves as a critical step in ensuring that databases are designed to accurately capture and organize the necessary data for an organization, supporting efficient data retrieval, storage, and integrity. By adhering to these foundational concepts, database designers can 49 create robust models that serve the needs of applications and end-users while adhering to business rules and requirements. Business Rules Business rule: A description of a policy, procedure, or principle within an organization. For example, a pilot cannot be on duty for more than 10 hours during a 24-hour period, or a professor may teach up to four classes during a semester. The passage provides a comprehensive guide on the significance of business rules in data modeling and the steps involved in translating these rules into a coherent data model for database design. Here are the key points: ○ Understanding and Utilizing Business Rules Definition and Importance: Business rules are concise descriptions of policies, procedures, or principles that govern the organization's operations, crucial for creating a meaningful data model reflecting the organization's operational environment. Source of Business Rules: They can be derived from company managers, policy documentation, and direct interviews with end users, although verifying end-user perceptions is crucial due to possible differences in understanding. ○ Process of Identifying Business Rules Standardization and Communication: Documenting business rules standardizes the company’s view of data, serves as a communication tool, and helps designers understand the nature, role, and scope of the data, as well as the business processes. Challenges: Not all business rules can be directly modeled in a database but can be enforced through application software. ○ Translating Business Rules into Data Model Components Entities, Attributes, and Relationships: Nouns in business rules typically translate into entities, and verbs indicate the relationships between these entities. This process helps in identifying what objects (entities) and associations (relationships) are important based on how the organization operates. Determining Relationship Types: Relationships are bidirectional and understanding the nature of these relationships (e.g., one-to-many, many-to-many) is crucial for accurately representing the data structure. ○ Naming Conventions Importance of Proper Naming: Naming entities, attributes, and relationships correctly enhances the clarity and usability of the data model, making it easier for all stakeholders to understand and use the database system. 50 Best Practices: Entity names should describe the business objects they represent, and attribute names should be descriptive and, if possible, prefixed with an abbreviation of the entity name to indicate their association. Conclusion ○ Business rules are foundational to the data modeling process, guiding the identification of entities, attributes, relationships, and constraints that define how data is structured and interrelated in a database. Properly identifying and documenting these rules, coupled with thoughtful naming conventions, ensures the development of a coherent, functional, and effective database system that aligns with organizational objectives and operations. The Evolution of Data Models This comprehensive overview details the evolution of data models in database management, highlighting how each model has attempted to address the limitations of its predecessors and adapt to changing data management needs. Here's a summary of the key points: ○ Hierarchical and Network Models Hierarchical Model: Developed in the 1960s for managing large data for complex projects. It organizes data in a tree-like structure with one-to-many relationships, but each child has only one parent. Network Model: Introduced to handle more complex data relationships and improve performance with a standard allowing records to have more than one parent, overcoming a limitation of the hierarchical model. ○ Relational Model Introduced by E.F. Codd in 1970, it revolutionized data management with its simplicity. It organizes data into tables (relations), making it easier for users and designers. The relational model gained practical feasibility with advancements in computer power, becoming the foundation for relational database management systems (RDBMS) that manage data as a collection of tables. ○ Entity-Relationship Model Developed to complement the relational model with a graphical representation of entities and their relationships, making it a standard for data modeling. It enhances conceptual simplicity and is widely used in database design. ○ Object-Oriented Model Addresses complex real-world data representation by encapsulating data and its relationships within objects, supporting attributes, methods, and 51 inheritance. This model is semantically rich, representing data more closely to real-world entities and behaviors. ○ Object/Relational and XML The extended relational data model (ERDM) incorporates features from the object-oriented model, supporting objects, extensible data types, and inheritance within a relational framework. XML emerges as a standard for exchanging structured and unstructured data, leading to XML databases and support for XML in object/relational databases. ○ Big Data and NoSQL The explosion of web and sensor-generated data has led to Big Data, requiring new management approaches. NoSQL databases address these needs by supporting large volumes of unstructured data, high scalability, and fault tolerance, diverging from traditional relational databases. Emerging Data Models: Summary ○ The continuous evolution of data models reflects the ongoing search for better ways to model and manage complex data. Each model brings new capabilities and addresses specific data management challenges, from the hierarchical and network models to relational, object-oriented, and NoSQL databases for Big Data. This evolution underscores the dynamic nature of data management technologies, as they adapt to handle increasingly complex and voluminous data in various formats, driven by the needs of modern applications and the ever-growing data landscape. Hierarchical model: An early database model whose basic concepts and characteristics formed the basis for subsequent database development. This model is based on an upside-down tree structure in which each record is called a segment. The top record is the root segment. Each segment has a 1:M relationship to the segment directly below it. Segment: In the hierarchical data model, the equivalent of a file system’s record type. Network model: An early data model that represented data as a collection of record types in 1:M relationships Schema: A logical grouping of database objects, such as tables, indexes, views, and queries, that are related to each other. Subschema: The portion of the database that interacts with application programs. Data manipulation language (DML): he set of commands that allows an end user to manipulate the data in the database, such as SELECT, INSERT, UPDATE, DELETE, COMMIT, and ROLLBACK Data definition language (DDL): The language that allows a database administrator to define the database structure, schema, and subschema relational model: Developed by E. F. Codd of IBM in 1970, the relational model is based on mathematical set theory and represents data as independent relations. Each relation (table) is conceptually represented as a two-dimensional structure of intersecting rows 52 and columns. The relations are related to each other through the sharing of common entity characteristics (values in columns). Relational diagram: A graphical representation of a relational database’s entities, the attributes within those entities, and the relationships among the entities Table (relation): A logical construct perceived to be a twodimensional structure composed of intersecting rows (entities) and columns (attributes) that represents an entity set in the relational model. Tuple: In the relational model, a table row. Relational database management system (RDBMS): A collection of programs that manages a relational database. The RDBMS software translates a user’s logical requests (queries) into commands that physically locate and retrieve the requested data. Entity relationship (ER) model (ERM): A data model that describes relationships (1:1, 1:M, and M:N) among entities at the conceptual level with the help of ER diagrams. Entity relationship diagram (ERD): A diagram that depicts an entity relationship model’s entities, attributes, and relations. Entity instance (entity occurrence): A row in a relational table. Entity set: A collection of like entities. Connectivity: The type of relationship between entities. Classifications include 1:1, 1:M, and M:N. Chen notation: See entity relationship (ER) model. Crow’s Foot notation: A representation of the entity relationship diagram that uses a three-pronged symbol to represent the “many” sides of the relationship. Class diagram notation: The set of symbols used in the creation of class diagrams. Object-oriented data model (OODM): A data model whose basic modeling structure is an object. Object: An abstract representation of a real-world entity that has a unique identity, embedded properties, and the ability to interact with other objects and itself. Object-oriented database management system (OODBMS): Data management software used to manage data in an object-oriented database model. semantic data model: The first of a series of data models that models both data and their relationships in a single structure known as an object. Class: A collection of similar objects with shared structure (attributes) and behavior (methods). A class encapsulates an object’s data representation and a method’s implementation. Method: In the object-oriented data model, a named set of instructions to perform an action. Methods represent realworld actions. Class hierarchy: The organization of classes in a hierarchical tree in which each parent class is a superclass and each child class is a subclass. See also inheritance 53 Inheritance: In the object-oriented data model, the ability of an object to inherit the data structure and methods of the classes above it in the class hierarchy. See also class hierarchy Unified Modeling Language (UML): A language based on object-oriented concepts that provides tools such as diagrams and symbols to graphically model a system. Class diagram: A diagram used to represent data and their relationships in UML object notation. Extended relational data model (ERDM): A model that includes the object-oriented model’s best features in an inherently simpler relational database structural environment. See extended entity relationship model (EERM) Object/relational database management system (O/R DBMS): A DBMS based on the extended relational model (ERDM). The ERDM, championed by many relational database researchers, constitutes the relational model’s response to the OODM. This model includes many of the object-oriented model’s best features within an inherently simpler relational database structure Extensible Markup Language (XML): A metalanguage used to represent and manipulate data elements. Unlike other markup languages, XML permits the manipulation of a document’s data elements. XML facilitates the exchange of structured documents such as orders and invoices over the Internet. Big Data: A movement to find new and better ways to manage large amounts of web-generated data and derive business insight from it, while simultaneously providing high performance and scalability at a reasonable cost. 3 Vs: Three basic characteristics of Big Data databases: volume, velocity, and variety Hadoop: A Java-based, open-source, highspeed, fault-tolerant distributed storage and computational framework. Hadoop uses low-cost hardware to create clusters of thousands of computer nodes to store and process data. Hadoop Distributed File System (HDFS): A highly distributed, fault-tolerant file storage system designed to manage large amounts of data at high speeds Name node: One of three types of nodes used in the Hadoop Distributed File System (HDFS). The name node stores all the metadata about the file system. See also client node and data node. Data node: One of three types of nodes used in the Hadoop Distributed File System (HDFS). The data node stores fixed-size data blocks (that could be replicated to other data nodes). See also client node and name node. Client node: One of three types of nodes used in the Hadoop Distributed File System (HDFS). The client node acts as the interface between the user application and the HDFS. See also name node and data node. MapReduce: An open-source application programming interface (API) that provides fast data analytics services; one of the main Big Data technologies that allows organizations to process massive data stores. 54 Degrees of Data Abstraction American National Standards Institute (ANSI): The group that accepted the DBTG recommendations and augmented database standards in 1975 through its SPARC committee External model: The application programmer’s view of the data environment. Given its business focus, an external model works with a data subset of the global database schema. External schema: The specific representation of an external view; the end user’s view of the data environment. Conceptual model: The output of the conceptual design process. The conceptual model provides a global view of an entire database and describes the main data objects, avoiding details. Conceptual schema: A representation of the conceptual model, usually expressed graphically. See also conceptual model. Software independence: A property of any model or application that does not depend on the software used to implement it. Hardware independence: A condition in which a model does not depend on the hardware used in the model’s implementation. Therefore, changes in the hardware will have no effect on the database design at the conceptual level. Logical design: A stage in the design phase that matches the conceptual design to the requirements of the selected DBMS and is therefore softwaredependent. Logical design is used to translate the conceptual design into the internal model for a selected database 55 management system, such as DB2, SQL Server, Oracle, IMS, Informix, Access, or Ingress. Internal model: In database modeling, a level of data abstraction that adapts the conceptual model to a specific DBMS model for implementation. The internal model is the representation of a database as “seen” by the DBMS. In other words, the internal model requires a designer to match the conceptual model’s characteristics and constraints to those of the selected implementation model. Internal schema: A representation of an internal model using the database constructs supported by the chosen database Logical independence: A condition in which the internal model can be changed without affecting the conceptual model. (The internal model is hardwareindependent because it is unaffected by the computer on which the software is installed. Therefore, a change in storage devices or operating systems will not affect the internal model.) Physical model: A model in which physical characteristics such as location, path, and format are described for the data. The physical model is both hardware- and softwaredependent. See also physical design. Physical independence: A condition in which the physical model can be changed without affecting the internal model. The passage illustrates the concept of data abstraction in database design, paralleling it with the process of automotive design, from conceptual sketches to detailed engineering drawings and finally to production specifications. This analogy underscores the progression from abstract concepts to detailed implementation in database development. Here are the essential points: ○ ANSI/SPARC Framework of Data Abstraction External Model: Represents the end users' view of the data environment, focusing on how different business units within an organization view their data subsets. External schemas are used to represent these views through Entity-Relationship (ER) diagrams, highlighting specific entities, relationships, processes, and constraints relevant to each business unit. Conceptual Model: Offers a global view of the entire database, integrating all external views into a unified schema that outlines the main data objects of the organization without delving into database-specific details. It's depicted through the ER model and serves as the database blueprint, ensuring software and hardware independence. Internal Model: Maps the conceptual model to the chosen DBMS, detailing how the database is structured within the specific DBMS environment. This model is software-dependent, requiring adjustments if the DBMS software changes, but remains hardware-independent. Physical Model: Describes the actual storage of data on physical media, detailing the physical storage devices and access methods. It requires a 56 deep understanding of both the hardware and software used for database implementation and is both software and hardware dependent. ○ Key Advantages of Data Abstraction Levels External Model Benefits: Facilitates identification of data required by different business units. Simplifies database design by providing clear operational requirements and security constraints. Conceptual Model Benefits: Provides a comprehensive overview of the data environment, useful for understanding and communication. Ensures independence from specific technologies, allowing for flexibility in database implementation. Internal and Physical Models: Tailor the conceptual design to specific DBMS requirements and physical storage conditions, respectively, allowing for optimized database performance and management. ○ Importance of Data Abstraction in Database Design The structured approach to database design, moving from abstract to detailed levels, enables designers to create databases that are well-tuned to organizational needs, ensuring that all user requirements are met and that the database can be efficiently implemented and managed. This process also allows for the integration of diverse data views within an organization, ensuring consistency and security across different business units. 57 Summary A data model is an abstraction of a complex real-world data environment. Database designers use data models to communicate with programmers and end users. The basic data-modeling components are entities, attributes, relationships, and constraints. Business 58 rules are used to identify and define the basic modeling components within a specific real-world environment. The hierarchical and network data models were early models that are no longer used, but some of the concepts are found in current data models. The relational model is the current database implementation standard. In the relational model, the end user perceives the data as being stored in tables. Tables are related to each other by means of common values in common attributes. The entity relationship (ER) model is a popular graphical tool for data modeling that complements the relational model. The ER model allows database designers to visually present different views of the data—as seen by database designers, programmers, and end users—and to integrate the data into a common framework. The object-oriented data model (OODM) uses objects as the basic modeling structure. Like the relational model’s entity, an object is described by its factual content. Unlike an entity, however, the object also includes information about relationships between the facts, as well as relationships with other objects, thus giving its data more meaning. The relational model has adopted many object-oriented (OO) extensions to become the extended relational data model (ERDM). Object/relational database management systems (O/R DBMS) were developed to implement the ERDM. At this point, the OODM is largely used in specialized engineering and scientific applications, while the ERDM is primarily geared to business applications. Big Data technologies such as Hadoop, MapReduce, and NoSQL provide distributed, fault-tolerant, and cost-efficient support for Big Data analytics. NoSQL databases are a new generation of databases that do not use the relational model and are geared to support the very specific needs of Big Data organizations. NoSQL databases offer distributed data stores that provide high scalability, availability, and fault tolerance by sacrificing data consistency and shifting the burden of maintaining relationships and data integrity to the program code. Data-modeling requirements are a function of different data views (global versus local) and the level of data abstraction. The American National Standards Institute Standards Planning and Requirements Committee (ANSI/SPARC) describes three levels of data abstraction: external, conceptual, and internal. The fourth and lowest level of data abstraction, called the physical level, is concerned exclusively with physical storage methods. 59 Week 3 - Chapter 3 - The Relational Database Model A Logical View of Data: Definition: A logical view of data in a database focuses on organizing and understanding data relationships without delving into physical storage details. It abstracts users from the complexities of data storage and enables a structured approach to data manipulation and organization. Key Points: ○ Table Structure: Data is organized into tables consisting of rows and columns, resembling mathematical relations. Each row represents a record or entity, while each column represents attributes or characteristics of the entities. ○ Conceptualization: Users perceive tables as two-dimensional structures, facilitating a clear representation of data relationships. This abstraction simplifies data management and enhances query operations. ○ Predicate Logic: Tables in a relational database model are based on predicate logic, a mathematical framework where assertions can be verified as true or false. This logical approach ensures data consistency and accuracy. Advantages: ○ Structural Independence: Users focus on logical data representation rather than physical storage details, promoting data independence and flexibility. ○ Ease of Understanding: The relational model's table-based structure makes it easier to comprehend data relationships compared to hierarchical or network models. ○ Simplicity in Design: Logical simplicity leads to effective database design methodologies, improving overall database management. Significance of Logical View: ○ Data Representation: Enables users to view and interact with data logically, enhancing data manipulation and retrieval processes. ○ Table Concept: Tables play a central role in the relational model, providing a structured framework for organizing and accessing data efficiently. Conclusion: A logical view of data in a relational database model simplifies data management, promotes efficient data organization, and enhances the overall user experience by focusing on logical data relationships rather than physical storage intricacies. Predicate logic: Used extensively in mathematics to provide a framework in which an assertion (statement of fact) can be verified as either true or false. Set theory: A part of mathematical science that deals with sets, or groups of things, and is used as the basis for data manipulation in the relational model. Tuple: In the relational model, a table row. 60 Attribute domain: In data modeling, the construct used to organize and describe an attribute’s set of possible values Primary key (PK): In the relational model, an identifier composed of one or more attributes that uniquely identifies a row. Also, a candidate key is selected as a unique entity identifier. See also key. Keys: Role of a key depends on the concept of determination. Primary key (PK): In the relational model, an identifier composed of one or more attributes that uniquely identifies a row. Also, a candidate key selected as a unique entity identifier. See also key. Key: One or more attributes that determine other attributes. See also candidate key, foreign key, primary key (PK), secondary key, and superkey. Determination: The role of a key. In the context of a database table, the statement “A determines B” indicates that knowing the value of attribute A means that the value of attribute B can be looked up. Functional dependence: Within a relation R, an attribute B is functionally dependent on an attribute A if and only if a given value of attribute A determines exactly one value of attribute B. The relationship “B is dependent on A” is equivalent to “A determines B” and is written as A S B. Determinant: Any attribute in a specific row whose value directly determines other values in that row. See also Boyce-Codd normal form (BCNF). dependent An attribute whose value is determined by another attribute. Uniqueness: Keys in databases must be unique within their respective tables to ensure each record can be uniquely identified. This uniqueness is crucial for maintaining data integrity and avoiding duplicate entries. Full functional dependence: A condition in which an attribute is functionally dependent on a composite key but not on any subset of the key. Composite key: A multiple-attribute key. 61 Key attribute: An attribute that is part of a primary key. See also prime attribute. Superkey: An attribute or attributes that uniquely identify each entity in a table. See key. Candidate key: A minimal superkey; that is, a key that does not contain a subset of attributes that is itself a superkey. See key. Entity integrity: The property of a relational table that guarantees each entity has a unique value in a primary key and that the key has no null values. Null: The absence of an attribute value. Note that a null is not a blank. Foreign key (FK): An attribute or attributes in one table whose values must match the primary key in another table or whose values must be null. See key. Referential integrity: A condition by which a dependent table’s foreign key must have either a null entry or a matching entry in the related table. Secondary key: A key used strictly for data retrieval purposes. For example, customers are not likely to know their customer number (primary key), but the combination of last name, first name, middle initial, and telephone number will probably match the appropriate table row. See also key. Primary Key (PK): ○ The primary key is a fundamental key in a table that uniquely identifies each record. ○ It serves as the main identifier for data retrieval, indexing, and establishing relationships between tables. ○ The primary key cannot contain null values and must be unique for each record. Candidate Key: ○ Candidate keys are potential keys that could serve as the primary key. ○ They are evaluated based on criteria such as uniqueness, stability, and simplicity before selecting the primary key. ○ Candidate keys provide alternative options for uniquely identifying records in a table. Foreign Key (FK): ○ Foreign keys establish relationships between tables by referencing the primary key of another table. ○ They enforce referential integrity, ensuring that data references remain consistent and valid. ○ Foreign keys play a crucial role in maintaining data consistency across related tables. Superkey: ○ A superkey is a set of attributes that uniquely identifies each row in a table. ○ It may contain more attributes than necessary for uniqueness and serves as a basis for defining candidate keys and primary keys. ○ Superkeys provide a broader identification mechanism compared to primary keys. Secondary Key: 62 ○ Secondary keys are used for data retrieval purposes and do not necessarily enforce uniqueness. ○ They help optimize query performance by providing additional indexing options for efficient data access. ○ Secondary keys are valuable for enhancing data retrieval operations and improving query efficiency. Data Integrity: ○ Keys play a vital role in maintaining data integrity by enforcing uniqueness constraints and relational consistency. ○ They prevent duplicate entries, ensure data accuracy, and support the reliability of the database. Relationship Establishment: ○ Keys are essential for establishing relationships between tables in a relational database. ○ Primary keys and foreign keys create links between related data entities, enabling data retrieval and manipulation across tables. Indexing and Performance: ○ Properly defined keys support indexing mechanisms, improving query performance and overall database efficiency. ○ Efficient key design enhances data retrieval speed, query optimization, and overall system performances. Understanding the significance of keys in databases is crucial for designing well-structured database schemas, ensuring data integrity, establishing relationships between tables, and optimizing query performance for efficient data management. Integrity Rules: Definition: Integrity rules are constraints that ensure the accuracy, consistency, and reliability of data within a database. 63 Flags: Special codes implemented by designers to trigger a required response, alert end users to specified conditions, or encode values. Flags may be used to prevent nulls by bringing attention to the absence of a value in a table Types of Integrity Rules: ○ Entity Integrity: Ensures each row in a table has a unique identity, typically enforced through primary keys. ○ Referential Integrity: Maintains consistency of relationships between tables by enforcing valid references. ○ Entity Integrity Requirements: All primary key entries must be unique. No part of a primary key can be null. ○ Referential Integrity Requirements: Foreign keys must match the primary key in another table or be null. Importance of Integrity Rules: ○ Ensures data consistency, accuracy, and reliability. ○ Prevents invalid or inconsistent data entries. ○ Maintains data integrity across related tables. Enforcement Mechanisms: ○ Database Management Systems (DBMS) automatically enforce integrity rules. ○ Application developers should adhere to integrity rules during database design. Consequences of Violations: ○ Violating integrity rules can lead to data inconsistencies and inaccuracies. ○ Data corruption may occur, affecting system reliability. Data Integrity Maintenance: ○ Regular monitoring, validation, and enforcement of integrity rules are essential. ○ Adhering to entity and referential integrity rules is fundamental for effective database management. Role in Data Quality Assurance: ○ Integrity rules play a crucial role in maintaining data quality, consistency, and reliability. ○ They ensure the integrity of the database system and the accuracy of stored information. Overall Significance: ○ Understanding and implementing integrity rules are critical for effective database design and management. ○ They are essential for ensuring data integrity, relational consistency, and the overall reliability of the database system. By adhering to integrity rules, database administrators and developers can maintain data quality, prevent data corruption, and ensure the accuracy and reliability of the database, ultimately leading to a robust and trustworthy data management system. 64 Relational Algebra: Relational algebra: A set of mathematical principles that form the basis for manipulating relational table contents; the eight main functions are SELECT, PROJECT, JOIN, INTERSECT, UNION, DIFFERENCE, PRODUCT, and DIVIDE. Relvar: Short for relation variable, a variable that holds a relation. A relvar is a container (variable) for holding relation data, not the relation itself Closure: A property of relational operators that permits the use of relational algebra operators on existing tables (relations) to produce new relations. SELECT: In relational algebra, an operator used to select a subset of rows. Also known as RESTRICT. RESTRICT: See SELECT. PROJECT: In relational algebra, an operator used to select a subset of columns. UNION: In relational algebra, an operator used to merge (append) two tables into a new table, dropping the duplicate rows. The tables must be union-compatible. Union-compatible: Two or more tables that have the same number of columns and the corresponding columns have compatible domains. INTERSECT: In relational algebra, an operator used to yield only the rows that are common to two union-compatible tables. DIFFERENCE: In relational algebra, an operator used to yield all rows from one table that are not found in another union-compatible table. PRODUCT: In relational algebra, an operator used to yield all possible pairs of rows from two tables. Also known as the Cartesian product. 65 JOIN: In relational algebra, a type of operator used to yield rows from two tables based on criteria. There are many types of joins, such as natural join, theta join, equijoin, and outer join. natural join: A relational operation that yields a new table composed of only the rows with common values in their common attribute(s). join columns: Columns that are used in the criteria of join operations. The join columns generally share similar values. equijoin: A join operator that links tables based on an equality condition that compares specified columns of the tables. theta join: A join operator that links tables using an inequality comparison operator (, =) in the join condition. inner join: A join operation in which only rows that meet a given criterion are selected. The criterion can be an equality condition (natural join or equijoin) or an inequality condition (theta join). The most commonly used type of join. outer join: A join operation that produces a table in which all unmatched pairs are retained; unmatched values in the related table are left null. left outer join: A join operation that yields all the rows in the left table, including those that have no matching values in the other table. right outer join: A join operation that yields all of the rows in the right table, including the ones with no matching values in the other table. DIVIDE: In relational algebra, an operator that answers queries about one set of data being associated with all values of data in another set of data Definition: Relational algebra is a formal system for manipulating relational table contents using relational operators. Basic Operations: ○ Relational algebra includes fundamental operations such as SELECT, PROJECT, JOIN, INTERSECT, UNION, DIFFERENCE, PRODUCT, and DIVIDE. Manipulation of Data: ○ Relational algebra provides a theoretical framework for performing data manipulation operations on relational databases. Mathematical Basis: ○ Relational algebra is based on mathematical principles and provides a structured approach to working with relational data. Relational Completeness: ○ The extent to which a database management system supports relational algebra operators determines its relational completeness. ○ Minimally relational systems must support key operators like SELECT, PROJECT, and JOIN. Relvar Concept: 66 ○ Relational algebra distinguishes between relations (data in tables) and relation variables (variables holding relations). ○ A relvar is a container for holding relation data, not the data itself. Formal Definitions and Terminology: ○ Relational algebra uses formal mathematical representations for data operations, ensuring precision and clarity in database manipulation. SQL Implementation: ○ SQL commands can be used to perform relational algebra operations in practical database management. ○ Understanding relational algebra principles can enhance the efficiency and effectiveness of writing SQL queries. Data Manipulation Capabilities: ○ Relational algebra defines the basic data manipulation capabilities of the relational model, enabling users to retrieve, filter, and combine data effectively. Data Integrity and Consistency: ○ Relational algebra operations help maintain data integrity and consistency by providing a structured approach to data manipulation. Design Considerations: ○ Good database design involves understanding relational algebra concepts to create efficient and effective database structures. ○ ERDs (Entity-Relationship Diagrams) can be used to represent relationships among entities and attributes in a database design. Role in Database Management: ○ Relational algebra is fundamental to database management and plays a crucial role in data retrieval, manipulation, and query optimization. Understanding relational algebra principles is essential for database professionals as it forms the basis for manipulating relational data effectively, ensuring data integrity, consistency, and reliability in database systems. 67 68 The Data Dictionary and the System Catalog: data dictionary: A DBMS component that stores metadata—data about data. Thus, the data dictionary contains the data definition as well as their characteristics and relationships. A data dictionary may also include data that are external to the DBMS. Also known as an information resource dictionary. See also active data dictionary, metadata, and passive data dictionary. system catalog: A detailed system data dictionary that describes all objects in a database. 69 homonym: The use of the same name to label different attributes. Homonyms generally should be avoided. See also synonym. Definition: The data dictionary and system catalog are essential components of a database management system that store metadata about the database structure and contents. Data Dictionary: ○ Provides a detailed description of all tables created by users and designers in the database. ○ Contains information about attribute names, characteristics, data types, and relationships within the database. Metadata Management: ○ The data dictionary serves as a repository for metadata, which includes data about data, such as table structures, attributes, and constraints. Design Documentation: ○ Often referred to as "the database designer's database," the data dictionary records design decisions about tables and their structures. ○ Helps maintain consistency and clarity in database design and management. System Catalog: ○ Contains detailed system metadata about all objects within the database, including table names, creators, creation dates, column information, data types, indexes, users, and access privileges. ○ Described as a system data dictionary that provides comprehensive information about the database schema. Interchangeability: ○ The terms "data dictionary" and "system catalog" are sometimes used interchangeably, as both store metadata about the database structure and contents. ○ Current relational database software typically provides a system catalog that serves as a system-created database for storing metadata. Database Documentation: ○ The system catalog automatically generates database documentation, allowing users to query and retrieve information about database objects and characteristics. ○ Facilitates database maintenance, management, and troubleshooting by providing a centralized repository of system information. Data Quality Assurance: ○ The data dictionary and system catalog play a crucial role in ensuring data quality by maintaining accurate and consistent metadata. ○ Help prevent data inconsistencies, homonyms, and synonyms within the database. Querying Capabilities: ○ Both the data dictionary and system catalog can be queried like regular database tables, enabling users to access and retrieve metadata for analysis and management purposes. 70 Database Integrity: ○ By storing comprehensive metadata, the data dictionary and system catalog contribute to maintaining database integrity, consistency, and reliability. ○ They support effective database design, development, and administration by providing essential information about the database schema and structure. Understanding the roles and significance of the data dictionary and system catalog is essential for database administrators and designers to effectively manage database metadata, ensure data quality, and support database operations and maintenance. Relationships with the Relational Database: Synonym: The use of different names to identify the same object, such as an entity, an attribute, or a relationship; synonyms should generally be avoided. See also homonym Composite entity: An entity designed to transform an M:N relationship into two 1:M relationships. The composite entity’s primary key comprises at least the primary keys of the entities that it connects. Also known as a bridge entity or associative entity. See also linking table. bridge entity: See composite entity. associative entity: See composite entity. linking table: In the relational model, a table that implements an M:M relationship. See also composite entity. Definition: Relationships in a relational database establish connections between tables based on common fields, enabling data retrieval and manipulation across related entities. Types of Relationships: ○ One-to-One (1:1): Each record in one table is related to only one record in another table. 71 ○ One-to-Many (1:M): A record in one table can be related to multiple records in another table. ○ Many-to-Many (M:N): Multiple records in one table can be related to multiple records in another table. Foreign Keys: ○ Foreign keys are used to establish relationships between tables by linking a column in one table to the primary key in another table. ○ They enforce referential integrity, ensuring data consistency and accuracy across related tables. Referential Integrity: ○ Referential integrity ensures that relationships between tables are maintained, preventing orphaned records and data inconsistencies. ○ It enforces rules that require values in foreign key columns to match values in the primary key columns of related tables. Normalization: ○ Normalization is the process of organizing data in a database to reduce redundancy and dependency by dividing tables into smaller, related tables. ○ By establishing relationships between normalized tables, data integrity and efficiency are improved. Cascade Operations: ○ Cascade operations, such as CASCADE DELETE and CASCADE UPDATE, allow changes made to parent records to automatically propagate to related child records. ○ These operations help maintain data consistency and integrity across interconnected tables. Role in Querying: ○ Relationships enable complex queries that involve joining multiple tables to retrieve related data. ○ JOIN operations, such as INNER JOIN, LEFT JOIN, and RIGHT JOIN, are used to combine data from related tables based on specified conditions. Entity-Relationship Diagrams (ERDs): ○ ERDs visually represent relationships between entities in a database, illustrating how tables are connected through primary and foreign keys. ○ ERDs help database designers and administrators understand and communicate the database structure effectively. Data Integrity and Consistency: ○ Well-defined relationships contribute to data integrity by ensuring that data is accurately linked and maintained across tables. ○ They help prevent data anomalies, such as insertion, update, and deletion anomalies, by structuring data relationships appropriately. 72 Importance of Understanding Relationships: ○ Understanding and managing relationships within a relational database is crucial for designing efficient database schemas, ensuring data consistency, and optimizing query performance. ○ Properly defined relationships support data integrity, facilitate data retrieval, and enhance the overall functionality and usability of the database system. By comprehensively understanding and effectively managing relationships within a relational database, database professionals can design robust and efficient database structures that support data integrity, consistency, and reliability, ultimately enhancing the usability and performance of the database system. Data Redundancy Revisited: Data Redundancy Definition: Data redundancy refers to the duplication of data within a database, which can lead to inconsistencies, inefficiencies, and data anomalies if not properly managed. Data Anomalies: Data redundancy can result in various anomalies, including insertion, update, and deletion anomalies, which can compromise data integrity and accuracy. Controlled Redundancy: In the relational database model, controlled redundancy is used strategically to link tables through common attributes, such as foreign keys, to establish relationships and ensure data consistency. Foreign Keys: Foreign keys play a crucial role in controlling data redundancy by linking tables and enforcing referential integrity, which helps maintain data consistency across related tables. Minimizing Data Redundancy: While foreign keys may lead to some level of data redundancy, their proper use minimizes redundancies and reduces the likelihood of data anomalies and inconsistencies. 73 Database Design Considerations: Database designers must balance the need to control data redundancy with requirements for design elegance, processing speed, and information retrieval efficiency. Data Warehousing: In data warehousing design, carefully defined and controlled data redundancies are sometimes necessary to support analytical processes and reporting requirements effectively. Historical Accuracy: Data redundancies may be intentionally introduced to preserve historical accuracy and ensure that past data remains consistent and accessible for analysis and decision-making. Database Maintenance: Proper implementation and careful control of data redundancies are essential for maintaining database integrity, consistency, and reliability over time. Data Redundancy in Database Systems: While data redundancy can pose challenges, it can also serve important purposes in database systems, such as improving data retrieval performance and supporting specific information needs. Understanding the implications of data redundancy, the role of controlled redundancy in relational databases, and the importance of managing data redundancies effectively are critical for database designers, administrators, and developers to ensure data quality, integrity, and usability in database systems. Indexes: index:An ordered array of index key values and row ID values (pointers). Indexes are generally used to speed up and facilitate data retrieval. Also known as an index key. Index key: See index Definition: Indexes in databases are data structures that improve the speed of data retrieval operations by providing quick access to specific rows in a table based on the indexed columns. Index Key: An index key is a set of values that uniquely identify rows in a table and serve as reference points for locating data efficiently. Types of Indexes: ○ Primary Index: Automatically created when defining a table's primary key, ensuring uniqueness and fast retrieval of primary key values. ○ Unique Index: Ensures the uniqueness of values in the indexed column(s), allowing only one row with a specific value. ○ Secondary Index: Created on columns other than the primary key for faster data retrieval based on specific attributes. Index Structure: Indexes consist of index keys and pointers to the actual data rows, facilitating quick lookup and retrieval of data based on the indexed values. Indexing Benefits: ○ Improves query performance by reducing the number of rows that need to be scanned during data retrieval operations. 74 ○ Speeds up data retrieval for SELECT, JOIN, and WHERE clauses by directly accessing indexed values. ○ Enhances database efficiency by optimizing data access paths and reducing disk I/O operations. Index Creation: Database administrators can create indexes on columns that are frequently used in search conditions or involved in JOIN operations to enhance query performance. Index Maintenance: Regular maintenance of indexes, such as rebuilding or reorganizing fragmented indexes, is essential to ensure optimal performance and data access efficiency. Query Optimization: Query execution plans generated by the database optimizer often utilize indexes to access data efficiently and minimize query processing time. Index Usage: Indexes are utilized by database management systems for various purposes, including enforcing data integrity, supporting primary key constraints, and facilitating data retrieval operations. Considerations: While indexes improve data retrieval speed, they also consume storage space and require maintenance, so it's important to carefully plan and optimize index usage based on the specific database workload and access patterns. Understanding the role of indexes in databases, their impact on query performance, and best practices for index creation and maintenance is crucial for optimizing database performance, enhancing data access efficiency, and improving overall system responsiveness. Codd’s Relational Database Rules: unique index An index in which the index key can have only one associated pointer value (row) Origin: In 1985, Dr. E. F. Codd, the inventor of the relational model, published a set of 12 rules to define the characteristics of a truly relational database system. Purpose: Codd's rules were introduced to establish a standard for evaluating the relational capabilities of database management systems and to ensure adherence to fundamental principles of relational database design. Minimum Standards: The rules serve as a benchmark to determine whether a database system can be considered truly relational based on its adherence to specific criteria outlined by Codd. Relational Integrity: Codd's rules emphasize the importance of maintaining relational integrity through the systematic representation and treatment of data within tables, ensuring consistency and accuracy. 75 Data Access: The rules guarantee that every value in a relational database can be accessed through a combination of table name, primary key value, and column name, providing a systematic approach to data retrieval. Metadata Management: Codd's rules require that metadata, such as data definitions and constraints, be stored and managed within the database as ordinary data accessible through standard relational language commands. Data Manipulation: A relational database system must support a comprehensive data sublanguage for defining, manipulating, and managing data, including data definition, view definition, and integrity constraints. Data Independence: Codd's rules advocate for both physical and logical data independence, ensuring that changes to storage structures or table designs do not impact application programs and data access methods. Integrity Constraints: All relational integrity constraints must be definable within the relational language and stored in the system catalog to enforce data consistency and accuracy at the database level. Distribution Independence: End users and applications should be unaware of the physical location of data, whether stored in a distributed or local database, to maintain transparency and consistency in data access. Nonsubversion: Users should not be able to bypass the integrity rules of the database through low-level access, ensuring that data integrity is preserved and enforced at all levels of data manipulation. Rule Zero: Codd's rules are based on the principle that a relational database system should exclusively use its relational facilities for data management, emphasizing the core principles of the relational model. Understanding Codd's relational database rules provides a framework for evaluating the relational capabilities of database systems, ensuring data integrity, consistency, and adherence to fundamental principles of relational database design. 76 Summary Tables are the basic building blocks of a relational database. A grouping of relatedentities, known as an entity set, is stored in a table. Conceptually speaking, the rela-tional table is composed of intersecting rows (tuples) and columns. Each row represents a single entity, and each column represents the characteristics (attributes) of the entities. 77 Keys are central to the use of relational tables. Keys define functional dependencies;that is, other attributes are dependent on the key and can therefore be found if the key value is known. A key can be classified as a superkey, a candidate key, a primary key, a secondary key, or a foreign key. Each table row must have a primary key. The primary key is an attribute or combination of attributes that uniquely identifies all remaining attributes found in any given row. Because a primary key must be unique, no null values are allowed if entity integrity is to be maintained. Although tables are independent, they can be linked by common attributes. Thus, the primary key of one table can appear as the foreign key in another table to which it is linked. Referential integrity dictates that the foreign key must contain values that match the primary key in the related table or must contain nulls. The relational model supports several relational algebra functions, including SELECT, PROJECT, JOIN, INTERSECT, UNION, DIFFERENCE, PRODUCT, and DIVIDE. Understanding the basic mathematical forms of these functions gives a broader understanding of the data manipulation options. A relational database performs much of the data manipulation work behind the scenes. For example, when you create a database, the RDBMS automatically produces a structure to house a data dictionary for your database. Each time you create a new table within the database, the RDBMS updates the data dictionary, thereby providing the database documentation. Once you know the basics of relational databases, you can concentrate on design. Good design begins by identifying appropriate entities and their attributes and then the relationships among the entities. Those relationships (1:1, 1:M, and M:N) can be represented using ERDs. The use of ERDs allows you to create and evaluate simple logical design. The 1:M relationship is most easily incorporated in a good design; just make sure that the primary key of the “1” is included in the table of the “many.” 78 Week 3 - Chapter 4 - Entity Relationship (ER) Modeling The Entity Relationship Model The Entity Relationship Model (ERM) forms the basis of an Entity Relationship Diagram (ERD). ERDs represent the conceptual database as viewed by the end user, depicting entities, attributes, and relationships. Entities in an ERD represent real-world objects and are often used interchangeably with the term "objects". The ERD components include entities, attributes, and relationships. The ERD modeling tools are essential for successful database design and implementation. The original Chen notation and the newer Crow’s Foot and UML notations are used for ER modeling concepts. The Crow’s Foot notation is implementation-oriented and can represent what could be implemented. The Chen notation favors conceptual modeling, while the UML notation can be used for both conceptual and implementation modeling. Entities in ER modeling refer to entity sets rather than single entity occurrences. These points highlight the significance of the Entity Relationship Model in database design and emphasize the importance of understanding entities, attributes, and relationships in developing successful database models. Domain: The possible set of values for a given attribute. Required attribute: In ER modeling, an attribute that must have a value. In other words, it cannot be left empty. optional attribute: In ER modeling, an attribute that does not require a value; therefore, it can be left empty. identifier: One or more attributes that uniquely identify each entity instance. relational schema: The organization of a relational database as described by the database administrator composite identifier: In ER modeling, a key composed of more than one attribute. composite attribute: An attribute that can be further subdivided to yield additional attributes. For example, a phone number such as 615-898- 2368 may be divided into an area code (615), an exchange number (898), and a four-digit code (2368). Compare to simple attribute. simple attribute: An attribute that cannot be subdivided into meaningful components. Compare to composite attribute. single-valued attribute: An attribute that can have only one value. multivalued attribute: An attribute that can have many values for a single entity occurrence. For example, an EMP_ DEGREE attribute might store the string “BBA, MBA, PHD” to indicate three different degrees held 79 derived attribute: An attribute that does not physically exist within the entity and is derived via an algorithm. For example, the Age attribute might be derived by subtracting the birth date from the current date participants: An ER term for entities that participate in a relationship. For example, in the relationship “PROFESSOR teaches CLASS,” the teaches relationship is based on the participants PROFESSOR and CLASS connectivity: The classification of the relationship between entities. Classifications include 1:1, 1:M, and M:N. cardinality: A property that assigns a specific value to connectivity and expresses the range of allowed entity occurrences associated with a single occurrence of the related entity existence-dependent: A property of an entity whose existence depends on one or more other entities. In such an environment, the existence-independent table must be created and loaded first because the existence dependent key cannot reference a table that does not yet exist. existence-independent: A property of an entity that can exist apart from one or more related entities. Such a table must be created first when referencing an existence-dependent table. strong entity: An entity that is existence-independent, that is, it can exist apart from all of its related entities. regular entity: See strong entity weak (nonidentifying) relationship: A relationship in which the primary key of the related entity does not contain a primary key component of the parent entity. strong (identifying) relationship: A relationship that occurs when two entities are existence dependent; from a database design perspective, this relationship exists whenever the primary key of the related entity contains the primary key of the parent entity weak entity: An entity that displays existence dependence and inherits the primary key of its parent entity. For example, a DEPENDENT requires the existence of an EMPLOYEE. optional participation: In ER modeling, a condition in which one entity occurrence does not require a corresponding entity occurrence in a particular relationship. mandatory participation: A relationship in which one entity occurrence must have a corresponding occurrence in another entity. For example, an EMPLOYEE works in a DIVISION. (A person cannot be an employee without being assigned to a company’s division.) relationship degree: The number of entities or participants associated with a relationship. A relationship degree can be unary, binary, ternary, or higher. unary relationship: An ER term used to describe an association within an entity. For example, an EMPLOYEE might manage another EMPLOYEE. 80 binary relationship: An ER term for an association (relationship) between two entities. For example, PROFESSOR teaches CLASS. ternary relationship: An ER term used to describe an association (relationship) between three entities. For example, a DOCTOR prescribes a DRUG for a PATIENT recursive relationship: A relationship found within a single entity type. For example, an EMPLOYEE is married to an EMPLOYEE or a PART is a component of another PART. Entities ○ An entity is an object of interest to the end user in the Entity Relationship Model (ERM). ○ Entities represent real-world objects and are often used interchangeably with the term "objects". ○ Entities in ER modeling refer to entity sets, not single entity occurrences. ○ The order in which ERD components are covered is dictated by the way modeling tools are used to develop ERDs for successful database design and implementation. ○ Entities are represented by rectangles containing the entity name, usually written in all capital letters. ○ The emphasis on design and implementation of databases leads to the use of Crow’s Foot and UML notations for ER modeling. ○ The Crow’s Foot notation favors an implementation-oriented approach in ER modeling Attributes ○ Attributes are characteristics of entities in the Entity Relationship Model (ERM). ○ Attributes are represented by ovals connected to the entity rectangle with a line in the original Chen notation. ○ Attributes have domains, which are the set of possible values for a given attribute ○ Required attributes must have a value and are indicated in boldface in the Crow’s Foot notation. ○ Optional attributes do not require a value and can be left empty. ○ Attributes may share a domain, and newly declared attributes can inherit characteristics of existing attributes if the same name is used. ○ Single-valued attributes can have only a single value, while composite attributes can be further subdivided into additional attributes. ○ The database designer must be aware of composite attributes and decompose them into simple attributes for detailed queries. ○ Identifiers, or primary keys, uniquely identify each entity instance in the ERM and are underlined in the ERD. ○ Composite identifiers ideally consist of a single attribute for entity identification ○ 81 82 Developing an ER Diagram iterative process: A process based on repetition of steps and procedures The process of database design is iterative rather than linear, based on repetition of processes and procedures. Building an Entity-Relationship Diagram (ERD) involves creating a detailed narrative of the organization's operations, identifying business rules, entities, relationships, and attributes, and developing the initial ERD. During the ERD development process, identifying attributes and primary keys that adequately describe the entities is crucial. 83 The ERD is revised and reviewed iteratively, incorporating newly discovered components and clarifications based on feedback from end users and designers. The ERD should be a fair representation of the organization's activities and functions, meeting the end users' requirements. The ERD components include entities, relationships, and attributes, which are essential for database design and implementation. The ERD uses connectivity and cardinality notations to show relationship strength, participation, and degree of relationship. The iterative nature of the ERD development process allows for adjustments and enhancements based on operational requirements and end-user feedback. These points highlight the key aspects of developing an Entity-Relationship Diagram (ERD) as part of the database design process, emphasizing the iterative nature of the design process and the importance of incorporating feedback and refining the diagram to accurately represent the organization's activities and functions. Database Design Challenges: Conflicting Goals Conflicting Goals in Database Design: ○ Database designers often encounter conflicting goals during the design process. ○ These conflicts typically involve balancing design standards, processing speed, and information requirements. Design Standards: ○ Design standards play a crucial role in guiding the development of logical structures in a database design. ○ The aim of design standards is to minimize data redundancies and avoid null values as much as possible. Challenges Faced: ○ Designers must navigate conflicting goals such as design elegance, processing speed, and information requirements. ○ Balancing these goals is essential to create a database design that meets both technical standards and end-user needs. Consideration of End-User Requirements: ○ Designers need to consider various end-user requirements, including performance, security, shared access, and data integrity. ○ These considerations are crucial alongside the focus on entities, attributes, relationships, and constraints in the design process. Meeting Logical Requirements vs. User Needs: ○ While meeting logical requirements and design conventions is important, failing to address customer transaction speed and information requirements can render a design ineffective. 84 ○ Designers must ensure that the database design aligns with both technical standards and user expectations to deliver a successful solution. Compromises in Design: ○ Compromises are inevitable in real-world database design scenarios. ○ Designers must be prepared to make trade-offs and compromises to balance conflicting goals and create a functional database design. Essential Role of ER Modeling: ○ Entity-Relationship (ER) modeling is essential in developing a robust database design that can adapt to changes and growth. ○ ER modeling provides a deep understanding of how an organization operates, aiding in the creation of effective database structures. Complex Design and Implementation Choices: ○ Design and implementation problems may not always have straightforward solutions. ○ Designers may need to make complex choices based on specific project requirements and constraints to address conflicting goals effectively Summary The ERM uses ERDs to represent the conceptual database as viewed by the end user. The ERM’s main components are entities, relationships, and attributes. The ERD includes connectivity and cardinality notations, and can also show relationship strength, relationship participation (optional or mandatory), and degree of relationship (such as unary, binary, or ternary). Connectivity describes the relationship classification (1:1, 1:M, or M:N). Cardinality expresses the specific number of entity occurrences associated with an occurrence of a related entity. Connectivities and cardinalities are usually based on business rules. In the ERM, an M:N relationship is valid at the conceptual level. However, when implementing the ERM in a relational database, the M:N relationship must be mapped to a set of 1:M relationships through a composite entity. ERDs may be based on many different ERMs. However, regardless of which model is selected, the modeling logic remains the same. Because no ERM can accurately portray all real-world data and action constraints, application software must be used to augment the implementation of at least some of the business rules. Unified Modeling Language (UML) class diagrams are used to represent the static data structures in a data model. The symbols used in the UML class and ER diagrams are very similar. The UML class diagrams can be used to depict data models at the conceptual or implementation abstraction levels. Database designers, no matter how well they can produce designs that conform to all applicabl

ITM900 - Midterm Exam Notes - Book 3.pdf

Document Details

Tags

Related

Full Transcript