BI Architecture PDF
Document Details
ISEG - Lisbon School of Economics and Management
2023
false
Eduardo Rodrigues
Tags
Summary
This document describes Business Intelligence (BI) Architecture. It covers various aspects such as Business vision, different Business Processes, operational aspects, Information Strategic Planning, etc.The author seems to be from Lisbon School of Economics & Management, University of Lisboa.
Full Transcript
Data Platforms for Analytics BI Architecture @Eduardo Rodrigues BI Architecture Business vision Business Processes @Eduardo Rodrigues BI Architecture Business vision...
Data Platforms for Analytics BI Architecture @Eduardo Rodrigues BI Architecture Business vision Business Processes @Eduardo Rodrigues BI Architecture Business vision Business Processes Presentation APPLICATION Business Logic Data @Eduardo Rodrigues BI Architecture Business vision Business Processes Presentation Business Logic Data @Eduardo Rodrigues BI Architecture Reality Business Processes Operational Presentation Business Logic Operational Data @Eduardo Rodrigues BI Architecture Reality Business Processes Operational Presentation Business Logic Operational Data @Eduardo Rodrigues BI Architecture We´ve ended up with a huge “mess” Business Processes Operational Presentation Business Logic Operational Data @Eduardo Rodrigues BI Architecture We should have started with a planning process to avoid that “mess” Business Processes Business Processes Operational Operational Presentation Presentation Business Logic Business Logic Data Operational Data @Eduardo Rodrigues BI Architecture Information Strategic Planning Business Services / Information Applications Processes Technology Blueprint Processes Information Functional Analysis To long to wait @Eduardo Rodrigues BI Architecture So, this is really what we´ve got! Let’s work with it! Business Processes Operational Presentation Business Logic Operational Data @Eduardo Rodrigues BI Architecture Business Processes Operational Operational Digital Integrated Presentations Presentation Services Platform Business Logic Operational Data @Eduardo Rodrigues BI Architecture Business Processes Operational Operational Digital Integrated Presentations Presentation Services Platform External Services Services (SOA & API) Business Logic Operational Data Operational Data Store @Eduardo Rodrigues BI Architecture Business Processes Decision-making Operational Operational Digital Integrated Presentations Presentation Services Platform Consolidated Information Business Logic Operational Data @Eduardo Rodrigues BI Architecture Business Processes Decision-making Operational Operational Digital Integrated Presentations Presentation Services Platform Consolidated Information Business Logic Plataforma de Dados Operational Data @Eduardo Rodrigues BI Architecture Business Processes Decision-making Operational Operational Digital Apresentações integradas Presentation BI/OLAP Tools Plataforma de Serviços Consolidated Information Business Logic Plataforma de Dados External Data Data Warehouse Operational Data @Eduardo Rodrigues BI Architecture Business Processes Decision-making Operational Operational Digital Presentation Integration Presentation OLAP Tools SOA / API Platform (OnLine Analytical Processing) (Service Oriented Architecture / Application Program Interface) Business Logic Data Analytics Operational Data Platform Platform (Data Warehouse) (ODS) Operational Data @Eduardo Rodrigues BI Architecture “Classic” key layers and components Data Sources Operational Systems @Eduardo Rodrigues BI Architecture “Classic” key layers and components Data Data Sources Integration Operational Systems ETL @Eduardo Rodrigues BI Architecture “Classic” key layers and components Data Data Data Sources Integration Stores Operational Systems ETL Data Warehouse Data Marts @Eduardo Rodrigues BI Architecture “Classic” key layers and components Data Data Data Access Sources Integration Stores Layer Query Operational Systems ETL Data Warehouse Data Marts @Eduardo Rodrigues BI Architecture “Classic” key layers and components Data Data Data Access Analysis Sources Integration Stores Layer Query Dashboard Operational Systems ETL Report BI Tools Data Warehouse Data Mining Data Marts @Eduardo Rodrigues BI Architecture “Classic” key layers and components Data Data Data Access Analysis Consume Sources Integration Stores Layer Query Dashboard Operational Systems ETL Business Users Report BI Tools Data Warehouse Business Analysts Self BI Users Data Mining Data Marts @Eduardo Rodrigues BI Architecture “Classic” key layers and components Data Data Data Access Analysis Consume Sources Integration Stores Layer Query Dashboard Operational Systems ETL Business Users Report BI Tools Data Warehouse Business Analysts Self BI Users Data Mining Data Marts Data Governance (Architecture, Metadata, Data Quality, Security,…) & Infrastructure (BI Tools, DBMS, Cloud …) @Eduardo Rodrigues BI Architecture Data Data Data Access Analysis Consume Sources Integration Stores Layer Query Dashboard Operational Systems ETL Business Users Report BI Tools Data Warehouse Business Analysts Self BI Users Data Mining Data Marts Data Governance (Architecture, Metadata, Data Quality, …) & Infrastructure (BI Tools, DBMS, …) @Eduardo Rodrigues BI Architecture @Eduardo Rodrigues BI Architecture Number of social media users worldwide from 2017 to 2027 (in billions) @Eduardo Rodrigues BI Architecture Most popular social networks worldwide as of April 2023, ranked Number of social media users worldwide from 2017 to 2027 (in billions) by number of active users @Eduardo Rodrigues BI Architecture Amount of data actually stored in data centers worldwide from 2015 to 2021 Byte B = 8bits Kilobyte KB = 210 bytes ≈ 103 bytes Megabyte MB = 220 bytes ≈ 106 bytes Gigabyte GB = 230 bytes ≈ 109 bytes Terabyte TB = 240 bytes ≈ 1012 bytes Petabyte PB = 250 bytes ≈ 1015 bytes Exabyte EB = 260 bytes ≈ 1018 bytes Zettabyte ZB = 270 bytes ≈ 1021 bytes Yottabyte YB = 280 bytes ≈ 1024 bytes @Eduardo Rodrigues BI Architecture @Eduardo Rodrigues BI Architecture Business Processes Decision-making Operational Operational Operational Presentation Integration Presentation External Services OLAP Tools SOA / API Platform (OnLine Analytical Processing) (Service Oriented Architecture / Application Program Interface) External, Business Logic Complex, Huge, Unstructured Data Data Analytics Platform Operational Data (Warehouse / Lakes /…) Platform (ODS) Operational Data @Eduardo Rodrigues BI Architecture So, let’s start all over again - “Classic” key layers and components Data Data Data Access Analysis Consume Sources Integration Stores Layer Query Dashboard Operational Systems ETL Business Users Report BI Tools Data Warehouse Business Analysts Self BI Users Data Mining Data Marts Data Governance (Architecture, Metadata, Data Quality, Security,…) & Infrastructure (BI Tools, DBMS, Cloud …) @Eduardo Rodrigues BI Architecture “Modern” Key layers and components Data Data Data Access Analysis Consume Sources Integration Stores Layer Query Dashboard Operational Systems ETL Business Users Report SaaS BI Tools Data Warehouse Business Analysts Self BI Users WEB Data Mining Sensors IOT Data Marts Audio Video Social Media Data Governance (Architecture, Metadata, Data Quality, Security …) & Infrastructure (BI Tools, Cloud, DBMS, …) @Eduardo Rodrigues BI Architecture “Modern” Key layers and components Data Data Data Access Analysis Consume Sources Integration Stores Layer Query Dashboard Operational Systems ETL Business Users Report SaaS BI Tools Replicate Data Warehouse Business Analysts Self BI Users WEB Data Mining Sensors IOT Stream Data Marts Audio Video Staging Social Media Data Governance (Architecture, Metadata, Data Quality, Security …) & Infrastructure (BI Tools, Cloud, DBMS, …) @Eduardo Rodrigues BI Architecture “Modern” Key layers and components Data Data Data Access Analysis Consume Sources Integration Stores Layer Query Dashboard Operational Data Lake Systems ETL Business Users Report SaaS BI Tools Replicate Data Warehouse Business Analysts Self BI Users WEB Data Mining Sensors IOT Stream Data Marts Audio SandBox Video Staging Social Media ODS Data Governance (Architecture, Metadata, Data Quality, Security …) & Infrastructure (BI Tools, Cloud, DBMS, …) @Eduardo Rodrigues BI Architecture “Modern” Key layers and components Data Data Data Access Analysis Consume Sources Integration Stores Layer Query Dashboard Operational Data Lake Systems ETL Business Users Report SaaS BI Tools Virtualization Replicate Data Warehouse Business Analysts Self BI Users WEB Data Mining Sensors IOT Stream Data Marts Data Service Audio SandBox Video Staging Social Media ODS Data Governance (Architecture, Metadata, Data Quality, Security …) & Infrastructure (BI Tools, Cloud, DBMS, …) @Eduardo Rodrigues BI Architecture “Modern” Key layers and components Data Data Data Access Analysis Consume Sources Integration Stores Layer Query Dashboard Operational Data Lake Systems ETL Business Users Report SaaS BI Tools Virtualization Replicate Data Warehouse Business Analysts Self BI Users WEB Data Mining Sensors IOT Stream Data Marts Data Service Data Audio Science SandBox Alerts Video Staging Social Real Time Media ODS Decisions Data Governance (Architecture, Metadata, Data Quality, Security …) & Infrastructure (BI Tools, Cloud, DBMS, …) @Eduardo Rodrigues BI Architecture “Modern” Key layers and components Data Data Data Access Analysis Consume Sources Integration Stores Layer Query Dashboard Operational Data Lake Systems ETL Business Users Report SaaS BI Tools Virtualization Replicate Data Warehouse Business Analysts Self BI Users WEB Data Mining Sensors IOT Stream Data Marts Data Service Data Audio Science Power Users SandBox Alerts Video Staging Social Real Time Media ODS Decisions Applications Data Governance (Architecture, Metadata, Data Quality, Security …) & Infrastructure (BI Tools, Cloud, DBMS, …) @Eduardo Rodrigues BI Architecture Key layers and components Data Data Data Access Sources Analysis Consume Integration Stores Layer Query Dashboard Operational Data Lake Systems ETL Business Users Report SaaS BI Tools Virtualization Replicate Data Warehouse Business Analysts Self BI Users WEB Data Mining Sensors IOT Stream Data Marts Data Service Data Audio Science Power Users SandBox Alerts Video Staging Social Real Time Media ODS Decisions Applications Data Governance (Architecture, Metadata, Data Quality, Security …) & Infrastructure (BI Tools, Cloud, DBMS, …) @Eduardo Rodrigues BI Architecture Month Week Day Planned Remarks 24 1 October 25 BI Architecture 31 2 Data warehouse & Data Mart Concepts 1 (Holiday) Big Data 7 3 Metadata 8 Master Data Management 14 4 & Continuous Assessment November 15 Exercises 21 5 (Data Modelling) 22 28 6 29 Date of test and delivery of work to students 1 Deadline for students to hand in the work solution December 5 Date for publishing students' grades 9 Regular exam Exams January 7 Resit exam @Eduardo Rodrigues BI Architecture Questions @Eduardo Rodrigues Data Platforms for Analytics Data Warehousing @Eduardo Rodrigues BI Architecture Basic architecture Data Data Data Access Analysis Consume Sources Integration Stores Layer Query Dashboard Operational Systems ETL Business Users Report BI Tools Data Warehouse Business Analysts Self BI Users Data Mining Data Marts Staging Data Governance (Architecture, Metadata, Data Quality, …) & Infrastructure (BI Tools, DBMS, …) @Eduardo Rodrigues Data Warehousing The problem… @Eduardo Rodrigues Data Warehousing Why do we have this problem? @Eduardo Rodrigues Data Warehousing Root causes Data resides in different applications. Semantic/definition issues / lack of catalogs Inconsistent data values / lack of data creation/entry control Data with inconsistent structures (types, encodings, modelling,…) Free-form v structured fields Lack of historical data Data models optimized for transactional purposes … (many more reasons) @Eduardo Rodrigues Data Warehousing OLAP OLTP Business Processes Decision-making Operational Operational Presentation Informational Systems Operational Systems OLAP Tools SOA / API Platform (OnLine Analytical Processing) (Service Oriented Architecture / Application Program Interface) Business Logic ETL Data Analytics Platforms Operational Data Platforms Operational Data @Eduardo Rodrigues Data Warehousing Operational system (OLTP – OnLine Transactional Processing) – a system that is used to run a business in real time, based on current data; also called a system of record Informational system (OLAP- OnLine Analytical Processing) – a system designed to support decision making based on historical point-in-time and prediction data for complex queries and analysis @Eduardo Rodrigues Data Warehousing OLAP OLTP Business Processes Decision-making Operational Operational Presentation Informational Systems Operational Systems OLAP Tools SOA / API Platform (OnLine Analytical Processing) (Service Oriented Architecture / Application Program Interface) Business Logic Data Analytics Platform (Data Warehouse) Operational Data Platforms Operational Data @Eduardo Rodrigues Data Warehousing Data warehouse history The concept of Data Warehousing dates back to 1988 when IBM researchers, Barry Devlin and Paul Murphy, published the article An architecture for a business information system Considered by many to be the Father of Data Warehousing, Bill Inmon first began to discuss the principles around the Data Warehouse and even coined the term in the 1980s. – In 1992, Inmon published Building the Data Warehouse, one of the seminal volumes of the industry Ralph Kimball’s book, The Data Warehouse Toolkit, first published in 1996, that included a host of industry-honed, practical examples for OLAP-style modeling. @Eduardo Rodrigues Data Warehousing What is a Data Warehouse Is a subject-oriented, integrated, time-variant, non-volatile collection of data used in support of management decision making processes ▪ Subject-oriented: focused in customers, products, markets, sales,..., and not in company’s ongoing operations ▪ Integrated: consistent naming conventions, formats, encoding structures; from multiple data sources ▪ Time-variant: data associated with time periods ▪ Non-volatile: read-only, periodically refreshed, never removed data @Eduardo Rodrigues Business Processes Operational Decision-making Data Warehousing Operational Presentation Integration Presentation Subject-Oriented Informational Systems Operational Systems SOA / API Platform OLAP Tools – The operational applications are (OnLine Analytical Processing) (Service Oriented Architecture / Application Program Interface) Business Logic organized around processes/functions ETL such as: Data Analytics Platform (Warehouse/Lakes /…) Operational Data Platform (ODS) Operational Data ▪ Offer product ▪ Book seat ▪ Deliver service ▪ Invoice client – A data warehouse is organized around the major subjects of the business, like: ▪ Customer ▪ Sales ▪ Product ▪ or “strange subjects” such as delays, cancelations, incidents, complaints … @Eduardo Rodrigues Data Warehousing Operational Applications orientation The travel process is supported by a large set of operational systems Simplified Generic Travel Process from End-to–End Develop Park/ Enter departure Dis - Enter Travel Judge awareness Make Travel to pick up Make lounge/ Board Fly embark Passport Reclaim Customs arrival from and reservation airport baggage airplane control baggage the Check–in duty–free airplane hall airport experience need trolley shopping Departure NPS System WebSite Control System Miles Accounting System Reservation Notification System System Lost & Found Ancillaries System System @Eduardo Rodrigues Data Warehousing Operational Applications orientation WHERE IS THE CUSTOMER INFORMATION? Simplified Generic Travel Process from End-to–End Develop Park/ Enter departure Dis - Enter Travel Judge awareness Make Travel to pick up Make lounge/ Board Fly embark Passport Reclaim Customs arrival from and reservation airport baggage airplane control baggage the Check–in duty–free airplane hall airport experience need trolley shopping Departure NPS System WebSite Control System Miles Accounting System Reservation Notification System System Lost & Found Ancillaries System System @Eduardo Rodrigues Data Warehousing Subject orientation (Customer) will allow to support the travel process with another piece of important information Simplified Generic Travel Process from End-to–End Develop Park/ Enter departure Dis - Enter Travel Judge awareness Make Travel to pick up Make lounge/ Board Fly embark Passport Reclaim Customs arrival from and reservation airport baggage airplane control baggage the Check–in duty–free airplane hall airport experience need trolley shopping Target customer and Collect, record and process improve cycle experience customer experience Customer Airlines are supporting now the development of Customer-Centric Information that will allow the creation of integrated customer profiles and implementation of customer relationship processes, neglected during all these years in favor of internal process automation (Operational Systems) @Eduardo Rodrigues Data Warehousing Integrated – Data contained in the warehouse are integrated – Integration involves different aspects, such as: ▪ Consistent definitions ▪ Consistent naming conventions ▪ Consistent measures of variables ▪ Consistent encoding structures ▪ Consistent attributes of data @Eduardo Rodrigues Data Warehousing Reservation System Departure Control System John Smith; M; Id 823654; Gold; ; DOB 2nd of March 2000; Single Lisbon – London - Sydney Smith, John; PN 234123;1 Bag ; Window Seat Lis – Lon 5Dec2020 20/12/05 Frequent Flyer Program John Herbert Smith; Male; PN 234123; Lisbon – London 8Apr2020; OLTP Special Meal DW John Herbert Smith; Male; PN 234123; ID 567876; Gold; Single; DOB 02Mar2000; Lisbon – London – Sydney: 8Apr2020; Special Meal; Window Seat Lisbon – London – Sydney: 5Dec2020; 1 Bag; @Eduardo Rodrigues Data Warehousing Non-Volatile – Operations in operational environment ▪ Insert ▪ Delete ▪ Update ▪ Select – Operational data is volatile … @Eduardo Rodrigues Data Warehousing OLTP Reservations System Book Pax John; Lisbon – London 5Dec2020 @Eduardo Rodrigues Data Warehousing OLTP Reservations System Book Pax John; Lisbon – London 5Dec2020 ReBook Pax John; Lisbon – London 10Dec2020 ReBook Pax John; Lisbon – London 12Dec2020 @Eduardo Rodrigues Data Warehousing OLTP Reservations System Pax John; Lisbon – London 12Dec2020 @Eduardo Rodrigues Data Warehousing OLTP DW Reservations System Book Pax John; Lisbon – London Pax John; Lisbon – London 5Dec2020 5Dec2020 Pax John; Lisbon – London ReBook 10Dec2020 Pax John; Lisbon – London 12Dec2020 Pax John; Lisbon – London 10Dec2020 ReBook Pax John; Lisbon – London 12Dec2020 @Eduardo Rodrigues Data Warehousing Non-Volatile – Operations in operational environment ▪ Insert ▪ Delete ▪ Update ▪ Select – Operations in Data Warehouse environment ▪ Insert (Initial loading) ▪ Select ▪ No updates and deletes (except for technical purposes) @Eduardo Rodrigues Data Warehousing Time-Variant – All data in a Data Warehouse is accurate at some moment in time ▪ Associated time-stamp @Eduardo Rodrigues Data Warehousing OLTP DW Reservations System Pax John; Lisbon – London 22Nov2020 – 16:25 5Dec2020 Pax John; Lisbon – London Pax John; Lisbon – London 26Nov2020 – 09:47 12Dec2020 10Dec2020 Pax John; Lisbon – London 28Nov2020 – 22:12 12Dec2020 @Eduardo Rodrigues Data Warehousing Time-Variant – All data in a Data Warehouse is accurate at some moment in time ▪ Associated time-stamp – Once data is correctly stored in the data warehouse, it cannot be updated or deleted ▪ Data Warehouse data is a long series of snapshots @Eduardo Rodrigues Data Warehousing Need for Data Warehousing Separate of operational and informational systems and data (for improved performance) Focus business discussions on business subjects Define and document business subjects Clean, integrate and provide company-wide view of high-quality information Retain important time stamped, historical information Be the basis for the OLAP environment (On Line Analytical Processing) @Eduardo Rodrigues Data Warehousing Operational system (OLTP – OnLine Transactional Processing) – a system that is used to run a business in real time, based on current data; also called a system of record Informational system (OLAP- OnLine Analytical Processing) – a system designed to support decision making based on historical point-in-time and prediction data for complex queries and analysis But where is the difference? @Eduardo Rodrigues Data Warehousing OLTP data (operational data) OLAP data (data warehouse) Process/Application Oriented Subject Oriented Used to run the business Used to analyse the business Detailed data Detailed, summarized & refined data Current up to date (weeks / months) Snapshot data (years) Isolated data Integrated data Clerical user Knowledge user Few records accessed at a time Large volumes accessed at a time CRUD Access Mostly Read No data redundancy Redundancy inevitable Supports only predefined operations Needs to accommodate ad hoc queries Updates done by users Updates done by ETL batch processing Use fully normalizaed schemas to optimize CRUD Use denormalized schemas to optimize query and guarantee consistency performance Complex data structures Multidimensional data structures Few indexes Many indexes Rarely aggregated Common aggregation @Eduardo Rodrigues Data Warehousing Data Modelling Review Entity – Relationship Diagrams Conceptual, Logical and Physical Models Normalization How to build a Data Warehouse structure? @Eduardo Rodrigues Data Modelling Conceptual model vs Logical model vs Physical model Notation? @Eduardo Rodrigues Data Modelling Entity-Relationship Diagrams (ERD) Entity-relationship model components – Entities – Attributes – Relationships There are many notations and symbols that are unique to ERDs that you need to know in order to read and create them: – UML (Class diagram) – Arrow notation – Barker’s notation The major differences are in the relationship notation – Chen notation – Crow notation –… @Eduardo Rodrigues Data Modelling ERD - Crow notation symbols In a relationship, there are two marks that indicate multiplicity. ▪ The first mark, closest to the end of the line, represents the maximum number of times an instance of an entity can be associated with other instances One time – A short, perpendicular line Many – A three-pronged line ▪ The second mark, behind the first, represents the minimum number of times an instance of an entity can be associated with other instances. The minimum can only be zero or one, and they are referred to as‘optional’ or ‘mandatory’, respectively Optional – An open circle Mandatory – A short, perpendicular line Cardinalities 1..1 0..1 Mandatory One Optional One 1..* 0..* Mandatory Many Optional Many @Eduardo Rodrigues Data Modelling ERD - Crow notation symbols In a relationship, there are two marks that indicate multiplicity. ▪ The first mark, closest to the end of the line, represents the maximum number of times an instance of an entity can be associated with other instances One time – A short, perpendicular line Many – A three-pronged line ▪ The second mark, behind the first, represents the minimum number of times an instance of an entity can be associated with other instances. The minimum can only be zero or one, and they are referred to as‘optional’ or ‘mandatory’, respectively Optional – An open circle Mandatory – A short, perpendicular line @Eduardo Rodrigues Data Modelling E-R Model Constructs Order conceptual model example @Eduardo Rodrigues Data Warehousing Data Modelling Review Entity – Relationship Diagrams Conceptual, Logical and Physical Models Normalization How to build a Data Warehouse structure? @Eduardo Rodrigues Data Modelling Levels of Data Models @Eduardo Rodrigues Data Modelling Levels of Data Models Defines WHAT the system contains. This model is typically created by Business stakeholders and Data Architects. The purpose is to organize, scope and define business concepts and rules. @Eduardo Rodrigues Data Warehousing Business Processes Operational Decision-making Operational Operational Presentation Integration Presentation Informational Systems Operational Systems SOA / API Platform OLAP Tools (OnLine Analytical Processing) (Service Oriented Architecture / Application Program Interface) Business Logic ETL Data Warehouse Operational Data Platform (ODS) Operational Data @Eduardo Rodrigues Business Processes Data Warehousing Decision-making Operational Presentation Integration Operational Presentation Informational Systems Operational Systems ▪ Subject-oriented: focused in customers, products, markets, sales,..., and not in company’s ongoing OLAP Tools (OnLine Analytical Processing) SOA / API Platform (Service Oriented Architecture / operations Business Logic Application Program Interface) ▪ Integrated: consistent naming conventions, formats, encoding structures; from multiple data sources ETL Data Operational Data Platform ▪ Time-variant: data associated with time periods Warehouse Operational Data (ODS) ▪ Non-volatile: read-only, periodically refreshed, never removed data @Eduardo Rodrigues Business Processes Data Warehousing Decision-making Operational Presentation Integration Operational Presentation Informational Systems Operational Systems ▪ Subject-oriented: focused in customers, products, markets, sales,..., and not in company’s ongoing OLAP Tools (OnLine Analytical Processing) SOA / API Platform (Service Oriented Architecture / operations Business Logic Application Program Interface) ▪ Integrated: consistent naming conventions, formats, encoding structures; from multiple data sources ETL Data Operational Data Platform ▪ Time-variant: data associated with time periods Warehouse Operational Data (ODS) ▪ Non-volatile: read-only, periodically refreshed, never removed data Operational data and Informational data are completely different. @Eduardo Rodrigues Business Processes Data Warehousing Decision-making Operational Presentation Integration Operational Presentation Informational Systems Operational Systems ▪ Subject-oriented: focused in customers, products, markets, sales,..., and not in company’s ongoing OLAP Tools (OnLine Analytical Processing) SOA / API Platform (Service Oriented Architecture / operations Business Logic Application Program Interface) ▪ Integrated: consistent naming conventions, formats, encoding structures; from multiple data sources ETL Data Operational Data Platform ▪ Time-variant: data associated with time periods Warehouse Operational Data (ODS) ▪ Non-volatile: read-only, periodically refreshed, never removed data Operational data and Informational data are completely different. Data Modelling Review Entity – Relationship Diagrams Conceptual, Logical and Physical Models Normalization @Eduardo Rodrigues Data Modelling Levels of Data Models Defines WHAT the system contains. This model is typically created by Business stakeholders and Data Architects. The purpose is to organize, scope and define business concepts and rules. @Eduardo Rodrigues Data Modelling Conceptual model vs Logical model vs Physical model Conceptual Data Model is an organized view of database concepts and their relationships. The purpose of creating a conceptual data model is to establish entities, their main attributes, and their relationships. In this data modeling level, there is hardly any detail available on the actual database structure. Business stakeholders and data architects typically create a conceptual data model. The 3 basic tenants of Conceptual Data Model are: Entity: A real-world thing Attribute: Main characteristics or properties of an entity Relationship: Dependency or association between two entities Characteristics of a conceptual data model: Offers organisational-wide coverage of the business concepts. Designed for a business audience. Independent of technology or vendor specifications The focus is to represent data as a user will see it in the “real world.” @Eduardo Rodrigues Data Modelling Conceptual model vs Logical model vs Physical model ERD feature Conceptual Logical Physical Entity Relevant ones Yes Yes (concepts) Relationship Yes (cardinality is Yes Yes optional) Attributes Optional / Yes Yes Relevant ones Attributes’s Type Optional Yes 3NF Yes Yes Resolve many-to-many relationships Yes / Optional Yes Primary Key Yes Yes Foreign Key Optional Yes Technical details (views, indexes, Yes default values, DBMS dependent…) @Eduardo Rodrigues Data Modelling Let's draw a simple conceptual model that relates some very basic entities such as Students, Classes, Teachers, Class Subjects: - Class: Represents a lesson that is taught at a certain time, by a certain teacher to a certain audience of students and with a certain subject matter - Student: Represents students enrolled in classes. - Teacher: Represents teachers who may teach one or multiple classes. - Subject: Represents academic subjects which can be taught in one or more classes. @Eduardo Rodrigues Data Modelling Let's draw a simple conceptual model that relates some very basic entities such as Students, Classes, Teachers, Class Subjects: - Class: Represents a lesson that is taught at a certain time, by a certain teacher to a certain audience of students and with a certain subject matter - Student: Represents students enrolled in classes (mandatory to be enrolled in at least one class). - Teacher: Represents teachers who may teach one or multiple classes. - Subject: Represents academic subjects which can be taught in one or more classes. 09:00 10:30 12:00 Class - Teacher: A many-to-one relationship: a Tea1, Sub1 Tea1, Sub1 class has one teacher, but a teacher can teach Class Stud 1 Stud 4 Stud 2 Stud 5 multiple classes. Class - Subject: A many-to-one relationship: a Tea2, Sub2 Tea2, Sub1 class corresponds to one subject, but a subject can Stud 1 Stud 1 be presented in multiple classes. Stud 4 Stud 6 Class - Student: A many-to-many relationship: a class can have multiple students, and a student can be enrolled in multiple classes (at the same time?). Tea3, Sub2 Tea3, Sub3 Subject - Teacher: A many-to-many relationship: Stud 2 Stud 2 a subject can be taught by several teachers and a Stud 6 Stud 3 teacher can teach several subjects. @Eduardo Rodrigues Data Modelling Let's draw a simple conceptual model that relates some very basic entities such as Students, Classes, Teachers, Class Subjects Conceptual model vs Logical model vs Physical model - Class: Represents a lesson that is taught at a certain time, by a certain teacher to a certain audience of students and with a certain subject matter - Student: Represents students enrolled in classes. - Teacher: Represents teachers who may teach one or multiple classes. - Subject: Represents academic subjects which can be taught in one or more classes. @Eduardo Rodrigues Data Modelling Levels of Data Models Defines WHAT the system contains. This model is typically created by Business stakeholders and Data Architects. The purpose Defines is tothe HOW organize, systemscope shouldand bedefine implemented business concepts regardless and rules. of the RDBMS. This model is typically created by Data Architects and Business Analysts. The purpose is to add detail to business definitions and to develop technical map of rules and data structures. @Eduardo Rodrigues Data Modelling Conceptual model vs Logical model vs Physical model Logical Data Model adds further information to the conceptual data model elements and introduces important aspects of data elements structure, providing a foundation to form the base for the Physical model, as well. However, the modeling structure remains generic. Characteristics of a Logical data model Describes data needs for a single project but could integrate with other logical data models based on the scope of the project. Data attributes may already have datatypes with precisions and length. Normalization processes to the model is applied typically till 3NF. @Eduardo Rodrigues Data Modelling Conceptual model vs Logical model vs Physical model ERD feature Conceptual Logical Physical Entity Relevant ones Yes Yes (concepts) Relationship Yes (cardinality is Yes Yes optional) Attributes Optional / Yes Yes Relevant ones Attributes’s Type Optional Yes 3NF Yes Yes Resolve many-to-many relationships Yes / Optional Yes Primary Key Yes Yes Foreign Key Yes Yes Technical details (views, indexes, Yes default values, DBMS dependent…) @Eduardo Rodrigues Data Modelling Conceptual model vs Logical model vs Physical model @Eduardo Rodrigues Data Modelling Levels of Data Models Defines WHAT the system contains. This model is typically created by Business stakeholders and Data Architects. The purpose is to organize, scope and define Defines HOW business the system concepts should be and rules. implemented regardless of the DBMS. This model is typically created by Data Architects and Business Analysts. The purpose is to developed technical map of rules and data Describes structures.HOW the system will be implemented using a specific DBMS system. This model is typically created by DBA and developers. The purpose is actual implementation of the database. @Eduardo Rodrigues Data Modelling Conceptual model vs Logical model vs Physical model Physical Data Model describes a database-specific implementation of the data model. Helps in visualizing database structure by replicating database column keys, constraints, indexes, triggers, and other RDBMS features. Characteristics of a physical data model: Developed for a specific version of a RDBMS, location, data storage or technology to be used in the project. Columns should have exact datatypes, lengths assigned and default values. Primary and Foreign keys, views, indexes, access profiles, and authorizations, etc. are defined. ….. @Eduardo Rodrigues Data Modelling Conceptual model vs Logical model vs Physical model ERD feature Conceptual Logical Physical Entity Relevant ones Yes Yes (concepts) Relationship Yes (cardinality is Yes Yes optional) Attributes Optional / Yes Yes Relevant ones Attributes’s Type Optional Yes 3NF Yes Yes Resolve many-to-many relationships Yes / Optional Yes Primary Key Yes Yes Foreign Key Yes Yes Technical details (views, indexes, Yes default values, DBMS dependent…) @Eduardo Rodrigues Data Modelling Conceptual model vs Logical model vs Physical model @Eduardo Rodrigues Data Modelling Data Modelling Review Conceptual, Logical and Physical Models Entity – Relationship Diagrams Normalization @Eduardo Rodrigues Normalization Normalization vs Denormalization Database normalization is the process of removing redundant data from the tables to improve storage efficiency, data integrity and scalability. Normalization generally involves splitting/constructing tables and setting up relations according with normalization rules Normalization will minimize the insertion, deletion and update anomalies through eliminating the redundant data and inconsistencies Operational/Transactional Systems use, generally, normalized DBs Database denormalization is the inverse process where redundancy is added to the data to improve performance of specific data queries and analysis BI Systems use, in most cases, denormalized DBs MIS 57 Normalization Normalized vs Denormalized data Imagine we want to analyze orders placed by customers Informational Data @Eduardo Rodrigues Normalization Normalized vs Denormalized data Imagine we want to analyze orders placed by customers Transactional Data Informational Data @Eduardo Rodrigues Normalization Normalized Example Teacher_Nr Name Category Salary Discipline Associate 10234 José Silva Professor 3.000 € MIS Auxiliar 10345 Tomás Ribeiro Professor 2.500 € MIS Associate 11908 Maria Marques Professor 3.000 € Mathematics 16000 Carlos Cesar Assistant 1.500 € MIS Auxiliar 17120 Ana Lopes Professor 2.500 € Mathematics @Eduardo Rodrigues Normalization Normalized Example Teacher_Nr Name Category Salary Discipline Associate 10234 José Silva Professor 3.000 € MIS Auxiliar 10345 Tomás Ribeiro Professor 2.500 € MIS Associate 11908 Maria Marques Professor 3.000 € Mathematics 16000 Carlos Cesar Assistant 1.500 € MIS Auxiliar Denormalized Example 17120 Ana Lopes Professor 2.500 € Mathematics Teacher_ Category_Id Category Salary Nr Name Category_Id Discipline_Id Discipline_Id Discipline Associate 10234 José Silva 110 12 12 MIS 110 Professor 3.000 € 10345 Tomás Ribeiro 230 12 38 Maths 180 Assistant 1.500 € Auxiliar 11908 Maria Marques 110 38 230 Professor 2.500 € 16000 Carlos Cesar 180 12 17120 Ana Lopes 230 38 MIS 61 Normalization Normalization Example Insert problems: If we insert a Teacher_Nr Name Category Salary Discipline new Teacher and by error we Associate introduce the Category Asociate 10234 José Silva Professor 3.000 € MIS Professor we will end up with a Auxiliar 10345 Tomás Ribeiro Professor 2.500 € MIS new non-existent category. Associate 11908 Maria Marques Professor 3.000 € Mathematics Update problems: To change the 16000 Carlos Cesar Assistant 1.500 € MIS salary of a category we have to Auxiliar change several lines (all the 17120 Ana Lopes Professor 2.500 € Mathematics Teachers with that category) Deletion problems: If we delete Carlos Cesar’s record we will no longer have information regarding the salary of an Assistant MIS 62 Normalization Normalization Example Insert problems: If we insert a new Teacher and by error we introduce the Category Teacher_Nr Name Category Salary Discipline Asociate Professor we will end up with a new Associate non-existent category. 10234 José Silva Professor 3.000 € MIS Auxiliar Update problems: To change the salary of a 10345 Tomás Ribeiro Professor 2.500 € MIS category we have to change several lines (all Associate the Teachers with that category) 11908 Maria Marques Professor 3.000 € Mathematics Deletion problems: If we delete Carlos 16000 Carlos Cesar Assistant 1.500 € MIS Cesar’s record we will no longer have Auxiliar information regarding the salary of an 17120 Ana Lopes Professor 2.500 € Mathematics Assistant Teacher_ Category_Id Category Salary Nr Name Category_Id Discipline_Id Discipline_Id Discipline Associate 10234 José Silva 110 12 12 MIS 110 Professor 3.000 € 10345 Tomás Ribeiro 230 12 38 Maths 180 Assistant 1.500 € Auxiliar 11908 Maria Marques 110 38 230 Professor 2.500 € 16000 Carlos Cesar 180 12 17120 Ana Lopes 230 38 MIS 63 Normalization Normalized vs Denormalized data MIS 64 Normalization Normal Forms 1NF Edgar Codd originally established 2NF 3 normal forms designated 1NF, 3NF 2NF and 3NF. Boyce-Codd NF There are now other NF that are 4NF generally accepted but the first 3 5NF ones are widely considered to be sufficient for most applications. Most tables when reaching 3NF are also in BCNF (Boyce-Codd Normal Form) MIS 65 Data Modelling Normalization vs Denormalization BASIS FOR COMPARISON NORMALIZATION DENORMALIZATION Basic Normalization is the process Denormalization is the process of creating a set schema to of combining the data so that store non-redundant and it can be queried speedily. consistent data. Purpose To reduce the data redundancy To achieve the faster and inconsistency. execution of the queries through introducing redundancy. Used in OLTP system, where the OLAP system, where the emphasize is on making the emphasis is on making the insert, delete and update search and analysis faster. anomalies faster and storing the quality data. Data integrity Maintained May not retain Redundancy Eliminated Added Number of tables and joins Increases Decreases Disk space Optimized usage Wastage @Eduardo Rodrigues Data Warehousing Data Modelling Review Conceptual, Logical and Physical Models Entity – Relationship Diagrams Normalization How to build a Data Warehouse structure? @Eduardo Rodrigues Data Warehousing BI Architecture Data Data Data Access Analysis Consume Sources Integration Stores Layer Query Dashboard Operational Systems ETL Business Users Report BI Tools Data Warehouse Business Analysts Self BI Users Data Mining Data Marts Staging Data Governance (Architecture, Metadata, Data Quality, …) & Infrastructure (BI Tools, DBMS, …) @Eduardo Rodrigues Data Warehousing What is a Data Mart – A subset of the data warehouse which is usually oriented to a specific business area or subject like a business line, team, market, etc, containing all important related facts and dimensions. – Data marts make specific data available to a defined group of users, which allows those users to quickly access critical insights without wasting time searching through an entire data warehouse. – In general, companies develop data marts that align with specific departments in the business, such as finance, sales or marketing. @Eduardo Rodrigues Data Warehousing Sales Data Mart @Eduardo Rodrigues Data Warehousing Data Warehouse versus Data Mart Data Warehouse Data Mart Informational application independency Specific of an informational application or purpose Centralized, enterprise-wide Decentralized by business/user area or subject Planned Possibly not planned Data is historical (detailed & Data has history needed (detailed & summarized) summarized) Data it lightly denormalized Data is highly denormalized Multiple subjects One central subject or concern to users @Eduardo Rodrigues Data Warehousing Data Warehouse Architectures Independent Data Mart Dependent Data Mart and Operational Data Store Mixed Approach @Eduardo Rodrigues Data Warehousing Dependent Data Mart Data Warehousing Architecture Centralized enterprise-wide data warehouse (EDW), and the data marts are loaded from this EDW, that serves as the control point and single source of all data @Eduardo Rodrigues Data Warehousing Independent Data Mart Data Warehousing Architecture Data marts are mini-warehouses, limited in scope. Separate E T L for each independent data mart. Data access complexity due to multiple data marts @Eduardo Rodrigues Data Warehousing Independent vs Dependent Data Mart Data Warehousing Architecture @Eduardo Rodrigues Data Warehousing Independent vs Dependent Data Mart Data Warehousing Architecture There is a central integrated DW with the overall picture of the entire organization DW serves as the control point and single source of all data made available to end users The dependent data mart approach is called a “hub and spoke” architecture More costly and complex DW structure Bill Inmon designated this approach as “Corporate Information Factory” @Eduardo Rodrigues Data Warehousing Independent vs Dependent Data Mart Data Warehousing Architecture Although Kimball defends this approach, others including Bill Inmon find it faulty. There is no central DW with the overall picture of the entire organization Separate ETL process for each data mart leads to redundant data and processing efforts Inconsistency between data marts Difficult to drill down for related facts between data marts Excessive scaling costs the more applications are built High cost for obtaining consistency between marts @Eduardo Rodrigues Rent-a-car Company Data Warehouse Example @Eduardo Rodrigues Rent-a-car Company Data Mart Example - Sales At which location does the company generate most revenue? How’ the revenue changing through time? Which car or type of cars generate more revenue? @Eduardo Rodrigues Rent-a-car Company Data Mart Example - Damages Which cars and equipment are damaged the most? Which damages are most costly? Do more damages occur in certain locations? @Eduardo Rodrigues Rent-a-car Company Data Mart Example - Overbookings Which cars are overbooked? How much did the overbooking cost? @Eduardo Rodrigues Hotel Chain Logical Data Model Example @Eduardo Rodrigues Hotel Chain Data Mart Example @Eduardo Rodrigues Hotel Chain Data Mart Example @Eduardo Rodrigues Data Warehousing Derived Data Data Marts are oriented to business subjects, simple and