Data Warehousing Concepts and Models PDF
Document Details
Uploaded by CaptivatingConnotation7819
Prasad V. Potluri Siddhartha Institute of Technology
Tags
Summary
This document provides a detailed introduction to data warehousing, explaining its characteristics, types of data warehouses (enterprise warehouse, data mart, virtual), and various models (star, snowflake, and fact constellations). It also compares OLTP and OLAP systems in terms of their functionalities, data structures, and performance.
Full Transcript
UNIT-2 Data Warehousing: Introduction, Definition, three-tier data warehousing architecture, OLTP vs OLAP Systems, Data Warehouse Models(Enterprise Warehouse, Data Mart, and Virtual), Warehouse Schemas for Multidimensional Data Models (Stars, Snowflakes, and Fact...
UNIT-2 Data Warehousing: Introduction, Definition, three-tier data warehousing architecture, OLTP vs OLAP Systems, Data Warehouse Models(Enterprise Warehouse, Data Mart, and Virtual), Warehouse Schemas for Multidimensional Data Models (Stars, Snowflakes, and Fact Constellations), Typical OLAP Operations. Data Warehouse: A data warehouse is an enterprise system used for the analysis and reporting of structured and semi-structured data from multiple sources, such as point-of-sale transactions, marketing automation, customer relationship management, and more. A data warehouse is suited for ad hoc analysis as well custom reporting. A data warehouse can store both current and historical data in one place and is designed to give a long-range view of data over time, making it a primary component of business intelligence. The term "Data Warehouse" was first coined by Bill Inmon in 1990. According to William H.Inmon, a leading architect in the construction of data warehouse systems, “A data warehouse is a subject-oriented, integrated, time- variant, and nonvolatile collection of data in support of management’s decision making process”. Characteristics of Data ware house/terms of data warehouse Subject-oriented: A data warehouse is organized around major subjects such as customer, supplier, product, and sales. Rather than concentrating on the day-to-day operations and transaction processing of an organization, a data warehouse focuses on the modeling and analysis of data for decision makers. Integrated: A data warehouse is usually constructed by integrating multiple heterogeneous sources, such as relational databases, flat files, and online transaction records. Data cleaning and data integration techniques are applied to ensure consistency in naming conventions, encoding structures, attribute measures, and so on. Time-variant: Data are stored to provide information from an historic perspective (e.g., the past 5–10 years). Every key structure in the data warehouse contains, either implicitly or explicitly, a time element. Nonvolatile: A data warehouse is always a physically separate store of data transformed from the application data found in the operational environment. Due to this separation, a data warehouse does not require transaction processing, recovery, and concurrency control mechanisms. It usually requires only two operations in data accessing: initial loading of data and access of data. Operational Database Systems and Data Warehouse Operational Database: The Operational Database is the source of data for the information distribution center. It incorporates point by point data utilized to run the day to day operations of the trade. The information as often as possible changes as upgrades are made and reflect the current esteem of the final transactions. Operational Database Administration Frameworks too called as OLTP (Online Transactions Processing Databases) Data Warehouse: A Data Warehouse is a system that is used by the users or knowledge managers for data analysis and decision-making. It can construct and present the data in a certain structure to fulfill the diverse requirements of several users. Data warehouses are also known as Online Analytical Processing (OLAP) Systems. In a data warehouse or OLAP system, the data is saved in a format that allows the effective creation of data mining documents. The data structure in a data warehousing has demoralized schema. Performance-wise, data warehouses are quite fast when it comes to analyzing queries. Data warehouse systems do the integration of several application systems. These systems then provide data processing by supporting a solid platform of consolidated historical data for analysis. Differences between operational database systems and data warehouses operational database data warehouse systems Purpose Operational database systems data warehouses are are used to support day-to- used to support day operations of an decision-making and organization analysis activities. Data Structure Operational database systems Data warehouses, on the typically have a normalized other hand, typically data structure, which means have a de normalized that the data is organized into data structure, which many related tables to reduce means that the data is data redundancy and improve organized into fewer data consistency. tables optimized for reporting and analysis. Data Volume Operational database systems which may store years of typically store a smaller historical data(Large volume of data compared to amount of data will data warehouses store) Performance Operational database systems ata warehouses, on the are optimized for transaction other hand, are processing and are designed optimized for querying to support high-volume, high- and reporting and are speed transaction processing. designed to support complex analytical queries that may involve large volumes of data. Differences between OLAP and OLTP Criteria OLAP OLTP Purpose OLAP helps you analyze OLTP helps you manage large volumes of data to and process real-time support decision- transactions. making. Data source OLAP uses historical and OLTP uses real-time and aggregated data from transactional data from multiple sources. a single source. Data structure OLAP uses OLTP uses relational multidimensional (cubes) databases. or relational databases. Data model OLAP uses star schema, OLTP uses normalized or snowflake schema, or de normalized models. other analytical models. Volume of data OLAP has large storage OLTP has comparatively requirements. Think smaller storage terabytes (TB) and peta requirements. Think bytes (PB). gigabytes (GB). Response time OLAP has longer OLTP has shorter response times, typically response times, typically in seconds or minutes. in milliseconds Example applications OLAP is good for OLTP is good for analyzing trends, processing payments, predicting customer customer data behavior, and identifying management, and order profitability. processing. Three-tier architecture of data warehouse The Three-Tier Data Warehouse Architecture is the commonly used Data Warehouse design in order to build a Data Warehouse by including the required Data Warehouse Schema Model, the required OLAP server type, and the required front-end tools for Reporting or Analysis purposes, which as the name suggests contains three tiers such as Top tier, Bottom Tier and the Middle Tier that are procedurally linked with one another from Bottom tier(data sources) through Middle tier(OLAP servers) to the Top tier(Front-end tools). Data Warehouse Architecture is the design based on which a Data Warehouse is built, to accommodate the desired type of Data Warehouse Schema, user interface application and database management system, for data organization and repository structure. The type of Architecture is chosen based on the requirement provided by the project team. Three-tier Data Warehouse Architecture is the commonly used choice, due to its detailing in the structure. The three different tiers here are termed as: Top-Tier Middle-Tier Bottom-Tier Each Tier can have different components based on the prerequisites presented by the decision-makers of the project but are subject to the novelty of their respective tier. 1. Bottom Tier The Bottom Tier in the three-tier architecture of a data warehouse consists of the Data Repository. Data Repository is the storage space for the data extracted from various data sources, which undergoes a series of activities as a part of the ETL process. ETL stands for Extract, Transform and Load. As a preliminary process, before the data is loaded into the repository, all the data relevant and required are identified from several sources of the system. These data are then cleaned up, to avoid repeating or junk data from its current storage units. The next step is to transform all these data into a single format of storage. The final step of ETL is to Load the data on the repository. Few commonly used ETL tools are: Informatica Microsoft SSIS Snaplogic The storage type of the repository can be a relational database management system or a multidimensional database management system. A relational database system can hold simple relational data, whereas a multidimensional database system can hold data that more than one dimension. Whenever the Repository includes both relational and multidimensional database management systems, there exists a metadata unit. As the name suggests, the metadata unit consists of all the metadata fetched from both the relational database and multidimensional database systems. Meta data: This Metadata unit provides incoming data to the next tier, that is, the middle tier. From the user’s standpoint, the data from the bottom tier can be accessed only with the use of SQL queries. The complexity of the queries depends on the type of database. Data from the relational database system can be retrieved using simple queries, whereas the multidimensional database system demands complex queries with multiple joins and conditional statements. 2. Middle Tier The Middle tier here is the tier with the OLAP servers. The Data Warehouse can have more than one OLAP server, and it can have more than one type of OLAP server model as well, which depends on the volume of the data to be processed and the type of data held in the bottom tier. There are three types of OLAP server models, such as: ROLAP Relational online analytical processing is a model of online analytical processing which carries out an active multidimensional breakdown of data stored in a relational database, instead of redesigning a relational database into a multidimensional database. This is applied when the repository consists of only the relational database system in it. MOLAP Multidimensional online analytical processing is another model of online analytical processing that catalogs and comprises of directories directly on its multidimensional database system. This is applied when the repository consists of only the multidimensional database system in it. HOLAP Hybrid online analytical processing is a hybrid of both relational and multidimensional online analytical processing models. When the repository contains both the relational database management system and the multidimensional database management system, HOLAP is the best solution for a smooth functional flow between the database systems. HOLAP allows storing data in both the relational and the multidimensional formats. The Middle Tier acts as an intermediary component between the top tier and the data repository, that is, the top tier and the bottom tier respectively. From the user’s standpoint, the middle tier gives an idea about the conceptual outlook of the database. 3. Top Tier The Top Tier is a front-end layer, that is, the user interface that allows the user to connect with the database systems. This user interface is usually a tool or an API call, which is used to fetch the required data for Reporting, Analysis, and Data Mining purposes. The type of tool depends purely on the form of outcome expected. It could be a Reporting tool, an Analysis tool, a Query tool or a Data mining tool.It is essential that the Top Tier should be uncomplicated in terms of usability. Only user-friendly tools can give effective outcomes. Even when the bottom tier and middle tier are designed with at most cautiousness and clarity, if the Top tier is enabled with a bungling front-end tool, then the whole Data Warehouse Architecture can become an utter failure. This makes the selection of the user interface/ front-end tool as the Top Tier, which will serve as the face of the Data Warehouse system, a very significant part of the Three-Tier Data Warehouse Architecture designing process. Multidimensional Data Model The multi-Dimensional Data Model is a method which is used for ordering data in the database along with good arrangement and assembling of the contents in the database. Data warehouses and OLAP tools are based on a multidimensional data model. This model views data in the form of a data cube Data cube Data cube allows data to be modeled and viewed in multiple dimensions. It is defined by dimensions and facts. Dimensions Dimensions are the perspectives or entities with respect to which an organization wants to keep records. Dimension table These dimensions allow the store to keep track of things like monthly sales of items and the branches and locations at which the items were sold. Each dimension may have a table associated with it, called a dimension table, which further describes the dimension. For example, a dimension table for item may contain the attributes item name, brand, and type. Dimension tables can be specified by users or experts, or automatically generated and adjusted based on data distributions. Fact table: A multidimensional data model is typically organized around a central theme, such as sales. This theme is represented by a fact table. Facts are numeric measures. Think of them as the quantities by which we want to analyze relationships between dimensions. Examples of facts for a sales data warehouse include dollars sold (sales amount in dollars), units sold (number of units sold), and amount budgeted. The fact table contains the names of the facts, or measures, as well as keys to each of the related dimension tables. Example: 2-D data cube that is, in fact, a table or spreadsheet for sales data from All Electronics. In particular, we will look at the All Electronics sales data for items sold per quarter in the city of Vancouver. These data are shown in Table 4.2. In this 2-D representation, the sales for Vancouver are shown with respect to the time dimension (organized in quarters) and the item dimension (organized according to the types of items sold). The fact or measure displayed is dollars sold (in thousands). Cuboid: A data cube is a data structure that, contrary to tables and spreadsheets, can store data in more than 2 dimensions. They are mainly used for fast retrieval of aggregated data. The key elements of a data cube are dimensions, attributes, facts and measures. The cuboid that holds the lowest level of summarization is called the base cuboid. For example, the 4-D cuboid in Figure 4.4 is the base cuboid for the given time, item, location, and supplier dimensions. Figure 4.3 is a 3-D (nonbase) cuboid for time, item. Working on a Multidimensional Data Model Multidimensional Data Model works. The following stages should be followed by every project for building a Multi Dimensional Data Model : Stage 1 : Assembling data from the client : In first stage, a Multi Dimensional Data Model collects correct data from the client. Mostly, software professionals provide simplicity to the client about the range of data which can be gained with the selected technology and collect the complete data in detail. Stage 2 : Grouping different segments of the system : In the second stage, the Multi Dimensional Data Model recognizes and classifies all the data to the respective section they belong to and also builds it problem-free to apply step by step. Stage 3 : Noticing the different proportions : In the third stage, it is the basis on which the design of the system is based. In this stage, the main factors are recognized according to the user’s point of view. These factors are also known as “Dimensions”. Stage 4 : Preparing the actual-time factors and their respective qualities : In the fourth stage, the factors which are recognized in the previous step are used further for identifying the related qualities. These qualities are also known as “attributes” in the database. Stage 5 : Finding the actuality of factors which are listed previously and their qualities : In the fifth stage, A Multi Dimensional Data Model separates and differentiates the actuality from the factors which are collected by it. These actually play a significant role in the arrangement of a Multi Dimensional Data Model. Stage 6 : Building the Schema to place the data, with respect to the information collected from the steps above : In the sixth stage, on the basis of the data which was collected previously, a Schema is built. Let us consider the data according to item, time and location (like Kolkata, Delhi, Mumbai). Here is the table : Features of multidimensional data models: Measures: Measures are numerical data that can be analyzed and compared, such as sales or revenue. They are typically stored in fact tables in a multidimensional data model. Dimensions: Dimensions are attributes that describe the measures, such as time, location, or product. They are typically stored in dimension tables in a multidimensional data model. Cubes: Cubes are structures that represent the multidimensional relationships between measures and dimensions in a data model. They provide a fast and efficient way to retrieve and analyze data. Aggregation: Aggregation is the process of summarizing data across dimensions and levels of detail. This is a key feature of multidimensional data models, as it enables users to quickly analyze data at different levels of granularity. Drill-down and roll-up: Drill-down is the process of moving from a higher- level summary of data to a lower level of detail, while roll-up is the opposite process of moving from a lower-level detail to a higher-level summary. These features enable users to explore data in greater detail and gain insights into the underlying patterns. Hierarchies: Hierarchies are a way of organizing dimensions into levels of detail. For example, a time dimension might be organized into years, quarters, months, and days. Hierarchies provide a way to navigate the data and perform drill-down and roll-up operations. OLAP (Online Analytical Processing): OLAP is a type of multidimensional data model that supports fast and efficient querying of large datasets. OLAP systems are designed to handle complex queries and provide fast response times. Advantages of Multi Dimensional Data Model The following are the advantages of a multi-dimensional data model : A multi-dimensional data model is easy to handle. It is easy to maintain. Its performance is better than that of normal databases (e.g. relational databases). The representation of data is better than traditional databases. That is because the multi-dimensional databases are multi-viewed and carry different types of factors. It is workable on complex systems and applications, contrary to the simple one-dimensional database systems. The compatibility in this type of database is an upliftment for projects having lower bandwidth for maintenance staff. Disadvantages of Multi Dimensional Data Model The following are the disadvantages of a Multi Dimensional Data Model : The multi-dimensional Data Model is slightly complicated in nature and it requires professionals to recognize and examine the data in the database. During the work of a Multi-Dimensional Data Model, when the system caches, there is a great effect on the working of the system. It is complicated in nature due to which the databases are generally dynamic in design. The path to achieving the end product is complicated most of the time. As the Multi Dimensional Data Model has complicated systems, databases have a large number of databases due to which the system is very insecure when there is a security break. Data Warehouse Modeling Data warehouse modeling is the process of designing the schemas of the detailed and summarized information of the data warehouse. The goal of data warehouse modeling is to develop a schema describing the reality, or at least a part of the fact, which the data warehouse is needed to support. Data warehouse modeling is an essential stage of building a data warehouse for two main reasons. Firstly, through the schema, data warehouse clients can visualize the relationships among the warehouse data, to use them with greater ease. Secondly, a well-designed schema allows an effective data warehouse structure to emerge, to help decrease the cost of implementing the warehouse and improve the efficiency of using it. Data modeling in data warehouses is different from data modeling in operational database systems. The primary function of data warehouses is to support DSS processes. Thus, the objective of data warehouse modeling is to make the data warehouse efficiently support complex queries on long term information. Data Warehouse Models: 1. Enterprise Warehouse 2. Data Mart 3. Virtual Warehouse. 1.Enterprise warehouse: An enterprise warehouse collects all of the information about subjects spanning the entire organization. It provides corporate-wide data integration, usually from one or more operational systems or external information providers, and is cross-functional in scope. It typically contains detailed data as well as summarized data, and can range in size from a few gigabytes to hundreds of gigabytes, terabytes, or beyond. An enterprise data warehouse may be implemented on traditional mainframes, computer super servers, or parallel architecture platforms. It requires extensive business modeling and may take years to design and build. 2.Data mart: A data mart is a specialized subset of a data warehouse focused on a specific functional area or department within an organization. It provides a simplified and targeted view of data, addressing specific reporting and analytical needs. Data marts are smaller in scale and scope, typically holding relevant data for a specific group of users, such as sales, marketing, or finance. They are organized around specific subjects, such as sales, customer data, or product information, and are structured, transformed, and optimized for efficient querying and analysis within the domain. However, it may involve complex integration in the long run if its design and planning were not enterprise-wide. Data marts are three types. They are 1. Independent data mart 2. Dependent data mart 3. Hybrid data mart 1.Independent data marts An independent data mart is created and maintained separately from the data warehouse. It is created to satisfy the particular needs of a specific business unit or department. Independent data marts are typically smaller in size and more rapidly and readily set up. They offer flexibility and agility since they are not constrained by the challenges of the centralized data warehouse. Independent Data Mart is created directly from external sources instead of data warehouse. 2. Dependent data marts A dependent data mart is generated right out of a data warehouse. It takes some of the data from the data warehouse and arranges it to meet the needs of a specific industry. Dependent data marts, which profit from the data integration, data quality, and consistency provided by the data warehouse, allow for the centralization and preservation of all data in a single source of truth. They are often developed to serve particular reporting and analytical needs, and they are frequently updated from the data warehouse. Dependent data marts offer data consistency and prevent data duplication because they rely on the data warehouse as their main source of data. 3.Hybrid Data mart: This type of Data Mart is created by extracting data from operational source or from data warehouse. 1Path reflects accessing data directly from external sources and 2Path reflects dependent data model of data mart. Both independent and dependent data mart components can be found in a hybrid data mart. As well as combining additional data sources particular to a given business unit or department, it makes use of the centralized data warehouse for the integration and consistency of the core data. By offering flexibility and agility for department-specific needs while keeping the integrity and consistency of shared data from the data warehouse, hybrid data marts offer the benefits of both strategies. This strategy creates a balance between localized data management and centralized control. 3. Virtual warehouse: A virtual warehouse is a set of views over operational databases. For efficient query processing, only some of the possible summary views may be materialized. A virtual warehouse is easy to build but requires excess capacity on operational database servers. Virtual warehousing is an innovative approach to inventory management, where businesses can efficiently track and monitor their stock across multiple physical warehouses using digital documentation and asset-tracking technology. At its core, virtual warehousing involves creating a digital replica of physical space storage facilities that accurately represents all essential information about products housed within them. This method enables companies to control their inventory completely while reducing costs and improving fulfillment performance. Schema in Data Ware house/ Data warehouse schema Schema is a logical description of the entire database. It includes the name and description of records of all record types including all associated data-items and aggregates. Same like a database, a data warehouse also requires to main tain a schema. A database uses relational model, while a data warehouse uses Star, Snowflake, and Fact Constellation schema. Star Schema in Data Warehouse modeling A star schema is a type of data modeling technique used in data warehousing to represent data in a structured and intuitive way. In a star schema, data is organized into a central fact table that contains the measures of interest, surrounded by dimension tables that describe the attributes of the measures. The fact table in a star schema contains the measures or metrics that are of interest to the user or organization. A star schema is a relational schema where a relational schema whose design represents a multidimensional data model. The star schema is the explicit data warehouse schema. It is known as star schema because the entity-relationship diagram of this schemas simulates a star, with points, diverge from a central table. The center of the schema consists of a large fact table, and the points of the star are the dimension tables. Fact Tables A table in a star schema which contains facts and connected to dimensions. A fact table has two types of columns: those that include fact and those that are foreign keys to the dimension table. The primary key of the fact tables is generally a composite key that is made up of all of its foreign keys. A fact table might involve either detail level fact or fact that have been aggregated (fact tables that include aggregated fact are often instead called summary tables). A fact table generally contains facts with the same level of aggregation. Dimension Tables A dimension is an architecture usually composed of one or more hierarchies that categorize data. If a dimension has not got hierarchies and levels, it is called a flat dimension or list. The primary keys of each of the dimensions table are part of the composite primary keys of the fact table. Dimensional attributes help to define the dimensional value. They are generally descriptive, textual values. Dimensional tables are usually small in size than fact table. Fact tables store data about sales while dimension tables data about the geographic region (markets, cities), clients, products, times, channels. with two measures: dollars sold and units sold. To minimize the size of the fact table, dimension identifiers (e.g., time key and item key) are system- generated identifiers. In the star schema, each dimension is represented by only one table, and each table contains a set of attributes. For example, the location dimension table contains the attribute set {location key, street, city, province or state, country}. This constraint may introduce some redundancy. For example, “Urbana” and “Chicago” are both cities in the state of Illinois, USA. Entries for such cities in the location dimension table will create redundancy among the attributes province or state and country; that is, (..., Urbana, IL, USA) and (..., Chicago, IL, USA). Moreover, the attributes within a dimension table may form either a hierarchy (total order) or a lattice (partial order). Characteristics of Star Schema o It creates a DE-normalized database that can quickly provide query responses. o It provides a flexible design that can be changed easily or added to throughout the development cycle, and as the database grows. o It provides a parallel in design to how end-users typically think of and use the data. o It reduces the complexity of metadata for both developers and end-users. Advantages of Star Schema Star Schemas are easy for end-users and application to understand and navigate. With a well-designed schema, the customer can instantly analyze large, multidimensional data sets. The main advantage of star schemas in a decision-support environment are: 1.Query Performance A star schema database has a limited number of table and clear join paths, the query run faster than they do against OLTP systems. Small single-table queries, frequently of a dimension table, are almost instantaneous. Large join queries that contain multiple tables takes only seconds or minutes to run. In a star schema database design, the dimension is connected only through the central fact table. When the two-dimension table is used in a query, only one join path, intersecting the fact tables, exist between those two tables. This design feature enforces authentic and consistent query results. 2.Load performance and administration Structural simplicity also decreases the time required to load large batches of record into a star schema database. By describing facts and dimensions and separating them into the various table, the impact of a load structure is reduced. Dimension table can be populated once and occasionally refreshed. We can add new facts regularly and selectively by appending records to a fact table. 3.Built-in referential integrity A star schema has referential integrity built-in when information is loaded. Referential integrity is enforced because each data in dimensional tables has a unique primary key, and all keys in the fact table are legitimate foreign keys drawn from the dimension table. A record in the fact table which is not related correctly to a dimension cannot be given the correct key value to be retrieved. 4.Easily Understood A star schema is simple to understand and navigate, with dimensions joined only through the fact table. These joins are more significant to the end-user because they represent the fundamental relationship between parts of the underlying business. Customer can also browse dimension table attributes before constructing a query. Disadvantage of Star Schema There is some condition which cannot be meet by star schemas like the relationship between the user, and bank account cannot describe as star schema as the relationship between them is many to many. Snowflake schema A snowflake schema is a multi-dimensional data model that is an extension of a star schema, where dimension tables are broken down into sub dimensions. Snowflake schemas are commonly used for business intelligence and reporting in OLAP data warehouses, data marts, and relational databases. In a snowflake schema, engineers break down individual dimension tables into logical sub dimensions. This makes the data model more complex, but it can be easier for analysts to work with, especially for certain data types. It's called a snowflake schema because its entity-relationship diagram (ERD) looks like a snowflake, as seen below. A snowflake schema for All Electronics sales is given in Figure The main difference between the two schemas is in the definition of dimension tables. The single dimension table for item in the star schema is normalized in the snowflake schema, resulting in new item and supplier tables. For example, the item dimension table now contains the attributes item key, item name, brand, type, and supplier key, where supplier key is linked to the supplier dimension table, containing supplier key and supplier type information. Similarly, the single dimension table for location in the star schema can be normalized into two new tables: location and city. The city key in the new location table links to the city dimension. Notice that, when desirable, further normalization can be performed on province or state and country in the snowflake schema shown in Figure. Benefits of snowflake schemas Fast data retrieval Enforces data quality Simple, common data model for data warehousing Drawbacks of snowflake schemas Lots of overhead upon initial setup Rigid data model High maintenance costs Fact Constellation is a schema for representing multidimensional model. It is a collection of multiple fact tables having some common dimension tables. It can be viewed as a collection of several star schemas and hence, also known as Galaxy schema. It is one of the widely used schema for Data warehouse designing and it is much more complex than star and snowflake schema. For complex systems, we require fact constellations. In Figure : This schema specifies two fact tables, sales and shipping. The sales table definition is identical to that of the star schema (Figure 4.6). The shipping table has five dimensions, or keys—item key, time key, shipper key, from location, and to location—and two measures— dollars cost and units shipped. A fact constellation schema allows dimension tables to be shared between fact tables. For example, the dimensions tables for time, item, and location are shared between the sales and shipping fact tables. Advantage: Provides a flexible schema. Disadvantage: It is much more complex and hence, hard to implement and maintain. Difference between Star and Snowflake Schema: S.NO Star Schema Snowflake Schema In star schema, The fact tables While in snowflake schema, The fact tables, 1. and the dimension tables are dimension tables as well as sub dimension contained. tables are contained. Star schema is a top-down 2. While it is a bottom-up model. model. 3. Star schema uses more space. While it uses less space. It takes less time for the While it takes more time than star schema for 4. execution of queries. the execution of queries. In star schema, Normalization is While in this, Both normalization and de 5. not used. normalization are used. 6. It’s design is very simple. While it’s design is complex. The query complexity of star While the query complexity of snowflake 7. schema is low. schema is higher than star schema. 8. It’s understanding is very simple. While it’s understanding is difficult. It has less number of foreign 9. While it has more number of foreign keys. keys. 10. It has high data redundancy. While it has low data redundancy. Typical OLAP Operations: In the multidimensional model, data are organized into multiple dimensions, and each dimension contains multiple levels of abstraction defined by concept hierarchies. This organization provides users with the flexibility to view data from different perspectives. A number of OLAP data cube operations exist to materialize these different views, allowing interactive querying and analysis of the data at hand. Hence, OLAP provides a user-friendly environment for interactive data analysis. OLAP Operations OLAP stands for Online Analytical Processing Server. It is a software technology that allows users to analyze information from multiple database systems at the same time. It is based on multidimensional data model and allows the user to query on multi-dimensional data OLAP databases are divided into one or more cubes and these cubes are known as Hyper-cubes. OLAP Cube The OLAP cube or Hypercube is a special kind of data structure that is optimized for very quick multidimensional data analysis and storage. It is a screenshot of data at a specific point in time. As seen in the figure, using certain OLAP operations, a user can request a specified view of the hypercube. Hence, OLAP cubes allow users to perform multidimensional analytical querying on the data. Types of OLAP Servers There exist mainly three types of OLAP systems: Relational OLAP (ROLAP): These systems work directly with relational databases and use complex SQL queries to retrieve information from the database. It can handle large volumes of data but provides slower data processing. Multidimensional OLAP (MOLAP): MOLAP is also known as the classic form of OLAP. It uses an optimized multi-dimensional array storage system for data storage. It makes use of positional techniques to access the data physically stored in multidimensional arrays. Hybrid OLAP (HOLAP): It uses a best-of-both-worlds approach and is a combination of ROLAP and MOLAP. It provides the high scalability feature of ROLAP systems along with the fast computation functionality of MOLAP systems. The list of OLAP operations: Roll-up Drill-down Slice and dice Pivot (rotate) Drill Down Drill down operation allows a user to zoom in on the data cube i.e., the less detailed data is converted into highly detailed data. It can be implemented by either stepping down a concept hierarchy for a dimension or adding additional dimensions to the hypercube. Example: Consider a cube that represents the annual sales (4 Quarters: Q1, Q2, Q3, Q4) of various kinds of clothes (Shirt, Pant, Shorts, Tees) of a company in 4 cities (Delhi, Mumbai, Las Vegas, New York) as shown below: Here, the drill-down operation is applied on the time dimension and the quarter Q1 is drilled down to January, February, and March. Hence, by applying the drill-down operation, we can move down from quarterly sales in a year to monthly or weekly records. Roll up It is the opposite of the drill-down operation and is also known as a drill-up or aggregation operation. It is a dimension-reduction technique that performs aggregation on a data cube. It makes the data less detailed and it can be performed by combining similar dimensions across any axis. Example: Considering the above-mentioned clothing company sales example: Here, we are performing the Roll-up operation on the given data cube by combining and categorizing the sales based on the countries instead of cities. Dice Dice operation is used to generate a new sub-cube from the existing hypercube. It selects two or more dimensions from the hypercube to generate a new sub- cube for the given data. Example: Considering our clothing company sales example: Here, we are using the dice operation to retrieve the sales done by the company in the first half of the year i.e., the sales in the first two quarters. Slice Slice operation is used to select a single dimension from the given cube to generate a new sub-cube. It represents the information from another point of view. Example: Considering our clothing company sales example: Here, the sales done by the company during the first quarter are retrieved by performing the slice operation on the given hypercube. Pivot It is used to provide an alternate view of the data available to the users. It is also known as Rotate operation as it rotates the cube’s orientation to view the data from different perspectives. Example: Considering our clothing company sales example: Here, we are using the Pivot operation to view the sub-cube from a different perspective. Other OLAP operations: Some OLAP systems offer additional drilling operations. For example, drill-across executes queries involving (i.e., across) more than one fact table. The drill-through operation uses relational SQL facilities to drill through the bottom level of a data cube down to its back-end relational tables. Other OLAP operations may include ranking the top N or bottom N items in lists, as well as computing moving averages, growth rates, interests, internal return rates, depreciation, currency conversions, and statistical functions. OLAP offers analytical modeling capabilities, including a calculation engine for deriving ratios, variance, and so on, and for computing measures across multiple dimensions. OLAP also supports functional models for forecasting, trend analysis, and statistical analysis. OLAP Systems versus Statistical Databases Many OLAP systems’ characteristics (e.g., the use of a multidimensional data model and concept hierarchies, the association of measures with dimensions, and the notions of roll-up and drill-down) also exist in earlier work on statistical databases (SDBs). A statistical database is a database system that is designed to support statistical applications. Similarities between the two types of systems are rarely discussed, mainly due to differences in terminology and application domains. OLAP and SDB systems, however, have distinguishing differences. While SDBs tend to focus on socioeconomic applications, OLAP has been targeted for business applications. Privacy issues regarding concept hierarchies are a major concern for SDBs. For example, given summarized socioeconomic data, it is controversial to allow users to view the corresponding low-level data. Finally, unlike SDBs, OLAP systems are designed for efficiently handling huge amounts of data