Introduction to Data Warehouses and ETL & Data Manipulation (1) PDF

Document Details

ToughestIrrational4641

Uploaded by ToughestIrrational4641

Faculty of Computer Science and Engineering

A. Deshpande, E. Franconi, and W. Lang

Tags

data warehousing data management etl database systems

Summary

This document provides an introduction to data warehouses and ETL processes (Extract, Transform, Load). It explains the purpose of data warehouses and different aspects of designing and implementing them. It also discusses related database management systems (DBMS) and describes the rationale for using data warehouses instead of traditional data querying methods.

Full Transcript

Introduction to Data Warehouses and ETL and Data Manipulation (1) Integrated Systems Based on slides from: A.Deshpande, E. Franconi and W. Lang DBMS Used to maintain and access data efficiently Mostly applied for On-Line Transaction Processing (OLTP): designed and opt...

Introduction to Data Warehouses and ETL and Data Manipulation (1) Integrated Systems Based on slides from: A.Deshpande, E. Franconi and W. Lang DBMS Used to maintain and access data efficiently Mostly applied for On-Line Transaction Processing (OLTP): designed and optimized for fast transactions, inserts of new ID Grade records, updates, and deletes. Typically contains current, “active” data. Old data are archived. A121 B Include mechanism that ensure data security and integrity Have reporting capabilities: Query A124 A- database via a query language (e.g., Structured Query Language: SQL) Example: List all student IDs and corresponding grades of students A126 B+ enrolled in MIS 372 and whose grade is at least B. Query result is: Relational Model Normalized Tables: Follow a set of steps to create multiple tables in order to minimize redundancies, and facilitate data management. Each table is associated with an entity. E.g., a student table, a class table, etc. Attributes: characteristics of the entity (student’s name, address, etc.) Primary key: a unique identifier to each entry in the table (e.g., student ID in the Student table). Problem: Heterogeneous Information Sources “Heterogeneities are everywhere” Personal Databases World Scientific Databases Wide Web Digital Libraries Different interfaces Different data representations Duplicate and inconsistent information Problem: Data Management in Large Enterprises Vertical fragmentation of informational systems (vertical stove pipes) Result of application (user)-driven development of operational systems Sales Planning Suppliers Num. Control Stock Mngmt Debt Mngmt Inventory......... Sales Administration Finance Manufacturing... Goal: Unified Access to Data Integration System World Wide Personal Web Databases Digital Libraries Scientific Databases Collects and combines information Provides integrated view, uniform user interface Supports sharing What is a Data Warehouse? A Practitioners Viewpoint “A data warehouse is simply a single, complete, and consistent store of data obtained from a variety of sources and made available to end users in a way they can understand and use it in a business context.” -- Barry Devlin, IBM Consultant Data Warehouse Supports information processing by providing a solid platform of consolidated, historical data for analysis. Designed to get data out and analyze quickly A data warehouse is: – Subject-oriented : Contains information regarding objects of interest for decision support: Sales by region, by product, etc. – Integrated: Data are typically extracted from multiple, heterogeneous data sources (e.g., from sales, inventory, billing DBs etc.) - ETL (see later). – Time-variant: Contain historical data, longer horizon than operational system. – Nonvolatile : Physically separate store, data is not (or rarely) directly updated. -- W.H. Inmon, Building the Data Warehouse, 1992 A Data Warehouse is... Stored collection of diverse data A solution to data integration problem Single repository of information Subject-oriented Organized by subject, not by application Used for analysis, data mining, etc. Optimized differently from transaction-oriented db User interface aimed at executive A Data Warehouse is… Large volume of data (Gb, Tb) Non-volatile Historical Time attributes are important Updates infrequent May be append-only Examples All transactions ever at Sainsbury’s Complete client histories at insurance firm LSE financial information and portfolios Why a Warehouse? Two Approaches: - Query-Driven (Lazy) - Warehouse (Eager) ? Source Source The Traditional Research Approach Query-driven (lazy, on-demand) Clients Integration System Metadata... Wrapper Wrapper Wrapper... Source Source Source Disadvantages of Query-Driven Approach ¨ Delay in query processing ¨ Slow or unavailable information sources ¨ Complex filtering and integration ¨ Inefficient and potentially expensive for frequent queries ¨ Competes with local processing at sources ¨ Hasn’t caught on in industry The Warehousing Approach Information Clients integrated in advance Data Warehouse Stored in wh for direct querying Integration System Metadata and analysis... Extractor/ Extractor/ Extractor/ Monitor Monitor Monitor... Source Source Source CS 336 Advantages of Warehousing Approach High query performance But not necessarily most current information Doesn’t interfere with local processing at sources Complex queries at warehouse OLTP at information sources Information copied at warehouse Can modify, annotate, summarize, restructure, etc. Can store historical information Security, no auditing Has caught on in industry Not Either-Or Decision Query-driven approach still better for Rapidly changing information Rapidly changing information sources Truly vast amounts of data from large numbers of sources Clients with unpredictable needs Generic Warehouse Architecture Client Client Query & Analysis Design Phase Loading Warehouse Metadata Maintenance Optimization Integrator Extractor/ Extractor/ Extractor/ Monitor Monitor Monitor... Data Warehouse Architectures: Conceptual View Operational Informational systems systems Operational systems Informational systems Single-layer Two-layer Every data element is stored once only Real-time + derived data Derived Data Virtual warehouse Most commonly used approach in “Real-time data” industry today Real-time data Three-layer Architecture: Conceptual View Transformation of real-time data to derived data really requires two steps Operational Informational systems systems View level “Particular informational Derived Data needs” Physical Implementation Reconciled Data of the Data Warehouse Real-time data Issues in Data Warehousing Warehouse Design Extraction Wrappers, monitors (change detectors) Integration Cleansing & merging Warehousing specification & Maintenance Optimizations Miscellaneous (e.g., evolution) Data Warehouse vs OLTP Summary Multi-Dimensional Data Measures - numerical data being tracked Dimensions - business parameters that define a transaction Example: Analyst may want to view sales data (measure) by geography, by time, and by product (dimensions) Dimensional modeling is a technique for structuring data around business concepts ER models describe “entities” and “relationships” Dimensional models describe “measures” and “dimensions” The Dimensional Model “Sales by product line over the past six months” “Sales by store between 1990 and 1995” Store Info Key columns joining fact table to dimension tables Numerical Measures Prod Code Time Code Store Code Sales Qty Fact table for Product Info measures Dimension tables Time Info... Dimensional Modeling Dimensions are organized into hierarchies E.g., Time dimension: days ® weeks ® quarters E.g., Product dimension: product ® product line ® brand Dimensions have attributes Basic Building Blocks Star Schema Dimension tables Fact table(s) Makes multi-dimensional database functionality possible using a traditional RDB Dimensional modeling is a technique for structuring data around business concepts ER models describe “entities” and “relationships” Dimensional models describe “measures” and “dimensions” Dimension Tables Dimension tables have the following characteristics: Hold descriptive information about a particular business perspective Contain relatively static data Are joined to a fact table through a foreign key reference Fact Tables Fact tables have the following characteristics: Contain numeric metrics of the business Can contain one or more calculated measures Records are updated not deleted. May contain summarized (aggregated) data Often contain date-stamped data Key value is a composite key made up of the primary keys of the dimensions Joined to dimension tables through foreign keys that reference primary keys in the dimension tables Star Schema Example Star Schema with Sample Data Data Warehouse Usage Three kinds of data warehouse applications – Information processing supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs – Analytical processing multidimensional analysis of data warehouse data supports basic OLAP operations, slice-dice, drilling, pivoting – Data mining knowledge discovery from hidden patterns supports associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools The Big Picture: DBs, Warehouse, & OLAP OLAP Server other sources Analysis Query Operational Extract Reports DBs Transform Data Data mining Load Warehouse Refresh Serve Data Storage OLAP Engine Front-End Tools On-Line Analytical Processing (OLAP) Front-end to the data warehouse. Allowing easy data manipulation Allows conducting inquiries over the data at various levels of abstractions Fast and easy because some aggregations are computed in advance No need to formulate entire query OLAP: Data Cube OLAP uses data in multidimensional format (e.g., data cubes) to facilitate query and response time. Overall sales of Date TV’s in the US 1Qtr 2Qtr 3Qtr sum in 3rd quarter t TV 4Qtr uc od PC U.S.A Pr VCR sum Country Canada Mexico sum Typical OLAP Operations Roll up (drill-up): summarize data by climbing up hierarchy or by dimension reduction Drill down (roll down): reverse of roll-up from higher level summary to lower level summary or detailed data, or introducing new dimensions Slice and dice: project and select Pivot (rotate): reorient the cube, visualization, 3D to series of 2D planes Other operations drill across: involving (across) more than one fact table drill through: through the bottom level of the cube to its back-end relational tables (using SQL) OLAP: Data Cube Operations Slicing: Selecting the dimensions of the cube to be viewed. – Example: View “Sales volume” as a function of “Product ” by “Country “by “Quarter” Dicing: Specifying the values along one or more dimensions. – Example: View “Sales volume” for “Product=PC” by “Country “by “Quarter” OLAP: Data Cube Operations Drilling down: from higher level aggregation to lower level aggregation or detailed data (Viewing by “state” after viewing by “region” ) Rolling-up: Summarize data by climbing up hierarchy or by dimension reduction (E.g., viewing by “region” instead of by “state”) OLTP vs. OLAP OLTP OLAP User Clerk, IT professional Knowledge worker Function Day to day operations Decision support DB Design Application oriented Subject oriented Data Current, up-to-date, Historical, summarized, detailed, flat relational, multidimensional, isolated consolidated Usage Repetitive Ad-hoc Short, simple Unit of work transactions Complex queries Transaction Query throughput, (Evaluation )Metric throughput response Warehouse is a Specialized DB Standard DB (OLTP) Warehouse (OLAP) Mostly updates Mostly reads Many small transactions Queries are long and complex Mb - Gb of data Gb - Tb of data Current snapshot History Index/hash on p.k. Lots of scans Raw data Summarized, reconciled data Thousands of users (e.g., Hundreds of users (e.g., clerical users) decision-makers, analysts) Data Integration Issues Same data / Different name Same name / Different data Data found in only one store (and nowhere else) Same data / Different keys Data Integrity Problems Same person / Different spellings (e.g., Giraud- Carrier, GirodCarier, etc.) Multiple ways to denote company name (e.g., Brigham Young University, BYU, etc.) Use of different names (e.g., Saigon / Ho Chi Minh, Mumbai / Bombay, etc.) Different account numbers generated by different applications for same customer Required fields left blank Invalid product codes collected at POS Factors driving the DW explosion Moore’s Law $100/TB storage, $1000 servers, commodity networking Increasing volumes of “dark” data Data collected but never analyzed Widening analysis gap of ”traditional” solutions Due to their cost, complexity, scalability, & rigidity 10,000 km. view: Complexity vs Cost Use a SAAS DW in the cloud Complexity § AWS Redshift, MSFT SQL DW, Snowflake, BigQuery, (deployment & Athena (Presto), … operational) § Low complexity § No capex, low opex High RYO Buy an appliance Roll-your-own (RYO) § Teradata, Microsoft APS, Netezza § Buy & install a cluster of servers RYOC § High capex, low opex § Buy, install, & configure software (Vertica, Medium § Low complexity Asterdata, Greenplum, …) § Gold standard for performance § High complexity Appliance § Medium capex and opex Roll-your-own-Cloud (RYOC) Low CLOUD DW § Rent a cluster of cloud servers § Buy, install, & configure software (Spark, Cost (capex + opex) Hive, Vertica, Asterdata, Greenplum, …) Low Medium High § Medium to high complexity § Low capex 42 § Medium opex Instant gratification Time to Insights Hardware and Deployment § Physical delivery and setup of cluster § IT deploys complex parallel system Months Appliance Months due to high capex cost Weeks RYO § Significant due diligence through multiple vendor POCs RYOC § Physical delivery and setup of goods Minutes CLOUD DW Cost Low Medium High Scalability and the price of agility Time to make an adjustment Months Appliance Weeks RYO RYOC Minutes CLOUD DW Cost of making an adjustment Low Medium High Unfortunately, no “free lunch” Absolute Performance High Appliance Cloud DW RYOC Medium RYO Low Cost Low Medium High Why Cloud DW? No CapEx and low OpEx Go from conception to insight in hours Rock bottom storage prices (Azure, AWS S3, GFS) Flexibility to scale up/down compute capacity Simple upgrade process Shared Disk/Storage Commodity servers connected to each other and storage using commodity networking Local disks for caching DB pages, DBfiles, temp is …stored on “remote storage” (e.g. a SAN, S3, Azure Storage) Node 1 Node 2 Node K Network can limit scaling as it must CPU CPU CPU carry I/O traffic … MEM MEM MEM Storage Area Network 47 Shared-Nothing Commodity servers connected via commodity networking DB storage is ”strictly local” to each node Node 1 Node 2 Node K Co-located compute CPU CPU … CPU and storage MEM MEM MEM Interconnection Network Design scales extremely well 48 Table Partitioning Shared Storage What? Distribute rows of each table across multiple storage devices Why? Spread I/O load Facilitates parallel query execution How? Hash, Round Robin, Range 49 Partitioned Parallelism Used to parallelize the execution of relational operators (selects, joins, aggregates, …) By both shared-storage and shared-nothing systems Pipelining is used between operators to avoid unnecessary disk I/Os. 50 Amazon (AWS) Redshift Classic shared-nothing design w. locally attached storage Leverages AWS services EC2 compute instances S3 storage system Virtual Private Cloud (VPC) Leader in market adoption 51 A Redshift Instance Application Single Leader Node LEADER Catalogs NODE One slice/core One or more compute nodes (EC2 instance) Memory, storage, & data partitioned NODE among slices Customers (ID, Name, AmtDue) 1 SLICE 1 NODE 1 SLICE 2 SLICE 3 NODE 2 SLICE 4 Hash Partition on ID Hash & round-robin table partitioning ID Name Amt ID Name Amt ID Name Amt ID Name Amt Redshift Summary Highly successful cloud SAAS DW service Classic shared-nothing design Leverages S3 to handle node and disk failures Key strength: performance through use of local storage Key weaknesses: compute cannot be scaled independent of storage (and vice versa) New “Redshift Spectrum” offering can query data in S3 w/o loading Snowflake Elastic DW Shared-storage design Compute decoupled from storage Highly elastic Leverages AWS Tables stored in S3 but dynamically cached on local storage Clusters of EC2 instances used to execute queries Rich data model Schema-less ingestion of JSON documents Snowflake Architecture AUTHENTICATION & ACCESS CONTROL SHARED INFRASTRUCTURE QUERY TRANSACTION CLOUD SECURITY MANAGER OPTIMIZER MANAGER SERVICES METADATA STORAGE These disks are strictly used as caches VIRTUAL VIRTUAL VIRTUAL WAREHOUSE WAREHOUSE WAREHOUSE COMPUTE N1 N2 N3 N4 N1 N2 N1 CLUSTER OF EC2 INSTANCES N2 N3 N4 N5 N6 N7 N8 LAYER Database tables stored here DATA CACHE S3 DATA STORAGE Snowflake Summary Designed for the cloud from conception Can directly query unstructured data (JSON) w/o loading Compute and storage independently scalable AWS S3 for table storage Virtual warehouses composed of clusters of AWS EC2 instances Queries can be given exactly the compute resources they need No management knobs No indices, no create/update stats, no distribution keys, … 56 Microsoft Azure SQL Data Warehouse Shared-Storage design Based on SQL Server PDW appliance software Leverages Azure Storage Elastic Design Query w/o loading (Polybase) 57 A SQL DW Instance Single Control Node (parse & optimize queries) Catalogs CONTROL Each w.NODE one SQL Azure “standard” instance 1 or more DW Nodes Hash & round-robin Customers (ID, Name, AmtDue) Table stored in Azure Premium table partitioning DW Node DW Node DW Node DW Node Storage Also replicated tables DMS DMS DMS DMS Hash Partition on ID AZURE Id Name Amt Id Name Amt And one Id Name Amt Data Movement Id Name Amt Id Name Amt Service Id Name Amt Id Name Amt Id Name Amt PREMIUM process STORAGE 58 58 SQL DW Wrap Up SQL Server T-SQL Surface Nearly full T-SQL surface compatibility. Ability to adjust resources SQL DW T-SQL elastically Surface Ability to pause/resume Big Data, No Load query through Polybase 59 Google BigQuery Separate storage and compute Leverages Google’s internal storage & execution stacks Collosus distributed file system DremelX query executor Jupiter networking stack Borg resource allocator No knobs, no indices, … BigQuery Tables Stored in Collosus FS Partitioned by day (optionally) Columnar storage (Capacitor) RLE compression Sampling used to pick sort order Columns partitioned across multiple disks Also “external” tables JSON, CSV, & Avro formats Google Drive and Cloud Storage Elasticity Redshift Co-located storage and compute constrains elasticity SQL DW DB-level adjustment of DWU capacity Snowflake Query-level control through Virtual Warehouse mechanism BigQuery Google decides for you based on input table sizes What about … Spark, Impala, Hive, Presto,… Not exclusively SAAS offerings All use shared storage (HDFS) Questions?

Use Quizgecko on...
Browser
Browser