Real-time Data Warehousing Lecture Notes PDF

Summary

This lecture delves into the concepts related to Real-Time Data Warehousing, highlighting real-time data processing challenges and real-time join algorithms. It outlines the principles behind designing effective real-time data warehousing systems.

Full Transcript

Lecture 16 Real-time Data Warehousing This week Real-time Data Warehousing – Real-time Data Processing Challenges – Real-time Join Algorithms Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 2 Real-time DW – Why? Customer business sce...

Lecture 16 Real-time Data Warehousing This week Real-time Data Warehousing – Real-time Data Processing Challenges – Real-time Join Algorithms Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 2 Real-time DW – Why? Customer business scenario: a utility company owns plants generating energy Existing DW supports planning by recommending: The production capacity The reserve capacity When to buy supplemental energy,as needed Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 3 Real-time DW – Why? Each day is pre-planned on historical behavior – Peak demand periods are somewhat predictable Good planning is important because: – Expensive to have unused capacity! – Cheaper to buy energy ahead! Planning on last week’s average is not enough Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 4 Real-time DW – Why? Getting more in-time accuracy enhances operational business – Compare today's plant output and customer consumption volumes to yesterday’s or last week’s average – Know when to purchase additional options or supplies CustomerTarget:have the actual data from the operational environment available for analytics within a 5 minute lag Real-time ≠ fast – Real time DW has the capability to enforce time constraints Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 5 Real-time DW – Why? The most difficult part for complying with the 5 minutes time constraint is the ETL process – ETL tools usually operate in batch mode on a certain schedule nightly,weekly or monthly – ETL typically involves downtime for the DW during the loading step (usually happens over night) – Data is extracted into flat files in the staging area outside the DBMS, where it is not available for querying – ETL may take hours to finish The ETL process needs to be re-designed to meet real-time constraints Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 6 Real-time ETL Solutions enabling real-time ETL – Microbatch – Direct Trickle-feed – Trickle & Flip – External Real-time Data Cache (ERDC) 7 Microbatches Microbatches relies on the classical ETL- batch solution to provide near-real-time ETL – Reduce the time interval between consecutive loads (typical intervals are 3-4 hours) – Transformation tasks have to be fully automatized (no human intervention) – However, the time interval can’t be reduced to minutes – The interval depends on: The data volume The operational system and OLAP load during the microbatch process Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 8 Direct Trickle-Feed Direct Trickle-Feed (also called continuous feed) – Continuously feed the DW with new data from the operational environment Directly propagate changes on data from the OD store as inserts into the DW fact table – But constantly updating the same tables being queried by a reporting or OLAP tool can cause the DW’s query performance to degrade – query contention Under moderate to heavy usage from either OLAP queries or the incoming data, most RDBMS will block the incoming data – the Walmart scenario Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 9 Direct Trickle-Feed Solution: separate real-time fact partition Idea is using a separate table subject to special rules for update and query – Contains all the activity that occurs since the last update of the “static” (updated each night) fact table – Linked as seamlessly as possible to the content of the “static” fact table – Indexed only lightly to account for incoming data – Support highly responsive queries Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 10 Direct Trickle-Feed Real-time partition architecture Operational System Data Cleaning Pipeline Real-Time Partition „Static“ Fact … Operational On each Table System transaction Periodically Flat files Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 11 Real-Time Partition in Practice The Amazon DW: – An estimate of 55 GB sales transactions a year For a day, this would mean about 150 MB of raw data – The “static” fact table just for the sales cube, for the last 3 years, would be about 150 GB For fast OLAP querying it will be heavily indexed and supported by aggregates – The small 150 MB real-time partition can be pinned in memory for rapid loading and querying Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 12 Real-Time Partition in Practice The query: How did the sales for today evolve until now (Tuesday the 6th of December, at 10:30) compared to the last 3 Tuesdays? – The query processor detects that fresh data is required by the query – The query processor sends the same query to both the “static” fact table and to the real-time partition Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 13 Real-Time Partition in Practice – Data from the real-time partition is aggregated to the necessary granularity in concordance with the classification schema – The system benefits from the indexes on the “static” fact table, and from the small size of the in-memory real-time partition – As a downside, this approach may still suffer from query contention, if daily data volume is high Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 14 Real-Time ETL Trickle & Flip – Instead of continuously feeding the data into the DW, the data is fed into the staging tables in the exact same format as the target fact tables This way the staging tables are copies of the fact tables, with up-to-date data – Periodically, the trickle is halted, the staging tables are copied, and renamed to the active fact table names (just a pointer switch operation) The active fact table is deleted, and the process begins anew Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 15 Trickle & Flip Architecture: Data Sources Warehouse Copying the fact table is slow. But the OLAP queries can continue. Operational Just the ETL is paused. System Fact Table … Copy Fact Table Operational System The pointer change process is fast! Flat files A matter of microseconds… So the analysis is stopped for just a very short time during the flip. Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 16 Trickle & Flip The main problem is scalability – The staging tables must contain the complete fact table – If the fact table is too large the copying process is slow – This method is only useful if the frequency of refresh corresponds to the time it takes to perform the flip No query contention between complex OLAP queries and loading inserts – During flipping queries are paused Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 17 ERDC External Real-Time Data Cache (ERDC) – Leave the D W largely as-is – Store the incoming real-time data outside the traditional D W in order to completely avoid performance problems – The data cache can simply be a dedicated database server for loading storing and processing real-time data Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 18 ERDC Architecture: Data Sources Warehouse Query Processor Query Operational System Nightly Staging … Table Fact Table Operational System Flat files ERDC Temp. Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 19 Real-time ETL Real-time transformation is an important phase in an ETL layer where incoming source updates are transformed into warehouse format in an online fashion. http://icons.iconarchive.com/icons/siristhius/vista-style/256/My-computer-icon.png End user’s OLAP analysis http://files.softicons.com/download/internet-icons/web-hosting-icons-by-heart-internet/png/256/database.png transactions in the form of stream. ETL https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSt4b6Hm3MCR4NGQMGmBx1AHioYx6o82Md16h_TlTCPV0iHJtfDGg Transformed http://icons.iconarchive.com/icons/siristhius/vista-style/256/My-computer-icon.png http://files.softicons.com/download/internet-icons/web-hosting-icons-by-heart-internet/png/256/database.png Extraction data http://files.softicons.com/download/internet-icons/web-hosting-icons-by-heart-internet/png/256/database.png Transformation Reporting Loading Join http://icons.iconarchive.com/icons/siristhius/vista-style/256/My-computer-icon.png module Data Warehouse Data Sources Data mining Master Data 20 Real-time ETL (cont’d.) Transformation examples: Key replacement, enrichment of data DS1 Join operator w_key name id name 101 Sugar 01 Sugar 102 Milk 02 Milk 103 Butter DS2 id name id Source warehouse key Data Warehouse 01 DS1 101 01 Milk 02 DS1 102 02 Butter 01 DS2 102 Data Sources 02 DS2 103 Master Data 21 What is Stream-based Join? Stream-based join is an operation to combine the information coming from two or more data sources. Data sources can be in the form of streams or persistent data. Stream-Stream Semi-Stream Join: Stream-Disk Ө(s,r) Ө (s1,s2) R Applications of stream-based join. – Enrichment of stream data with master data. – Key replacement in data warehouse. – Identification of duplicate records. – Merging of two or more data streams. 22 Research Challenges Challenge 1 – Both inputs of the join operator have different arrival rates. The stream input is fast, high volume and has an intermittent nature. Master data input is comparatively slow due to the disk I/O cost. – It creates a bottleneck in join processing.The challenge here is to eliminate this bottleneck. Challenge 2: – Stream data is non-uniform therefore, an efficient approach is required to retrieve master data. 23 Existing Approaches Stream-based joins Full stream joins Semi-stream joins Symmetric Hash Join (SHJ) Index Nested Loop Join (INLJ) Early Hash Join (EHJ) Mesh Join (MESHJOIN) X-Join Partitioned-based Join Double Pipeline Hash Join (DPHJ) Semi Stream Index Join (SSIJ) Hash Merge Join (HMJ) Reduced Mesh Join (R-MESHJOIN) MJoin Hybrid Join (HYBRIDJOIN) Cache Join (CACHEJOIN) 24 Index Nested Loop Join (INLJ) Stream INLJ Join output Ө (si, ri) Master Data Issues in INLJ – INLJ processes only one stream tuple per iteration therefore solution is not practical for fast and huge volume stream data. – INLJ does not amortize expensive disk reading cost on stream data. – INLJ does not take into account some common characteristics of stream data e.g. skew in stream data. 25 MESHJOIN (Mesh Join) Features – MESHJOIN (Mesh Join) has been proposed for processing stream data with master data. – Designed for joining a stream S with disk-based relation R. – Uses limited memory budget. – Does not need an index. – Works for any equijoin. 26 MESHJOIN Components Queue Stream Keeps record of expiring tuples in hash table S Stream-buffer Stores some of the incoming stream tuples Hash for a while table Disk-based Disk-buffer Master Data Stores the group of disk- R pages that are probed Stores all the stream into hash table while join tuples that are being is executing processed 27 MESHJOIN Operation Chunks (R3) already (R4) (R4) seen by Q1 Stream (R1) (R1) (R1) S = Queue w4 w3 w2 w1 Partition Stream- R1 buffer Hash Hash R2 function table R3 R2 R4 Disk-buffer Disk-based relation R Size of stream-buffer = w tuples Size of disk-buffer = b pages Iterations required to bring all of R into memory = k (in this example k=4) 28 Problem in MESHJOIN Problem 1 – Due to unnecessary dependency, memory distribution among the join components is not optimal. Problem 2 – The performance of the algorithm is inversely proportional to the size of master data. Problem 3 – Can not deal with intermittency in stream data. Problem 4 – Typical characteristics of stream data such as non-uniform data are not considered. 29 Solutions We propose following novel algorithms to solve the highlighted problems. Reduced Mesh Join (R-MESHJOIN):- clarifies the dependency among the components more appropriately. Hybrid Join (HYBRIDJOIN):- introduces an index-based strategy to access the master data. Cache Join (CACHJOIN):- considers non- uniform characteristic in stream data. 30 o uti n 1 R-MESHJOIN (Reduced Mesh Join) Sol of Number of partitions w1xk....... w. 3.. w2 w1 Queue = w in the queue Stream S Stream- buffer Hash Hash function table Disk-buffer Disk-based Master Data R Number of logical partitions in disk-buffer= l Size of each logical partition (pages) = bp Size of disk-buffer (pages)= b Iteration required to load entire disk-based relation into memory=k 31 o uti n Sol 2 of HYBRIDJOIN (Hybrid Join) Stream S Stream buffer Stream Queue tm... t3 t2 t1 Hash Output function........................ Hash table (HS) Disk buffer contains stream 1. Assume that R is sorted in the Master Data (R) order of access frequency. 2. Index-based strategy is used to access R. 32 o uti n 3 CACHEJOIN (Cache Join) Sol of Stream If matched S Hash function If not matched Stream buffer Queue tm... t3 t2 t1 Hash table (HR) Stream probing contains frequent phase disk tuples Output Frequency Hash detection function........................ frequency ≥ threshold value Hash table (HS) Disk buffer contains stream Disk probing phase Master Data (R) 33 Experimental Setup Hardware specifications – Pentium-core-i5, 8G main memory, 500G HDD Synthetic dataset – Size of master data R, 100 million tuples (~11.18GB) – Size of each disk tuple, 120 bytes – Size of each stream tuple, 20 bytes – Size of each pointer in queue, 12 bytes – Based on Zipf's Law with skew of 0 to 1 34 Experimental Setup (cont’d.) TPC-H dataset – Create the datasets using a scale factor of 100 – Uses table Customer as our master data table with each tuple o 223 bytes – Uses table Order as our stream data table with each tuple of 138 bytes Real dataset – Size of master data, 20 million tuples – Size of each master as well as stream tuple is 128 bytes – Source url: http://cdiac.ornl.gov/ftp/ndp026b/ Evaluation metrics – We calculate confidence interval by considering 95% accuracy rate. 35 Performance Comparisons Performance comparisons with 95% confidence interval while R=100 million tuples and M varies in percentage of R. 4 x 10 3.5 CACHEJOIN HYBRIDJOIN 3 R MESHJOIN MESHJOIN 2.5 Service rate (tuples/sec) 2 1.5 1 0.5 0 1 2 3 4 5 6 7 8 9 10 Allocated memory in %age of R 36 Performance Comparisons (cont’d.) Performance comparisons with 95% confidence interval while M~1.2G and R varies. 4 x 10 CACHEJOIN 12 HYBRIDJOIN R MESHJOIN MESHJOIN 10 Service rate (tuples/sec) 8 6 4 2 0 20 40 60 80 100 Size of R (in million tuples) 37 Performance Comparisons (cont’d.) Performance comparisons with 95% confidence interval while M~1.2G, R=100 million tuples and skew varies. 4 x 10 CACHEJOIN 3.5 HYBRIDJOIN R MESHJOIN MESHJOIN 3 Service rate (tuples/sec) 2.5 2 1.5 1 0.5 0 0 0.25 0.5 0.75 1 Skew 38 Performance Comparisons (cont’d.) TPC-H datasets Real datasets 12000 14000 CACHEJOIN CACHEJOIN HYBRIDJOIN HYBRIDJOIN R MESHJOIN R MESHJOIN 12000 10000 MESHJOIN MESHJOIN Service rate (tuples/sec) Service rate (tuples/sec) 10000 8000 8000 6000 6000 4000 4000 2000 2000 0 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Allocated memory in %age of R Allocated memory in %age of R 39 Performance Comparisons (cont’d.) Processing time Waiting time 20 10 5 CACHEJOIN CACHEJOIN 18 HYBRIDJOIN HYBRIDJOIN R MESHJOIN R MESHJOIN MESHJOIN MESHJOIN 16 4 10 Processing time (minutes) Waiting time (milliseconds) 14 12 3 10 10 8 6 2 10 4 2 1 0 10 20 30 40 50 60 70 80 90 100 125 250 500 1000 2000 4000 8000 16000 Size of R (million tuples) Stream arrival rate (tuples/sec) 40 Role of Cache Module in Performance 6 x 10 2.5 Stream tuples processed 2 through cache module (about 43% of total) 1.5 1 0.5 0 50 100 150 200 250 Total memory (MB) Total number of stream tuples processed through cache module in 4000 iterations. 41 Summary Stream processing has become a novel field in the area of data management due to its infinite characteristics. Stream-based join operators perform a key role in processing of stream data. A number of algorithms were designed to process semi-stream data however, they suffer with some limitations. We addressed these limitations in our research by presenting three novel algorithms. Our experimental evaluation proved the contribution of each algorithm in terms of performance. 42 Summary Real-time DW – Real-time ETL – Joins to process stream data MESHJOIN Problems in MESHJOIN R-MESHJOIN HYBRIDJOIN CACHEJOIN Performance comparisons – My research 43 Next Lecture Big Data – Big Data Introduction – Characteristics of Big Data – Challenges 44

Use Quizgecko on...
Browser
Browser