Full Transcript

Big Data Processing and Analysis Minos Garofalakis www.softnet.tuc.gr/~minos Big Data is Big News (& Big Business…) Rapid growth due to several information- generating technologies, such as mobile computing, sensornets, and social networks How can we cost-...

Big Data Processing and Analysis Minos Garofalakis www.softnet.tuc.gr/~minos Big Data is Big News (& Big Business…) Rapid growth due to several information- generating technologies, such as mobile computing, sensornets, and social networks How can we cost-effectively manage and analyze all this data…? Big Data Challenges: The Four V’s Volume: Scaling from Terabytes to Exa/Zettabytes Velocity: Processing massive amounts of streaming data Variety: Managing the complexity of multiple relational and non-relational data types and schemas Veracity: Handling the inherent uncertainty and noise in the data Distribution: Dealing with massively distributed information Exciting Times for Data Management! M St a c at hin is e ti cs Le Networking a rn in g/ Databases Algorithms/ Theory Bioinformatics Web / IR NOT “database people” -- “data people”!! Course Topics & Outline Approximate Query Processing Parallelism & Data in the Cloud – Map-Reduce/Hadoop, Spark, Flink, … Data Stream Processing – Centralized & Distributed Projects (25-50%) -- Potential topics TBA – Literature/paper survey, Presentation – Implementation (??) 7 Approximate QP -- Outline Intro & Approximate Query Answering Overview – Synopses, System architectures, Commercial offerings One-Dimensional Synopses – Histograms, Samples, Wavelets Multi-Dimensional Synopses and Joins – Multi-D Histograms, Join synopses, Wavelets Set-Valued Queries – Using Histograms, Samples, Wavelets Discussion & Comparisons Advanced Techniques & Future Directions – Dependency-based, Workload-tuned, Streaming data 8 Decision Support Systems Data Warehousing: Consolidate data from many sources in one large repository. – Loading, periodic synchronization of replicas. – Semantic integration. OLAP: – Complex SQL queries and views. – Queries based on spreadsheet-style operations and “multidimensional” view of data. – Interactive and “online” queries. Data Mining: – Exploratory search for interesting trends and anomalies. (Another lecture!) 9 Introduction & Motivation Decision SQL Query Support Systems (DSS) Exact Answer Long Response Times! Exact answers NOT always required – DSS applications usually exploratory: early feedback to help identify “interesting” regions – Aggregate queries: precision to “last decimal” not needed e.g., “What percentage of the US sales are in NJ?” (display as bar graph) – Preview answers while waiting. Trial queries – Base data can be remote or unavailable: approximate processing using locally-cached data synopses is the only option 10 Fast Approximate Answers Primarily for Aggregate Queries Goal is to quickly report the leading digits of answers – In seconds instead of minutes or hours – Most useful if can provide error guarantees E.g., Average salary $59,000 +/- $500 (with 95% confidence) in 10 seconds vs. $59,152.25 in 10 minutes Achieved by answering the query based on samples or other synopses of the data Speed-up obtained because synopses are orders of magnitude smaller than the original data 11 Approximate Query Answering Basic Approach 1: Online Query Processing – e.g., Control Project [HHW97, HH99, HAR00] – Sampling at query time – Answers continually improve, under user control 12 Approximate Query Answering Basic Approach 2: Precomputed Synopses – Construct & store synopses prior to query time – At query time, use synopses to answer the query – Like estimation in query optimizers, but reported to the user (need higher accuracy) more general queries – Need to maintain synopses up-to-date – Most work in the area based on the precomputed approach e.g., Sample Views [OR92, Olk93], Aqua Project [GMP97a, AGP99,etc] 13 The Aqua Architecture SQL Query Q Q Data Warehouse Network (e.g., Result Oracle) HTML XML Browser Excel Warehouse Data Updates Picture without Aqua: User poses a query Q Data Warehouse executes Q and returns result Warehouse is periodically updated with new data 14 The Aqua Architecture [GMP97a, AGP99] SQL Query Q Q’ Data Warehouse Network Rewriter (e.g., Result Oracle) HTML (w/ error bounds) XML Browser Warehouse AQUA Excel Data Synopses Updates Picture with Aqua: AQUA Tracker Aqua is middleware, between the user and the warehouse Aqua Synopses are stored in the warehouse Aqua intercepts the user query and rewrites it to be a query Q’ on the synopses. Data warehouse returns approximate answer 15 Online vs. Precomputed Online: + Continuous refinement of answers (online aggregation) + User control: what to refine, when to stop + Seeing the query is very helpful for fast approximate results + No maintenance overheads + See [HH01] Online Query Processing tutorial for details Precomputed: + Seeing entire data is very helpful (provably & in practice) (But must construct synopses for a family of queries) + Often faster: better access patterns, small synopses can reside in memory or cache + Middleware: Can use with any DBMS, no special index striding + Also effective for remote or streaming data 16 Commercial DBMS (circa 2002) Oracle, IBM Informix: Sampling operator (online) IBM DB2: “IBM Almaden is working on a prototype version of DB2 that supports sampling. The user specifies a priori the amount of sampling to be done.” Microsoft SQL Server: “New auto statistics extract statistics [e.g., histograms] using fast sampling, enabling the Query Optimizer to use the latest information.” The index tuning wizard uses sampling to build statistics. – see [CN97, CMN98, CN98] In summary, not much announced yet 17 Approximate Query Processing using Data Synopses Decision Support SQL Query Systems (DSS) Exact Answer GB/TB Long Response Times! Compact Data “Transformed” Query Synopses Approximate Answer KB/MB FAST!! How to construct effective data synopses ?? 18 Outline Intro & Approximate Query Answering Overview One-Dimensional Synopses – Histograms: Equi-depth, Compressed, V-optimal, Incremental maintenance, Self-tuning – Samples: Basics, Sampling from DBs, Reservoir Sampling – Wavelets: 1-D Haar-wavelet histogram construction & maintenance Multi-Dimensional Synopses and Joins Set-Valued Queries Discussion & Comparisons Advanced Techniques & Future Directions 19 Relations as Frequency salary name sales age Distributions One-dimensional distribution MG 34 100K 25K JG 33 90K 30K counts tuple RR 40 190K 55K Age (attribute domain values) JH 36 110K 45K MF 39 150K 50K Three-dimensional distribution DD 45 150K 50K tuple counts JN 43 140K 45K 8 10 10 AP 32 70K 20K age 30 20 50 EM 24 50K 18K 25 8 15 sales DW 24 50K 28K salary 20 Histograms Partition attribute value(s) domain into a set of buckets Issues: – How to partition – What to store for each bucket – How to estimate an answer using the histogram Long history of use for selectivity estimation within a query optimizer [Koo80], [PSC84], etc. [PIH96] [Poo97] introduced a taxonomy, algorithms, etc. 21 1-D Histograms: Equi-Depth Count in bucket 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Domain values Goal: Equal number of rows per bucket (B buckets in all) Can construct by first sorting then taking B-1 equally-spaced splits 1 2 2 3 4 7 8 9 10 10 10 10 11 11 12 12 14 16 16 18 19 20 20 20 Faster construction: Sample & take equally-spaced splits in sample – Nearly equal buckets – Can also use one-pass quantile algorithms (e.g., [GK01]) 22 1-D Histograms: Equi-Depth Count in bucket 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Domain values Can maintain using one-pass algorithms (insertions only), or Use a backing sample [GMP97b]: Maintain a larger sample on disk in support of histogram maintenance – Keep histogram bucket counts up-to-date by incrementing on row insertion, decrementing on row deletion Split – Merge adjacent buckets with small counts 456789 45 6789 – Split any bucket with a large count, using the sample to select a split value, i.e, take median of the sample points in bucket range Keeps counts within a factor of 2; for more equal buckets, can recompute from the sample 23 1-D Histograms: Compressed [PIH96] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Create singleton buckets for largest values, equi-depth over the rest Improvement over equi-depth since get exact info on largest values, e.g., join estimation in DB2 compares largest values in the relations Construction: Sorting + O(B log B) + one pass; can use sample Maintenance: Split & Merge approach as with equi-depth, but must also decide when to create and remove singleton buckets [GMP97b] 24 1-D Histograms: V-Optimal [IP95] defined V-optimal & showed it minimizes the average selectivity estimation error for equality-joins & selections – Idea: Select buckets to minimize frequency variance within buckets [JKM98] gave an O(B*N^2) time dynamic programming algorithm – F[k] = freq. of value k; AVGF[i:j] = avg freq for values i..j – SSE[i:j] = sum{k=i..j}F[k]^2 – (j-i+1)*AVGF[i:j]^2 – For i=1..N, compute P[i] = sum{k=1..i} F[k] & Q[i] = sum{k=1..i} F[k]^2 – Then can compute any SSE[i:j] in constant time – Let SSEP(i,k) = min SSE for F..F[i] using k buckets – Then SSEP(i,k) = min{j=1..i-1} (SSEP(j,k-1) + SSE[j+1:i]), i.e., suffices to consider all possible left boundaries for kth bucket – Also gave faster approximation algorithms 25 Answering Queries: Histograms Answering queries from 1-D histograms (in general): – (Implicitly) map the histogram back to an approximate relation, & apply the query to the approximate relation – Continuous value mapping [SAC79]: Count spread evenly among 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 bucket values - Uniform spread mapping [PIH96]: Need number of distinct in each bucket 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 3 2 1 2 3 1 27 Self-Tuning 1-D Histograms 1. Tune Bucket Frequencies: [AC99] – Compare actual selectivity to histogram estimate – Use to adjust bucket frequencies query range Actual = 60 25 30 30 Estimate = 40 10 12 16 12 12 10 10 Error = +20 1 10 11 20 21 30 31 60 61 64 65 69 70 70 71 79 80 89 90 99 - Divide d*Error proportionately, d=dampening factor d=½ of Error 30 30 = +10 20 16 15 15 25 So divide 10 12 12 12 10 10 +4,+3,+3 1 10 11 20 21 30 31 60 61 64 65 69 70 70 71 79 80 89 90 99 28 Self-Tuning 1-D Histograms 2. Restructure: split – Merge buckets of near-equal frequencies in 3 – Split large frequency buckets 69 merge merge split 30 25 13 17 13 13 10 10 10 1 10 11 20 21 30 31 60 61 64 65 69 70 70 71 79 80 89 90 99 36 23 25 23 23 23 17 15 15 10 Also Extends 1 20 21 30 31 69 70 70 71 73 74 76 77 79 80 89 90 94 95 99 to Multi-D 29

Use Quizgecko on...
Browser
Browser