Big Data Processing and Analysis Lecture Notes PDF
Document Details
Uploaded by ChivalrousWatermelonTourmaline8861
Technical University of Crete
Minos Garofalakis
Tags
Summary
These notes cover Big Data Processing and Analysis, specifically focusing on approximate query processing techniques. It discusses the use of data synopses for efficient query answering, including histograms, sampling, and wavelets. The notes also discuss dynamic maintenance of wavelet-based histograms.
Full Transcript
Big Data Processing and Analysis Minos Garofalakis Course Topics & Outline Approximate Query Processing Data Stream Processing Distributed Data Streams Parallelism & Data in the Cloud – Map-Reduce, Hadoop Projects (50%) -- Potential topics TBA – Literature Survey, Implement...
Big Data Processing and Analysis Minos Garofalakis Course Topics & Outline Approximate Query Processing Data Stream Processing Distributed Data Streams Parallelism & Data in the Cloud – Map-Reduce, Hadoop Projects (50%) -- Potential topics TBA – Literature Survey, Implementation, Presentation 2 Approximate Query Processing using Data Synopses Decision Support SQL Query Systems (DSS) Exact Answer GB/TB Long Response Times! Compact Data “Transformed” Query Synopses Approximate Answer KB/MB FAST!! How to construct effective data synopses ?? 3 Outline Intro & Approximate Query Answering Overview One-Dimensional Synopses – Histograms: Equi-depth, Compressed, V-optimal, Incremental maintenance, Self-tuning – Samples: Basics, Sampling from DBs, Reservoir Sampling – Wavelets: 1-D Haar-wavelet histogram construction & maintenance Multi-Dimensional Synopses and Joins Set-Valued Queries Discussion & Comparisons Advanced Techniques & Future Directions 4 Relations as Frequency salary name sales age Distributions One-dimensional distribution MG 34 100K 25K JG 33 90K 30K counts tuple RR 40 190K 55K Age (attribute domain values) JH 36 110K 45K MF 39 150K 50K Three-dimensional distribution DD 45 150K 50K tuple counts JN 43 140K 45K 8 10 10 AP 32 70K 20K age 30 20 50 EM 24 50K 18K 25 8 15 sales DW 24 50K 28K salary 5 Previous Lecture: Histograms Partition attribute value(s) domain into a set of buckets Issues: – How to partition – What to store for each bucket – How to estimate an answer using the histogram [PIH96] introduced a taxonomy, algorithms, etc. Partitioning – Equiwidth -- dumb… – Equidepth (quantiles) – V-Optimal -- minimize Sum-Squared-Error [JKMP98] 6 Outline Intro & Approximate Query Answering Overview One-Dimensional Synopses – Histograms: Equi-depth, Compressed, V-optimal, Incremental maintenance, Self-tuning – Samples: Basics, Sampling from DBs, Reservoir Sampling – Wavelets: 1-D Haar-wavelet histogram construction & maintenance Multi-Dimensional Synopses and Joins Set-Valued Queries Discussion & Comparisons Advanced Techniques & Future Directions 7 Sampling: Basics Idea: A small random sample S of the data often well- represents all the data – For a fast approx answer, apply the query to S & “scale” the result – E.g., R.a is {0,1}, S is a 20% sample R.a 1101 select count(*) from R where R.a = 0 1111000 01111101 1101011 Red = select 5 * count(*) from S where S.a = 0 0110 in S Est. count = 5*2 = 10, Exact count = 10 Unbiased: For expressions involving count, sum, avg: the estimator is unbiased, i.e., the expected value of the answer is the actual answer, even for (most) queries with predicates! Leverage extensive literature on confidence intervals for sampling Actual answer is within the interval [a,b] with a given probability E.g., 54,000 ± 600 with prob ³ 90% 8 Confidence Intervals & Probabilistic Guarantees With Replacement (SWR) , Without Replacement (SWOR) – Technical distinction, not that important in practice… Example: Actual answer is within 10 ± 1 with prob ³ 0.9 Use Tail Inequalities to give probabilistic bounds on returned answer – Markov Inequality – Chebyshev’s Inequality – Chernoff Bound – Hoeffding Bound 9 Basic Tools: Tail Inequalities General bounds on tail probability of a random variable (that is, probability that a random variable deviates far from its expectation) Probability distribution Tail probability µe µ µe Basic Inequalities: Let X be a random variable with expectation µ and variance Var[X]. Then for any e > 0 µ Markov: Pr( X ³ e ) £ e Var[ X ] Chebyshev: Pr(| X - µ |³ µe ) £ µ 2e 2 10 Tail Inequalities for Sums Possible to derive stronger bounds on tail probabilities for the sum of independent random variables Hoeffding’s Inequality: Let X1,..., Xm be independent random variables 1 with 0 0 , - µe 2 Pr(| X - µ |³ µe ) £ 2 exp 2 Application to count queries: – m is size of sample S (6 in example) – p is fraction of zeros in the data (unknown) Remark: Chernoff bound results in tighter bounds for count queries compared to Hoeffding’s inequality 12 Sampling: Confidence Intervals Method 90% Confidence Interval (±) Guarantees? Central Limit Theorem 1.65 * s(S) / sqrt(|S|) as |S| ® ¥ Hoeffding 1.22 * (MAX-MIN) / sqrt(|S|) always Chebychev (known s(R)) 3.16 * s(R) / sqrt(|S|) always Chebychev (est. s(R)) 3.16 * s(S) / sqrt(|S|) as s(S) ® s(R) Confidence intervals for Average: select avg(R.A) from R (Can replace R.A with any arithmetic expression on the attributes in R) s(R) = standard deviation of the values of R.A; s(S) = s.d. for S.A If predicates, S above is subset of sample that satisfies the predicate Quality of the estimate depends only on the variance in R & |S| after the predicate: So 10K sample may suffice for 10B row relation! – Advantage of larger samples: can handle more selective predicates 13 Sampling from Databases Sampling disk-resident data is slow – Row-level sampling has high I/O cost: must bring in entire disk block to get the row – Block-level sampling: rows may be highly correlated – Random access pattern, possibly via an index – Need Acceptance/Rejection (A-R) sampling to account for the variable number of rows in a page, children in an index node, etc Alternatives – Random physical clustering: destroys “natural” clustering – Precomputed samples: must incrementally maintain (at specified size) Fast to use: packed in disk blocks, can sequentially scan, can store as relation and leverage full DBMS query support, can store in main memory 14 One-Pass (Streaming) Uniform Sampling Reservoir Sampling [Vit85]: Maintains a sample R of a fixed-size M … x(j), x(j-1), … , x(2), x(1) M locations – Add each new element to R with probability M/j, where j is the current number of stream elements – If add an element, evict a random element from R – [Vitter] Instead of flipping a coin for each element, determine the number of elements to skip before the next to be added to R Concise sampling [GM98]: Duplicates in sample R stored as pairs (thus, potentially boosting actual sample size) – Sampling probability dynamically reduced as reservoir fills up – Subsample elements to evict when sampling prob changes 15 Reservoir Sampling Analysis … x(j), x(j-1), … , x(2), x(1) M locations Let U(j) = {x(1),…, x(j)}, and R(j) = contents of reservoir after x(j). Will prove that R(j) = URS of U(j). By Induction: Obviously holds for jM, assume it holds for U(j-1). Let A be any M-subset of U(j). Two cases: 1. x(j) is NOT in A: Then !"# P[R(j) = A] = P[R(j-1) = A and x(j) is not inserted] = 1/ $ * (1-M/j) 𝑗 =1/ 𝑀 16 Reservoir Sampling Analysis … x(j), x(j-1), … , x(2), x(1) M locations 2. x(j) is in A: Then, define A(k) = A - {x(j)} + {x(k)} , where x(k) is any element of U(j-1) - A. Note that there are exactly (j-M) such sets A(k). Thus P[R(j) = A] = Σ P[R(j-1) = A(k) and x(j) is inserted and x(k) is evicted] Α(κ) !"# 𝑗 = (j-M) * 1/ $ * M/j * 1/M = =1/ 𝑀 17 Biased Sampling Often, advantageous to sample different data at different rates (Stratified Sampling) – E.g., outliers can be sampled at a higher rate to ensure they are accounted for; better accuracy for small groups in group-by queries – Each tuple j in the relation is selected for the sample S with some probability Pj (can depend on values in tuple j) – If selected, it is added to S along with its scale factor sf = 1/Pj – Answering queries from S: e.g., R.a 10 10 10 50 50 select sum(R.a) from R where R.b < 5 Pj 1/3 1/3 1/3 ½ ½ S.sf --- 3 --- --- 2 select sum(S.a * S.sf) from S where S.b < 5 Sum(R.a) = 130 Unbiased answer. Good choice for Pj’s Sum(S.a*S.sf) = 10*3 + 50*2 = 130 results in tighter confidence intervals 18 Sampling: Summary Probably oldest summarization tool (statistics, survey sampling) Commercial acceptance (together with histograms) – Most commercial systems have SAMPLE operator (create & store sample views of tables) Only technique allowing for principled confidence bounds – Based on statistics, tail inequalities Naturally multi-dimensional (all attrs for a sampled tuple) Many-many variants around – Uniform, Biased, Stratified, Acceptance/Rejection, Bernoulli, Cluster, … 19 Outline Intro & Approximate Query Answering Overview – Synopses, System architectures, Commercial offerings One-Dimensional Synopses – Histograms: Equi-depth, Compressed, V-optimal, Incremental maintenance, Self-tuning – Samples: Basics, Sampling from DBs, Reservoir Sampling – Wavelets: 1-D Haar-wavelet histogram construction & maintenance Multi-Dimensional Synopses and Joins Set-Valued Queries Discussion & Comparisons Advanced Techniques & Future Directions 20 Relations as Frequency name salary sales age Distributions One-dimensional distribution MG 34 100K 25K JG 33 90K 30K counts tuple RR 40 190K 55K Age (attribute domain values) JH 36 110K 45K MF 39 150K 50K Three-dimensional distribution DD 45 150K 50K tuple counts JN 43 140K 45K 8 10 10 AP 32 70K 20K age 30 20 50 EM 24 50K 18K 25 8 15 sales DW 24 50K 28K salary 21 One-Dimensional Haar Wavelets Wavelets: mathematical tool for hierarchical decomposition of functions/signals Haar wavelets: simplest wavelet basis, easy to understand and implement – Recursive pairwise averaging and differencing at different resolutions Resolution Averages Detail Coefficients 3 [2, 2, 0, 2, 3, 5, 4, 4] ---- 2 [2, 1, 4, 4] [0, -1, -1, 0] 1 [1.5, 4] [0.5, 0] 0 [2.75] [-1.25] Haar wavelet decomposition: [2.75, -1.25, 0.5, 0, 0, -1, -1, 0] 22 Haar Wavelet Coefficients Hierarchical decomposition structure (a.k.a. “error tree”) Coefficient “Supports” 2.75 + 2.75 + -1.25 -1.25 + - + - 0.5 + - + 0.5 - + 0 - 0 + - + 0 - + -1 - + -1 - + 0 - 0 + - -1 + - - 2 2 0 2 3 5 4 4 -1 + Original data 0 + - 23 Wavelet-based Histograms [MVW98] Problem: range-query selectivity estimation Key idea: use a compact subset of Haar/linear wavelet coefficients for approximating the data distribution – Connection to traditional histograms (buckets ~ coeffs)…? Steps – compute (cumulative) data distribution C – compute Haar (or linear) wavelet transform of C – coefficient thresholding : only b