New Jersey Data Reduction Report PDF

The New Jersey Data Reduction Report Daniel Barbara William DuMouchel Christos Faloutsos Peter J. Haas Joseph M. Hellerstein Yannis Ioannidis H. V. Jagadish Theodore Johnson Raymond Ng Viswanath Poosala Kenneth A. Ross Kenneth C. Sevcik 1 Introduction There is often a need to get quick approximate answers from large databases. This leads to a need for data reduction. There are many di erent approaches to this problem, some of them not traditionally posed as solutions to a data reduction problem. In this paper we describe and evaluate several popular techniques for data reduction. Historically, the primary need for data reduction has been internal to a database system, in a cost-based query optimizer. The need is for the query optimizer to estimate the cost of alternative query plans cheaply { clearly the e ort required to do so must be much smaller than the e ort of actually executing the query, and yet the cost of executing any query plan depends strongly upon the numerosity of speci ed attribute values and the selectivities of speci ed predicates. To address these query optimizer needs, many databases keep summary statistics. Sampling techniques have also been proposed. More recently, there has been an explosion of interest in the analysis of data in warehouses. Data warehouses can be extremely large, yet obtaining answers quickly is important. Often, it is quite acceptable to sacri ce the accuracy of the answer for speed. Particularly in the early, more exploratory, stages of data analysis, interactive response times are critical, while tolerance for approximation errors is quite high. Data reduction, thus, becomes a pressing need. The query optimizer need for estimates was completely internal to the database, and the quality of the estimates used was observable by a user only very indirectly, in terms of the performance of the database system. On the other hand, the more recent data analysis needs for approximate answers directly expose the user to the estimates obtained. Therefore the nature and quality of these estimates becomes more salient. Moreover, to the extent that these estimates are being used as part of a data analysis task, there may often be \by-products" such as, say, a hierarchical clustering of data, that are of value to the analyst in and of themselves. Copyright 1997 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this ma- terial for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering Email addresses in order: [email protected], [email protected], [email protected], pe- [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]. 3 1.1 The Techniques For many in the database community, particularly with the recent prominence of data cubes, data reduction is closely associated with aggregation. Further, since histograms aggregate information in each bucket, and since histograms have been popularly used to record data statistics for query opti- mizers, one may naturally be inclined to think only of histograms when data reduction is suggested. A signi cant point of this report is to show that this is not warranted. While histograms have many good properties, and may indeed be the data reduction technique of choice in many circumstances, there is a wealth of alternative techniques that are worth considering, and many of these are described below. Following standard statistical nomenclature, we divide data reduction techniques into two broad classes: parametric techniques that assume a model for the data, and then estimate the parameters of this model, and non-parametric techniques that do not assume any model for the data. The former are likely, when well-chosen, to result in substantial data reduction. However, choosing an appropriate model is an art, and a parametric technique may not always do well with any given data set. In this paper we consider singular value decomposition and discrete wavelet transform as transform-based parametric techniques. We also consider linear regression models and log-linear models as direct, rather than transform-based, parametric techniques. A histogram is a non-parametric representation of data. So is a cluster-based reduction of data, where each data item is identi ed by means of its cluster representative. Perhaps a more surprising inclusion is the notion of an index tree as a data reduction device. The central observation here is that a typical index partitions the data into buckets recursively, and stores some information regarding the data contained in the bucket. With minimal augmentation, it becomes possible to answer queries approximately based upon an examination of only the top levels of an index tree. If these top levels are cached in memory, as is typically the case, then one can view these top levels of the tree as a reduced form of data eminently suited for approximate query answering. Finally, one way of reducing data is to bypass the data representation problem addressed in all the techniques above. Instead, one could just sample the given data set to produce a smaller reduced data set, and then operate on the reduced data set to obtain quick but approximate answers. This technique, even though not directly supported by any database system to our knowledge, is widely used by data analysts who develop and test hypotheses on small data samples rst and only then do a major run on the full data set. 1.2 The Data Set The appropriateness of any data reduction technique is centrally dependent upon the nature of the data set to be reduced. Based upon the foregoing discussion, it should be evident that there is a wide variety of data sets, used for a wide variety of analysis applications. Moreover, multi-dimensionality is a given, in most cases. To enable at least a qualitative discussion regarding the suitability of di erent techniques, we devised a taxonomy of data set types, described below. 1.2.1 Distance Only For some data sets, all we have is a distance metric between data points { without any embedding of the data points into any multi-dimensional space. We call these distance only data sets. Many data reduction (and indexing) techniques do not apply to such data sets. However, an embedding in a multi-dimensional space can often be obtained through the use of multi-dimensional scaling, or other similar techniques. 4 1.2.2 Multi-dimensional Space The bulk of our concern is with data sets where individual data points can be embedded into an appro- priate multi-dimensional attribute space. We consider various characteristics, in two main categories: intrinsic characteristics of each individual attribute, such as whether it is ordered or nominal, discrete or continuous; and extrinsic characteristics, such as sparseness and skew, which may apply to individual attributes or may be used to characterize the data set as a whole. We also consider dimensionality of the attribute space, which is a characteristic of the data set as a whole rather than that of any individual attribute. 1.2.3 Intrinsic Characteristics We seem to divide the world strongly between ordered and unordered (or nominal) attributes. Un- ordered attributes can always be ordered by de ning a hash label and sorting on this label. So the question is not as much whether the attribute is ordered by de nition as whether it is ordered in spirit, that is, with useful semantics to the order. For example, a list of (customer) names sorted alphabeti- cally is ordered by de nition. However, for many reasonable applications, there is unlikely to be any pattern based on occurrence of name in the dictionary, and it is not very likely that queries will specify ranges of names. Therefore, for the purposes of data representation, such an attribute is e ectively unordered. Similar arguments hold for account numbers, sorted numerically. Ordered Attributes have values drawn from a nite or an in nite interval. The points in the data set may take discrete or continuous values. None of these details matter as far as data reduction techniques are concerned. There is a di erence in language (and formal notation) between discrete and continuous domains. For convenience, we will use only the language of the discrete domain in what follows. The notation and language for a continuous attribute follows analogously, and is not given explicitly in this document. Some attributes, such as rank, may be ordered but have no metric associated with the order. We do not consider such attributes in this report. Unordered Attributes can have values that are drawn from a at or a hierarchical name space. The name space is said to be at if there is no particular structure to the range of attribute values. A name space is said to be hierarchical if values within a sub-category are in some sense \closer" than values within the next higher level category, and so on, as in a le system hierarchy. Item stock unit numbers, library book classi cation numbers, and classi cation systems in general, all have this sort of property. 1.2.4 Extrinsic Attributes Sparse Data Set A data set is said to be sparse if most points in the attribute space de ned have no data points corresponding. Conversely, a data set is dense if most coordinate points in the attribute space have at least one data point de ned. At least for ordered and hierarchical attributes, one can aggregate \ranges" of attribute values to change a sparse space into a less sparse (or more dense) data space. While such manipulations may be common in practice, we work with the data set as given to the data representation/reduction process, paying no heed to any pre-processing steps that may have been involved. Skewed Data Set A data set is said to be skewed if the number of data points per attribute point has a high variance across the entire data space, but has a substantially lower variance in appropriately 5 de ned \local" regions. Note that this de nition of skew applies only to ordered attributes. Note also that an over- ne disaggregation of attribute value will make it hard to observe skew { aggregation into appropriate size ranges is required. Finally, note that this de nition is for \skew in frequency". There is a di erent notion of skew { \skew in value", where the attribute value for a small number of data points di ers substantially from the attribute value for the bulk of the population. Skew in value implies skew in frequency over the attribute value range. However, the converse is not necessarily true. 1.2.5 Dimensionality By default, we assume all data sets to be low dimensional, that is, represented in ten or fewer dimen- sions. A data set with more dimensions (attributes) is said to be high dimensional. There are many tricks that can be used to reduce the dimensionality of a given data set. We look at the data set after any such techniques have been applied. 1.3 Metrics In this paper we focus purely on data reduction { the value of a hierarchical clustering of data, for example, to a data analyst or data miner, is not considered, except in so far as it results in less data storage and quick approximate answers to queries. Thus, the primary metric applied to a data reduction technique is how accurate it can be in response to queries as a function of the storage space consumed or as a function of the time taken to respond. In most cases, the time to respond is closely related to the storage consumed. There are secondary metrics as well. Data often changes, and we may care how easy it is to maintain the data reduced storage structure incrementally in the face of additions and deletions. Some data reduction techniques may cause a complete recomputation, and this is clearly not desirable. Finally, progressive resolution re nement may sometimes be useful. We may want to produce an approximate answer very rapidly, and then progressively improve the approximation with time. A few data reduction techniques may permit this sort of re nement. 1.4 Outline of Paper and Acknowledgment of Contribution The paper is organized into sections, one per technique. In each section, the technique is rst de- scribed and then its applicability to the di erent types of data sets is explored. Finally, our summary conclusions are presented in Section 10. Section 2 on the Singular Value Decomposition, and Section 3 on the discrete Wavelet transform were primarily written by Christos Faloutsos. Section 4 on Linear Regression was primarily written by Daniel Barbara. Section 5 on Log-Linear Models was primarily written by Bill DuMouchel. Section 6 on Histograms was primarily written by Vishy Poosala and Yannis Ioannidis. Section 7 on Clustering was primarily written by Raymond Ng. Section 8 on Index Trees was primarily written by Ken Sevcik and Joe Hellerstein. Section 9 was primarily written by Peter Haas. The introduction and conclusion sections were primarily written by H. V. Jagadish. The paper as a whole was edited, smoothed, and formatted by Joe Hellerstein and Ken Ross. 2 Singular Value Decomposition (SVD) The rst proposed method is based on the so-called Singular Value Decomposition (SVD) of the data matrix. SVD is a popular and powerful operation, and it has been used in numerous applications, such as statistical analysis (as the driving engine behind the Principal Component Analysis [Jol86]), 6 text retrieval under the name of Latent Semantic Indexing [Dum94], pattern recognition and dimen- sionality reduction as the Karhunen-Loeve (KL) transform [DH73], and face recognition [TP91]. SVD is particularly useful in settings that involve least-squares optimization such as in linear regression, dimensionality reduction, and matrix approximation. See [Str80] or [PTVF96] for more details. The latter citation also gives `C' code. Example 1: Table 1 provides an example of the kind of matrix that is typical in warehousing ap- plications, where rows are customers, columns are days, and the values are the dollar amounts spent on phone calls each day. Alternatively, rows could correspond to patients, with hourly recordings of their temperature for the past 48 hours, or companies, with stock closing prices over the past 365 days. Such a setting also appears in other contexts. In information retrieval systems rows could be text documents, columns could be vocabulary terms, with the (i; j ) entry showing the importance of the j -th term for the i-th document. day We Th Fr Sa Su customer 7/10/96 7/11/96 7/12/96 7/13/96 7/14/96 ABC Inc. 1 1 1 0 0 DEF Ltd. 2 2 2 0 0 GHI Inc. 1 1 1 0 0 KLM Co. 5 5 5 0 0 Smith 0 0 0 2 2 Johnson 0 0 0 3 3 Thompson 0 0 0 1 1 Table 1: Example of a (customer-day) matrix To make our discussion more concrete, we will refer to rows as \customers" and to columns as \days". The mathematical machinery is applicable to many di erent applications, such as those men- tioned in the preceding paragraph, including ones where there is no notion of a customer or a day, as long as the problem involves a set of vectors or, equivalently, an N M matrix X. 2.1 Description 2.1.1 Preliminaries We shall use the following notational conventions from linear algebra: Bold capital letters denote matrices, e.g., U, X. Bold lower-case letters denote column vectors, e.g., u, v. The \" symbol indicates matrix multiplication. The SVD is based on the concepts of eigenvalues and eigenvectors: De nition 2.1: For a square n n matrix S, the unit vector u and the scalar that satisfy Su=u (1) are called an eigenvector and its corresponding eigenvalue of the matrix S. 7 2.1.2 Intuition behind SVD Before we give the de nition of SVD, it is best that we try to give the intuition behind it. Consider a set of points as before, represented as an N M matrix X. In our running example, such a matrix would represent for N customers and M days, the dollar amount spent by each customer on each day. It would be desirable to group similar customers together, as well as similar days together. This is exactly what SVD does, automatically! Each group corresponds to a \pattern" or a \principal component", i.e., an important grouping of days that is a \good feature" to use, because it has a high discriminatory power and is orthogonal to the other such groups. Figure 1 illustrates the rotation of axis that SVD implies: suppose that we have M =2 dimensions; then our customers are 2-d points, as in Figure 1. The corresponding 2 directions (x0 and y0 ) that SVD suggests are shown. The meaning is that, if we are allowed only k=1, the best direction to project on is the direction of x0 ; the next best is y0 , etc.See Example 2, for more details and explanations. y x’ y’ x Figure 1: Illustration of the rotation of axis that SVD implies: the \best" axis to project is x0. 2.1.3 De nition of SVD The formal de nition for SVD follows: Theorem 2.1 (SVD): Given an N M real matrix X we can express it as X = U Vt (2) where U is a column-orthonormal N r matrix, r is the rank of the matrix X, is a diagonal r r matrix and V is a column-orthonormal M r matrix. Proof: See [PTVF96, p. 59]. 2 Recall that a matrix U is called column-orthonormal if its columns ui are mutually orthogonal unit vectors. Equivalently: Ut U = I, where I is the identity matrix. Also, recall that the rank of a matrix is the highest number of linearly independent rows (or columns). Eq. 2 equivalently states that a matrix X can be brought in the following form, the so-called spectral decomposition [Jol86, p. 11]: X = 1 u1 v1t + 2u2 v2t + : : : + r ur vrt (3) where ui , and vi are column vectors of the U and V matrices respectively, and i the diagonal elements of the matrix . Without loss of generality, we can assume that the eigenvalues i are sorted in decreasing order. Returning to Figure 1, v1 is exactly the unit vector of the best x0 axis; v2 is the unit vector of the second best axis, y0 , and so on. 8 Geometrically, gives the strengths of the dimensions (as eigenvalues), V gives the respective directions, and U gives the locations along these dimensions where the points occur. In addition to axis rotation, another intuitive way of thinking about SVD is that it tries to identify \rectangular blobs" of related values in the X matrix. This is best illustrated through an example. Example 2: for the above \toy" matrix of Table 1, we have two \blobs" of values, while the rest of the entries are zero. This is con rmed by the SVD, which identi es them both: 2 3 0:18 0 66 0:36 0 77 66 0:18 0 77 " # " # 66 77 9 : 64 0 0 : 58 0 : 58 0 : 58 0 0 X = 6 0:90 0 7 0 5:29 0 0 0 0:71 0:71 (4) 66 0 0:53 7 7 64 0 0:80 75 0 0:27 or, in \spectral decomposition" form: 2 3 2 3 0:18 0 66 0:36 77 66 0 77 66 0:18 77 66 0 77 X = 9:64 6 0:90 7 [0:58; 0:58; 0:58; 0; 0] + 5:29 666 0 777 [0; 0; 0; 0:71; 0:71] 66 77 66 0 77 66 0:53 77 64 0 75 64 0:80 75 0 0:27 Notice that the rank of the X matrix is r=2: there are e ectively 2 types of customers: weekday (business) and weekend (residential) callers, and two patterns (i.e., groups-of-days): the \weekday pattern" (that is, the group f`We', `Th', `Fr'g), and the \weekend pattern" (that is, the group f`Sa', `Su'g). The intuitive meaning of the U and V matrices is as follows: Observation 2.1: U can be thought of as the customer-to-pattern similarity matrix, Observation 2.2: Symmetrically, V is the day-to-pattern similarity matrix. For example, v1;2 = 0 means that the rst day (`We') has zero similarity with the 2nd pattern (the \weekend pattern"). Observation 2.3: The column vectors vj (j = 1; 2; : : :) of the V are unit vectors that correspond to the directions for optimal projection of the given set of points For example, in Figure 1, v1 and v2 are the unit vectors on the directions x0 and y0 , respectively. Observation 2.4: The i-th row vector of U gives the coordinates of the i-th data vector (\cus- tomer"), when it is projected in the new space dictated by SVD. For more details and additional properties of the SVD, see [KJF97] or [Fal96]. 2.2 Distance-Only Data SVD can be applied to any attribute-types, including un-ordered ones, like `car-type' or `customer- name', as we saw earlier. It will naturally group together similar `customer-names' into customer groups with similar behavior. 9 2.3 Multi-Dimensional Data As described, SVD is tailored to 2-d matrices. Higher dimensionalities can be handled by reducing the problem to 2 dimensions. For example, for the DataCube (`product', `customer', `date')(`dollars- spent') we could create two attributes, such as `product' and (`customer' `date'). Direct extension to 3-dimensional SVD has been studied, under the name of 3-mode PCA [KD80]. 2.3.1 Ordered and Unordered Attributes SVD can handle them all, as mentioned under the 'Distance-Only' subsection above. 2.3.2 Sparse Data SVD can handle sparse data. For example, in the Latent Semantic Indexing method (LSI), SVD is used on very sparse document-term matrices. [FD92]. Fast sparse-matrix SVD algorithms have been recently developed [Ber92]. 2.3.3 Skewed Data SVD can handle skewed data. In fact, the more skewed the data values, the fewer eigenvalues that SVD will need to achieve a small error. 2.3.4 High-Dimensional Data As mentioned, SVD is geared towards 2-dimensional matrices. 3 Wavelets 3.1 Description The Discrete Wavelet Transform (DWT) is a signal processing technique that is well suited for data reduction. A k-d signal is a k-dimensional matrix (or, technically, tensor, or DataCube, in our termi- nology). For example, a 1-d signal is a vector (like a time-sequence); a 2-d signal is a matrix (like a grayscale image) etc.. The DWT is closely related to the popular Discrete Fourier Transform (DFT), with the di erence that it typically achieves better lossy compression: for the same number of coe- cients retained, DWT shows smaller error, on real signals. Thus, given a collection of time sequences, we can encode each one of them with its few strongest coecients, su ering little error. Similarly, given a k-d DataCube, we can use the k-d DWT and keep a small fraction of the strongest coecients, to derive a compressed approximation of it. We focus rst on 1-dimensional signals; the DWT can be applied to signals of any dimensionality, by applying it rst on the rst dimension, then the second, etc. [PTVF96]. Contrary to the DFT, there are more than one Wavelet transforms. The simplest to describe and code is the Haar transform. Ignoring temporarily some proportionality constants, the Haar transform operates on the whole signal (e.g., time-sequence), giving the sum and the di erence of the left and right part; then it focuses recursively on each of the halves, and computes the di erence of their two sub-halves, etc., until it reaches an interval with one only sample in it. It is instructive to consider the equivalent, bottom-up procedure. The input signal ~x must have a length n that is a power of 2, by appropriate zero-padding if necessary. 10 1. Level 0: take the rst two sample points x0 and x1 , and compute their sum s0;0 and di erence d0;0 ; do the same for all the other pairs of points (x2i , x2i+1 ). Thus, s0;i = C (x2i + x2i+1 ) and d0;i = C (x2i x2i+1 ), where C is a proportionality constant, to be discussed soon. The values s0;i (0 i n=2) constitute a `smooth' (=low frequency) version of the signal, while the values d0;i represent the high-frequency content of it. 2. Level 1: consider the `smooth' s0;i values; repeat the previous step for them, giving the even- smoother version of the signal s1;i and the smooth-di erences d1;i (0 i n=4) 3. : : : and so on recursively, until we have a smooth signal of length 2. The Haar transform of the original signal ~x is the collection of all the `di erence' values dl;i at every level l and o set i, plus the smooth component sL;0 at the last level L (L =plog2 (n) 1). Following the literature, the appropriate value for the constant C is 1= 2, because it makes the transformation matrix orthonormal (eg., see Eq. 8). An orthonormal matrix is a matrix which has columns that are unit vectors and that are mutually orthogonal. Adapting the notation (eg., from [Cra94] [VM]), the Haar transform is de ned as follows: p dl;i = 1= 2 (sl 1;2i sl 1;2i+1) l = 0; : : : ; L; i = 0; : : : ; n=2l+1 1 (5) with p sl;i = 1= 2 (sl 1;2i + sl 1;2i+1) l = 0; : : : ; L; i = 0; : : : ; n=2l+1 1 (6) with the initial condition: s 1;i = xi (7) For example, the 4-point Haar transform is as follows. Envisioning the input signal ~x as a column vector, and its Haar transform w~ as another column vector (w~ = [s1;0 ; d1;0 ; d0;0 ; d0;1 ]t - the superscript t denoting transposition), the Haar transform is equivalent to a matrix multiplication, as follows: 2 3 2 3 2 3 s1;0 1=2 1=2 1=2 1=2 x 66 d1;0 77 66 1=2 1 = 2 1 = 2 1 = 2 77 66 x01 77 64 d 75 = 64 1=p2 1=p2 0 0p 75 64 x2 75 (8) 0;0 p d0;1 0 0 1= 2 1= 2 x3 The above procedure is shared among all the wavelet transforms: we start at the lowest level, applying two functions at successive windows of the signal: the rst function does some smoothing, like a weighted average, while the second function does a weighted di erencing; the smooth (and, notice, shorter: halved in length) version of the signal is recursively fed back into the loop, until the resulting signal is too short. There are numerous wavelet transforms [PTVF96], some popular ones being the so-called Daubechies- 4 and Daubechies-6 transforms [Dau92]. 3.1.1 Discussion The computational complexity of the above transforms is O(n), as can be veri ed from Eq. 5-7. In addition to their computational speed, there is a fascinating relationship between wavelets, multireso- lution methods (like quadtrees or the pyramid structures in machine vision), and fractals. The reason is that wavelets, like quadtrees, will need only a few non-zero coecients for regions of the image (or the time sequence) that are smooth (i.e., homogeneous), while they will spend more e ort on the `high activity' areas. It is believed [Fie93] that the mammalian retina consists of neurons which are tuned 11 each to a di erent wavelet. Naturally occurring scenes tend to excite only few of the neurons, implying that a wavelet transform will achieve excellent compression for such images. Similarly, the human ear seems to use a wavelet transform to analyze a sound, at least in the very rst stage [Dau92, p. 6] [WS93]. In conclusion, the Discrete Wavelet Transform (DWT) achieves even better energy concentration than the DFT and Discrete Cosine (DCT) transforms, for natural signals [PTVF96, p. 604]. It uses multiresolution analysis, and it models well the early signal processing operations of the human eye and human ear. 3.2 Distance-Only Data In this case, DWT can only be applied after the data have been mapped to an k-dimensional space, with, e.g., Multidimensional scaling, or FastMap [FL95]. 3.3 Multi-Dimensional Data As mentioned, the DWT can be applied to an k-dimensional hyper-cube. In fact, it has been very successful in image compression [PTVF96], where a grayscale image is treated as a 2-d matrix. 3.3.1 Ordered and Unordered Attributes DWT will give good results for ordered attributes, when successive values tend to be correlated (which is typically the case in real datasets). For unordered attributes (like \car-type"), DWT can still be applied, but it won't give the good compression we would like. 3.3.2 Sparse Data DWT will work ne on sparse data - it will just have zero coecients in the deserted regions of the address space. 3.3.3 Skewed Data DWT should work well on skewed data because it is adaptable: it will have many non-zero coecients for the portion of the address space that has large values, and near-zero coecients for the rest. 3.3.4 High-Dimensional Data As mentioned several times before, the de nition of DWT can be trivially extended to arbitrary dimen- sionalities. However, although linear on the number of cells of the k-d matrix, notice that the number of cells itself grows exponentially with the number of dimensions k. This is the only point that may create eciency problems. However, most of the competitors will run into similar problems, too (and, probably, sooner than DWT). 4 Regression Regression is a popular technique that attempts to model data as a function of the values of a multi- dimensional vector. The simplest form of regression is that of Linear Regression [WW85], in which a variable Y is modeled as a linear function of another variable X , using Equation 9. 12 Y= + X (9) The parameters and specify the line and are to be estimated by using the data at hand. To do this, one should apply the least squares criterion to the known values Y1 ; Y2 ; :::, X1 ; X2 ; :::. The least squares formulas for 9 yield the values of and as shown in Equations 10 and 11 respectively. P = (XP X )(Y 2 Y ) (10) (X X ) = Y X (11) where X and Y are the average values for the data points X1 ; X2 ; ::: and Y1 ; Y2 ; ::: respectively. The extension of Linear Regression, called Multiple Regression, takes account of more than one independent variable X , allowing us to model Y as a linear function of a multidimensional vector. An example of a Multiple Regression model based on two dimensions is shown in Equation 12 Y = b0 + b1 X1 + b2 X2 (12) Again, b0 ; b1 and b2 must be estimated using the values at hand. The general procedure to do least square tting for Multiple Regression can be found in [PTVF96]. It is also possible to use nonlinear functions to perform data regression. Equation 13 shows an example of a nonlinear regression between variables Y and X. Y = b0 + b1 X + b2 X 2 + b3 X 3 (13) To estimate the parameters of (13), we could simply de ne the new variables: X1 = X X2 = X 2 X3 = X 3 (14) By the substitutions shown in Equation 14, Equation 13 becomes a linear model: Y = b0 + b1 X1 + b2 X2 + b3 X3 (15) that can be solved with the standard least square techniques. The method of rede ning variables to make the model linear is quite general. For instance, terms like reciprocals ( X1 and cosines (cos( X )) can be easily rede ned as linear terms. This technique does not work, however, if the nonlinearity is present in the parameters to be estimated. A notorious case of nonlinearity that can be easily removed by taking logarithms is shown in Equation 16: Y = b0 X1b (16) The following substitutions 13 Y1 = log(Y ) = log(b0 ) = b1 X1 = log(X ) (17) transform Equation 16 into the linear model shown in Equation 18. Y1 = + X1 (18) Other models are intractably nonlinear and cannot be subject to any transformation that renders them linear (e.g., the sum of exponential terms). For these models, it is sometimes possible to obtain least-square estimates by performing a lot of calculations on more complex formulae. Let us describe now in which cases regression can be used to compress or characterize data. 4.1 Distance-Only Data Clearly regression is useless with this kind of data, since the data is not embedded in any multi- dimensional space. 4.2 Multi-Dimensional Data 1. Ordered: This is the class of data for which regression applies more naturally. A simple example would be the case of modeling the amount of sales in a store as a function of the date. 2. Unordered (Flat/Hierarchical) Although unordered data can always be nominally ordered and thus subjected to regression, the model obtained by doing this may not be very meaningful. Alternatively, one could do the regression in using the range of a function whose domain is formed by the attribute values. An example of such a function is one to compute marginal values. Consider a dataset that maps \amount of sales" to the variables \store location" and \date". Clearly, the variable \store location" is not ordered. However, it is possible to create a variable X whose domain is formed by the cumulative sales for each store location, and use X as an ordered variable for the regression model. 4.2.1 Sparse Data The sparseness of the data does not a ect the applicability of regression. It is sometimes advisable to perform the regression only for the multidimensional points that are non-zero, to make the model t the non-zero data better. For example, stores with amount of sales = 0 may be exceptional and t a regression model poorly. 4.2.2 Skewed Data Skewness is actually a good feature for regression. Skewed data is more likely to t better in a model, given that the proper model is found. 14 4.2.3 High-Dimensional Data High-dimensionality forces the usage of a multiregression model. The price of using a multiregression model of a high degree is performance. Given a model, the dataset that needs to be modeled might not t in memory, forcing the estimation of the regression parameters to perform several passes over the data, thus slowing down the process. To alleviate this problem, one might choose to model portions of the dataset, by xing the values of one or several of the dimensions in each portion. Each of the models will have a smaller degree and the corresponding datasets will be also smaller, perhaps small enough to t in main memory, making the estimation a faster process. By doing this, however, one increases the number of estimations that need to be performed (one for each portion of the original dataset). Therefore there exists a tradeo between the size of the portions modeled and the performance of the overall modeling process. If the portions of the dataset are too small, each individual modeling process runs fast, but there may be too many of them to be performed, thus o setting the gains obtained by the individual runs. On the other hand, if the portions are too big, each individual modeling e ort will run slowly, but there will be few models to be run. The tradeo depends heavily on how sparse the data is: if only a few multidimensional cells are non-zero, then even a high degree portion of the data set (one with only a few of the dimensions xed) may be small enough to t in memory. In the rest of this section we discuss three important aspects of the use of regression as a data reduction technique. Accuracy: Accuracy, of course, depends on how well the chosen model ts the real data. In practice, however, even with simple models such as linear regression (and multiregression) one can obtain a reasonable approximation to the dataset. One way of getting better accuracy progressively is by reducing the in uence that outliers have in the model by giving them less weight in the least square regression. This method is known as biweight regression or robust regression; an example of this is the use of weighted least squares [WW85]. The rst thing to do is determine whether a data value is an outlier. That is usually done by measuring the di erence between the real value and its estimation, as in Equation 19. d = Y Y^ (19) It is customary to normalize d, dividing it by some overall measure of spread S , as shown in Equation 20. An example of S is the interquartile range of all deviations. (I.e., the di erence between the 25th percentile and the 75th percentile of deviations). ^ Z = Y 3S Y (20) With Z , we can compute the biweights, shown in Equation 21 ( w = (1 Z 2 )2 if jZ j 1 (21) 0 otherwise Equation 21 e ectively makes the weight of an outlier equal to 0. These weights are now used in the estimation formulas shown in Equations 22 and 23. P Y Y ) = wP(Xw(XX )(X )2 (22) 15 = Y^ bX^ (23) where the averages Y^ and X^ are found by using Equations 24 and 25. P Y^ = PwY w (24) P X^ = PwX w (25) The new estimation of and supports a more robust model. We can improve on it by re- calculating the deviations using Equation 20 and estimating new values of and using new weights. That should give an even better line. This process can be repeated until no substantial improvement can be obtained. Progressive resolution re nement: A way of obtaining progressively re ned answers is to store the outliers of the model. A rst cut of the answer consists of the estimated values for all the points requested. That answer can be polished by retrieving the real values of the outliers progressively replacing the estimated values for those data points. A technique similar to this has been successfully used in [BS97]. Incremental maintenance: As new data gets incorporated in the dataset, the relevant model(s) need to be updated to re ect the e ect of this data. The updating of the model can be achieved by using techniques similar to those described in [CR94] to update polynomial models for selectivity estimation. The techniques use a method called recursive least-square-error [You84] to avoid a lot of expensive recomputation. 5 Log-Linear Models Log-linear modeling is a methodology for approximating discrete multidimensional probability distri- butions. The multi-way table of joint probabilities is approximated by a product of lower-order tables. For example, suppose the four categorical attributes A; B; C; and D can respectively assume the values a = 1; :::; KA ; b = 1; :::; KB ; c = 1; :::; KC ; and d = 1; :::; KD. Then, if p(a; b; c; d) = Prob(A = a; B = b; C = c; D = d), one might assume a model of the form: p(a; b; c; d) = ab ac ad bcd (26) For given matrices ; ; and three-dimensional array . The simplest log-linear model is that of inde- pendence, which in this example becomes P (a; b; c; d) = a b c d. The presence of multiple subscripts in the same array allows for greater dependency within the distributions of the associated attributes. The name \log-linear" is used for these models in the statistics literature because log p is assumed to be a linear combination of unknown parameters. The phrase \multiplicative model" is more common in the computer science literature. Such models have been discussed and used since the 1940s or earlier, but especially since the 1970s, when computer algorithms to t them became widely available. Many text-book treatments of log-linear modeling are available, for example [Agr90] and [BFH75]. Sample references from the Computer Science literature are [KK69], [Pea88], and [Mal91]. Log-linear models use only categorical variables { continuous variables must be discretized rst, and even then the modeling will not make use of the ordinal nature of the categories. The purpose of 16 using this technique can be either data compression (since the several small arrays will take up less storage than the full multidimensional array) or data smoothing (since estimates of the small arrays will be less subject to sampling variation than elements of the full array), assuming that the full array was computed as observed proportions from a sample. Using log-linear models involves two steps: choosing a general form (how many factors to use and what sets of attributes are associated with each factor) and then estimating the numerical values of the array elements for each factor (parameter estimation). An important result due to [Bir63] is that, given the results of step one, the parameter estimation problem only requires as input the marginal proportions corresponding to the combinations of attributes making up the factor arrays. By marginal proportion we mean the sum of the values of elements in the datacube corresponding to appropriate speci ed attributes, with all other attributes projected out. In the example of Equation 26, the pa- rameters ; ; ; can all be estimated using just the marginals p(a; b; +; +); p(a; +; c; +); p(a; +; +; d) and p(+; b; c; d), where \+" denotes summation over the appropriate range. In statistical terms, the indicated marginals are sucient statistics for the parameters, and no more information is needed to estimate the parameters eciently, assuming that the model of Equation 26 holds. In addition, the computed approximation will t the input marginal distributions exactly. Another application of the methodology occurs when only certain marginal tables are available, and it is required to extend the probability distribution to the complete array, as in [Mal89]. The estimation of the parameter arrays can sometimes, for certain assumptions of factor combina- tions called decomposable models or graphical models, be quite simple, involving just simple arithmetic products and ratios of the given marginal probabilities. In general, however, an iterative method will be required to obtain the maximum likelihood (maximum entropy) estimates for scaling factors to be applied to the marginals. The most common such method is iterative proportional scaling, generally attributed to [DS40], which is guaranteed to converge to a unique solution whenever the marginal arrays have all positive elements and are consistent with each other. One drawback of the standard iterative proportional scaling algorithm is that it requires storage and computation over the complete estimated probability array, which could be quite large. For example, if there are 20 attributes, the complete array could have 1010 cells, depending on the number of values each variable takes on. Even if the data base represents millions of entities, the vast majority of the cells will have zero count. In such situations, there is obviously a great advantage to choosing a decomposable model. Among oth- ers [Mal91] discusses how to search the set of decomposable models for a good tting model. As in all such model choice problems, one must consider the usual tradeo between parsimony and variance reduction on the one hand, and adequacy of representation on the other. On the whole, log-linear modeling is a powerful and exible technique that scales up well to many dimensions and has many favorable and well understood statistical properties. The user can specify that arbitrary marginal distributions be t exactly, and be assured that all estimated probabilities remain within the unit interval. At the cost of discretizing continuous variables, it can be applied to any data type. 5.1 Distance-Only Data Assuming that the distances can be rounded to a discrete number of values, log-linear models might be useful for data reduction, depending on the complexity of the attributes for labeling endpoints. 5.2 Ordered Data There is no problem with using ordered categories with log-linear models, and some extensions of these models have been proposed to explicitly use the ordinal information, as in [Agr90] (Chapter 8). 17 5.3 Unordered Data This type of data is the primary application for log-linear models. 5.4 Sparse Data The log-linear methodology does not require dense data. However, as mentioned above, a very high- dimensional sample will usually have many cells with zero count in its multiway frequency table. Thus one may be limited for computational reasons to decomposable log-linear models. This may limit the adequacy of the representation of a sparse data set, depending on the complexity of the distributional dependencies. 5.5 Skewed Data Since the user of log-linear models is free to choose discrete values to match the distribution of observed values, skewness of data values is not a problem. Skewness of frequencies (presence of some very large counts) is also not necessarily a problem, although such data may mean that simple log-linear models t poorly. 5.6 High-Dimensional Data As mentioned above, log-linear models scale up fairly well to ten or so dimensions for arbitrary models. Above that number, it may be necessary to restrict consideration to decomposable models, which have fewer and weaker dependency relations. 5.7 Accuracy The ability to choose more complex log-linear models allows the user to tune the accuracy of the t as desired. There is also a well-developed statistical theory providing measures of goodness of t of models, hypothesis tests comparing models, and con dence limits for the estimated parameters. 5.8 Progressive Resolution Re nement Once a log-linear model has been constructed, computing the answer to any point query is rather easy, involving simply the multiplication of a few numbers, so progressive resolution re nement is not of much value. The method of iterative proportional scaling allows the result of tting one log-linear model to be used as a starting point when tting a more complex log-linear model - that is, a model inputting higher-dimensional marginals than the rst model. This provides good control over the resolution of the model. 5.9 Incremental Maintenance As new data are collected, the values of the marginal proportions used as input to the model tting will change. As in the previous item, the t from the previously computed data cube can be used to begin the iterative proportional scaling and hasten convergence compared to default initial values. However, in practice, the savings in computation will probably not be large in either situation, perhaps in the range of 10-50%. 18 6 Histograms Histograms approximate the data in one or more attributes of a relation by grouping attribute values into \buckets" (subsets) and approximating true attribute values and their frequencies in the data based on summary statistics maintained in each bucket. For most real-world databases, there exist histograms that produce low-error estimates while occupying reasonably small space (of the order of 500 bytes in a catalog)1. Hence, they are the most commonly used form of statistics in practice (e.g., they are used in DB2, Informix, Ingres, Oracle, Microsoft SQL Server, Sybase, and Teradata). They are used mainly for selectivity estimation purposes within a query optimizer. They have also been used in query execution (e.g., for parallel-join load balancing [PI96]) and there is work in progress on using them for approximate query answering. 6.1 De nitions In what follows, histograms are de ned in the context of a single attribute. The extensions to multiple attributes can be found elsewhere [PI97]. The domain D of an attribute X in relation R is the set of all possible values of X and the ( nite) value set V ( D) is the set of values of X that are actually present in R. Let V = f vi : 1 i D g, where vi < vj when i < j. The spread si of vi is de ned as si = vi+1 vi , for 1 i < D. (We take s0 = v1 and sD = 1.) In this section we only consider numerical attributes. A commonly used technique for constructing histograms on non-numerical attributes (such as string elds, etc.) is to use a function that converts these data types into oating point numbers before constructing a histogram2. The frequency fi of vi is the number of tuples t 2 R with t:X = vi. Finally, the area ai of vi is equal to vi fi. The data distribution of X (in R) is the set of pairs T = f (v1 ; f1 ); (v2 ; f2 ); : : : ; (vD ; fD ) g. Typically, several real-life attributes tend to have skewed or highly non-uniform data distributions. The main characteristics of such distributions are unequal frequencies and/or unequal spreads. A histogram on attribute X is constructed by partitioning the data distribution T into disjoint subsets called buckets and approximating the frequencies and values in each buckets in some common fashion. Typically, the frequencies are approximated by their average. The value domain is most often approximated by the entire set of values in the bucket's range (the continuous value assumption). A much more accurate approximation, however, and the one that has been used in recent research is one assuming that all the values in a bucket are separated by the same amount from their next neighbor (the uniform spread assumption). In all cases, histograms can be viewed as approximate data distributions of the underlying attributes and used in any estimation problem requiring those distributions. These de nitions are illustrated in the following example. Example 1: Consider a relation with schema EMP(ename,salary). The following table shows how each parameter de ned above is instantiated for this relation. Quantity Set of values Attribute Value fv g i 10 60 70 120 140 190 Frequency ff g i 110 90 20 30 70 80 Spread fs gi 50 10 50 20 50 1 Figure 2 plots the data distribution, with attribute values on the x-axis and frequencies on the y-axis. Figure 3 corresponds to the approximate data distribution arising from a histogram using three buckets and making the uniform spread assumption. 1 Nevertheless, one can construct data distributions that cannot be approximated well using a small number of buckets. 2 For example, this technique is used in IBM's DB2-6000 system. 19 150 150 F F Bkt 1 R R E E Q 100 Q 100 U U E E N N Bkt 3 C C Bkt 2 I I E 50 E 50 S S SPREAD 10 60 70 120 140 190 10 60 90 120 140 190 ATTRIBUTE VALUES ATTRIBUTE VALUES Figure 2: Data Distribution Figure 3: Approximated distribution One of the key factors a ecting the accuracy of the histograms is the partitioning rule employed in determining the buckets. In order to illustrate this, two well-known classes of histograms, the equi- width and equi-depth histograms are described next. Both these histograms group contiguous ranges of attribute values into buckets and assume that all attribute values within the range corresponding to a given bucket occur with equal probability. Theis di erence lies in the exact choice of bucket boundaries chosen. In an equi-width histogram, the widths of all buckets' ranges are the same; in an equi-depth (or equi-height) histogram, the total number of tuples having the attribute values associated with each bucket is the same. In [PIHS96], a set of key properties that characterize histograms have been identi ed, forming the basis for a taxonomy of histograms. These properties essentially determine the e ectiveness histograms in approximating data distributions and are the following: the sort parameter, which determine the order in which the attribute-value/frequency pairs of the data distribution are grouped in the histogram; the histogram class, which determines the sizes of buckets allowed in the histogram (e.g., are singleton buckets mandatory?); the source parameter, which represents the quantity that the histogram should try to capture accurately; and the partition constraint, which is the mathematical rule that determines where exactly the histogram boundaries will fall based on the source parameter. Both the sort and the source parameters are functions of the attribute-value/frequency pairs in the data distribution. Examples include the attribute value itself, the frequency itself, and the area. The partition constraints include the following. Equi-sum: In an equi-sum histogram (with buckets), the sum of the source values in each bucket is approximately the same and equal to 1= times the sum of all the source values in the histogram. V-Optimal: The V-Optimal histogram on an attribute is the histogram with the least variance among all the histograms using the same number of buckets. Here, the variance of a histogram is the weighted sum of its source parameters values in each bucket, with the weights being equal to the number of values in that bucket. MaxDi : In a MaxDi histogram, there is a bucket boundary between two source parameter values that are adjacent (in sort parameter order) if the di erence between these values is one of the 1 largest such di erences. Compressed: In a Compressed histogram, the n highest source values are stored separately in n 20 singleton buckets; the rest are partitioned as in an equi-sum histogram. Often n is the number of source values that (a) exceed the sum of all source values divided by the number of buckets and (b) can be accommodated in a histogram with buckets. Spline-based: In a spline-based histogram, the maximum absolute di erence between a source value and the average of the source values in its bucket is minimized. We refer to a histogram with c, u, and s as the partition constraint, source parameter, and sort parameter as the c(s; u) histogram. Figure 4 provides an overview of the new combinations that were introduced in [PIHS96] together with the traditional combinations. Ecient sampling-based techniques exist for computing all classes of histograms and are given in [PIHS96]. SORT SOURCE PARAMETER PARAMETER SPREAD (S) FREQUENCY (F) AREA (A) CUM. FREQ (C) EQUI−SUM EQUI−SUM VALUE (V) V−OPTIMAL V−OPTIMAL V−OPTIMAL MAX−DIFF MAX−DIFF SPLINE BASED COMPRESSED COMPRESSED V−OPTIMAL FREQUENCY (F) MAX−DIFF V−OPTIMAL AREA (A) MAX−DIFF Figure 4: Augmented Histogram Taxonomy. Most of the work on histograms is in the context of evaluating their accuracy in estimating the result sizes of queries containing selections [Koo80, PSC84] and joins [IC93, Ioa93, IP95a]. Multi-dimensional histograms have also been studied in detail [MD88, PI97]. By building histograms on multiple attributes together, these techniques are able to capture dependencies between those attributes. Incremental maintenance techniques for histograms and samples have also been investigated [GMP97], as has the use of histograms in parallel-join load balancing [PI96]. Finally, there are several sources where one may nd extensive discussions of histogram-based estimation techniques [Koo80, IP95b, MCS88, Poo97]. In the following sections, the e ectiveness of histograms in approximating di erent kinds of data is studied. 6.2 Distance-Only Data The current histogram techniques cannot approximate such data, because they rely on information about the placement of data in a multi-dimensional space. A possible solution would be to identify new choices for the parameters of the taxonomy that are based on distance and a new value domain approximation technique that does not rely on data placement. 6.3 Multi-Dimensional Data 1. Ordered: Histograms are well suited for approximating ordered data (discrete or continuous domains, nite or in nite intervals). Most of the research so far on the accuracy of histograms has focussed on ordered data and has shown that the most accurate classes of histograms for ordered data in fact preserve the order in grouping values into buckets. 21 2. Unordered: Histograms that do not use the attribute value as the sort parameter assume that there is no inherent ordering among the attribute values. Hence, these histograms can also be used to approximate unordered at data. On the other hand, all techniques for approximating the value domain within a bucket rely on an inherent order among those values. As a result, a histogram on unordered data needs to keep track of all values falling within each bucket, which is clearly impractical for large value domains. In summary, it is unclear how histograms can be used to eciently approximate unordered data. This discussion applies to hierarchical unordered data as well, with an important exception. Often the hierarchy structure can be used to group values into buckets (e.g., bucket per each week), in which case the values inside a bucket (often) need not be stored (e.g., days of the week) and the above problem disappears. 6.3.1 Sparse Data Histograms have been shown in earlier work to be highly e ective in approximating sparse and dense data [PIHS96]. Speci cally, histograms making the uniform spread assumption work for both kinds of data while the older continuous value assumption works well only for dense data. 6.3.2 Skewed Data Histograms have been shown to be most e ective in approximating highly skewed data (frequency and value domain skews) as well as nearly uniform data. For high skews, there are a few signi cant attribute values (or frequencies) that can be captured accurately by the histograms using an appropriate choice for its sort and source parameters. For nearly uniform data, most histograms are likely to be highly accurate because the uniformity assumptions within the bucket will not result in high errors. Surprisingly, histograms perform relatively the poorest on data that is moderately skewed. For example, when several values have high but dissimilar frequencies, grouping them into a bucket will incur high errors because of the dissimilarities, and hence one needs more buckets to be accurate in this case. 6.3.3 High-Dimensional Data Research has shown that multi-dimensional histograms are highly e ective in approximating data in multiple (scalar) attributes of a relation. In all these studies, however, the number of attributes has been less than 5. More work needs to be done to e ectively use histograms for very high dimensions. The same applies to approximating multi-dimensional data within a single attribute as well (e.g., polygons). 6.4 Aspects of histogram usage Accuracy: Although histograms are used in many systems, many of the histograms proposed in earlier works are not always e ective or practical. For example, equi-depth histograms [Koo80, MD88, PSC84] work well for range queries only when the data distribution has low skew, while V-Optimal(F,F) histograms [IC93, Ioa93, IP95a] have only been proven optimal for equality joins and selections when a list of all the attribute values in each bucket is maintained. Earlier work has shown that the most accurate and practical histograms belong to the V-Optimal(V,A) and MaxDi (V,A) classes [Poo97]. Brie y, these histograms group contiguous ranges of values into buckets and avoid grouping attribute values with highly di erent areas. These histograms have been shown to be highly accurate for both join and selection queries [Poo97]. Progressive resolution re nement: A histogram on at (non-hierarchical) data can not be used to provide di erent levels of resolution. On the other hand, histograms built on hierarchical data 22 can be used at various levels of the hierarchy (if the buckets were also constructed based on the hierarchy) to provide progressive resolution re nement. Incremental maintenance: The common approach taken by all commercial systems is to peri- odically recompute the histogram from the updated data. This approach leads to inaccurate estimates from outdated histograms and can be quite expensive when used on a database with very large number of relations. Recent work has shown that some classes of histograms can be maintained eciently and accurately using incremental techniques [GMP97]. These techniques make use of a (possibly disk-resident) backing sample, which is also incrementally maintained as a uniformly random representative of the underlying data. The histograms are kept in main- memory and are updated frequently to preserve their accuracy while the sample is accessed very infrequently - basically when the histogram becomes too inaccurate. 6.5 End-biased histograms (Outliers) End-biased histograms are a special case of histograms that have only singleton buckets - i.e., each bucket has a single attribute-value/frequency pair. As a result, the value and its frequency are accu- rately captured. Obviously, due to limited space, not all values in the relation can be stored in this manner. Even these limited number of buckets, however, often provide highly accurate estimates - either directly or by supplementing other statistics. Hence, these outlier isolation techniques can be employed alongside any of the data reduction techniques described in this report. There has been lot of work on storing and using outlier information in databases. Some of the research on histogram-based join result size estimation has shown the bene ts of storing values with extreme frequencies. The class of end-biased histograms contains a few high-frequency values and a few low-frequency values in singleton buckets and the rest in a single large bucket [Ioa93].3. These histograms are less expensive to construct than the general class of histograms, occupy less space, and often o er equally high accuracies for join queries. A few commercial systems also employ singleton buckets for selectivity estimation purposes. For example, DB2 stores a small number of attribute values with the highest frequencies in the relation. For a highly skewed relation with a few very high frequencies, this information by itself may be enough to provide accurate estimates. When enhanced with a usual histogram on the remaining data, the combined set of statistics has been shown to be highly accurate. These combined statistics are in fact also used in DB2 and are known as Compressed histograms [PIHS96]. In the following sections, the e ectiveness of using singleton buckets is discussed. In practice, these statistics are almost always used in conjunction with other forms of statistics, which eliminates most of the de ciencies below. 6.5.1 Distance-Only, Ordered, Unordered Data End-biased histograms are not a ected by these properties of the data because they store each value individually in a singleton bucket and do not assume anything about the relation between various data values. 6.5.2 Sparse Data End-biased histograms are very well suited for sparse data because there are fewer values that need to be captured. On the other hand, they are incapable of approximating dense data because of the 3 Storing lowest frequency values in singleton buckets is useful if the distribution has several high frequencies that are equal and a few much smaller frequencies. But, often, one only stores high frequencies in singleton buckets. 23 limited number of (singleton) buckets. 6.5.3 Skewed Data End-biased histograms are most e ective in approximating highly skewed data (both frequency and value domain skews). The reason is that skewed data often contains very few values that cause the skew (e.g., values with extreme frequencies) which can be captured accurately using the singleton buckets. These histograms, however, fail to capture lower to medium skews because of the large number of signi cant values to capture. 6.5.4 High-Dimensional Data End-biased histograms function the same in capturing signi cant values in data of any dimensionality. 6.6 Aspects of end-biased histogram usage Accuracy: Although singleton buckets are used in some commercial systems, their accuracy has not been studied much in the literature. It has been shown that end-biased histograms are quite accurate in estimating join results sizes, particularly when the data is skewed [IP95a]. On the other hand, since they do not approximate the entire data distribution, they can not be used e ectively for estimating the result sizes of selection predicates. There are two ways to increase the accuracy of end-biased histograms: adding more singleton buckets, and carefully choosing the values to be preserved in singleton buckets (which need not always be the high frequency values). Progressive resolution re nement: End-biased histograms directly provide the nest partitioning of data (into individual values) and hence can not provide further resolution. Incremental maintenance: Gibbons and Matias have designed ecient techniques with theoretical bounds on accuracy to incrementally maintain the highest frequency values in a database relation [GM96]. 7 Clustering Techniques In the past 30 years, cluster analysis has been widely studied in statistics. The objective is to identify clusters embedded in the data. Intuitively, a cluster is a collection of data objects that are \similar" to one another, thus legitimizing the treatment of all the objects collectively as one group. Similarity is expressed in terms of a distance function, which is typically, though not necessarily, a metric. In other words, for each pair of data objects p1 ; p2 , the distance D(p1 ; p2 ) is known. In addition to a distance function, there is a separate \quality" function that measures the \goodness" of a cluster. One example of a quality function is the centroid distance, i.e., the average over fD(p1 ; c)jp1 2 Clg, where c is the centroid of all the objects in cluster Cl. Another example is the diameter, i.e., the maximum over fD(p1 ; p2)jp1 ; p2 2 Clg. Even though similarity between objects and goodness of clusters can be de ned, it is much harder to de ne \similar enough" and \good enough". The fundamental question here is: how many natural clusters there are in the given dataset. The answer to this question are typically highly subjective and remains an open issue in cluster analysis [KR90]. Existing clustering algorithms deal with this issue in one of two ways. 24 7.1 Overview of Existing Algorithms 7.1.1 Statistical Methods The rst way is to avoid answering the question entirely by giving a complete clustering of the dataset. That is, if there are n objects in the dataset, a clustering algorithm of this type speci es how to group the objects in 1, 2, : : : ; n clusters. Clustering algorithms of this type are called hierarchical methods. They are either agglomerative (i.e., \bottom-up" in computer science jargon), or divisive (i.e., \top- down"). Given n objects to be clustered, an agglomerative method begins with n clusters (i.e., all objects are apart). In each step, based on the distance and quality functions, it chooses two clusters to merge. This process continues until it puts all objects into one group. Conversely, a divisive method begins by putting all objects in one cluster. In each step, it picks a cluster to split into two. This process continues until it produces n clusters. While hierarchical methods have been successfully applied to many biological applications (e.g., for producing taxonomies of animals and plants, and classi cation of diseases [KR90]), they are well known to su er from the weakness that they can never undo what was done previously. Once an agglomerative method merges two objects, these objects are always in one cluster. And once a divisive method separates two objects, these objects are never re-grouped into the same cluster. More importantly, for large datasets, producing all n clusters is excessive and computationally prohibitive. The second way to answer the question of how many natural clusters there are in the dataset is to ask the human user { not the clustering algorithm { to determine that number and to feed the number as input to the algorithm. Given the number, denoted as k, a partitioning clustering method tries to nd the best k partitions of the n objects, i.e., each object is assigned to exactly one group. 4 However, the task of nding the best k partitions amounts to solving a nonconvex discrete optimization problem. Exhaustive enumeration of all partitions appears to be the only way to nd the global optimal solution. Thus, the development of partitioning methods focuses on heuristics that try to strike as good a balance as possible between eciency and nding solutions close to the global optimum. The k-means and k-medoids algorithms are well-known examples of partitioning methods. They have found many successful application areas, including social studies (e.g., for classi cation of statistical ndings), manufacturing (e.g., garment) and chemical analysis. 7.1.2 Databases Methods In recent years, some database researchers have re-visited the clustering problem. For them, there is the additional focus that the datasets may be a lot larger than the typical sizes used for statistical clustering (i.e., hundreds of thousands, if not millions, versus only hundreds or thousands). Furthermore, because the data may be mainly disk-resident, there is also the emphasis of minimizing I/O cost. Based on randomized search, CLARANS can be viewed as an extension to the k-medoids algo- rithm [NH94, KR90]. It is highly tunable, depending on how much CPU time the user can a ord. Focusing techniques based on spatial access methods (e.g., R* trees, Voronoi diagrams) are later devel- oped to reduce the I/O cost required by CLARANS [EKX95]. By employing a balanced tree structure called CF tree, BIRCH makes explicit and takes full advantage of the amount of available bu ering space [ZRL96]. A single scan of the dataset gives a basic clustering, and additional scans can be used to improve the quality further. Relying on the parameters of the size of the neighborhood and the minimum number of data points in the neighborhood, DBSCAN connects regions of suciently high densities into clusters [EKXS96]. As such, it does a better job nding elongated clusters than most of the algorithms mentioned above. It uses an R* tree to achieve good performance. Finally, STING is 4 There are a few methods that tolerate a limited degree of overlap between clusters. See [KR90] for more details. 25 a hierarchical cell structure that stores statistical information (e.g., density) about the objects in the cells [WYM97]. Thus, with only one scan of the dataset, clustering can be achieved by using the stored information but without recourse to the individual objects. 7.1.3 Machine Learning Methods There are a few clustering methods developed in the machine learning community. They are mostly probability-based approaches [Fis87]. And typically, they make the assumption that the probability distributions on di erent attributes are independent of each other. In practice, this is often too strong an assumption, because correlation may exist between attributes. In fact, as far as data reduction is concerned, correlation is exactly what is being searched for. 7.2 Distance-Only Data As discussed above, all clustering methods require the speci cation of a distance function. Distance- only datasets, thus, pose no additional problem to clustering methods. Many clustering methods (e.g., k-medoids, CLARANS) can even handle non-metric distance functions. In our evaluation of all data reduction methods, we measure (i) the accuracy and space/time tradeo , and check whether (ii) progressive resolution re nement and (iii) incremental maintenance are supported. As far as clustering methods are concerned, the rst criterion is the key. Once clustering methods are considered to be applicable and provide good accuracy, the other two criteria of progressive resolution re nement and incremental maintenance are automatic. For example, CLARANS, BIRCH and DBSCAN all provide various parameters for tuning and incremental maintenance. Thus, in the sequel, we only focus on the rst criterion. 7.3 Multi-Dimensional Data 7.3.1 Ordered and Unordered Attributes 1. Ordered: For ordered datasets, clustering algorithms should work well. This is because the underlying order provides a natural distance function for the clustering algorithms to use. For instance, if age is the ordered attribute under consideration, then the distance between A and B is the di erence between the ages of A and B. 2. Unordered (Flat/Hierarchical): For at datasets, clustering algorithms do not work. This is because clustering algorithms require the existence of a distance function. If equality is the only meaningful comparison operation for the dataset, there is not enough discrimination of the objects for the algorithms to reason with. Consequently, there is only one trivial clustering structure: the equivalent classes of the objects, i.e., objects are in the same cluster if and only if they are equal to each other. For hierarchical datasets, the answer to the question of whether clustering algorithms work well is not as clear-cut as in the ORDERED and FLAT cases. On the one hand, the underlying domain is unordered and provides no distance function for the clustering algorithms. On the other hand, the structure of the hierarchy can be used to provide candidate distance functions. For example, the distance between two objects p1 and p2 can be de ned by the path length between p1 and anc, and the path length between p2 and anc, where anc is the smallest common ancestor of p1 and p2 , and an ancestor is the smallest if it is farthest away from the root. For some applications, 26 candidate distance functions of this nature provide reasonable clustering quality; for others, they do not. 7.3.2 Sparse Data For sparse datasets, clustering algorithms should work well. The sparsity is translated to discrimination between objects. Clustering algorithms should be the most e ective when the sparsity is localized corresponding to distinct clustering structures. In this case, clustering algorithms that can be tuned are the most preferred because the great discrimination between some objects renders a coarse-grained, but ecient, analysis to be sucient. For dense datasets, the embedded clustering structures are less distinct and crisp. Clustering algorithms still work, but their e ectiveness is weakened. One of the reasons is that most clustering algorithms produce groups that do not overlap. For DENSE datasets, this requirement is restrictive. Algorithms that allow overlap perform better under this circumstance. 7.3.3 Skewed Data For datasets that are skewed in the value domain, clustering algorithms should work very well. This situation is very similar to the localized sparsity scenario discussed above. As for datasets that are skewed in the frequency domain, clustering algorithms are not a ected. This is because all data objects having the same attribute value behave identically as far as clustering is concerned. Thus, it is sucient to pick one representative per attribute value to participate in the clustering. 7.3.4 High-Dimensional Data For high dimensional datasets, the situation for clustering algorithms is mixed. First, from an e ec- tiveness point of view, the algorithms are not a ected by the dimensionality { so long as the distance function has already captured the relationships that may exist between the dimensions. From an ef- cieny standpoint, in theory, a larger number of dimensions only implies a larger cost in computing the distance function. Thus, clustering algorithms should scale linearly with the number of dimen- sions. However, in practice, the situation is not as rosy, particularly for those algorithms that rely on various kinds of indexing to facilitate processing. For instance, for algorithms relying on trees (e.g., BIRCH [ZRL96] and DBSCAN [EKXS96]) the O(logn) factor degrades to O(n) as the dimensional- ity increases. Similarly, for algorithms using a grid structure (e.g., STING [WYM97]), processing is exponential with respect to the number of dimensions. 8 Index Trees 8.1 Descriptions and References Index trees of various types are widely used to organize and access large data sets. Typically they are used to speed up selection queries on one-dimensional data sets ordered on a single key attribute. B+-trees are the most common and signi cant form of index tree for disk-resident one-dimensional data [BM72, Com79]. For data sets of higher dimension (i.e., those organized and accessed on the basis of values of two or more attributes in combination), a variety of other types of disk-based index trees have been developed and used. The most common example is the R-tree [Gut84] and its vari- ants: the R*-tree [BKSS90] and R+-tree [SRF87]. Other multidimensional search trees include quad- trees [FB74], k-D-B-trees [Rob81], hB-trees [LS90], and TV-trees [LJF94]. Multidimensional data can also be transformed into unidimensional data using a space- lling curve [Jag90]; after transformation, a 27 key1 key2... Internal Nodes (directory) Leaf Nodes (linked list) Figure 5: Sketch of a database index tree. B+-tree can be used to index the resulting unidimensional data. A survey of multidimensional indexes is given by Gaede and Gunther [GG97]. 8.2 A Generalized Picture of Index Trees The canonical rough picture of a database index tree appears in Figure 5. It is typically a balanced tree, with high fanout. The internal nodes are used as a directory. The leaf nodes contain pointers to the actual data, and are stored as a linked list to allow for partial or complete scanning. Within each internal node is a series of keys and pointers. In the typical application of index trees, the keys are used to guide tree traversal for nding all data items satisfying a selection predicate q. Traversal starts at the root node, and for each pointer on the node, if the associated key is found to be consistent with q | i.e., the key does not rule out the possibility that data stored below the pointer may match q | then traversal continues in the subtree below the pointer. This is repeated recursively down all consistent subtrees until all the matching data are found. In B+-trees, queries are in the form of range predicates (e.g., \ nd all i such that c1 i c2 "), and keys logically delineate a range in which the data below a pointer is contained. If the query range and a pointer's key range overlap, then the two are consistent and the pointer is traversed. In R-trees, queries are in the form of region predicates (e.g., \ nd all i such that (x1 ; y1 ; x2 ; y2 ) overlaps i"), and keys delineate the bounding box in which the data below a pointer is contained. If the query region and the pointer's key box overlap, the pointer is traversed. Note that in the above description the only restriction on a key is that it must logically describe the set of data stored below it, so that the consistency check does not miss any valid data. In B+- trees and R-trees, keys are essentially \containment" predicates: they describe a contiguous region in which all the data below a pointer are contained. Containment predicates are not the only possible key constructs, however. For example, the key \purple, cardinality = 516" is perfectly acceptable, indicating that there are 516 data items below the pointer, all of which are all purple. In general, keys on a node may \overlap", i.e., two keys on the same node may hold simultaneously for some data item in the data set. In essence, a database index tree is a hierarchy of clusterings of a dataset, in which each cluster has a label that holds for the data contained in the cluster. The grouping of data into clusters may be controlled by a variety of tree insertion and node splitting algorithms. The main variations among index trees are the way they represent keys, and the algorithms for insertion and node splitting. Although index trees have typically been used for accelerating selection queries, their structure is amenable for use in data reduction, as we shall see below. 28 731 1623 1942 2978 3258 9643 Figure 6: The root of a B+-tree 8.3 Data vs. Space Partitioning Multidimensional index trees recursively subdivide a underlying k-dimensional space. The root node corresponds to the entire space. Each internal node corresponds to a portion of the space of its parent, and that portion is further subdivided among the children of the node. The leaf nodes consist of pointers to individual records (or the records themselves) that lie in the region of k-space corresponding to the node. Data partitioning index trees divide the space based on the speci c records or groups of records that are loaded or inserted into the tree. Examples of data partitioning trees include R-trees, TV-trees and hB-trees. Space partitioning trees divide the space along predetermined lines of division that are independent of the particular records represented in the trees (e.g., recursive binary splitting of the attribute ranges in each dimension). Examples of space partitioning trees include the various versions of quad trees, and k-D-B-trees. For both data and space partitioning trees, the splitting is propagated to a suciently deep level in each part of the tree that the leaf nodes can hold all the data or pointers to data assigned to them. Most space-parititoning trees are not balanced, which renders them less useful for disk-based storage; typically they are mapped to another representation when saved to disk. Though there are many variations of index trees for both one-dimensional and multi-dimensional data, they all have shared properties. The Generalized Search Tree (GiST) [HNP95] is a template index tree that provides a common basis for describing and easily implementing a variety of index trees. In our discussion here, we focus on properties that tend to be shared by many if not all index tree variants. For one-dimensional data, we assume B+-trees as the prototypical index tree; for multidimensional data, we choose R-trees (or the very similar R*-trees). 8.4 Using Index Trees for Representing Aggregate or Summary Data Index trees are used primarily for their bene ts in organizing and supporting access to data. However, it is also possible to make direct use of the nodes of index trees to obtain and exploit aggregate or summary information about a data set; this has been observed by numerous researchers and implemented in some database products [ACD+88, Ant93b, Aok97]. The information stored in an index node can be used for such purposes as providing approximate answers to queries or making choices in query optimization. As an example, consider looking only at the upper levels of an existing hierarchical index tree. They reveal a great deal of distributional information about the data. Typically, assuming branching factors found in practice in modern database systems, the root index node alone provides information equivalent to an approximately equi-depth histogram of one hundred to two hundred buckets. To see this consider the small example pictured in Figure 6. Assume that a B+ tree contains 10,000 records with keys in the range 1 to 99,999. Assuming that the root node is as shown in Figure 6 we can conclude that the data in the tree can be approximated by an equi-depth histogram of seven buckets, each containing 10000/7 items, split at the values 1, 731, 1623, 1942, 2978, 3258, 9643. More and more detailed aggregate and summary information can be obtained by examining lower and lower levels of the index tree, which involves reading more and more index tree nodes. In essence, 29 an index tree can be thought of as a hierarchical histogram. This is complicated somewhat if keys are allowed to overlap as they are in R-trees; this becomes analogous to a hierarchical histogram in which the \buckets" overlap. A great deal of distribution information can be obtained from an index tree without traversing it very deeply. During query processing, it is common for blocks corresponding to the uppermost nodes in the tree to be consistently found in main memory, while those at lower levels usually require one block read from disk each for access. Note that much summary and aggregate information is available from index trees with no modi - cation beyond how they are used to organize and access data. However, some minor extensions that involve only modest overhead can increase the accuracy and precision of the summary and aggregate information that can be extracted from the index tree structure: One can store a count with each pointer that represents the precise number of data records in all the descendents of the node pointed to. Trees with such counts are said to be ranked [Knu73]. Ranked trees with non-overlapping keys truly are hierarchical histograms, and allow for arbitrary re nement of buckets by traversing pointers. The advantages of ranking do not come without cost. The space requirement for the counters associated with each pointer can reduce tree fanout by a signi cant factor, and maintenance of counters on a complete path from root to leaf is required on each record insertion or deletion. The overhead of handling insertion and deletion can be ameliorated by using pseudo-ranking [Ant93b], which allows counters to diverge a certain amount from the accurate values. While counters are the most natural \decoration" that one can add to index entries, there is no reason one can not store additional statistics in the entries, though additional statistics can consume space and further reduce the tree fanout. Generalized Search Trees [HNP95] support arbitrary keys of this nature, and Aoki's extensions to them [Aok97] extend the idea of psuedo- ranking to support extensible \divergence control" for arbitrary statistics. 8.5 Indexes and Histograms It is asserted above that indexes can be viewed as hierarchical histograms, but not all avors of histograms can be conveniently supported in index trees. Typically, index trees are balanced, meaning they are a hierarchy of roughly equi-depth histograms. This may or may not be appropriate for data reduction. In order to coerce index trees to serve as non-equi-depth histograms, one must tolerate index trees that are unbalanced either in height or node occupancy. 8.6 Distance-Only Data Index trees provide no special advantage over other methods for dealing with distance-only data. Distance-only data can be stored using a multidimensional index tree only after the data is converted to positional data by embedding the points in a space of suciently high dimension. 8.7 Multi-Dimensional Data 8.7.1 Ordered and Unordered Attributes Multidimensional index trees rely on the ordering of the attribute values in each dimension. Unordered attribute domains must have some ordering (perhaps an arbitrary one) imposed on them before they can be represented in a multidimensional index tree structure. 30 1. Ordered: Given an ordered attribute domain, the domain is usually normalized or mapped (preserving order) onto a canonical range, such as the unit interval. Some types of index trees rely on recursive binary splitting of the unit interval in order to keep the occupancy of each multi- dimensional sub-range of the space below some prespeci ed limit (e.g., re ecting the capacity of a single page of storage). In a manner similar to that used for B+-trees, occupancy information (either exact or approximate) can be associated with each pointer to a hyperrectangular area in the multi-dimensional index tree structure, see e.g., [WYM97]. 2. Unordered (Flat/Hierarchical) The Generalized Search Tree was designed expressly to handle \ at" or \encapsulated" data. In GiSTs, user-de ned operations may be implemented to choose an insertion location for new data, to partition data when nodes ll, and to label pointers with keys. A domain expert need not require ordering or hierarchical structure of the data per se to implement these operations, and the index itself can remain oblivious to the domain semantics. Moreover, it may be possible for the tree to automatically organize itself based on observing queries, to see which data items are often fetched together. Clustering such items into subtrees increases the eciency of selection, and (more interestingly for our purposes here) provides a query-driven notion of data reduction that requires no semantic understanding of the data set. The application of GiSTs to such encapsulated at data is an active area of research, but as of yet the results are preliminary. In the remainder of this section we focus on the more traditional multi-dimensional search trees, whose properties are currently better understood. Hierarchical relationships of values in each dimension of a multi-dimensional space are not con- ventionally represented within index trees. However, for index trees in which split points can be selected exibly, placing the split points at major breaks in the hierarchy of values would have the advantage of tending to localize in storage elements that are more closely related in the hierarchy. 8.7.2 Sparse Data Index trees generally handle sparse data well, in that the structure of a tree for a speci c set of data adapts automatically to the distributional characteristics of the data. In some cases, problems arise if two or more attributes are so highly correlated that the e ective dimensionality of the embedding space is reduced. 8.7.3 Skewed Data As with sparse data, index trees generally adapt well to skewed data by either allowing deeper devel- opment of the tree in locally dense regions of the space, or by placing split boundaries in the dense areas to retain some balance in the populations associated with various tree nodes. 8.7.4 High-Dimensional Data Some index trees have been developed explicitly to address high dimensional problems (e.g., TV- trees [LJF94] and X-trees [BKK96]). The ecacy of these structures remains in doubt, especially in light of recent results on the hardness of indexing high-dimensional space [HKP97]. Most known suc- cessful approaches involve projecting (based on some heuristics) to a space of lower, more manageable dimensionality. 31 9 Sampling The notion that a large set of data can be represented by a small random sample of the data elements goes back to the end of the nineteenth century and has led to the development of a large body of survey- sampling techniques [Coc77, SSW92, Sud76]. Over the past fteen years, there has been increasing interest in the application of sampling ideas to database management systems (DBMS's). Some existing and proposed applications of sampling include the following. Query optimization Given a query in an object-relational DBMS, the query optimizer estimates the cost of alternative query execution plans and attempts to select a low-cost plan. Current optimizers estimate costs based on summary statistics about the base relations; these statistics are stored in the system catalog. Speci cally, the catalog statistics are used to estimate the sizes of intermediate query results via \selectivity formulas," and the resulting \selectivity estimates" are then substituted into cost formulas to yield the nal cost estimates. Sampling is currently used in DBMS's such as DB2 V2 and Oracle 7 SQL Server to estimate a variety of catalog statistics from samples of the base relations, and there is ongoing research into sampling-based methods for estimating such key catalog statistics as quantiles and \column cardinality" (the number of distinct values of an attribute in a relation); see [GMP97, HNSS95, PIHS96]. Sampling is much less expensive than exact computation of catalog statistics from entire relations; such cost reduction is important since catalog statistics must be recomputed periodically as the database changes over time. Even whe

New Jersey Data Reduction Report PDF

Document Details

Tags

Related

Summary

Full Transcript