I just finished reading this cool paper about providing approximate answers to queries with bounded expected error for individual queries using wavelet-based representations of the data. The problem solved is very clear and the techniques proposed are supported by a rich background theory, specially regarding wavelets, probability theory, optimization, and dynamic programming. The paper was published in the ACM SIGMOD Conference 2002, when the authors were working at Bell Labs. There is also a journal version of the paper, published in the ACM TODS two years later, that is basically a copy of this paper with some details (e.g., proofs, explanations, and results). I've lost track of how many hours I spent on this paper and I still think I should understand it deeper, but I decided to let it go for now.
In several scenarios, providing a fast approximate answer to a query is more effective than taking a long time to give the exact answer. Previous work has applied wavelets, which is a harmonic analysis technique for function decomposition, in order to represent a database as a compact set of coefficients, as known as synopses. Based on a good set of coefficients, it is possible to retrieve specific values in the database with small overall error. However, existing wavelet-based strategies for providing synopses did not give any guarantee on the the error of individual queries. In other words, given a specific query, there was no guarantees regarding the expected error of an answer for such query. This is the problem addressed in this paper.
Wavelets synopses are based on the Haar wavelet function. The idea is very simple, given a sequence of values, we keep pairwise averages and detail coefficients for different levels (or resolutions). Detail coefficients give the difference between pairs of values. We start from pairs of values in the database and stop when there is a single average (i.e., the average of the values in the database) and detail coefficient. These averages and detail coefficients enable the full reconstruction of the values in the database. Starting from the root of the hierarchy (single average and coefficient) we generate the values for level L by adding (left child) and subtracting (right child) the detail coefficient from the average value. The original values will be obtained in the last level of the hierarchy. The following table shows an example of how we compute the averages and coefficients for a set of values shown in the first line.
The error tree shows the coefficients (c_0, c_1, ..., c_k) and how they are associated to the data values.
Any value can be reconstructed given the coefficients from the root to it in the error tree. In the same fashion, range queries can be answered based on the paths to the values involved. In previous coefficients to compose the synopses were those with largest normalized value (c_i/sqrt(2^L), where L is the level in the error tree). This selection of coefficients is known to minimize the L^2 error of the reconstruction, but does not provide any guarantee on individual answers. In fact, the authors show an example where the optimal choice of coefficients leads to some very bad answers for some values. This kind of results is mainly due to an unbalanced selection of coefficients in the error tree.
In order to solve this problem, the authors apply the idea of randomized rounding. Each coefficient c_i is associated to a random variable
C_i that can be gamma_i with probability
c_i/gamma_i, or
0 with probability
1 - c_i/gamma_i, where
0 < c_i/gamma_i <= 1. This means that we are replacing the coefficients by these variables that can be rounded to values larger than
c_i or set to 0 but they still keep the same expected value
c_i of the original coefficients. Moreover, the variance of the coefficients is
(gamma_i-c_i).c_i. The authors show that the estimations for values and range sums using coefficients generated with randomized rounding are unbiased, which means that their expected values are the correct ones for any value of
gamma_i. However, gamma_i affects the variance of the estimations (positively) and the expected number of coefficients kept (negatively). The focus of the paper is how we can set these
gamma_i values in a way that minimizes the expected error of the estimations for a fixed expected number of coefficients (or budget).
First, it is shown how the expected mean squared error
(L^2) can be minimized. Due to specific properties of the minimization problem, the authors are able to provide a
O(n log n) algorithm, where
n is the number of values in the database, for the problem. The following formula gives the optimal values of
gamma_i without constraining the values of
c_i/gamma_i to the interval
(0,1]:
where
y_i = c_i/gamma_i and
k_i = c_i^2/2^level(c_i). In order to get optimal values of
gamma_i following the given constraint, they set any value of
gamma_i that is greater than 1 to one.
But the main contribution of the paper is not giving a bound on the
L^2 error but on the relative error of individual queries. The approach to achieve such bound is similar, an optimal selection of the
gamma_i values. Because we are talking about relative errors, which can be highly influenced by small absolute values, the authors apply a sanity bound
S in the computation of errors. Whenever an absolute value is lower than this sanity parameter
S, the error to be minimized is relative to
S and not to the specific value anymore. Similar to the mean error, the relative error is also related to the variance of the estimates, but now we are interesting in improving the worst estimate, which is the one with highest variance. The authors propose dynamic programming algorithm for setting the values of
y_i, which is equivalent to setting the corresponding values of gamma_i (
y_i = c_i / gamma_i). Based on the fact that in an optimal setting for an error sub-tree, the highest error will be associated to the minimum data value, the algorithm finds the optimal solution. However, some perturbation of coefficients may be required in order to make this property hold.
The idea of the dynamic programming algorithm is minimizing the relative error of the maximum error path by setting different values of y_i to the root of an error sub-tree and then dividing the remaining budget (coefficients) optimally between its left and right sub-trees.
The value of
j is greater than
N for the leaves of the error tree (i.e., data values). One issue of this strategy is that the value of
y_i is continuous, and thus we cannot check all possible values in the range. The solution the authors offer for this problem is considering a set of
q discrete values in the given interval. Therefore, the possible values of
y_i are
1/q, 2/q, ..., and
1. The proposed algorithm has time complexity
O(Nq^2B) and space complexity
O(qB.logN).
It is also proposed an alternate scheme that does not apply randomized rounding, retaining or discarding coefficients minimizing the error instead. This approach may lead to bias (the expected value of
C_i is not
c_i anymore) but the choice of
gamma_i values can minimize this bias as well.
The overall algorithm is the following:
Notice that they generate several trials and always select the synopses that minimize the mean relative error.
In the experimental evaluation, they compare MinL2 (probabilistic synopses minimizing L^2), MinRelVar (probabilistic synopses minimizing relative error), MinRelBias (probabilistic synopses without randomized rounding), and the deterministic approach. Synthetic data was generated based on Zipf distribution for pairs data-counts and then different permutation strategies were applied as means to set which values were set to each frequencies. A real database (Cover Type) with 581K tuples was also applied in the evaluation. In the case of this real dataset, they selected a given attribute with cardinality at least 256. Evaluation metrics applied are the mean relative error, maximum relative error, and 25% percentile of the relative error. I won't show any plot here but the main conclusions of the evaluation are:
- MinL2 achieves relative error worse than all the other strategies;
- MinRelVar and MinRelBias outperform the deterministic approach and MinL2, specially for small synopses. Using only 10 coefficients (4% of the data) they reduce the relative error of the deterministic strategy from 1.5 to 0.5.
- The more skewed the data, the greater are the gains of MinRelBias and MinRelVar.
- Similar results were found for the real dataset.
My general view on this paper is very positive. I don't think the writing was very good but the proposed strategies are really clever. One thing still missing for me is a good intuition why the idea of randomized rounding enables the guarantees on the relative error. I believe it works, but I don't have an intuitive reasoning for its use in this particular scenario. Another thing that is not clear to me is whether using randomized rounding for minimizing the L^2 error makes sense. The deterministic strategy already minimizes the L^2 error and is much simpler. Another missing point is whether solving the original optimization problem regarding the relative error using a general optimization approach would be get much better results and at which cost. Finally, the meaning of 'error guarantees' in the paper is not very clear to me. I think that one specific probabilistic synopses can still get a very poor answer for a given query, though it is unlikely. Therefore, maybe you are making poor answers unlikely but are still not providing guarantees regarding them.
Link:
http://dl.acm.org/citation.cfm?id=564746