terça-feira, 27 de novembro de 2012

Wavelet Synopses with Error Guarantees

I just finished reading this cool paper about providing approximate answers to queries with bounded expected error for individual queries using wavelet-based representations of the data. The problem solved is very clear and the techniques proposed are supported by a rich background theory, specially regarding wavelets, probability theory, optimization, and dynamic programming. The paper was published in the ACM SIGMOD Conference 2002, when the authors were working at Bell Labs. There is also a journal version of the paper, published in the ACM TODS two years later, that is basically a copy of this paper with some details (e.g., proofs, explanations, and results). I've lost track of how many hours I spent on this paper and I still think I should understand it deeper, but I decided to let it go for now.

In several scenarios, providing a fast approximate answer to a query is more effective than taking a long time to give the exact answer. Previous work has applied wavelets, which is a harmonic analysis technique for function decomposition, in order to represent a database as a compact set of coefficients, as known as synopses. Based on a good set of coefficients, it is possible to retrieve specific values in the database with small overall error. However, existing wavelet-based strategies for providing synopses did not give any guarantee on the the error of individual queries. In other words, given a specific query, there was no guarantees regarding the expected error of an answer for such query. This is the problem addressed in this paper.

Wavelets synopses are based on the Haar wavelet function. The idea is very simple, given a sequence of values, we keep pairwise averages and detail coefficients for different levels (or resolutions). Detail coefficients give the difference between pairs of values. We start from pairs of values in the database and stop when there is a single average (i.e., the average of the values in the database) and detail coefficient. These averages and detail coefficients enable the full reconstruction of the values in the database. Starting from the root of the hierarchy (single average and coefficient) we generate the values for level L by adding (left child) and subtracting (right child) the detail coefficient from the average value. The original values will be obtained in the last level of the hierarchy. The following table shows an example of how we compute the averages and coefficients for a set of values shown in the first line.

The error tree shows the coefficients (c_0, c_1, ..., c_k) and how they are associated to the data values.

Any value can be reconstructed given the coefficients from the root to it in the error tree. In the same fashion, range queries can be answered based on the paths to the values involved. In previous coefficients to compose the synopses were those with largest normalized value (c_i/sqrt(2^L), where L is the level in the error tree). This selection of coefficients is known to minimize the L^2 error of the reconstruction, but does not provide any guarantee on individual answers. In fact, the authors show an example where the optimal choice of coefficients leads to some very bad answers for some values. This kind of results is mainly due to an unbalanced selection of coefficients in the error tree.

In order to solve this problem, the authors apply the idea of randomized rounding. Each coefficient c_i is associated to a random variable C_i that can be gamma_i with probability c_i/gamma_i, or 0 with probability 1 - c_i/gamma_i, where 0 < c_i/gamma_i <= 1. This means that we are replacing the coefficients by these variables that can be rounded to values larger than c_i or set to 0 but they still keep the same expected value c_i of the original coefficients. Moreover, the variance of the coefficients is (gamma_i-c_i).c_i. The authors show that the estimations for values and range sums using coefficients generated with randomized rounding are unbiased, which means that their expected values are the correct ones for any value of gamma_i. However, gamma_i affects the variance of the estimations (positively) and the expected number of coefficients kept (negatively). The focus of the paper is how we can set these gamma_i values in a way that minimizes the expected error of the estimations for a fixed expected number of coefficients (or budget).

First, it is shown how the expected mean squared error (L^2) can be minimized. Due to specific properties of the minimization problem, the authors are able to provide a O(n log n) algorithm, where n is the number of values in the database, for the problem. The following formula gives the optimal values of gamma_i without constraining the values of c_i/gamma_i to the interval (0,1]:

where y_i = c_i/gamma_i and k_i = c_i^2/2^level(c_i). In order to get optimal values of gamma_i following the given constraint, they set any value of gamma_i that is greater than 1 to one.

But the main contribution of the paper is not giving a bound on the L^2 error but on the relative error of individual queries. The approach to achieve such bound is similar, an optimal selection of the gamma_i values. Because we are talking about relative errors, which can be highly influenced by small absolute values, the authors apply a sanity bound S in the computation of errors. Whenever an absolute value is lower than this sanity parameter S, the error to be minimized is relative to S and not to the specific value anymore. Similar to the mean error, the relative error is also related to the variance of the estimates, but now we are interesting in improving the worst estimate, which is the one with highest variance. The authors propose dynamic programming algorithm for setting the values of y_i, which is equivalent to setting the corresponding values of gamma_i (y_i = c_i / gamma_i). Based on the fact that in an optimal setting for an error sub-tree, the highest error will be associated to the minimum data value, the algorithm finds the optimal solution. However, some perturbation of coefficients may be required in order to make this property hold.

The idea of the dynamic programming algorithm is minimizing the relative error of the maximum error path by setting different values of y_i to the root of an error sub-tree and then dividing the remaining budget (coefficients) optimally between its left and right sub-trees.

The value of j is greater than N for the leaves of the error tree (i.e., data values). One issue of this strategy is that the value of y_i is continuous, and thus we cannot check all possible values in the range. The solution the authors offer for this problem is considering a set of q discrete values in the given interval. Therefore, the possible values of y_i are 1/q, 2/q, ..., and 1. The proposed algorithm has time complexity O(Nq^2B) and space complexity O(qB.logN).

It is also proposed an alternate scheme that does not apply randomized rounding, retaining or discarding coefficients minimizing the error instead. This approach may lead to bias (the expected value of C_i is not c_i anymore) but the choice of gamma_i values can minimize this bias as well.

The overall algorithm is the following:

Notice that they generate several trials and always select the synopses that minimize the mean relative error.

In the experimental evaluation, they compare MinL2 (probabilistic synopses minimizing L^2), MinRelVar (probabilistic synopses minimizing relative error), MinRelBias (probabilistic synopses without randomized rounding), and the deterministic approach. Synthetic data was generated based on Zipf distribution for pairs data-counts and then different permutation strategies were applied as means to set which values were set to each frequencies. A real database (Cover Type) with 581K tuples was also applied in the evaluation. In the case of this real dataset, they selected a given attribute with cardinality at least 256. Evaluation metrics applied are the mean relative error, maximum relative error, and 25% percentile of the relative error.  I won't show any plot here but the main conclusions of the evaluation are:

  • MinL2 achieves relative error worse than all the other strategies;
  • MinRelVar and MinRelBias outperform the deterministic approach and MinL2, specially for small synopses. Using only 10 coefficients (4% of the data) they reduce the relative error of the deterministic strategy from 1.5 to 0.5. 
  • The more skewed the data, the greater are the gains of MinRelBias and MinRelVar. 
  • Similar results were found for the real dataset.
My general view on this paper is very positive. I don't think the writing was very good but the proposed  strategies are really clever. One thing still missing for me is a good intuition why the idea of randomized rounding enables the guarantees on the relative error. I believe it works, but I don't have an intuitive reasoning for its use in this particular scenario. Another thing that is not clear to me is whether using randomized rounding for minimizing the L^2 error makes sense. The deterministic strategy already minimizes the L^2 error and is much simpler. Another missing point is whether solving the original optimization problem regarding the relative error using a general optimization approach would be get much better results and at which cost. Finally, the meaning of 'error guarantees' in the paper is not very clear to me. I think that one specific probabilistic synopses can still get a very poor answer for a given query, though it is unlikely. Therefore, maybe you are making poor answers unlikely but are still not providing guarantees regarding them.

Link: http://dl.acm.org/citation.cfm?id=564746

domingo, 18 de novembro de 2012

ciForager: Incrementally Discovering Regions of Correlated Change in Evolving Graphs

I'm back to reading! This paper is authored by some researchers from University of Melbourne (Australia) and the National Institute for Informatics (Japan). It was published in the ACM TKDD journal this year. The problem studied by the authors is the discovery of subgraphs (or regions) that change in a similar (or correlated) fashion in an evolving graph (or network). This graph is given as a sequence of snapshots and changes can represent edge activity. Patterns of interest correspond to sets of edge that are similar in both the temporal and spatial (topological) dimensions. The authors call these patterns regions of correlated change. One interesting application  for this patterns is the analysis of a brain network, where vertices are brain regions and edges connect regions are close to each other in a 3D view of the brain. As the brain is exposed to a sequence of stimuli, some of its regions (i.e., subgraphs) are expected to present correlated activity levels along time.

The above figure shows an illustrative example of the discovery of regions of correlated change, which are edges that are close in the graph topology and also are active in the same snapshots. It is easier to see these regions, and their specific changes, based on waveforms, as shown in the following figure.

More formally, the problem consists of, given a sequence of graph snapshots <G_1, ...G_T> with a set of changed edges E_C^{1,T}, which are edges that changes their status in the interval 1-T, partitioning E_C^{1,T} into regions R = {R_1^{ts1,te1}, ... R_L^{tsL,teL}}, such that the temporal and spatial distances of edges inside the same region are minimized. Temporal distance is measured using an extended version of the Euclidean distance and spatial distance is measured as the shortest path distance in the graph that is the union of all snapshots.

The problem of identifying these regions of correlated change was previously proposed by the same authors. In fact, the two previous papers on this subject were both authored by the same group (curious). The main contribution of this particular paper is the design of efficient algorithmic techniques that enable the discovery of these patterns in large graphs. these techniques are combined into a new algorithm called ciForager, which builds upon a previous algorithm proposed by the authors for this same problem named cSTAG.

The general idea of cSTAG is dividing the snapshots into a sequence of consecutive and overlapping windows, discovering the regions of correlated change for each window, and then merging them. Regions for each window are discovered using a bi-objective (time+space) clustering algorithm. There are several ways to combine the temporal and spatial optimization functions, they list some variations but I won't describe them here. Regions for different windows are merged whenever they have an average temporal and spatial distance below a user-defined threshold.

The above figure shows the process diagram of cSTAG (left) and ciForager (right). The main difference between these two algorithms, which constitute the main contributions of this paper, is the use of the following strategies for the efficient discovery of the regions of correlated change:

  • Equivalence classes and synchronized connected components: An equivalent class is defined by a unique waveform and a synchronized connected component (synCC) is a maximal set of connected edges that present the same waveform. The idea is instead of computing distances between isolated edges, computing distances between synCCs. SynCCs are updated efficiently through a set of 4 operations (extend, split, shrink, and merge). The worst case complexity of the update of the set of synCCs from a window W_k to a W_{k+1} is O(|E_C^k|) + O(|V_{S_{kAvg}}|.|S_k|) + O(|E_C^{k+1}|), where E_C^k is the set of changed edges in W_k, V_{S_{kAvg}} is the average size of a synCC in S_k, and E_C^{k+1} is the set of changed edges in W_{k+1}. The number of synCCs is expected to be much smaller than the number of changed edges, which leads to performance gains of ciForager in comparison to cSTAG.
  • Voronoi diagrams for shortest path computation: A Voronoi diagram divides the space into cells such that points in each cell are closer to the center of its cell than to the center of other cells in the diagram. In this paper, the authors apply a graph Voronoi diagram in order to compute approximate shortest paths efficiently. The center of a  cell is a synchronized connected component and a cell is composed by edges and vertices that are closer to the center of their respective center than to any other center in the graph. Distances between synCCs are computed based on a metagraph, where each vertex is a synCC, edge weights are minimum distances between vertices from two synCCs, and vertex weights are average distances between vertices in a given synCC. The distance between two synCCs is approximated as the sum of the edges and vertices weights in the path from one to the other. The authors show how this metagraph can be efficiently updated through merging, splitting, creating, and deleting operations. The idea is that distances involving a limited number of cells need to be updated by these operations whenever the synCCs change. For instance, if two synCCs are merged, only the vertices and edges in the cells whose centers represent these synCCs are updated. Each of these operations is performed in O(E + VLOGV) time, where E is the set of edges and V is the set of vertices in the metagraph. This complexity comes directly from the fact that these operations are based on Dijkstra's algorithm.

The overall complexity of ciForager for a transition from window W_k to W_{k+1} is:

The first three parts are due to the updating of equivalence classes and synCCs. The fourth part comes from the update of the graph Voronoi diagrams (U_k is the number of synCC updates). The fifth part is due to the pairwise temporal distance computations between pairs of equivalence classes. The sixth part is related to the computation of shortest path distances in the metagraph, and the final component is the cost of running a clustering algorithm.

In the evaluation section, the authors show that ciForager is much faster than cSTAG maintaining similar accuracy. These gains rely on localized changes and increase with the size of the graphs considered. The authors argue that ciForager enables the analysis of the global BGP (Border Gateway Protocol) network, something that was not feasible before. They also support their case using synthetic datasets and the 1998 World Cup Web site access data.

The evaluation metrics applied in the evaluation can be divided into two groups:

  1. External measures: Based on a set of regions generated (ground truth). Given the known regions and the regions discovered by the algorithms, the problem consists of matching these regions. The authors solve this matching problem as a mass transportation linear programming problem.
  2. Internal measures: Average intra-region and inter-region distances for both the spatial and temporal dimensions. 
Eight different clustering methods are evaluated (remember that ciForager does not depend on a specific  clustering method) and ciForager outperforms cSTAG (100-800 times faster) for all of them. However, cSTAG is more accurate than ciForager and the authors argue that this is due to the fact that some datasets have regions with small diameter, which leads to mistakenly grouped edges in the same regions. The performance gap between cSTAG and ciForager is shown to increase as the number of snapshots increase, without loss in accuracy. As expected, the larger the number of synCCs, the smaller are the performance gains of ciForager compared to cSTAG, the opposite happens when the number of edges changed increase because of the number of synCCs is expected to increase slower than the number of changed edges. It is interesting to notice that ciForager does not achieve similar gains on the BGP data, being only about 40 times faster than cSTAG. The authors justify such results based on the occurrence of randomly distributed changes in the BGP network. Two specific regions related to the damages caused by the Hurricane Katrina not only over the US network but also in Europe are discussed. 

Concluding, this paper studies an interesting problem and the performance gains shown are good. However, I don't usually like papers that try to describe the result of several combined but somehow uncorrelated strategies to improve algorithmic performance. The fact that the algorithm opens possibilities for analyzing new and larger data is amazing, but I think the authors could do a better job describing these results. But I understand that the focus of the paper were the performance gains of ciForager compared to cSTAG. Finally, I found some typos and other small errors that sometimes disrupted my reading, I don't expect finding this kind of issues in a TKDD paper. 

Link: http://dl.acm.org/citation.cfm?id=2362385