domingo, 18 de novembro de 2012

ciForager: Incrementally Discovering Regions of Correlated Change in Evolving Graphs

I'm back to reading! This paper is authored by some researchers from University of Melbourne (Australia) and the National Institute for Informatics (Japan). It was published in the ACM TKDD journal this year. The problem studied by the authors is the discovery of subgraphs (or regions) that change in a similar (or correlated) fashion in an evolving graph (or network). This graph is given as a sequence of snapshots and changes can represent edge activity. Patterns of interest correspond to sets of edge that are similar in both the temporal and spatial (topological) dimensions. The authors call these patterns regions of correlated change. One interesting application  for this patterns is the analysis of a brain network, where vertices are brain regions and edges connect regions are close to each other in a 3D view of the brain. As the brain is exposed to a sequence of stimuli, some of its regions (i.e., subgraphs) are expected to present correlated activity levels along time.

The above figure shows an illustrative example of the discovery of regions of correlated change, which are edges that are close in the graph topology and also are active in the same snapshots. It is easier to see these regions, and their specific changes, based on waveforms, as shown in the following figure.

More formally, the problem consists of, given a sequence of graph snapshots <G_1, ...G_T> with a set of changed edges E_C^{1,T}, which are edges that changes their status in the interval 1-T, partitioning E_C^{1,T} into regions R = {R_1^{ts1,te1}, ... R_L^{tsL,teL}}, such that the temporal and spatial distances of edges inside the same region are minimized. Temporal distance is measured using an extended version of the Euclidean distance and spatial distance is measured as the shortest path distance in the graph that is the union of all snapshots.

The problem of identifying these regions of correlated change was previously proposed by the same authors. In fact, the two previous papers on this subject were both authored by the same group (curious). The main contribution of this particular paper is the design of efficient algorithmic techniques that enable the discovery of these patterns in large graphs. these techniques are combined into a new algorithm called ciForager, which builds upon a previous algorithm proposed by the authors for this same problem named cSTAG.

The general idea of cSTAG is dividing the snapshots into a sequence of consecutive and overlapping windows, discovering the regions of correlated change for each window, and then merging them. Regions for each window are discovered using a bi-objective (time+space) clustering algorithm. There are several ways to combine the temporal and spatial optimization functions, they list some variations but I won't describe them here. Regions for different windows are merged whenever they have an average temporal and spatial distance below a user-defined threshold.

The above figure shows the process diagram of cSTAG (left) and ciForager (right). The main difference between these two algorithms, which constitute the main contributions of this paper, is the use of the following strategies for the efficient discovery of the regions of correlated change:

  • Equivalence classes and synchronized connected components: An equivalent class is defined by a unique waveform and a synchronized connected component (synCC) is a maximal set of connected edges that present the same waveform. The idea is instead of computing distances between isolated edges, computing distances between synCCs. SynCCs are updated efficiently through a set of 4 operations (extend, split, shrink, and merge). The worst case complexity of the update of the set of synCCs from a window W_k to a W_{k+1} is O(|E_C^k|) + O(|V_{S_{kAvg}}|.|S_k|) + O(|E_C^{k+1}|), where E_C^k is the set of changed edges in W_k, V_{S_{kAvg}} is the average size of a synCC in S_k, and E_C^{k+1} is the set of changed edges in W_{k+1}. The number of synCCs is expected to be much smaller than the number of changed edges, which leads to performance gains of ciForager in comparison to cSTAG.
  • Voronoi diagrams for shortest path computation: A Voronoi diagram divides the space into cells such that points in each cell are closer to the center of its cell than to the center of other cells in the diagram. In this paper, the authors apply a graph Voronoi diagram in order to compute approximate shortest paths efficiently. The center of a  cell is a synchronized connected component and a cell is composed by edges and vertices that are closer to the center of their respective center than to any other center in the graph. Distances between synCCs are computed based on a metagraph, where each vertex is a synCC, edge weights are minimum distances between vertices from two synCCs, and vertex weights are average distances between vertices in a given synCC. The distance between two synCCs is approximated as the sum of the edges and vertices weights in the path from one to the other. The authors show how this metagraph can be efficiently updated through merging, splitting, creating, and deleting operations. The idea is that distances involving a limited number of cells need to be updated by these operations whenever the synCCs change. For instance, if two synCCs are merged, only the vertices and edges in the cells whose centers represent these synCCs are updated. Each of these operations is performed in O(E + VLOGV) time, where E is the set of edges and V is the set of vertices in the metagraph. This complexity comes directly from the fact that these operations are based on Dijkstra's algorithm.

The overall complexity of ciForager for a transition from window W_k to W_{k+1} is:

The first three parts are due to the updating of equivalence classes and synCCs. The fourth part comes from the update of the graph Voronoi diagrams (U_k is the number of synCC updates). The fifth part is due to the pairwise temporal distance computations between pairs of equivalence classes. The sixth part is related to the computation of shortest path distances in the metagraph, and the final component is the cost of running a clustering algorithm.

In the evaluation section, the authors show that ciForager is much faster than cSTAG maintaining similar accuracy. These gains rely on localized changes and increase with the size of the graphs considered. The authors argue that ciForager enables the analysis of the global BGP (Border Gateway Protocol) network, something that was not feasible before. They also support their case using synthetic datasets and the 1998 World Cup Web site access data.

The evaluation metrics applied in the evaluation can be divided into two groups:

  1. External measures: Based on a set of regions generated (ground truth). Given the known regions and the regions discovered by the algorithms, the problem consists of matching these regions. The authors solve this matching problem as a mass transportation linear programming problem.
  2. Internal measures: Average intra-region and inter-region distances for both the spatial and temporal dimensions. 
Eight different clustering methods are evaluated (remember that ciForager does not depend on a specific  clustering method) and ciForager outperforms cSTAG (100-800 times faster) for all of them. However, cSTAG is more accurate than ciForager and the authors argue that this is due to the fact that some datasets have regions with small diameter, which leads to mistakenly grouped edges in the same regions. The performance gap between cSTAG and ciForager is shown to increase as the number of snapshots increase, without loss in accuracy. As expected, the larger the number of synCCs, the smaller are the performance gains of ciForager compared to cSTAG, the opposite happens when the number of edges changed increase because of the number of synCCs is expected to increase slower than the number of changed edges. It is interesting to notice that ciForager does not achieve similar gains on the BGP data, being only about 40 times faster than cSTAG. The authors justify such results based on the occurrence of randomly distributed changes in the BGP network. Two specific regions related to the damages caused by the Hurricane Katrina not only over the US network but also in Europe are discussed. 

Concluding, this paper studies an interesting problem and the performance gains shown are good. However, I don't usually like papers that try to describe the result of several combined but somehow uncorrelated strategies to improve algorithmic performance. The fact that the algorithm opens possibilities for analyzing new and larger data is amazing, but I think the authors could do a better job describing these results. But I understand that the focus of the paper were the performance gains of ciForager compared to cSTAG. Finally, I found some typos and other small errors that sometimes disrupted my reading, I don't expect finding this kind of issues in a TKDD paper. 


Nenhum comentário:

Postar um comentário