segunda-feira, 17 de dezembro de 2012

On Compressing Weighted Time-evolving Graphs

Just finished reading this short paper published in the CIKM conference this year. It is authored by several researchers from University of Melbourne, Australia, and some of them are also co-authors of the ciForager paper I've summarized here one month ago.  Jian Pei, from Simon Fraser University, is also one of the authors of this paper.

The problem studied by the authors is compressing a network where edge weights are dynamic. One example they mention is the DBLP co-authorship network, where pairs authors may have different co-authorship degrees along time. This setting differs from the traditional graph compressing problem studied in the literature, which is focused on compressing the structure of a static graph. In general, we compress data to save storage space and communication cost when dealing with large databases, such as some real-life graphs that arise from social networks, the Web, etc.

The approach described in the paper is based on a 3D tensor representation of a dynamic graph. Tensors are basically multi-dimensional arrays and they have been applied to several graph mining problems in the recent years. Here, the graph is a cube (vertices x vertices x time) with weights in the cells. In practice, because we will be dealing with sparse graphs, we can store the sparse version of this tensor as a set of locations (indices) and non-zero values (weights). The idea is reducing the entropy of this tensor representation by merging subsets of weights across dimensions. For instance, edge weights 4, 5, and 6 can all be mapped to the value 6. The above figure shows an example of an illustrative dynamic graph and how edge weights can be merged as means to reduce the entropy of the edge values. The challenge is selecting the appropriate subsets of values to be merged while keeping some relevant properties of the dynamic graph (e.g. weighted shortest path distances for snapshots).  The authors propose the use of a hierarchical cluster tree to merge edge values that lead to an optimal compression given an error threshold.

Non-zero edge weights are always rounded to their nearest integer. The locations are represented as a sequence of integers where for each position we have the number of zeros before the next non-zero position. These integers are concatenated with the sequence of values into a single string LocVal, which is then encoded using an entropy encoding method (e.g., arithmetic coding, Huffman coding). Integers are represented as variable length quantities (VLQ's), which are representations of large integers as sequences of binary octets. In particular, the first bit of the octet tells whether there is a next octet that is part of the the number being represented. As a consequence, an integer x can be represented using 8.ceil(log2(x+1)/7) bits. The graph is represented as a sequence |V||T|Meta(LocVal)H(LocVal), where the first two values are the dimensions (number of vertices and snapshots) of the dynamic graph and Meta(LocVal) contains unique edge weights (that have been rounded to integers) and their frequencies. H(LocVal) is the entropy of LocVal, but I have no idea why it appears like that in the representation. The total length of the graph representation depends on the lengths of |V|, |T|, and Meta(LocVal), and also on the entropy of LocVal. Therefore, we can compress this representation by reducing the entropy (or heterogeneity) of LocVal. The maximal compression can be achieved by averaging all the edge values across snapshots, but this approach would certainly remove important properties of the original dynamic graph.

The solution presented by the authors (in the form of the MaxErrComp algorithm) is the use of a cluster tree to merge values. Two values are merged whenever they differ by less than a cutoff value. Merged values are set to their average. This cutoff is decreased gradually while the error of the compression, measured as the maximum error of the weighted shorted path for all the snapshots, is lower than a threshold epsilon.

They compare the proposed approach against two baselines. The first, called MaxErrRandom, merges two values randomly (i.e. the hierarchical clustering method is not applied). The second is a technique for compressing static graphs called StatComp. In order to apply StatComp to a dynamic setting, they compress each snapshot of the input graph.

The datasets applied in the evaluation are a portion of the DBLP co-authorship network (3K vertices, 56K edges, 12 years) and the Enron email network (2K vertices, 100K edges, 28 months). The following figure shows the encoding cost achieved by the 3 different techniques for varying error thresholds and the datasets considered. MaxErrComp outperforms the baseline approaches. In particular, StatComp achieves the worst results. MaxErrRandom and MaxErrComp are only on extreme cases, which are 0 error (no compression) or when the error is so high that the compressed graphs degenerate to the average graphs.

In another experiment discussed in the paper, the compression rate was fixed and then the error was analyzed (it is the opposite of what was done in the previous experiment). It is shown that MaxErrComp significantly outperforms the baseline techniques using a statistical test.

This a curious paper. First, I was surprised that it is not a full-paper, because both the problem studied and the solution proposed revealed potential for a greater contribution that what we often find in short papers. Then, when I started reading the paper I increased my expectations as I found some nice background topics (e.g, information theory, coding, tensors). But I got frustrated by the technical and experimental sections of the paper. The MaxErrComp sounds like a naive approach to me. The idea of merging values using a greedy approach and then evaluating the compressed graph using an overall error metric is not very challenging. Maybe, the problem would become more interesting if a relative error metric was considered instead. I also missed some discussion about how hard the original problem is. Finally, the evaluation was poor and the results achieved by MaxErrComp does not seem to be very impressive even compared against the random baseline.