segunda-feira, 20 de outubro de 2014

Hierarchical In-Network Attribute Compression via Importance Sampling

Our most recent work entitled "Hierarchical In-Network Attribute Compression via Importance Sampling" just got accepted at ICDE'15. This research was done in collaboration with my advisor Ambuj Singh and Petko Bogdanov, who is now at SUNY (Albany). 

This paper focused on the problem of compressing real values over a network structure. Such values can represent average speeds in a road network (see Figure), opinions in a social network, gene expression in a PPI network, to name a few.

Speeds at different locations in the highway network from CA, US. Large/red means low speed, medium/yellow means average speed and small/green means high speed.
When managing such network data, specially in scenarios involving large dynamic networks, processing, storing or communicating a given state (or snapshot) of the network might not be feasible. For instance, consider the individual opinions of all Twitter users regarding a polemical topic (e.g. politics, religion) over time. While managing the individual opinions might be costly, we know that some knowledge about the network structure might explain a lot of these opinions due to patterns such as homophily and influence. Therefore, one way to approximate the user opinions given the structure would be partitioning the network into groups that have similar opinions. As long as these partitions can be described in a compact form, this lossy compression is a powerful way to compress values embedded in a network. We have proposed a compression model called Slice Tree, which is a hierarchical partition of a network that minimizes the compression error. 

Computing slice trees is a computationally hard problem, so we spend a lot of time trying to scale its construction to large networks. One of the main contributions of the paper was showing that slice trees can be computed using importance sampling. In particular, we have shown that by biasing the sampling process towards values the deviate from the mean of the network, we can quickly spot the best partition/slice. Therefore, an efficient way to build slice trees is by recursively sampling values as means to extend the tree in a way that the partitions generated are as homogeneous as possible.

We applied this idea to compress several real networks to see how it performs compared to some existing compression approaches, such as the graph Fourier (GF). The results show that slice tree can achieve up to 2-fold gains over the best baseline. We have also presented several results supporting the scalability and compression power of our approach for million-node networks.

Compression quality of Slice Tree (ST) against three baseline approaches.
We are now investigating the temporal aspect of the problem (when values are updated), which brings some interesting challenges and was one of the main issues the reviewers of the current paper raised. More on this subject soon!