segunda-feira, 20 de outubro de 2014

Hierarchical In-Network Attribute Compression via Importance Sampling

Our most recent work entitled "Hierarchical In-Network Attribute Compression via Importance Sampling" just got accepted at ICDE'15. This research was done in collaboration with my advisor Ambuj Singh and Petko Bogdanov, who is now at SUNY (Albany). 

This paper focused on the problem of compressing real values over a network structure. Such values can represent average speeds in a road network (see Figure), opinions in a social network, gene expression in a PPI network, to name a few.

Speeds at different locations in the highway network from CA, US. Large/red means low speed, medium/yellow means average speed and small/green means high speed.
When managing such network data, specially in scenarios involving large dynamic networks, processing, storing or communicating a given state (or snapshot) of the network might not be feasible. For instance, consider the individual opinions of all Twitter users regarding a polemical topic (e.g. politics, religion) over time. While managing the individual opinions might be costly, we know that some knowledge about the network structure might explain a lot of these opinions due to patterns such as homophily and influence. Therefore, one way to approximate the user opinions given the structure would be partitioning the network into groups that have similar opinions. As long as these partitions can be described in a compact form, this lossy compression is a powerful way to compress values embedded in a network. We have proposed a compression model called Slice Tree, which is a hierarchical partition of a network that minimizes the compression error. 

Computing slice trees is a computationally hard problem, so we spend a lot of time trying to scale its construction to large networks. One of the main contributions of the paper was showing that slice trees can be computed using importance sampling. In particular, we have shown that by biasing the sampling process towards values the deviate from the mean of the network, we can quickly spot the best partition/slice. Therefore, an efficient way to build slice trees is by recursively sampling values as means to extend the tree in a way that the partitions generated are as homogeneous as possible.

We applied this idea to compress several real networks to see how it performs compared to some existing compression approaches, such as the graph Fourier (GF). The results show that slice tree can achieve up to 2-fold gains over the best baseline. We have also presented several results supporting the scalability and compression power of our approach for million-node networks.

Compression quality of Slice Tree (ST) against three baseline approaches.
We are now investigating the temporal aspect of the problem (when values are updated), which brings some interesting challenges and was one of the main issues the reviewers of the current paper raised. More on this subject soon!

segunda-feira, 29 de setembro de 2014

Back to blogging!

After more than a year of silence, I've decided to resuscitate this blog. Life as a PhD student has been busy and I've tried to use my free time doing activities other than blogging about research. But I'm glad that a few people read some of my posts, specially the one about grad school.

This is a summary on my PhD so far:

1) I've worked for more than a year on the problem of compressing values over a graph. After trying lots of different approaches we came up with an algorithm called slice-tree, which decomposes the network into center-radius hierarchical partitions and compresses values inside partitions to their average.



I should write a post about this project here soon. We have submitted a paper on this called Hierarchical In-Network Attribute Compression via Importance Sampling and are now waiting for the reviews. A previous version of this paper was rejected by VLDB some months ago.

2) I've also taken some courses towards fulfilling the requirements for my degree at UC, Santa Barbara. The courses are: Computational Geometry, Combinatorial Algorithms, Advanced Topics on
Databases, Searching over Big Data and Advanced Topics in Data Mining.

3) During the last summer, I was a summer intern at IBM Thomas J. Watson Research Center working on spatio-temporal data management. I'm still working on this project and I should also summarize it here at some point in the future. Working at IBM was really cool and I hope I can get similar internship opportunities in the next years.

More posts soon.