domingo, 8 de janeiro de 2012

Sampling for Large Graphs

During the past months, I've been thinking a lot about topological properties (e.g., degree distribution, clustering coefficient, diameter) of random subgraphs, which can be seen as samples of graphs. So, I decided to read this paper again. In statistics, sampling is basically the technique of selecting a limited number of instances to represent a larger population. This paper studies the sampling from graphs, which is the selection of a smaller, but still representative, subgraph from an original graph. These samples are useful in the analysis of really large graphs, for which processing is impractical.

The authors study the problem of sampling large graphs from two perspectives: scale-down and back-in-time. The scale-down sampling is defined as the the selection of a sample S of size n' from a graph G that has n vertices such that n' < n and that S has similar properties as G. In the back-in-time sampling, the sample S is expected to have the same properties of the graph G when it had n' vertices. Samples are evaluated in terms of several static (e.g., in and out-degree, clustering coefficient distribution) and temporal (e.g., densification power-law, shrinking diameter) properties of the original graph. Two distributions are compared using the Komolgorov-Smirnov D-statistic. The sampling techniques studied in the paper are:


  • Random Node: Vertices are selected at random.
  • Random Walk Page-rank Node: Vertices are selected with a probability that is proportional to their page-rank weight.
  • Random Degree Node: Vertices are selected with a probability that is proportional to their degrees.
  • Random Edge: Vertices are selected at random.
  • Random Node-edge: Vertices are selected at random and then a random edge from them is selected randomly as well.
  • Hybrid: Applies the random node-edge sampling with a probability p and the random edge sampling with a probability 1 - p (p = 0.8).
  • Random Node Neighbor: A vertex is selected at random together with all its neighbors.
  • Random Walk: Picks a vertex at random and simulates a random walk starting from from it, the algorithm goes back to the starting vertex with a probability c (c = 0.15).
  • Random Jump: Similar to the random walk, but jumps to a random vertex instead of the same starting vertex.
  • Forest Fire: Randomly picks a vertex and then selects x of its out-going neighbors where x is geometrically distributed with mean pf/(1-pf) and pf is a parameter.


The graph datasets used in the evaluation are a citation network from arXiv, an autonomous system network, a bipartite affiliation network from arXiv and a network of trust from epinions. The results show that though there is not a best overall strategy, the Random Walk Node and the Forest Fire sampling achieved the best results in the scale-down sampling task. Regarding the back-in-time task, the Random Page Rank Node and the Forest Fire strategies obtained the best results in terms of Komolgorov-Smirnov D-statistic on average. The paper also presents an evaluation of the impact of the sample size over the quality of the sampling. For small samples (< 40%), the Random Degree Node, Random Jump, and Random Walk Node achieved the best results in the scale-down task and the Forest Fire, Random Node and Random Page-Rank Node achieved the best resutls in the back-in-time task. Moreover, methods based on edge selection did not performed well in general.

My first conclusion from this paper is that it is not exactly what I'm looking for. I seems that the paper Statistical properties of sampled networks is much more related to the properties of random subgraphs. However, I enjoyed reading this paper, specially because of its experimental section. The evaluation metric applied (the Komolgorov-Smirnov D-statistical) may be handy in the future.

Link: http://cs.stanford.edu/people/jure/pubs/sampling-kdd06.pdf

Nenhum comentário:

Postar um comentário