The authors study the problem of sampling large graphs from two perspectives: scale-down and back-in-time. The scale-down sampling is defined as the the selection of a sample S of size n' from a graph G that has n vertices such that n' < n and that S has similar properties as G. In the back-in-time sampling, the sample S is expected to have the same properties of the graph G when it had n' vertices. Samples are evaluated in terms of several static (e.g., in and out-degree, clustering coefficient distribution) and temporal (e.g., densification power-law, shrinking diameter) properties of the original graph. Two distributions are compared using the Komolgorov-Smirnov D-statistic. The sampling techniques studied in the paper are:
- Random Node: Vertices are selected at random.
- Random Walk Page-rank Node: Vertices are selected with a probability that is proportional to their page-rank weight.
- Random Degree Node: Vertices are selected with a probability that is proportional to their degrees.
- Random Edge: Vertices are selected at random.
- Random Node-edge: Vertices are selected at random and then a random edge from them is selected randomly as well.
- Hybrid: Applies the random node-edge sampling with a probability p and the random edge sampling with a probability 1 - p (p = 0.8).
- Random Node Neighbor: A vertex is selected at random together with all its neighbors.
- Random Walk: Picks a vertex at random and simulates a random walk starting from from it, the algorithm goes back to the starting vertex with a probability c (c = 0.15).
- Random Jump: Similar to the random walk, but jumps to a random vertex instead of the same starting vertex.
- Forest Fire: Randomly picks a vertex and then selects x of its out-going neighbors where x is geometrically distributed with mean pf/(1-pf) and pf is a parameter.
The graph datasets used in the evaluation are a citation network from arXiv, an autonomous system network, a bipartite affiliation network from arXiv and a network of trust from epinions. The results show that though there is not a best overall strategy, the Random Walk Node and the Forest Fire sampling achieved the best results in the scale-down sampling task. Regarding the back-in-time task, the Random Page Rank Node and the Forest Fire strategies obtained the best results in terms of Komolgorov-Smirnov D-statistic on average. The paper also presents an evaluation of the impact of the sample size over the quality of the sampling. For small samples (< 40%), the Random Degree Node, Random Jump, and Random Walk Node achieved the best results in the scale-down task and the Forest Fire, Random Node and Random Page-Rank Node achieved the best resutls in the back-in-time task. Moreover, methods based on edge selection did not performed well in general.
My first conclusion from this paper is that it is not exactly what I'm looking for. I seems that the paper Statistical properties of sampled networks is much more related to the properties of random subgraphs. However, I enjoyed reading this paper, specially because of its experimental section. The evaluation metric applied (the Komolgorov-Smirnov D-statistical) may be handy in the future.
Link: http://cs.stanford.edu/people/jure/pubs/sampling-kdd06.pdf
Nenhum comentário:
Postar um comentário