Random sampling is probably the most widespread sampling technique in the literature. Nevertheless, while random sampling works very well in several scenarios, networks have an important property that affects the performance of random sampling, which is the existence of two distinct entities: nodes and links. As an example, the degree is not an individual characteristic of a node. The authors also discuss the reverse relationship between sampling and attacks (or breakdowns) in networks, something that is very interesting.
There are several sampling techniques in the literature. In this paper, they consider three techniques: node, link, and snowball sampling. In node sampling, a certain number of nodes are chosen randomly and the links among these nodes are kept. Link sampling can be seen as the opposite, in the sense that links are selected at random and nodes connected by such links are kept. Snowball sampling starts from a single node and recursivelly picks nodes connected to already selected nodes until the desired number of nodes is reached. Partially crawled networks are usually a result of snowball sampling.
In order to study the sampling techniques described, the authors consider four networks: a synthetic network generated using the Barabasi-Albert model, a protein interaction network, a network of autonomous systems, and co-authorship network. The topological properties of interest are: degree distribution, betweeness centrality, average shortest path, assortativity and clustering coefficient.
Degree distribution: The authors present the following analytical formulation for the degree distribution p' of a sampled network based on the degree distribution p of the original network:
p'(k) = Sum_{k_0=k to inf} p(k_0) C(k_0,k) a^k(1-a)^k_0-k
Where a is the probability of selecting a node from the original network (size of sample / size of network). This formulation works for both node and link sampling strategies. As part of my master's thesis research project, it took me one week to devise a similar formulation by myself. I should have read this paper one year ago. Something that I did not know is that this formulation may not work well for small values of a. This property has important implications for the normalization approach I proposed in my research project. The authors found that, while the node and link sampling overestimate the exponent of the degree distribution, the snowball sampling underestimates it, due to its bias towards high degree nodes.
Average path length: For node and link sampling, the average path length is larger than the one from the original network. On the other hand, the snowball sampling gives an average path that is lower than the original value.
Betweenness centrality distribution: The betweenness centrality of a node is the ratio of the number of shortest paths running through it to the number of shortest paths between pairs of nodes from the network. The betweenness centrality is known to follow a power-law distribution in several real graphs. The betweenness centrality of nodes from sampled networks are similar to those from the original network in terms of degree distribution: node and link sampling overestimate the exponent, while snowball sampling underestimates it.
Assortativity: Assortativity is the correlation between the degrees of connected nodes. It is known to be positive in social networks and negative in biological and technological networks. The authors found that node and link sampling conserve the assortativity of networks. However, sampled networks from snowball sampling are more disassortative than the orginal networks. The authors argue that this result is due to the links to the nodes in the last layer, which are lost in the snowball sampling process.
Clustering coefficient: The clustering coefficient of a node is given by:
C_i = 2y / k_i(k_i-1)
Where y is the number of links between neighbors of i and k_i is the degree of i. The clustering coefficient of a network is the average clustering coefficient of its nodes. For node and snowball sampling, the clustering coefficient of sampled networks are close to those from the original ones. Link sampling underestimates the clustering coefficient of networks.
This is certainly one of the most interesting papers I've ever read. It is very related the research I've done in my master's degree project. Moreover, it provides the reader with plenty of theoretical and empirical results. I've been thinking about comparing several properties of random samples and graphs induced by homophily in attributed graphs. I think this kind of analysis may give a broader view of the correlation between node attributes and the network topology in real graphs.
Link: http://arxiv.org/abs/cond-mat/0505232
Nenhum comentário:
Postar um comentário