segunda-feira, 2 de abril de 2012

Sequential Influence Models in Social Networks

This paper is authored by some researchers from Cornell University, including Jon Kleinberg, and Yahoo! Research. It studies two influence models, one based on snapshots of a network and another based on detailed dynamics. In this models, vertices are activated w.r.t. communities and the metric of interest is how the number of active neighbors affects the probability of a node to be activated. The two models are compared and a technique for estimating the detailed temporal dynamics based on snapshot data is proposed and evaluated using a dataset from Wikipedia.

Previous work have studied social influence by measuring how the number of active neighbors of a node affects the likelihood of this node become active. In fact, this idea is the basis of a standard epidemic model called Linear Threshold Model. This paper consider two definitions of influence according to the network information available:

Snapshot definition: Based on information from snapshots of network at different points in time. Given the number of nodes with k infected neighbors at time t_1 (d_s(k)) and the number of nodes with k infected neighbors at time t_1 which are infected at time t_2 (n_s(k)), the metric of interest p_s(k) is defined as p_s(k) = n_s(k) / d_s(k). 

Ordinal-time definition: The complete information, which includes the link formation and the activation of each node along time is available. In this case, d_o(k) is the number of nodes exposed to k infected neighbors at some point in time and n_o(k) is the number of nodes that where k-exposed and became active before becoming k+1 exposed. The influence ration is defined as p_o(k) = n_o(k) / d_o(k).

These two definitions are evaluated using a dataset from Wikipedia. An undirected social network is defined based on links between pairs of editors for which at least one of them has written on the other's user-talk page. Users become active by editing an article. Therefore, each article has a community of editors that can spread over the network. A dataset that contains approximately 510K users of the English Wikipedia is used in the study. Moreover, for some analysis, results using the German and French versions of Wikipedia are also presented.

The first important result of the paper is that the influence definitions produce different results. In the following figure, the authors show the values of p_o, and p_s for different values of k and considering the ordinal-time and the snapshot definitions.


These results come direct from the values of n_o, d_o, n_s and n_d, which are shown in the following Figure. These functions seem to follow a power-law distribution.



Further, the authors study the relation between the two models based on the following sets:

  • B(t_1) = {(u,C,k_1) | u joined C before t_1 an u had k_1 neighbors in C at t_1}. These events do not affect p_s but do affect p_o. As a consequence, the values of  n_o and d_o are shifted upwards with respect to n_s and d_s.
  • J(t_1,t_2) = {(u,C,k_1,k_2)  | u had k_1 neighbors in C at t_1, u joined C between t_1 and t_2, u had k_2 neighbors in C at t_2}. These events contribute to both p_s and p_o. However, the contribution over p_o is stretched towards higher values of k.
  • N(t_2) = {(u,C,k_2) | u did not join C before t_2 and u had k_2 neighbors in C at t_2}. Such tuples contribute only to p_o.

The analysis of the slope and the y-intercepts of the fit of the functions n_o, d_o, n_s, and d_s to a power-law distribution shows the upward effect of B and N. The stretching effect of J is small.

The authors also present a very simple technique for simulating ordinal time data from snapshots. For the events in B, it is assumed that u joined C at a uniformly random time j between 0 and t_1. The same is done for J (u is assumed to have joined C at a uniformly random time j between t_1 and t_2). The events in N are not considered. It is shown that this technique works well in practice, producing values of p_o that are similar to those obtained from ordinal-time data. The accuracy of the technique improves with the number of snapshots and as the time between snapshots gets longer.

The problem studied in this paper is very specific, but it is certainly relevant. I missed more datasets in the evaluation. In several points, I wondered whether the results discussed were general enough. In particular, the arguments of the authors were too complex to me. I think these arguments should be better supported with more experimental results. Finally, although they argued their technique for simulating ordinal-time influence based on snapshots worked well, they do not present any quantitative measure to support it. Moreover, I did not see any reason why such a technique model is actually effective.

Link: http://www.cs.cornell.edu/home/kleinber/icwsm10-seq.pdf

Nenhum comentário:

Postar um comentário