segunda-feira, 6 de fevereiro de 2012

A Data-Based Approach to Social Influence Maximization

These papers with so much theory are really holding me back from my reading plans. I may get more used to them with practice. This paper studies the social influence maximization problem, which is basically the problem of selecting a small set of nodes that maximize the influence in a social network, and is co-authored by researchers from University of British Columbia and Yahoo! Research. Different from previous work that have considered a version of this problem where a real network is used but a propagation model is used to simulate the influence, instead of real influence traces. One funny thing is that I have criticized this approach 2 years ago while looking for a new research project.

The independent cascade model (IC), where each active node has a given probability of influencing its neighbor, and the linear threshold model (LT), where each node has a uniformly distributed threshold that must be reached by the total weight from its active neighbors in order to it be activated too, are the propagation models applied in the influence maximization problem. The parameters of these models for a given network are usually assumed to be given, although only some recent work have focused on this important aspect of the problem. Given a seed set S, the standard approach is to simulate the spread of influence using one of the defined propagation models and then evaluate the influence of S. The objective is the identification of a seed set of size k that maximizes the influence, a problem known to be #P-hard. In order to solve this problem for large datasets, previous work have found that the expected influence function is monotone and submodular, concepts defined as follows.

Monotonicity: A function f is said to be monotone if f(S) <= f(T) whenever S is a subset of T

Submodularity: A function f is said to be submodular if f(S+w)-f(S) >= f(T+w)-f(T).

An important property of monotone submodular functions is that the a greedy algorithm that selects items  to be part of S based on marginal gain achieves a result that approximates the optimal result by a factor of (1-1/e)

The first important result of the paper is that existing ad hoc propagation models are not able to learn influence probabilities that fit those found in real data. They show it by comparing the edge probabilities learned by an EM-based method against other existing methods. Two real datasets one from, which is a movie rating website, and the other from are used. On flixter, a user u_1 influences u_2 if u_2 rates a movie already rated by u_1 and after they became friends. Similarly, on flickr, u_1 influences u_2 whenever u_2 joins a group of which u_1 was a member before and they were friends before it happened. It is important to notice that this definition of influence is a heuristics. The datasets were split into train and test sets. In the first experiment, the authors show that the seeds selected using the propagation models are very different from those based on the EM-based method, i.e. they do not fit the seeds selected using the propagation data. For all the methods a greedy algorithm was applied in the identification of the seed sets, I assumed that the marginal gains were computed using Monte Carlo simulation.  It is also shown that the expected influence computed using the propagation models differs significantly from the actual influence in the test set. The procedure used in this evaluation is selecting the initiators of spreading processes in the test set and taking as their propagation size the number of users that performed such actions. EM achieves the best results among the methods evaluated.

Further, the authors propose the Credibility Distribution Model (CD), which is the greatest contribution of the paper. The CD model uses propagation traces, which are tuples <user,action,time>, to measure the expected influence of a set of nodes S. The model is based on the effect of a given seed set S over each node from the network:

sigma_cd(S) = Sum_{u in V} k_S_u

where k_S_u is the credit relative to u and is defined as:

k_S_u = (1/A_u) * Sum_{a in A} Big_Gamma_S_u(a)

where Big_Gamma_S_u(a) is given as:

1,                               if u is part of S
Sum_{w in N_in(u,a)}Big_Gamma_S_w(a).small_gamma_w_u(a)

where small_gamma_w_u(a) = 1 / d_in(u,a), N_in(u,a) is the set of vertices that have u as neighbors and have the action a, and d_in(u,a) is the indegree of u in the graph induced by the action a. The authors propose also a better alternative for computing small_gamma_w_a that depends on the time in between w and u performed a, the average time that u takes to copy an action from v, and the influenceability of u. The intuition of the model is just starting from u in the reverse direction of edges until finding a node from S always multiplying the result by small_gamma in the graph induced by a given action. The authors show that the influence maximization problem under the CD model is #P-hard (the vertex covering problem can be reduced to it) using induction on path lengths. I will not discuss the proof here. They also show that the CD function is monotone and submodular using induction.

The last theoretical contribution of the paper is showing that the marginal gain of a given vertex, which is the basis of the greedy algorithm for the identification of seed sets, can be computed efficiently without simulation or even scanning the action log many times. Marginal gains can be computed using the following expression instead:

sigma_cd(S+x) - sigma_cd(S) = Sum_{a in A}((1-Big_Gamma_S_x(a)) . Sum_{u in V} (1/A_u) . Big_Gamma_x_u^V-S(a))

The proof of this equation takes 1 entire page of the paper. The intuition of this expression is that the marginal gain can be computed using the credit of the current seed set S for each action from the set of actions and also the credit of x over the vertices that are not in the seed set. This comes from the fact that the score of each vertex from the seed set can be computed independently by removing any other vertex in the seed set from the set of nodes considered, something that I understood but just does not seem very right to me. It important to notice that the marginal gain must consider both the removal of x from the set of nodes that may be activated and its addition to the seed set as an initializer. In particular, the removal of x requires subtracting the scores of all paths between vertices from the seed set and any other vertex passing through x. I've followed the proof and I think it is really impressive, very good work.

The algorithms that apply the CD model in the identification of the seed set S are straightforward. First, the action log is scanned and the credit for each combination of vertices and actions are computed. This data is given as an input to a greed algorithm that applies the smart lazy strategy proposed by Leskovec, which updates the marginal gain of the top candidate vertex since the marginal gain of vertices always decreases along the iterations. Marginal gains are computed using the expression presented above. 

In the experimental evaluation section, the authors apply the two datasets already described here. The CD model is compared with the IC and the LT model. It is shown that, in general, the CD model predicts the propagation sizes better than the other models. The seed sets identified uing the IC and the LT models have no superposition with the set found using CD and they achieve higher influence spread, but not as higher as I expected. Moreover, CD is at least one order of magnitude faster, although it may require a large memory space in the worst case. A pruning threshold is proposed to solve this issue. Finally, the authors show that the seed sets and their influence spread found using CD converges using a small part of the action log.

This paper is a really great reading. It certainly takes some courage to go through its theorems, definitions and proofs, but I considered it worthy. However, I consider that there is a lot of room for improving the proposed model, specially because it seems to consider the effect of each node from the seed set independently, what is counterintuitive. I should be better at math in order to read a paper like this faster and also to devise this kind of technique. Let's keep working!


Nenhum comentário:

Postar um comentário