quinta-feira, 12 de julho de 2012

Everyone's an influencer: Quantifying Influence on Twitter

Another paper on influence in social networks using Twitter data. The first author of this paper wrote it while he was a P.hD student at Michigan University. The other authors, including the legendary Duncan Watts, were at Yahoo! Research at the time though most of them may have already left Y!R by now. In this paper the authors conduct several experiments in order to understand and model the role played by users, specially the so called influentials, and content in the generation of information cascades on Twitter.

An information cascade is a popular representation of the diffusion of a given content. It describes the chain of propagation from the seed to the leaf users. Directed edges (u,v) in the cascade mean that that there is a path from v to u (e.g., v follows u). The definition of influence used in the paper is the logarithm of the sum of the size of the cascades seeded by a given user. The type of content considered are URLs, more specifically those URLs shortened using bit.ly.

All the analysis are based on a Twitter dataset containing 87M tweets, 74M diffusion events - I suppose this means the number of pairs (URL,seed) - crawled in a 2-month time frame and also a follower network collected using the users that took part in at least one event as seeds.

In case the cascade reaches an ambiguous step - a user that published a given content has two neighbors who also published it - the authors considered three possibilities:

  • First influence: The first publisher takes the credit
  • Last influence: The most recent (last) publisher takes the credit
  • Split credit: The credit is divided equally between the publishers.

In practice, the three strategies achieved similar results, but it is good to know them anyway.

It is interesting to notice that most of these papers on propagation/diffusion on Twitter consider URLs and hashtags as the type of content to be propagated. The authors discuss some of the implications of this choice. In particular, URLs are more inclusive than other information items such as tweets. The same can be said about hashtags. In particular, they discuss the potential effect of other sources of correlation, specially homophily, over the occurrence of cascades. In fact, the first author worked on this problem later and I have already summarize his paper here.

Examples of cascades are shown in the following figure:


Cascade sizes seems to follow a power-law distribution and cascade depth distribution look like an exponential.

In order to model influence, the authors propose the use of a regression tree model. Regression tree models are an elegant way to integrate several simple predictors into a tree for which the branches are based on values of variables. This property is expected to  handle cascades with varied sizes well. The attributes of the regression model are the following:

  • User attributes (log-transformed)
    • # followers
    • # friends
    • # tweets
    • date of joining
  • Past influence (in the first month):
    • avg, min, and max total influence
    • avg, min, and max local (immediate followers) influence.

Data from the first month are used in the generation of features to predict influence in the second month. Here is the regression tree model found. Conditions indicate partitions of the feature space (left if the condition holds and right otherwise) and leaf nodes give the logarithm of the predicted mean cascade size.


Past influence and number of followers are the only relevant features. Moreover, while the model works well for average values, which are very low in general, it does not predict well the specific cascade sizes.

Further, the authors study the effect of content over information diffusion. URLs were stratified into groups according to their cascade sizes and a sample from each group was selected uniformly at random. These samples were evaluated in terms of several criteria using humans from Mechanical Turk (it sounds funny). This figure shows average cascade sizes for different URL types and categories:


They also found that URLs classified as interesting and associated to positive feelings tend do generate larger cascades. However, adding the content attributes to the the regression model did not improved it.

In the final (and most interesting) part of the evaluation, the authors tried to answer a very challenging question: If you are a marketing professional and want to hire some twitteres to spread your content, which users would you choose? Intuitively, two opposing strategies would be selecting many "low rank" users or a few "star" users to do the job. In fact, the solution is likely to be a trade-off between these strategies. The authors defined a cost function as follows:

c_i = c_a+f_i.c_f

where c_a is a fixed cost per user, c_f is a cost-per-follower and f_i is the number of followers of the user u_i. The value of c_a is considered to be in the form alpha*c_f. Therefore, the value of alpha defines this trade-off between selecting many less influential or few more influential users. If alpha is small, contracting many less influential users becomes more attractive. On the other hand, if alpha is large, then selecting a few influential users tends to be a better strategy. They binned users according to their predicted influence given by the regression model and computed the mean actual influence / cost for each group. The results found are the following:


If alpha = 0, then selecting users with small mean predicted influence is the best choice. As alpha is increased, selecting influential users becomes more attractive, as expected. However, as the authors pointed out, the transition between these two sides is slow. This evidence puts in check the widely accepted belief that selecting a small set of influentials is a good strategy for viral marketing campaigns.

This is the second paper authored by Eytan Bakshy that were summarized in this blog. Both papers tried to answer fundamental questions about social influence using simple but sound empirical analysis and comprehensive datasets. In particular, I liked more this paper than the other one because the question here is clearer. Moreover, reading the introduction of this paper was delightful (great work!). One question about the last part of the evaluation (ordinary users vs. influentials) is: If the model does not work and the overall average size of cascades is small, does this analysis make sense? I mean, after thinking about this analysis a little bit I realized that the results obtained were highly expected. However, a better model may support a more conclusive analysis. I would like to see the results of an optimal model - that always predicts the actual influence - in order to fill this gap.

Link: research.yahoo.com/files/wsdm333w-bakshy.pdf

Nenhum comentário:

Postar um comentário