domingo, 13 de maio de 2012

TwitterRank: Finding Topic-sensitive Influential Twitters

This paper is not really new to me. But it became more important because of a project I've been working on during the last weeks, so I decided to read it carefully this time. It did not take me more than one hour to read it and get its fundamental contributions. Most of the authors of the paper are from Singapure Management University. There is also an author from Pennsylvania State University. I've never heard about any of them, but the particular paper is somehow famous because it was one of the first academic works on Twitter.

The problem studied in the paper is identifying influential users on Twitter. The authors apply a strategy very similar to the PageRank algorithm. However, the proposed algorithm considers also topics extracted from tweets. The idea is taking into account user's interests in the computation of influence. The use of topics is motivated by the topic homophily on Twitter (i.e., users that are connected through following or reciprocal interactions have a more similar distribution of topics than randomly selected users that do not interact in these ways).

The study is based on a small dataset (6k users, 1M tweets) crawled in 2009. All users crawled are from Singapore and the selection criteria is the number of followers. Even the authors agreed that their sample is biased, what probably have affected their conclusions. I understood that they call friendship a reciprocal following interaction on Twitter. They show that the distributions of number of tweets, number of followers and number of friends per user are heavy-tailed. Moreover, there is a very strong correlation between the number of friends and followers of users.

One of the main contribution of the paper is being the first paper to show that homophily occurs on Twitter. Basically, they compute each user's topic distribution based on tweet content using LDA. I may read a more detailed paper on LDA soon, so I will not describe it here. Given two distributions, they apply Jensen-Shannon divergence as a comparison metric, which is an extension of the Kullback-Leibler divergence. The divergence is based on the average of the distributions, and on weighted logarithm of their ratios. They evaluate homophily between followers and friends using statistical tests, showing that topics of connected users are correlated significantly.

Next, they propose TwitterRank, which is a ranking algorithm based on both network structure and topics. TwitterRank builds a transition matrix for each topic. Transitions between users for a given topic are based on topic similarity and also number of tweets. More precisely, TwitterRank gives more weight to transitions to users that generate a higher fraction of the tweets a given user has access to (i.e., tweets that appear in the timeline). They consider it to be a feature, what I disagree. Further, they say/confess that this "feature" makes them more sensitive to malicious behavior. Similar to PageRank, they also make use of a teleportation matrix and a damping factor. Topic-specific transitions can be aggregated based on topic probability (general influence) and user topic probability (perceived influence).

The evaluation section begins with an overview of the most influential users regarding the most popular topics. The discussion is highly subjective and does not really add much to the study. They also compare TwitterRank scores against indegree, PageRank, and Topic-sensitive PageRank (to my surprise it only biases the teleportation probability according to the topics). TwitterRank scores are different from other methods, what is shown using Kendall's rank correlation. Finally, they use TwitterRank in a followee recommendation task. The results are not very conclusive, since TwitterRank is sometimes outperformed by other simpler methods. Also, this section is not very clear due to the many different selection criteria for users employed.

I've never been a big fan of this paper. Apart from using Twitter, they are not very innovative. Moreover, I think the flow of ideas is not very clear in general. Finally, the dataset is not representative, as I've discussed before. One good point, is that the authors seem to be aware of most of the issues I've pointed out here.

Link: http://ink.library.smu.edu.sg/cgi/viewcontent.cgi?article=1503&context=sis_research

Nenhum comentário:

Postar um comentário