segunda-feira, 16 de janeiro de 2012

Does Bad News Go Away Faster?

Today, I've read this short paper about the correlation between persistence and content in information diffusion. This work was developed by some people from Cornell University, including Jon Kleinberg. The general idea of the paper is that it is possible to identify information that will persist by analyzing content features associated to it. A piece of information is said to be persistent if it still attracts attention some time after its peak of popularity. On the other hand, non-persistent information fades-away very quickly after its peak.

The authors apply a dataset of 21K URLs from Twitter as a case study. The decay time of a URL is defined as the time it takes to this URL to reach 75% of its mentions after its peak of popularity. They show that the decay time of URLs seems to follow a power-law distribution. Moreover, the mean decay time is 217 hours and the median decay time is 19 hours. Based on its decay time, a URL is classified as persistent if its decay time is longer than 24 hours. Non-persistent URLs are those for which the decay time is shorter than 6 hours. It is interesting to see that the time series that represent these two categories are clearly different.

The content features applied in the study are extracted from the html page pointed by the URL. They consider combinations of the header, the body and the URL links. These features are used to train a SVM classifier, which achieves around 70% of accuracy in the identification of persistent URLs. In particular, the more data (i.e. features) is given to the classifier, the more effective it gets.

In a further analysis, the authors study linguistic properties of the content associated to the two categories of URLs. Using LIWC (that stands for Linguistic Inquiry and Word Count), which is a taxonomy composed of 60 pre-defined categories of words, they show that affective content is related to persistence. Moreover, they found out that content associated to cognitive processes (e.g., thinking, knowing) is more viral (i.e., less persistent) and rapidly fading URLs point to content with more words related action and time (e.g., news). In terms of the words that better represent each class, the authors show that, while persistent URLs are links to pages that contain words related to entertainment, positive emotions, advertisement and marketing, non-persistent URLs are references to pages with words associated to news, blogs, and names.

I consider the problem studied in the paper to be very interesting to a broad range of professionals. This idea that the information itself affects the way it propagates is very intuitive but I've never seem a scientific study about it. However, I also believe that the paper is not very conclusive regarding any unexpected result. In fact, the conclusions of the paper while interesting are not surprising at all.

Link: http://www.cs.cornell.edu/home/kleinber/icwsm11-longevity.pdf

Nenhum comentário:

Postar um comentário