sábado, 14 de janeiro de 2012

Predicting the popularity of online content

This was certainly the busiest week of the year, so this is the first paper I've read this week. It is a paper about the problem of predicting the popularity of content in the web, a work done by Gabor Szabo and Bernardo Huberman, from HP Social Computing Lab. There is a shorter version of this paper published by the ACM Communications Magazine.

The basic premise of the paper is that we can predict the popularity of a given content from early measurements of its popularity. The authors study statistical techniques to solve this problem using data from Digg (story popularity) and Youtube (video popularity). In order to reduce the effect of daily and weekly popularity variations, the authors apply a logarithm transformation of the popularity counts. They also propose the digg hour as a measure of time that is less affected by such variations. A digg hour takes longer at low-activity periods.

Given an indicator time t_i and a reference time t_r, t_i < t_r, the problem consists of predicting the popularity of a content c at t_r given its popularity at t_i. The authors show that there are significant linear correlations between popularity counts at t_i and t_r, specially when a logarithm transformation is applied. As a consequence, they propose the following linear model for popularity prediction:

ln N_s(t_r) = ln[r(t_i,t_r)N_s(t_i)] + noise(t_i,t_r)

Where N_s is the popularity at a given time, r(t_i,t_r) is the linear relationship between popularities at different times and noise describes the randomness in the data. The authors show that the noise is normally distributed in log-scale. Moreover, a motivation for the linear model proposed is the log-normal popularity distribution found in digg stories. Three prediction models, with respective error functions to be minimized, are proposed: a linear model, a constant scaling model and a growth profile model (from the literature).

Linear model: Described by the equation given above. The error function is the least-squares absolute error. However, instead of minimizing the log-transformed error, the objective is minimizing the linear error. Therefore, the authors apply a transformation that gives the linear popularity at t_r using the popularity at t_i in log-scale.

Constant scaling model: The model is defined by the following equation:

N_s(t_i,t_r) = alpha(t_i,t_r) N_s(t_i)

The relative squared error is the function to be minimized and the value of alpha that minimizes the error is found using training data. The relative squared error function is given by:

RSE = Sum_c [N_c(t_i,t_r)/N_c(t_r) - 1]^2

Growth profile model: This model is based on average growth profiles devised from the training set. The growth profile is the average popularity of a content at time t_i and is normalized by the popularity at t_r. The growth profile is defined according to the following equation:

P(t_0,t_1) = <N_c(t_0)/N_c(t_1)>_c

Where <>_c is the average value of the ratio over the content from the training set. Given the profile P, the prediction model is defined as follows: 

N_s(t_r) = N_s(t_i)/P(t_i,t_r)

The three proposed models are evaluated using the least-squared error and the relative squared error functions. The datasets are divided into training and test sets of equal size. While the linear model achieves the best results for small values of t_i in terms of LSE using data from Digg, the three models perform very similarly when t_i approaches to t_r. For the Youtube data, it is not possible to draw significant conclusions w.r.t which model is the best one. Moreover, the variations in the value of LSE are higher than for the Digg dataset. In terms of the RSE function, the Constant Scaling Model achieved the best results, as expected, since it minimizes this error function.

The authors argue that the different results found for the Digg and the Youtube datasets can be explained by usage of such applications. While the saturation of the popularity of digg stories occurs very early, youtube videos are accessed during a much longer range of time. Therefore, while the average relative error converges to 0, with small variations, as we approach to the reference time for digg content, this error does not decrease monotonically and presents high variations until t_i is very close to t_r for youtube videos.

I appreciated reading this paper for several reasons. First of all, the problem studied is very relevant and also challenging. The techniques presented are theoretically sound and their experimental evaluation presented several interesting insights about the properties of both the techniques and the datasets. I will keep reading about more work done by Bernardo Huberman and other members of his lab.

Link: http://arxiv.org/abs/0811.0405

Nenhum comentário:

Postar um comentário