During the past years I've studied how URLs become popular on Twitter and this is very strong related work. In this paper, Jaewon Yang and Jure Leskovec, both from Stanford University, study patterns that describe how the popularity of a given content (e.g., URLs, hashtags, memes) evolves along time, which they call patterns of temporal variation. Each content produces a time series that describes its popularity during a given time interval and the authors propose a technique for the identification of clusters of contents that present a similar variation pattern. This problem has several relevant applications, such as understanding how online content appears, gets popular, and fades over time, and also predicting content that is likely to become popular.
The clustering algorithm for time series proposed in the paper, which is called K-Spectral Centroid (k-SC) is very similar to the k-means algorithm. However, the authors define a new distance metric that is more suitable for the particular scenario of temporal variation of online content. The proposed metric is defined as an optimization problem that finds the best match between two time series both in terms of the x and y scales. The objective is to shift and scale the time series in order to minimize the effects of the absolute popularity values and the particular time in which such series peaked. Based on this metric, they define an overall evaluation function to be minimized and also an optimization problem for the identification of a cluster's centroid in each iteration of the algorithm. K-SC receives the number of clusters as an input parameter and initial centroids are assigned at random. The algorithm iterates assigning each time series to its closest centroid and then updating the centroid of each cluster until the assignments converge. The identification of the centroids requires the computation of the eigenvectors of a matrix of size L, which takes O(L^3). Moreover, the performance of the algorithm is very sensitive to the initial assignments. In order to make K-SC more scalable, the authors propose an incremental version of the algorithm using the Discrete Haar Wavelet Transform.
The K-SC algorithm is applied in the analysis of 2 datasets. The first one is a set of phrases extracted from blogs and news articles through the MemeTracker methodology. Each time series is truncated to 128 hours and the number of clusters is chosen empirically. Experiments show that K-SC outperforms two variations of K-means in terms of two evaluation metrics: one the measures the distance between points and their respective centroids and another that evaluates the distance between centroids. While the first variation of the K-means algorithm aligns the peaks of the time series, the second also scales peak volumes. They also show that the incremental version of K-SC is more scalable than the naive one.
The 6 clusters identified by K-SC are very interesting. They represent intuitive and distinct patterns, such as fast and short popularization and popularity rebounding. The authors argue that these patterns are related to the time in which different types of media mention the phrases and that this information may be useful to classify phrases into each of the 6 categories. They show that these temporal features are more effective than volume features (i.e., the fraction of the total volume of citations corresponding to a particular website) and also tf-idf features based on the citations of a particular website to a given phrase and the overall citations of this website. Similar results are found in the application of K-SC to the analysis of URLs on Twitter.
This paper studied a very interesting problem and presents several contributions both in terms of the techniques proposed and the case studies presented. I think that, at some points, the authors make assertions that too strong, specially regarding the classification accuracy of the techniques, which seems quite similar to me. However, my overall evaluation of the paper is very positive.
Link: http://cs.stanford.edu/people/jure/pubs/memeshapes-wsdm11.pdf
Nenhum comentário:
Postar um comentário