This paper is very related to a previous paper I've read about how search volume data can be used for monitoring influenza epidemics. Here, the authors study a similar problem which is the prediction of consumer behavior (e.g., movie revenues, game sales, and song ranks) based on search data. The researchers involved in the work are from the Yahoo! Research Lab in New York City and include Duncan Watts, who is an authority on social network analysis.
In the recent years, several papers have shown how search volume is correlated to collective behavior, what is very intuitive. This is probably one of the most interesting among these papers because the authors do not assume that such a method would work. In fact, they show that search volume is not very useful in several important scenarios.
Using search data from the Yahoo! search engine and also from the Yahoo! music, the authors study the effectiveness of this data in the prediction of: (1) opening weekend box-office movie revenues, (2) first-month sales of video games, and (3) weekly ranks of songs on the Billboard Hot 100. The prediction model applied is linear correlation. Movie revenues are obtained from the IMDb website and game sales are available in vgchartz.com. A query is associated to a given movie if the IMDb page of this movie appears as one of the top results for this query on Yahoo! search. In the case of games, specific identifiers from URLs were applied to match a query to a particular game. Songs were linked to queries through normalized versions of their titles.
Results show that the accuracy of the models varies significantly depending on the application scenario. The predictions were evaluated in terms of how they were correlated to the actual results and achieved a 0.85, 0.76, and 0.56 correlation coefficients for the movie, game, and music prediction tasks. It is interesting to see that, in some cases, good prediction can be made even using data from one week in the past.
The authors also evaluate baseline models in the prediction tasks proposed. For the box-office revenue prediction, they apply the budget, number of screens opened and box-office predictions from hsx.com. To predict game sales, they use critic ratings from vgchartz.com and revenues of previous editions of games, if available. Music ranks for a given time t are predicted using ranks from two weeks before t. It is interesting to see that such baselines outperform search count based models. When hybrid models are generated, search volume data improves the results of the baselines only for non-sequel games (from 0.44 to 0.50) and music (from 0.70 to 0.87).
Not happy with the previous results, the authors also discuss the use of search data to monitor influenza. They show that an auto-regressive model, which applies previous reports on influenza caseloads achieves an accuracy of 0.86 while google flu trends achieved an accuracy of 0.94. While the results based on search volume are better, the gains are relatively small when compared against the auto-regressive model.
I appreciated reading this paper mainly because it is controversial. While several studies are going in the direction of showing the potential of search data in several tasks, the authors challenge this hypothesis and their results are very interesting. After reading so many papers using hadoop for data analysis, I start to wonder whether I should jump into hadoop as well.
Link: http://www.pnas.org/content/early/2010/09/20/1005962107
Nenhum comentário:
Postar um comentário