quinta-feira, 23 de fevereiro de 2012

The Pulse of News in Social Media: Forecasting Popularity

This is another paper on the series "Predicting popularity of Online Content", written by researchers from UCLA and HP Labs, including Bernardo Huberman. Different from a previous paper I've read, here the authors studied the problem of predicting popularity of news articles based on properties of the articles themselves and not on early popularity records. This problem is very relevant for journalists, content providers, advertisers, and news recommendation systems because of the convergence of news and social media in the dissemination of content on the Web.

The dataset used in this study come mainly from two sources. The first one is feedzilla, a news aggregator, and the second from Twitter. For a given news article from feedzilla, all Twitters that mentioned its respective URL were crawled. Properties of an article are: source, category, subjectivity, and named entities mentioned. For each of of these properties, the authors define a score called t-density. The score of a category is the number of tweets per article of the given category. Subjectivity is a binary attribute defined using a subjectivity classifier. This classifier was fed with tv and radio show transcripts as the subjective class, and articles from c-span (politics and history) and First Monday as the objective class. Named entities were identified using the Stanford-NER tool. For a given article, the score is the average score of the entities it mentions. Source score is based on the success of the given source in another sample of tweets (different from the main dataset). Scores were smoothed and normalized to reduce the effect of temporal variations.

Based on information from google news made available by KnewsKnife (news site rating), the paper investigates whether the top sources found using their score measure match the top sources on google news. The conclusion is that they do not match. Although the number of tweets and articles from a given source are somehow positively correlated with the google news rating, the proposed score is not correlated with the ratings. In fact, the top sources on google news are traditional media websites (e.g., WSJ, Reuters) while the top sources identified in the study are web marketing blogs (e.g., Mashable, Google Blog).

Two problems versions of the popularity prediction problem are considered in the paper: regression and classification. In the regression problem, the goal is to predict the exact number of tweets mentioning a given articles using the features just described. Three models (linear, KNN, and SVM) were applied and the linear model achieved the best results (R^2=0.34). The result is even better when only one category (technology) is considered (R^2=0.43). The authors argue that this is due to the overlap across categories on feedzilla. In general, the most important evidence is the source of the article. In the classification problem, three classes were defined (1-20, 20-100, and 100-2400 tweets). Four classification models were considered (SVM, Decision Tree, Bagging, and Naive Bayes). The bagging and the decision tree algorithm achieved the best result (84% accuracy). Again, the most important feature is the source.

I did not like this paper very much. I think it is due to a lack of theoretical background and the excess of ad-hoc decisions w.r.t. the datasets and the scoring schemes. The results show that the source determines the popularity of news articles what was somehow expected. Maybe, a more important question would be: What makes an article from a ordinary blog or news website to become popular? Or what makes an article from a famous blog or news website fail?

Link: http://arxiv.org/abs/1202.0332

Nenhum comentário:

Postar um comentário