segunda-feira, 5 de março de 2012

How (not) to predict elections

This paper is a very simple, but still interesting paper, that brings skepticism on the current hype about making predictions, specially wrt elections, using social media data. I would say that it is somehow similar to the paper 'Predicting Consumer Behavior with Web Search', summarized here 2 months ago.  The authors are from Wellesley College (US) and Universidad de Oviedo (Spain), and I confess I have never heard about any of them.

Several recent papers have reported positive results of the use of social media in the real-time prediction of many outcomes. Similar results have also been reported w.r.t. the use of search engine data. This studies have promptly attracted the interest of the general public. In this paper, the authors revisit some of these studies and show that the results achieved are actually disappointing or misleading. The authors emphasize that fair baselines, such as the incumbency (re-election) rate are not considered and also that the proposed methods perform much worse than professional pollsters or even chance.

In the experimental section, two datasets from Twitter are applied, one related to the Senate elections in Massachusetts (about 234K tweets) and other from an uniform sample of tweets related to 6 races for the US Senate (about 13K tweets). Two published strategies for predicting election results are evaluated, one based on the number of mentions of the candidate and the other based on sentiment associated to terms contained in tweets.  Results show that the number of mentions predicted the victory of the wrong candidate in Massachusetts but sentiment analysis worked. In the other dataset, the number of mentions predicted 3 out of 6 results correctly, same performance achieved by the sentiment analysis. In a further analysis, the authors show that the automatic strategy for sentiment analysis applied achieves very low accuracy (37%). A champaign against a given candidate had most its tweets classified as positive or neutral towards her. Finally, the aggregate opinion of all tweets from users is weakly correlated with the average ADA (Americans for Democratic Action) score of the candidates they follow.

Based on these results, the authors propose set of guidelines for the study of techniques for election prediction: (1) Adjustments should be determined before hand (they always seem obvious afterwards), (2) Social media data should not be considered as from a natural phenomena, since it can be manipulated, (3) We need to understand why a given method works, (4) Proper baselines (e.g., incumbency) should be considered, and (5) Existing polling strategies, based on statistically significant samples, should be considered.

This kind of paper have an interesting contribution to any research area, specially when it comes to criticizing research lines that attract too much publicity without the proper scientific rigor. I would be really concerned if I was cited by one paper like this. Publishing misleading results on purpose is unethical. Although seeing a researcher on NYT or CNN is always inspiring, there is no space for flawed results in  serious research.

Link: http://bit.ly/qUHJyv

Nenhum comentário:

Postar um comentário