sábado, 21 de janeiro de 2012

The Party is Over Here: Structure and Content in the 2010 Election

In this paper, some researchers from the University of Michigan studied the usage of Twitter by candidates in the 2010 U.S elections (senate and gubernatorial). The work focuses on content and structural features related to Democrat, Republican and Tea Party candidates. The objective is the understand the impact of such features over the election results. In other words, the general goal is to assess the potential of Twitter in predicting election results, what I consider to be a quite challenging task.

A dataset used  in the paper contains 460,038 tweets and 137,376 pages posted by 687 candidates during the 3.5 years past the election day. The authors assume that web pages mentioned by the candidates express  their opinions. Regarding the topology of the network, the paper shows that it represents very well the structure of the parties, with many edges between candidates from each party and the Tea Party as a part of the Republican network. In terms of density, the Democrats induce the sparsest network and the Republicans the densest one among the parties. Moreover, Tea Party members are more active than Republicans and Democrats on Twitter.

The content of tweets and pages is modeled through a language model. Language models enable the representation of a language through a probability distribution over terms.

Weight of the term t for a user u
w(t,u) = TF(t,D_u).udf(t,D_u).idf(t,D)

where D_u is the content of the candidate u, D is the set of documents (pages + tweets), t is a term from the vocabulary V, udf(t, D_u) = df(t,D_u) / |D_u|,  idf(t,D_u) = log(1 + |D| / df(t)) and  TF(t,D_u) = Sum_{d in D_u} tf(t,d) / |D_u|.

Probability of a term t in the corpus D:
P(t|D) = TF(t,D).udf(t,D)

Normalized probabilities:
w(t,u) = w(t,u) / (Sum_{t in V} w(t,u))

P(t|D) = P(t|D) / (Sum_{t in V} P(t|D))

Smoothing:
P(t|u) = (1 - y).w(t,u) + y.P(t|D)

where y is a normalization factor.

Final (normalized again) probability of a term for a user:
P(t|u) = P(t|u) / (Sum_{t in V}P(t|u))

In order to compute the probability of term for a given party, it is necessary to replace the user u by the set of users from the respective party.

Two distributions are compared through the Kullback-Leibler (KL) divergence. The non-symmetric version of the KL divergence is expressed as:

D(P_1,P_2) = Sum_{t in V} P_1(t).log P_1(t) / log P_2(t)

Moreover, a symmetric version of the KL divergence is defined as follows:

D(P_1,P_2) = Sum_{t in V} P_1(t).log P_1(t) / log P_2(t) + Sum_{t in V} P_2(t).log P_2(t) / log P_1(t)

The KL divergence is applied in the comparison of candidates, candidates and parties, and candidates and the corpus. Moreover, the non-symmetric version is used in the comparison between a term and a distribution. The paper presents the top terms that define each party (i.e. terms with highest marginal KL compared to the corpus). The terms are education (democrats), spending (republicans) and barney_frank (Tea Party). Barney Frank is a very important democrat congressman who is also gay, but I do not have enough information to understand what Tea Party members are saying about him. By comparing the profile of each pair of candidates from the same party, it is shown that the Tea Party is the most cohesive party, while the Democrat is the less cohesive one. Moreover, content and network distance are correlated in the dataset (i.e., candidates that tweet similar content are close in the network).

The final part of the paper studies the problem of predicting the election results. The method applied in the predictions is logistic regression with the following variables: closeness (incoming, outgoing, and all paths), hits, pagerank, in/out degree, KL of the party, party, same-party (i.e., whether the candidate is a member of the party the occupied the position in the previous term), incumbent (i.e., whether the candidate occupied the position in the previous term), and Twitter statistics (tweets, retweets, hashtags, etc.). The single variables that achieve the best results are (in this order): same-party, incumbent, indegree, and closeness (all paths). An interesting finding is that being close the the party and the corpus, in terms of content, is better than being different, which means that focusing on centric issues is a good strategy. Moreover, Twitter statistics are uninformative for the prediction task. The results considering all variables show that the incumbent, party, and same-party variables achieve an accuracy of 81.5% and the inclusion of content and structure variables leads to an accuracy of 88% in predicting election winners.

I confess that I expected more from this paper. It was good to learn more about language models but the results presented in the paper are kind of disappointing. While the content and structure analysis is limited to simple statistics and expected results, the prediction task is not very conclusive. In particular, it seems that the models proposed are a good explanation for the 2010 election results instead of generally effective prediction models.

Link: misc.si.umich.edu/publications/69

Nenhum comentário:

Postar um comentário