terça-feira, 27 de março de 2012

Content-based Modeling and Prediction of Information Dissemination

This paper was written by two researchers from University of California, Santa Barbara, and is about the problem of predicting direct links that express communication using content information. It can be seen as a variation of the traditional link-prediction problem where extra information w.r.t. communication flows (aka threads) is available. The idea is to apply Latent Dirichlet Allocation (LDA) in order to learn topics based on content features and then use topic distribution to predict the graph structure.

Link prediction is a very interesting problem and the scenario presented in the paper is novel. To illustrate it better, lets consider a network extracted from email communication as example. In this network, each node is an address and links are messages sent. Therefore, the content of the message may be useful in the discovery of its list of recipients. In order to make use of such information, they propose the GC-Model (don't know what GC means). The content (words) of messages are represented as a "word document" and the list of possible subgraphs, with limited size, of the graph that describes the messages in thread as a "graph document". Therefore, each thread produces two documents. This nomenclature matches the concepts involved in LDA and other topic models.



The GC-model works as follows. Topic distributions are learned from content using LDA (I won't try to explain LDA here, but it is basically a generative model for the identification of topic, which are sets of words, based on documents). This is the content part. The probability of a graph word appearing in a document and the topic mixture of this document are related as follows:

P(word) = Sum_{i=1 to t} P(word | T_i).Q(T_i)

Where Q(T_i) is the topic probability in the given document and P(word | T_i) is defined using Bayes' theorem:

P(word | T_i) = P(T_i|word).P(word) / P(T_i) 

Where T_i is a topic. Each component of the formula is computed as follows:

Topic distribution: P(T_i) = Sum_{j=1 to n} P(T | Thread_j).P(Thread_j)

Thread probability: P(Thread_j) is computed based on the time between threads and their percentage of link overlap, which are binned and fitted to an exponential. Roughly, the probability of a thread structure occurring again decreases exponentially with the time since its most recent occurrence.

Word probability: P(word) = Sum_{j=1 to n} P(word|Thread_j).P(Thread_j)

Topic probability given the word: Estimated using sampling.

Based on this formulation, the probability of a graph word can be computed based on the thread topic distribution. Therefore, given training data (i.e., a set of training thread), this model can be used to predict communication structures in new threads based on their content.

The authors apply the proposed method to three real datasets: The Enron email network (threads are based on email subjects, 5k threads), and two Twitter datasets (threads defined based on replies and retweets, 60k and 400k threads). The length of graph words selected (empirically) is 2 and the evaluation metric used is F1 score. Baselines are standard metrics applied in link prediction (e.g., common neighbors), a naive technique that always predicts the most frequent link and other based on text distance, and the PropFlow (unsupervised and somewhat similar to PageRank) algorithm. The GC-model outperforms all the baseline methods.

The general idea of this paper is good, but I think it should be developed a little further. Regarding the presentation, I think it would be greatly improved by removing some pseudo-codes and adding more text to it. The experimental section is not very strong, specially because it seems like unfair comparing PropFlow and the other baselines against the GC-model, which is the only one that is supervised. Moreover, there is no much evidence that supports why GC-model is good.

Link: http://www.cs.ucsb.edu/~kpm/ASONAM2011_Macropol_Singh_final.pdf

Nenhum comentário:

Postar um comentário