Link prediction is a very interesting problem and the scenario presented in the paper is novel. To illustrate it better, lets consider a network extracted from email communication as example. In this network, each node is an address and links are messages sent. Therefore, the content of the message may be useful in the discovery of its list of recipients. In order to make use of such information, they propose the GC-Model (don't know what GC means). The content (words) of messages are represented as a "word document" and the list of possible subgraphs, with limited size, of the graph that describes the messages in thread as a "graph document". Therefore, each thread produces two documents. This nomenclature matches the concepts involved in LDA and other topic models.
The GC-model works as follows. Topic distributions are learned from content using LDA (I won't try to explain LDA here, but it is basically a generative model for the identification of topic, which are sets of words, based on documents). This is the content part. The probability of a graph word appearing in a document and the topic mixture of this document are related as follows:
P(word) = Sum_{i=1 to t} P(word | T_i).Q(T_i)
P(word | T_i) = P(T_i|word).P(word) / P(T_i)
Topic distribution: P(T_i) = Sum_{j=1 to n} P(T | Thread_j).P(Thread_j)
Thread probability: P(Thread_j) is computed based on the time between threads and their percentage of link overlap, which are binned and fitted to an exponential. Roughly, the probability of a thread structure occurring again decreases exponentially with the time since its most recent occurrence.
Word probability: P(word) = Sum_{j=1 to n} P(word|Thread_j).P(Thread_j)
Topic probability given the word: Estimated using sampling.
Based on this formulation, the probability of a graph word can be computed based on the thread topic distribution. Therefore, given training data (i.e., a set of training thread), this model can be used to predict communication structures in new threads based on their content.
The authors apply the proposed method to three real datasets: The Enron email network (threads are based on email subjects, 5k threads), and two Twitter datasets (threads defined based on replies and retweets, 60k and 400k threads). The length of graph words selected (empirically) is 2 and the evaluation metric used is F1 score. Baselines are standard metrics applied in link prediction (e.g., common neighbors), a naive technique that always predicts the most frequent link and other based on text distance, and the PropFlow (unsupervised and somewhat similar to PageRank) algorithm. The GC-model outperforms all the baseline methods.
The general idea of this paper is good, but I think it should be developed a little further. Regarding the presentation, I think it would be greatly improved by removing some pseudo-codes and adding more text to it. The experimental section is not very strong, specially because it seems like unfair comparing PropFlow and the other baselines against the GC-model, which is the only one that is supervised. Moreover, there is no much evidence that supports why GC-model is good.
Link: http://www.cs.ucsb.edu/~kpm/ASONAM2011_Macropol_Singh_final.pdf
Nenhum comentário:
Postar um comentário