terça-feira, 8 de maio de 2012

Proximity Tracking on Time-Evolving Bipartite Graphs

This paper studies the problem of computing proximity and centrality in bipartite graphs that evolve over time. The researchers involved in this work are from Carnegie Mellon University, including Christos Faloutsos, IBM T.J. Watson Lab, and University of Illinois at Chicago (Philip Yu). It won the best paper award in the 2009 edition of the SIAM Data Mining Conference.

It is easy to think about application scenarios where entities can be modeled through a bipartite graph and the proximity/centrality of these entities is somehow relevant. The example used through the paper is a graph that connects authors to conferences. In this case, it is useful to compute the proximity between authors (i.e., how related their work are in terms of the conferences the publish in), authors and conferences (i.e., how important is a conference to an author), conferences and authors (i.e., how important is an author to a conference) and conferences (i.e., how close are the research communities that publish in two conferences). Based on proximity, it is easy to leverage the centrality of users and conferences as the average proximity from all conferences and authors to a specific author or conference. The following Figure shows the centrality of top-centrality authors in NIPS and Philip Yu's top conferences in a 5-year window.




The authors define proximity in terms of random walk with restart. In a static setting, the proximity matrix Q is defined as follows:

Q = (1-c).(I_{(n+l)x(n+l)}-cD^-1W)^-1

where (1-c) is the restart probability, I is an identity matrix, D is a degree matrix (i.e., it contains the degree of the vertices in the graph), and W is an adjacency matrix (the graph is undirected and weighted).

In a dynamic setting, which is the focus of the paper, the matrices D and W are dynamic. The authors propose three updating schemes for W: a global (i.e., new edges are simply added), a sliding window (i.e., only edges inside the current time window are considered) and an exponential weighting (i.e., more recent edges have their weights amplified exponentially). Regarding the matrix D, the authors propose the use of a constant matrix that is a times the first matrix in order to guarantee the monotonicity of the proximity measure.

In a previous work, some of the authors of this paper have shown how the proximity can be computed for general graphs efficiently if these graphs present linear correlations and block-wise community structure. The basis of this idea is the Sherman-Morrison lemma, which gives a formulation for the inverse of a sum of matrices. Instead of inverting a (n+l)x(n+l) matrix, the formulation allows the pre-computation of the inversion of a lxl matrix A (l is supposed to be much smaller than m), defined as:

A = (I - c^2.Mc.Mr)^-1

where Mc = D^-1.M and Mr = D^-1.M^T.

Given the matrix A, proximity can be computed as follows:

if i <= n and j <= n (e.g. author-author):
q(i,j) = 1(i=j) + c^2.Mr(i,:).A.Mc(:,j)

if i <= n and j > n (e.g. author-conference):
q(i,j) =  c.Mr(i,:).A(:,j-n)

if i > n and j <= n (e.g. conference-author):

q(i,j) =  c.A(i-n,:).Mc(:,j)

if i > n and j > n (e.g. conference-conference):

q(i,j) =  A(i-n,j-n)

Computing the matrix A for every update of the bipartite graph in a dynamic setting, which is O(l^3+m.l), is prohibitive. So, the authors propose two solutions:

1) Fast-single update: The matrix A is updated based on the a single edge update.
2) Batch update: For a set of edge updates, instead of performing several single updates, the authors propose an alternative strategy. Since the updates are expected to involve a small set set of records (authors and conferences), the idea is to explore the low-rank property of the update matrices. I did not go into the details of method.

The proposed techniques are applied in two scenarios:
1) Track-centrality: Tracking the top-10 most important type 1 our type 2 nodes over time.
2) Track-proximity: Tracking the top-k most relevant type 1 or type 2 objects for a particular query object A over time.

They apply track-centrality and track-proximity to 5 datasets. NIPS is based on the proceedings of the NIPS conference. Type 1 nodes are authors and type 2 nodes are papers. DM, AC, and ACPost were extracted from DBLP. DM is an author-paper graph for a set of data mining conferences. AC is an author-conference graph from the entire DBLP. ACPost is a similar graph, but the timestamps have a finer granularity (date in which the paper was entered into the database). Netflix is a user-movie graph from the Netflix prize and has a single timestamp.

The applications described in this paper are much more interesting than those from the previous paper (Fast Random Walk with Restart and its Applications). They show the top-10 most influential authors for NIPS over time and the top-5 most related authors for Jian Pei (from SFU). Moreover, they present the proximity rank of KDD w.r.t. VLDB over time, which have been increasing steadily since 1996 (KDD was the 5th closest conference to VLDB in 2007, behind SIGMOD, ICDE, PODS, and EDBT). I will not discuss all case studies here.

In terms of performance, they show that Fast-single-update is up to 176X faster than straight update (i.e., re-computing A for every update). They also show that Fast-batch-update is up to 32X faster than straight update. I missed a comparison between Fast-single-update and Fast-batch-update.

This paper is very good. The authors were able to navigate from a well-defined problem, to a complex but efficient algorithm, and then to cool application scenarios. I have not seem many successful cases like this so far. I think that some examples and general intuitions could help me in understanding the math applied in the paper. Also, I had to read a previous paper to have a better idea of the contributions of this one, i.e. it is not self-contained.

Link: www.siam.org/proceedings/datamining/2008/dm08_64_Tong.pdf

Nenhum comentário:

Postar um comentário