segunda-feira, 7 de maio de 2012

Fast Random Walk with Restart and its Applications

After the longest hiatus in the short history of this blog, I'm back! April was not exactly what I would call a following-the-reading-plans month. First, I got stuck in the middle of a paper and , then, I had two deadlines in a row. One of the lessons I've learned so far is that, considering the kind of papers I've been trying to read, one paper a week may be too much for me. Reading one (non-trivial) paper a day is just insane. To bring my guilty back to recommended levels, I will try to read at least 10 papers this month. Let's see how it works.

This is not the first time I've read this paper. The first one was in the very beginning of my M.Sc studies, when I was trying to find a research problem. This time, I think I've got I much better picture of the the contributions of this work. The authors were affiliated to the Carnegie Mellon University at the time (2006). The problem addressed is performing a random walk in a graph efficiently, specially in terms of query time. Random walk with restart gives a very interesting measure of proximity between two vertices. Nevertheless, existing methods may not be applicable to some real-life constraints. With this in mind, the authors propose some matrix manipulations that are the basis for two algorithms for random walk in graphs. The solution is not exact, but the authors show that it gives a good compromise between accuracy and execution time in real application scenarios.

Random walk with restart (RWR) works as follows. Given a normalized matrix W that represents a weighted graph and a probability c that the vertex will return to its original node, the probability of a random walk to pass through each vertex of the graph is given by:

r_i = cWr_i + (1-c)e_i

where e_i is a nX1 vector with 1 in the i-nth position and 0 otherwise. The vector r_i (nX1) gives the relevance scores for the n vertices in the graph. We can re-arrange the above equation as the following linear system:

r_i = (1-c)(I-cW)^(-1)e_i

The straightforward way to solve this system is by inverting the nXn matrix I-cW. The inversion is only required once, and many proximity queries can be computed with small cost afterwards. However, this operation requires O(n^3) time, which may be prohibitive. The most common solution in this case is using an iterative method (power method) to get an approximate  r_i for each query (vertex i). The authors call the first and the second aforementioned methods as PreCompute and OnTheFly, respectively. A third alternative is approximating W using low-rank approximation, which means applying a simpler representation of W based on its linear correlations. This method is called Blk.

The proposed method is based on the idea of decomposing W into two matrices based on its partitions. The authors apply the METIS algorithm, but other techniques may be applied as well. The first component of W (W_1) is composed of edges inside partitions and the second component (W_2) contains edges between clusters. Therefore, the method is making use of the community structure that is expected to appear in real graphs. W_1 can be represented by many small matrices (1 for each partition)   with a lot of 0's representing edges that cross different partitions. W_2 is decomposed into three matrices (W_2 = USV) using low-rank approximation. Given W_1 and W_2, we can re-formulate the random walk expression as:

r_i = (1-c)(I - cW_1 - c USV)^(-1)e_i

According to the Serman-Morrison lemma, we have:


r_i = (1-c)( ( I - cW_1 )^(-1) e_i + c ( I - cW_1 )^(-1) U (S^(-1) -cV( I - cW_1 )^(-1)U)^(-1) V ( I - cW_1 )^(-1) e_i)


Simplifying, if Q_1^(-1) = ( I - cW_1 )^(-1) and A = (S^(-1) -cVQ_1^(-1)U)^(-1), then:

r_i = (1-c)( Q_1^(-1)  e_i + c Q_1^(-1)  A V Q_1^(-1)  e_i)

In other (non-mathematical) words, now r_i can be approximated based on the inversion of A (a tXt matrix, where t is the rank of the low-rank approximation) and some other simple operations.

Based on this formulation, two algorithms are proposed: B_LIN and NB_LIN. NB_LIN is a simplified version of B_LIN where each vertex in W is considered as a single partition (i.e., low-rank approximation is directly applied to W).

Jumping straight to the results, the proposed algorithms are applied in two image retrieval tasks and one neighborhood formulation task. I found the results section boring, so I did not really tried to understand what the tasks really mean. Basically, the accuracy of the methods are at least 90% of the optimal ones. Moreover, they achieved up to 150x speed up in response time compared to OnTheFly, and up to 479x speedup in pre-computation time compared to PreCompute.

My general view on this paper is that it is theoretically strong for a CS but maybe not for a Mathematics paper (I would like to hear an expert's opinion on this). Also, the paper lacks elegance and didactic, though I know these are very subjective matters. The results section is difficult to understand and I missed a comparison against other fast random walk computation alternatives. Anyway, the paper was worth reading to give me a historical perspective on my understanding of more theoretical data mining papers. I've spent some hours following the mathematical formulations throughout the paper, which I consider to be its biggest contributions.

Link: www.cs.cmu.edu/~htong/pdf/ICDM06_tong.pdf

Nenhum comentário:

Postar um comentário