This is not the first time I've read this paper. The first one was in the very beginning of my M.Sc studies, when I was trying to find a research problem. This time, I think I've got I much better picture of the the contributions of this work. The authors were affiliated to the Carnegie Mellon University at the time (2006). The problem addressed is performing a random walk in a graph efficiently, specially in terms of query time. Random walk with restart gives a very interesting measure of proximity between two vertices. Nevertheless, existing methods may not be applicable to some real-life constraints. With this in mind, the authors propose some matrix manipulations that are the basis for two algorithms for random walk in graphs. The solution is not exact, but the authors show that it gives a good compromise between accuracy and execution time in real application scenarios.
Random walk with restart (RWR) works as follows. Given a normalized matrix W that represents a weighted graph and a probability c that the vertex will return to its original node, the probability of a random walk to pass through each vertex of the graph is given by:
r_i = cWr_i + (1-c)e_i
r_i = (1-c)(I-cW)^(-1)e_i
The proposed method is based on the idea of decomposing W into two matrices based on its partitions. The authors apply the METIS algorithm, but other techniques may be applied as well. The first component of W (W_1) is composed of edges inside partitions and the second component (W_2) contains edges between clusters. Therefore, the method is making use of the community structure that is expected to appear in real graphs. W_1 can be represented by many small matrices (1 for each partition) with a lot of 0's representing edges that cross different partitions. W_2 is decomposed into three matrices (W_2 = USV) using low-rank approximation. Given W_1 and W_2, we can re-formulate the random walk expression as:
r_i = (1-c)(I - cW_1 - c USV)^(-1)e_i
According to the Serman-Morrison lemma, we have:
r_i = (1-c)( ( I - cW_1 )^(-1) e_i + c ( I - cW_1 )^(-1) U (S^(-1) -cV( I - cW_1 )^(-1)U)^(-1) V ( I - cW_1 )^(-1) e_i)
Simplifying, if Q_1^(-1) = ( I - cW_1 )^(-1) and A = (S^(-1) -cVQ_1^(-1)U)^(-1), then:
r_i = (1-c)( Q_1^(-1) e_i + c Q_1^(-1) A V Q_1^(-1) e_i)
Based on this formulation, two algorithms are proposed: B_LIN and NB_LIN. NB_LIN is a simplified version of B_LIN where each vertex in W is considered as a single partition (i.e., low-rank approximation is directly applied to W).
Jumping straight to the results, the proposed algorithms are applied in two image retrieval tasks and one neighborhood formulation task. I found the results section boring, so I did not really tried to understand what the tasks really mean. Basically, the accuracy of the methods are at least 90% of the optimal ones. Moreover, they achieved up to 150x speed up in response time compared to OnTheFly, and up to 479x speedup in pre-computation time compared to PreCompute.
My general view on this paper is that it is theoretically strong for a CS but maybe not for a Mathematics paper (I would like to hear an expert's opinion on this). Also, the paper lacks elegance and didactic, though I know these are very subjective matters. The results section is difficult to understand and I missed a comparison against other fast random walk computation alternatives. Anyway, the paper was worth reading to give me a historical perspective on my understanding of more theoretical data mining papers. I've spent some hours following the mathematical formulations throughout the paper, which I consider to be its biggest contributions.
Link: www.cs.cmu.edu/~htong/pdf/ICDM06_tong.pdf
Nenhum comentário:
Postar um comentário