quinta-feira, 15 de março de 2012

PageRank: Standing on the shoulders of giants

This paper presents the PageRank algorithm from a historical perspective. PageRank is an algorithm for ranking web pages in terms of relevance based on the structure of the web graph. It was proposed by Sergey Brin and Larry Page in 1998 and has been applied by the Google search engine. This paper is authored by Massimo Fraceschet, from University of Udine, Italy, and was published by the Communications of the ACM magazine.

The HTML protocol, which is the basis for the Web as we know, was invented in 1989 by Berners-Lee while he was working at CERN. PageRank was one of the first efforts on the use of the web topology (that emerges from links between pages) to measure the importance of a page. The basic idea of PageRank is that a web page is important if it is pointed to by other important links. The formulation of the score is as follows:

PI = alpha.PI.S + (1-alpha).u

Where alpha is the damping factor, S is the adjacency matrix of the graph after dangling pages (i.e., without output links) have their output probabilities set as uniform, and u is a uniform probability vector.

The PageRank equation is known to have a unique solution. The existence of at least one solution comes from the fact that the transformed adjacency matrix is stochastic (i.e., all rows sum up to 1). Moreover, according to the Perron-Frobenius theorem,  the equation has a single solution.

Perron-Frobenius theorem: If A is irreducible (i.e., its associated directed graph is strongly connected) nonnegative square matrix, then there exists a unique vector x such that xA = rx, x > 0, and Sum_{i} x_i = 1, where r is the maximum eigenvalue of A in absolute value.

The solution to the PageRank equation can be obtained using the power method, which computes the eigenvector and eigenvalue of a matrix through an iterative process. An initial vector PI^{0} is set to u (uniform vector) and PI^{k+1} = PI^{k}. G, where G is the google matrix. The result converges after a limited number of iterations that depends on the damping factor alpha.

After presenting PageRank, the author discusses a list of very similar methods in Computer Science, Bibliometrics, Sociology and Economics. One of these methods is HITS (Hypertext Induced Topic Search), developed by Jon Kleinberg. HITS is based on the concept of authorities and hubs. Good authorities are pages that are pointed to by good hubs and good hubs are pages that point to good authorities. Given an adjacency matrix L, the authority (x) and hub (y) vectors are defined as follows:

x^k = L^T.y^{k-1}
y^k = L.x^k

These equations can be solved using the power method, like was discussed for PageRank.

I wont discuss the other methods in detail. The point is that ideas very close to PageRank and HITS were developed decades before in other areas.

Bibliometrics (1976): A journal is influential if it is cited by other influential journals.

Sociometry (1949): A person is prestigious if he is endorsed by prestigious people.

Econometrics (1941): Highly remunerated industries are those that receive substantial inputs from highly remunerated industries.

This is a great paper. It is funny how recurring ideas appear somehow independently in some different areas. In the case of PageRank and HITS, the research was developed almost in parallel. Something that I ask to myself sometimes is whether I should know Math a lot or just be able to model the a given problem and then search for the mathematical solution. I've heard somewhere that the second approach is better.

 Link: http://arxiv.org/abs/1002.2858

Nenhum comentário:

Postar um comentário