terça-feira, 20 de março de 2012

Output Space Sampling for Graph Patterns

No doubt that VLDB has published some great papers recently. This one is certainly part of my top list. It was written by some colleagues from RPI. The problem studied is subgraph pattern mining. The authors propose an output sampling algorithm based on Metropolis-Hastings that is able to produce graph patterns according to a general purpose interestingness metric (e.g., support) and threshold (e.g., minimum  support). Moreover, you can also bias the sampling method towards a given distribution. The benefits are: (1) Performance improvement over traditional graph mining algorithms, and (2) reduction of the output volume. Graph mining, specially frequent graph mining, is a problem that is not well-solved in the literature (the input graphs considered are too small) and sampling seems to be an interesting solution for it. The work is theoretically sound and has potential to affect data mining significantly in the long term.

The (generic) subgraph mining problem consists of, given a graph database (multiset of graphs), and an interestingness function F->R, return all subgraphs from the database that satisfy the given threshold. The proposed method samples the output space of a subgraph mining algorithm without generating the whole search-space. This approach is completely different from what has been done in the literature so far.

I will skip the formal definition of what is a graph, a subgraph, isomorphism, etc. and go straight to the point. The idea of the output sampling is performing a random-walk on the search-space of subgraphs, which can be seen as a Partial Order Graph (POG). However, this random walk must be smart. This is when the Metropolis-Hastings (MH) algorithm arrives. MH enables the generation of samples following a given target distribution. MH is a Markov chain Monte Carlo (MCMC) method developed in 1970. One important characteristic of this method is that it depends only on the current state (Markov property).

MH applies an arbitrary jumping function Q, which, given a current value x_t, returns a new value x_{t+1}. Q must be symmetric, i.e. Q(x_{t+1} | x_t) = Q(x_t | x_{t+1}).  The procedure starts with an initial value x_0 and then enters into a loop where new values are generated based on the current value x_t and Q.  Given a = P(x_{t+1})/P(x_t), if a >= 1, then x_{t+1} is accepted, otherwise, x_{t+1} is accepted with probability a. P is the target distribution. The method works better if Q is similar to P. There are some conditions under which MH works, the authors consider the condition that P must be stationary (i.e., P will converge after some iterations in the Markov process).

The state space of the MH algorithm for sampling the output space of subgraph patterns is the Partial Order Graph (POG). This graph is build only locally and the authors prove that a random walk on POG converges to a stationary distribution (it is finite, irreducible, and aperiodic).

First, it is shown how to obtain a uniform sample of frequent patterns, what means that any frequent pattern has the same probability to be returned. In order to do so, it is necessary to create proper transition probabilities in the POG. In fact, they show that if the transition matrix is symmetric, then the stationary distribution is uniform. P(u,v), where u and v are states (or vertices in the POG) are defined as follows:

P(u,v)        =                          1/2(*max(d_u,d_v))                                       if u != v and v in adj(u)
                      3/2- Sum_{x in adj(u)} P(u,x)                        if u = v
                        0                                                                    otherwise

The MH algorithm is adapted using P(u,v). The algorithm receives a pattern (in practice an empty pattern, which is part of the POG), the minimum support threshold, and the number of iterations to be performed. The acceptance of a transition is based on the degree of the current vertex and the degree of the new vertex. The number of transitions depend on the number of iterations and the final pattern found is returned as output. In this process, only neighbors of the current pattern are generated (i.e., sub- and super-patterns) and the support counting for these patterns is performed using standard isomorphism checking in a pass over the database.

The authors show also how to perform a support proportional sampling (i.e., a sampling biased towards the most frequent subgraphs), which requires a modification of the transition function P(u,v). However, since they are still working with the POG, they must relate this interestingness function with the degree of its vertices. Basically, they propose a transition function where choosing an up movement (towards the empty pattern) is more likely than choosing a down movement in some extent. Changing the algorithm to consider a discriminatory metric is easy, since it requires only a small change in how new vertices are accepted.

The results are divided into two parts. First an analysis of the patterns using a small dataset (AIDS antiviral screening dataset) is presented. It is shown that the method generates a uniform distribution of patterns if desired and it converges after 20k iterations. The authors also show that the distribution of patterns is still a bit skewed towards high degree vertices in the POG. Using an transaction dataset, for which the POG is denser, they show that their method achieves a more uniform distribution. In terms of support-biased sampling, the method also works well, but it is necessary to select a good value for up/down transition probabilities. Discriminatory subgraph sampling is evaluated using a dataset of chemical graphs. The results in this case are not very conclusive. Although many high delta score patterns are found, the correlation between delta score and visit count does not seem as high as for the support-biased case. In the evaluation using large graphs, a protein-interaction and a cell-graph database are applied. Most of the existing algorithms take more than 2 days to process such datasets and the output space sampling algorithm is shown to be a good alternative.

The technique proposed in this paper is very innovative and relevant. In the light of what has been done so far in graph mining, this is a big step towards the analysis of real graph databases. One thing that I did not understand very well is why the approach proposed for support-biased sampling is different from the one that considered the significance of patterns. In fact, after finding out that the significance-biased technique does not work very well I start to wonder how general the technique is in practice. I would like to know the line of thought that brought the authors from the existing work to this sampling idea, it is really different from everything I've seen in this area.

Link: www.vldb.org/pvldb/2/vldb09-713.pdf

Nenhum comentário:

Postar um comentário