segunda-feira, 30 de janeiro de 2012

Mining Top-K Large Structural Patterns in a Massive Network

After being far for a while, I finally found time to write about this paper. It is certainly on my top list, therefore, it took me a long time to read it carefully. The researchers involved in this work are from Singapore Management University, Peking University, UCSB, UIUC and UIC. Xifeng Yan, Jiawei Han, and Philip Yu are among them. The problem studied in the paper is finding frequent patterns in a single labeled graph. A frequent pattern is a subgraph for which the structure and attributes involved is frequent, more formally, the frequency of a subgraph is defined in terms of its isomorph subgraphs in the graph.

The basic idea of the work is that finding frequent subgraphs in a single graph is very expensive, what has been shown by previous work. Nevertheless, in practice, we are interested mainly in the largest subgraphs. But is it possible to mine only these large subgraphs efficiently? The authors propose an elegant approximate algorithm to solve this problem, which is called SpiderMine.

SpiderMine is based on the notion of a spider. A spider is a subgraph for which the distance between any vertex and a given head vertex is bounded by r (parameter). SpiderMine grows a random sample of spiders in order to find the largest patterns with high probability. In fact, they determine what is the required number of spiders required to find the k largest patterns with a 1-e probability, where e is a user-specified parameter. It is important to distinguish a subgraph from its embeddings. A subgraph is a pattern (labels+structure) while a embedding is a particular occurrence of a subgraph in the graph.

SpiderMine works as follows. First, all frequent spiders are identified. In the second step, M randomly selected frequent spiders are grown in D_max/2r iterations (D_max is a parameter). A pattern is grown by attaching spiders to its boundary vertices so that at each iteration a pattern has its radius increased by r. Subgraphs are merged in case they overlap during this process and are also frequent. Merged patterns are potential large subgraphs and are grown during a third step. In case a large pattern has at least two spiders as subgraphs, it will be find by SpiderMine. Spiders are also useful to speedup isomorphism checking, since isomorph subgraphs will have the same spider set.

One of my favorite parts of the paper is where the value of M is devised. Basically, the probability of a  pattern P being hit by spider (i.e., this spider has one of the vertices of P as head) is given by:

P_hit(P) = Sum_{v in V(P)} |Spider(v)| / |S_all|

where Spider(v) is composed of spiders that have at least one embedding with v as head (I understood v as a label and Spider(v) as the set of spiders with v as head) and S_all is the complete set of spiders. The expected value of |Spider(v)| is |S_all|/|V(G)|, i.e. assuming that all vertices in V are equally likely to be part of a spider, this ratio gives the number of spiders with a given vertex. As a consequence, the probability of a pattern P being hit is lower bounded by |V(P)|/|V(G)|. The probability of a vertex not to be hit, P_fail, i.e. the probability of at most one vertex from the pattern being the head of a spider is given by:

P_fail(P) = (1-P_hit(P))^M + M.P_hit(P)(1-P_hit(P))^M-1

If P_hit <=2, then P_fail is upper bounded to (M+1)(1-P_hit(P))^M and the probability of hitting the K largest patterns is upper bounded by (1 - (M+1)(1-V_min/|V(G)|)^M)^K, where V_min is a parameter that gives the minimum size of a large pattern. Using this last expression, it is possible to set the value of M based on e. Though this math seems pretty much intuitive, I do not think I would be able of deducing it by myself.

The authors compare SpiderMine  against existing algorithms for mining frequent subgraphs from single graphs using synthetic (ER and scale-free) and real datasets (DBLP and Jeti). They show that SpiderMine is more effective in finding the largest patterns and is also more scalable than competing techniques. In particular, they compare Spidermine and ORIGAMI, which is an algorithm for the identification of representative graph patterns, in the transaction setting and show that SpiderMine is able to find larger patterns. While the execution time of the proposed algorithm grows exponentially with r, it works well for small r.  Finally, they show some interesting patterns found in real settings. In DBLP, they found collaboration patterns that are common to several research groups (e.g., senior and prolific authors organized in a particular way). In Jeti, which is a software, they found some frequent relationships among methods from different classes.

My general view of this paper is very positive. The authors propose a very interesting and powerful algorithm to solve a relevant problem. The technical contributions are sound and well described, except for some small details. One particular point that is still not very clear to me is whether the complete set of spiders is used in the growth process. It seems to be required in order to consider the complete set instead of only the M selected spiders in order to generate large patterns.

Link: http://www.vldb.org/pvldb/vol4/p807-zhu.pdf

Nenhum comentário:

Postar um comentário