The basic idea of the work is that finding frequent subgraphs in a single graph is very expensive, what has been shown by previous work. Nevertheless, in practice, we are interested mainly in the largest subgraphs. But is it possible to mine only these large subgraphs efficiently? The authors propose an elegant approximate algorithm to solve this problem, which is called SpiderMine.

SpiderMine is based on the notion of a spider. A spider is a subgraph for which the distance between any vertex and a given head vertex is bounded by

*r*(parameter). SpiderMine grows a random sample of spiders in order to find the largest patterns with high probability. In fact, they determine what is the required number of spiders required to find the k largest patterns with a

*1-e*probability, where

*e*is a user-specified parameter. It is important to distinguish a subgraph from its embeddings. A subgraph is a pattern (labels+structure) while a embedding is a particular occurrence of a subgraph in the graph.

SpiderMine works as follows. First, all frequent spiders are identified. In the second step,

*M*randomly selected frequent spiders are grown in

*D_max/2r*iterations (

*D_max*is a parameter). A pattern is grown by attaching spiders to its boundary vertices so that at each iteration a pattern has its radius increased by

*r*. Subgraphs are merged in case they overlap during this process and are also frequent. Merged patterns are potential large subgraphs and are grown during a third step. In case a large pattern has at least two spiders as subgraphs, it will be find by SpiderMine. Spiders are also useful to speedup isomorphism checking, since isomorph subgraphs will have the same spider set.

One of my favorite parts of the paper is where the value of

*M*is devised. Basically, the probability of a pattern P being hit by spider (i.e., this spider has one of the vertices of P as head) is given by:

*P_hit(P) = Sum_{v in V(P)} |Spider(v)| / |S_all|*

where

*Spider(v)*is composed of spiders that have at least one embedding with

*v*as head (I understood

*v*as a label and

*Spider(v)*as the set of spiders with

*v*as head) and

*S_all*is the complete set of spiders. The expected value of

*|Spider(v)|*is

*|S_all|/|V(G)|*, i.e. assuming that all vertices in

*V*are equally likely to be part of a spider, this ratio gives the number of spiders with a given vertex. As a consequence, the probability of a pattern P being hit is lower bounded by

*|V(P)|/|V(G)|*. The probability of a vertex not to be hit, P_fail, i.e. the probability of at most one vertex from the pattern being the head of a spider is given by:

*P_fail(P) = (1-P_hit(P))^M + M.P_hit(P)(1-P_hit(P))^M-1*

*(M+1)(1-P_hit(P))^M*and the probability of hitting the K largest patterns is upper bounded by

*(1 - (M+1)(1-V_min/|V(G)|)^M)^K*, where

*V_min*is a parameter that gives the minimum size of a large pattern. Using this last expression, it is possible to set the value of

*M*based on

*e*. Though this math seems pretty much intuitive, I do not think I would be able of deducing it by myself.

The authors compare SpiderMine against existing algorithms for mining frequent subgraphs from single graphs using synthetic (ER and scale-free) and real datasets (DBLP and Jeti). They show that SpiderMine is more effective in finding the largest patterns and is also more scalable than competing techniques. In particular, they compare Spidermine and ORIGAMI, which is an algorithm for the identification of representative graph patterns, in the transaction setting and show that SpiderMine is able to find larger patterns. While the execution time of the proposed algorithm grows exponentially with

*r*, it works well for small

*r*. Finally, they show some interesting patterns found in real settings. In DBLP, they found collaboration patterns that are common to several research groups (e.g., senior and prolific authors organized in a particular way). In Jeti, which is a software, they found some frequent relationships among methods from different classes.

My general view of this paper is very positive. The authors propose a very interesting and powerful algorithm to solve a relevant problem. The technical contributions are sound and well described, except for some small details. One particular point that is still not very clear to me is whether the complete set of spiders is used in the growth process. It seems to be required in order to consider the complete set instead of only the M selected spiders in order to generate large patterns.

Link: http://www.vldb.org/pvldb/vol4/p807-zhu.pdf