The basic idea behind kernels is shown in the above figure. Kernels are transformations from a non-linear input space to a linear space. In this new space, a linear classifier may recognize a separating surface. Considering a classification task, a linear classifier can be defined as follows:

Where <u_i,u> is the inner-product and the prediction is represented by the sign of f(u). The so called kernel trick consists of applying a linear method to a transformation of the data:

The inner-product is replaced by a kernel function k(u,v) for which the following properties hold.

Using the kernel approach, the linear classifier can be formulated as:

The kernel method consists of two fundamental steps:

- Computing the kernel function k (focus of this work)
- Computing the optimal manifold in the feature space (for the classification problem, it is the separating surface).

A

*convolution kernel*is the application of two different kernels to parts (substructures) of the the input instances and a

*spectral kernel*is a special case of a convolution kernel where feature vectors are derived by counting substructures in a given structure. Graph kernels are examples of convolution kernels.

In this paper, the authors are interested in graph kernels for chemical data. These graphs are 2D representations of molecules. Vertices and edges are labeled and edges are undirected.

In particular, the paper focus on kernels based on paths in molecular graphs. A path is a sequence of vertices such that there exists and edge between each consecutive pair of vertices. The label of path is composed of the labels of vertices and edges involved in the path according to the sequence defined by such path. Edges cannot be visited twice but a path can contain cycles.

The kernels introduced in this work are based on the concept of

*molecular fingerprinting*, which are feature vector representations of molecules. A fingerprint is a bit-vector of limited size where bits are set according to limited depth-first searches on the graph (i.e. substructures are labeled paths). A hash value is computed based on each path and this value is used to initialize a random number generator. A number b of numbers are generated and then reduced modulo the size of the the bit-vector. Finally, the corresponding positions of the bit-vector are set as 1. The length of the paths is usually small and the cost of extracting all the paths is

*O(nm)*where n is the number of vertices and m is the number of edges. Another similar fingerprinting method can consider all possible paths in case the size of the bit-vector is the number of possible paths in the graph. Also, path counts can be used instead of binary variables.

The proposed kernes are defined as follows:

where phi is the counting feature map.

Therefore, the hybrid kernel considers both common and the missing paths. The authors show that the three proposed kernels are Mercer kernels.

Fingerprints are generated efficiently using suffix trees. The proposed kernels are evaluated using the

*Voted Perceptron algorithm*, but any other linear classification algorithm could be used as well.

All datasets used in the evaluation of the proposed kernels are sets of of chemical compounds. I will not give much detail on the particular results. The objectives of the classification tasks are predicting mutagenicity, toxicity, and anti-cancer activity on three datasets. The evaluation metrics applied are accuracy and Area under the ROC curve (AUC). The evaluation procedure is the leave-one-out. The Tanimoto and the MinMax kernels achieved the best or similar performance when compared against a pattern discovery algorithm, marginalized kernels, and other techniques proposed in the literature.

I think this paper is way more instructive than the papers I'm used to read. The concepts and formulations are easy to follow and the text is full of relevant details. I've learned the main concepts related to graph kernels, which was my primary goal. One thing that is not very clear to me is how you can separate the effect of kernel function and the linear classification algorithm in the study. In other words, what would be the implications of the use of another classification method?

Link: http://cbio.ensmp.fr/~jvert/svn/bibli/local/Ralaivola2005Graph.pdf