sexta-feira, 6 de janeiro de 2012

It's Who you Know: Graph Mining using Recursive Structural Features

This paper studies the problem of extracting features for nodes. Node features are useful in many graph mining tasks, such as classification, de-anonymization, and outlier detection. The focus of this work is the discovery of structural features, i.e. feature that are related to the topology of the graph. Structural features are specially interesting for transfer learning tasks, where the knowledge discovered from one graph is applied to another one. The authors cite as an example of a transfer learning task the problem of classifying IPs from a given network in terms of traffic type using labelled IPs from another network.

Node features are classified as follows. Local features are vertex degrees. Egonet features are extracted from the graph induced by a given node and its neighborhood (e.g., the number of edges that connect at least one node in this induced graph). Neighborhood features combine both the local and egonet features. Recursive features, which definition are the most important contribution of the paper, are recursive aggregations (sums and means) of neighborhood and even other recursive features. Finally, the set of regional features comprises the neighborhood and recursive features.

Recursive features have their range of values reduced to small integers through a vertical logarithm binning procedure. According to the authors, this procedure is effective in discriminating values that follow a power-law distribution. A feature graph containing edges between every pair of feature values that never disagree by more than s (input parameter) is used to prune the number of features. Each connected component of this graph is replaced by the simplest feature (i.e., the one generated with the minimum number of recursions).

The authors apply regional features in three graph mining tasks: within network classification, across network classification, and node de-anonymization. Such features are given as input to a log-forest model, which is a bagging of logistic regression classifiers. For within network classification, they use a dataset where nodes are IPs, edges represent communication, and classes are related to the type of traffic (e.g., DNS, Web, P2P). Their technique outperforms the weighted vote relational neighbor classifier, which makes use of neighbor's labels, in cases where the labels are sparse and achieve competitive results when 90% of nodes are labelled. The same dataset is used for across-network classification, in which different networks are generated using data from distinct periods of time and a completely labelled network is used to classify nodes from another completely unlabelled network. The results show that regional features achieve accuracies varying from 70% to 90% in this scenario.

The de-anonymization task consists of, given two networks with shared nodes, identifying corresponding nodes (i.e., those that represent the same identity). They solve this problem in IM, communication and SMS networks, generated in different periods of time. Moreover, they also apply features from a who-follows-whom network to de-anonymize a who-mentions-whom network from Twitter. Regional features outperform local and neighborhood features and also a baseline strategy that makes random choices in most of the datasets. While such features are able to set top-1% scores to the target node in up to 78% of the cases in the IP network and up to 55% in the IM network, the results on Twitter are poor. Moreover, regional features does not achieve the best results for the SMS data, unless it is able to select the nodes to be classified.

In general, the paper is not very clear. The authors should give at least one example of the generation of recursive features. Furthermore, although the proposed technique is shown to be effective, I found it to be arbitrary in several points, specially the pruning strategy. One of the strengths of the paper is its thorough experimental evaluation. Nevertheless, the results would be more easily understood if a standard evaluation procedure was employed. In particular, I was not able to deduce in which extent the regional features were more effective in the de-anonymization of nodes in the IP than in the Twitter case study.

Nenhum comentário:

Postar um comentário