domingo, 8 de janeiro de 2012

Sampling for Large Graphs

During the past months, I've been thinking a lot about topological properties (e.g., degree distribution, clustering coefficient, diameter) of random subgraphs, which can be seen as samples of graphs. So, I decided to read this paper again. In statistics, sampling is basically the technique of selecting a limited number of instances to represent a larger population. This paper studies the sampling from graphs, which is the selection of a smaller, but still representative, subgraph from an original graph. These samples are useful in the analysis of really large graphs, for which processing is impractical.

The authors study the problem of sampling large graphs from two perspectives: scale-down and back-in-time. The scale-down sampling is defined as the the selection of a sample S of size n' from a graph G that has n vertices such that n' < n and that S has similar properties as G. In the back-in-time sampling, the sample S is expected to have the same properties of the graph G when it had n' vertices. Samples are evaluated in terms of several static (e.g., in and out-degree, clustering coefficient distribution) and temporal (e.g., densification power-law, shrinking diameter) properties of the original graph. Two distributions are compared using the Komolgorov-Smirnov D-statistic. The sampling techniques studied in the paper are:


  • Random Node: Vertices are selected at random.
  • Random Walk Page-rank Node: Vertices are selected with a probability that is proportional to their page-rank weight.
  • Random Degree Node: Vertices are selected with a probability that is proportional to their degrees.
  • Random Edge: Vertices are selected at random.
  • Random Node-edge: Vertices are selected at random and then a random edge from them is selected randomly as well.
  • Hybrid: Applies the random node-edge sampling with a probability p and the random edge sampling with a probability 1 - p (p = 0.8).
  • Random Node Neighbor: A vertex is selected at random together with all its neighbors.
  • Random Walk: Picks a vertex at random and simulates a random walk starting from from it, the algorithm goes back to the starting vertex with a probability c (c = 0.15).
  • Random Jump: Similar to the random walk, but jumps to a random vertex instead of the same starting vertex.
  • Forest Fire: Randomly picks a vertex and then selects x of its out-going neighbors where x is geometrically distributed with mean pf/(1-pf) and pf is a parameter.


The graph datasets used in the evaluation are a citation network from arXiv, an autonomous system network, a bipartite affiliation network from arXiv and a network of trust from epinions. The results show that though there is not a best overall strategy, the Random Walk Node and the Forest Fire sampling achieved the best results in the scale-down sampling task. Regarding the back-in-time task, the Random Page Rank Node and the Forest Fire strategies obtained the best results in terms of Komolgorov-Smirnov D-statistic on average. The paper also presents an evaluation of the impact of the sample size over the quality of the sampling. For small samples (< 40%), the Random Degree Node, Random Jump, and Random Walk Node achieved the best results in the scale-down task and the Forest Fire, Random Node and Random Page-Rank Node achieved the best resutls in the back-in-time task. Moreover, methods based on edge selection did not performed well in general.

My first conclusion from this paper is that it is not exactly what I'm looking for. I seems that the paper Statistical properties of sampled networks is much more related to the properties of random subgraphs. However, I enjoyed reading this paper, specially because of its experimental section. The evaluation metric applied (the Komolgorov-Smirnov D-statistical) may be handy in the future.

Link: http://cs.stanford.edu/people/jure/pubs/sampling-kdd06.pdf

Patterns of Temporal Variation in Online Media

During the past years I've studied how URLs become popular on Twitter and this is very strong related work. In this paper, Jaewon Yang and Jure Leskovec, both from Stanford University, study patterns that describe how the popularity of a given content (e.g., URLs, hashtags, memes) evolves along time, which they call patterns of temporal variation. Each content produces a time series that describes its popularity during a given time interval and the authors propose a technique for the identification of clusters of contents that present a similar variation pattern. This problem has several relevant applications, such as understanding how online content appears, gets popular, and fades over time, and also predicting content that is likely to become popular.

The clustering algorithm for time series proposed in the paper, which is called K-Spectral Centroid (k-SC) is very similar to the k-means algorithm. However, the authors define a new distance metric that is more suitable for the particular scenario of temporal variation of online content. The proposed metric is defined as an optimization problem that finds the best match between two time series both in terms of the x and y scales. The objective is to shift and scale the time series in order to minimize the effects of the absolute popularity values and the particular time in which such series peaked. Based on this metric, they define an overall evaluation function to be minimized and also an optimization problem for the identification of a cluster's centroid in each iteration of the algorithm. K-SC receives the number of clusters as an input parameter and initial centroids are assigned at random. The algorithm iterates assigning each time series to its closest centroid and then updating the centroid of each cluster until the assignments converge. The identification of the centroids requires the computation of the eigenvectors of a matrix of size L, which takes O(L^3). Moreover, the performance of the algorithm is very sensitive to the initial assignments. In order to make K-SC more scalable, the authors propose an incremental version of the algorithm using the Discrete Haar Wavelet Transform.

The K-SC algorithm is applied in the analysis of 2 datasets. The first one is a set of phrases extracted from blogs and news articles through the MemeTracker methodology. Each time series is truncated to 128 hours and the number of clusters is chosen empirically. Experiments show that K-SC outperforms two variations of K-means in terms of two evaluation metrics: one the measures the distance between points and their respective centroids and another that evaluates the distance between centroids. While the first variation of the K-means algorithm aligns the peaks of the time series, the second also scales peak volumes. They also show that the incremental version of K-SC is more scalable than the naive one.

The 6 clusters identified by K-SC are very interesting. They represent intuitive and distinct patterns, such as fast and short popularization and popularity rebounding. The authors argue that these patterns are related to the time in which different types of media mention the phrases and that this information may be useful to classify phrases into each of the 6 categories. They show that these temporal features are more effective than volume features (i.e., the fraction of the total volume of citations corresponding to a particular website) and also tf-idf features based on the citations of a particular website to a given phrase and the overall citations of this website. Similar results are found in the application of K-SC to the analysis of URLs on Twitter.

This paper studied a very interesting problem and presents several contributions both in terms of the techniques proposed and the case studies presented. I think that, at some points, the authors make assertions that too strong, specially regarding the classification accuracy of the techniques, which seems quite similar to me. However, my overall evaluation of the paper is very positive.

Link: http://cs.stanford.edu/people/jure/pubs/memeshapes-wsdm11.pdf

sexta-feira, 6 de janeiro de 2012

It's Who you Know: Graph Mining using Recursive Structural Features

This paper studies the problem of extracting features for nodes. Node features are useful in many graph mining tasks, such as classification, de-anonymization, and outlier detection. The focus of this work is the discovery of structural features, i.e. feature that are related to the topology of the graph. Structural features are specially interesting for transfer learning tasks, where the knowledge discovered from one graph is applied to another one. The authors cite as an example of a transfer learning task the problem of classifying IPs from a given network in terms of traffic type using labelled IPs from another network.

Node features are classified as follows. Local features are vertex degrees. Egonet features are extracted from the graph induced by a given node and its neighborhood (e.g., the number of edges that connect at least one node in this induced graph). Neighborhood features combine both the local and egonet features. Recursive features, which definition are the most important contribution of the paper, are recursive aggregations (sums and means) of neighborhood and even other recursive features. Finally, the set of regional features comprises the neighborhood and recursive features.

Recursive features have their range of values reduced to small integers through a vertical logarithm binning procedure. According to the authors, this procedure is effective in discriminating values that follow a power-law distribution. A feature graph containing edges between every pair of feature values that never disagree by more than s (input parameter) is used to prune the number of features. Each connected component of this graph is replaced by the simplest feature (i.e., the one generated with the minimum number of recursions).

The authors apply regional features in three graph mining tasks: within network classification, across network classification, and node de-anonymization. Such features are given as input to a log-forest model, which is a bagging of logistic regression classifiers. For within network classification, they use a dataset where nodes are IPs, edges represent communication, and classes are related to the type of traffic (e.g., DNS, Web, P2P). Their technique outperforms the weighted vote relational neighbor classifier, which makes use of neighbor's labels, in cases where the labels are sparse and achieve competitive results when 90% of nodes are labelled. The same dataset is used for across-network classification, in which different networks are generated using data from distinct periods of time and a completely labelled network is used to classify nodes from another completely unlabelled network. The results show that regional features achieve accuracies varying from 70% to 90% in this scenario.

The de-anonymization task consists of, given two networks with shared nodes, identifying corresponding nodes (i.e., those that represent the same identity). They solve this problem in IM, communication and SMS networks, generated in different periods of time. Moreover, they also apply features from a who-follows-whom network to de-anonymize a who-mentions-whom network from Twitter. Regional features outperform local and neighborhood features and also a baseline strategy that makes random choices in most of the datasets. While such features are able to set top-1% scores to the target node in up to 78% of the cases in the IP network and up to 55% in the IM network, the results on Twitter are poor. Moreover, regional features does not achieve the best results for the SMS data, unless it is able to select the nodes to be classified.

In general, the paper is not very clear. The authors should give at least one example of the generation of recursive features. Furthermore, although the proposed technique is shown to be effective, I found it to be arbitrary in several points, specially the pruning strategy. One of the strengths of the paper is its thorough experimental evaluation. Nevertheless, the results would be more easily understood if a standard evaluation procedure was employed. In particular, I was not able to deduce in which extent the regional features were more effective in the de-anonymization of nodes in the IP than in the Twitter case study.

terça-feira, 3 de janeiro de 2012

Detecting Influenza Epidemics using Search Engine Query Data

This paper was in my TO-READ stack for a long time. It is a work developed by some researchers from Google and the Centers for Disease Control and Prevention and was published by the Nature magazine in 2009. The paper shows how query data is correlated with influenza illness epidemics in a given region, being useful for monitoring purposes. More specifically, a linear regression using the relative frequency of some queries submitted to the google search engine estimates accurately the percentage of physician visits in which a patient presents influenza-like symptoms. Official reports were applied in the identification of the queries and also in the validation of the models.

An interesting property of the strategy proposed is the fact that it is completely automatic. Instead of applying machine learning techniques to classify a set of 50 million queries, they applied the official data to rank and select the queries that produced the most accurate estimates. The weekly counts of each query were normalized for each state during a five-year period. Although the study considered data from each region of the United States, the selected model (i.e., set of queries + linear regression) was the one that obtained the best global results. The final set of queries are related to influenza complications, cold/flu remedy, symptoms, etc.

The mean correlation of the model with the official reports was 0.90. The model achieved a mean correlation of 0.97 when applied to untested data (i.e., not used during the fitting phase) and a further evaluation using data from the state of Utah obtained a mean correlation of 0.90. Moreover, the authors found that the proposed model is able to provide estimates 1-2 weeks ahead of the publication of the reports.

An interesting aspect of the paper is that it combines a very simple model (linear regression) with a huge dataset. In academy, we accustomed to the opposite approach, which is the application of a complex model to fit a small dataset. I like to call the strategy used in the paper the google approach. According to the paper, hundreds of machines running map-reduce were used in the processing of the models. Pretty cool, isn't it? Searching for related papers, I found one which argues that the query data is not that useful for this kind of monitoring. Nevertheless, using web data to monitor epidemics has become a hot topic.

Link: http://research.google.com/archive/papers/detecting-influenza-epidemics.pdf

Echoes of Power: Language Effects and Power Differences in Social Interaction

Today I've read a very interesting paper entitled Echoes of Power: Language Effects and Power Differences in Social Interaction. In this paper, some researchers from Cornell University and Yahoo! study the effect of people's position (or power) over the way they influence each other in terms of language usage. The basic idea of the paper is that in communication a higher-powered person (e.g., a justice) is more likely to affect the language of others than a less-powered person (e.g., a lawyer). Moreover, a higher-powered person has its language less affected by the others than a less-powered one. The authors show these patterns using the concept of linguistic coordination, which is, basically, the way people tend to mimic the usage of function words (e.g., articles, pronouns) from those who they are communicating with.

Based on the exchange theory framework, the authors consider two types of power in social interactions: status and dependence. Measures of coordination between two groups are defined for a specific linguistic style marker and also for aggregated markers. Such measures are, basically, a function of the increasing of the probability of a speaker 'a' from one group A to use a given marker in a reply to a target 'b' from a group B given that 'b' has used this marker while communicating with 'a'. The author say that, in this case, 'a' coordinated to 'b'.

They study these exchanges in language patterns in two real-life scenarios: Wikipedia and Supreme Court arguments. In the Wikipedia, admins (higher-powered) and non-admins (lower-powered) communicate about article editions. In terms of power, the authors found that users coordinate more to admins than to non-admins, as expected. However, the results also shown that admins coordinate more to other users than non-admins, which is counterintuitive. A further analysis revealed that admins reduce their levels of coordination after they are promoted. Considering dependence, they found that an user coordinates more to thoses users who he/she disagrees with. In the Supreme Court arguments, justices (higher-powered) and lawyers (lower-powered) argue about cases. As expected, they found that lawyers coordinate more to justices than the opposite. This coordination is even more likely for the cases in which the laywer and justice disagreed. Moreover, favorable justices coordinate more.

The authors also show how coordination features (i.e., features that state whether a user x coodinates more to y than the opposite) are useful for the prediction of the most powerful user in social interactions. In fact, while other features, such as textual content, may be more effective in this task, coordination features work in cross domain settings. Finally, the paper also studies activity and gender bias in coordination.

Link: http://www.cs.cornell.edu/~cristian/Echoes_of_power.html

Journals to watch

After listing a few conferences that may publish papers of my interest, I though it could be a good idea to do the same for journals as well. I must confess that I do not read journal papers as often as I should. My current strategy has being finding interesting journal papers through conference papers, but it seems that many people are using the same strategy. As a consequence, some journal papers that are relevant to my research are not cited by relevant conference papers. I'm aware that many journal papers in computer science are just extended versions of conference papers. Nevertheless, even in these cases the journal paper may be worth reading because it gives more detail on a given study. By my surprise, there are too many journals out there. While some of them have a specific topic, such as Data Mining, others are about Computer Science. Journals about science in general (e.g., Nature, Science) were not included because I'm already used to check them occasionally.

Journal Publisher
Transactions on the Web ACM
AI Magazine AAAI
Communications of the ACM ACM
Computing Surveys ACM
Data & Knowledge Engineering Elsevier
Data Mining and Knowledge Discovery Springer
Decision Support Systems Elsevier
IEEE Computer IEEE
Information Processing Letters Elsevier
Information Systems Elsevier
Intelligent Systems IEEE
Internet Computing IEEE
Internet Mathematics Taylor & Francis
Journal of Intelligent Information Systems Springer
Journal of Machine Learning Research SPARC
Journal on Very Large Data Bases VLDB
Knowledge and Information Systems Springer
Machine Learning Springer
Parallel Computing Elsevier
Pattern Recognition Letters Elsevier
Proceedings of the National Academy of Sciences American Academy of Sciences
SIGKDD Explorations ACM
SIGMOD Record ACM
Social Network Analysis and Mining Springer
Statistical Analysis and Data Mining Wiley
Transactions on Database Systems ACM
Transactions on Intelligent Systems and Technology ACM
Transactions on Knowledge and Data Engineering IEEE
Transactions on Knowledge Discovery from Data ACM
Transactions on Management Information Systems ACM
Transactions on Pattern Analysis and Machine Intelligence IEEE

domingo, 1 de janeiro de 2012

Conferences to watch in 2012

2012 has just begun and this year I'm supposed to read a lot. So, I decided to make a list of conferences to watch in 2012. The list contains papers on Data Mining, Web, Information Retrieval, Databases, Artificial Intelligence, Machine Learning, e-Commerce, Recommender Systems, Digital Libraries, and Parallel and Distributed Computing. The plan is to read at least one paper on every business day and to post small summaries of them here.


Conference Date
ACM Conference on Web Search and Data Mining Feb 8-12
ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Feb 25-29
International Conference on Extending Database Technology Mar 26-30
Conference on Uncertainty in Artificial Intelligence Aug 15-17
IEEE International Conference on Data Engineering Apr 1-5
World Wide Web Conference Apr 16-20
ACM SIGMOD/PODS Conference May 20-25
SIAM Conference on Data Mining Apr 26-28
IEEE Parallel and Distributed Processing Symposium May 21-25
Pacific-Asia Conference on Knowledge Discovery and Data Mining May 29-Jun 1
AAAI Conference on Weblogs and Social Media Jun 4-8
ACM Conference on Electronic Commerce Jun 4-8
ACM/IEEE Joint Conference on Digital Libraries Jun 10-14
ACM Web Science Jun 22-24
AAAI Conference on Artificial Intelligence Jul 22-26
International Conference on Machine Learning Jun 26-Jul 1
ACM SIGKDD Conference on Knowledge Discovery and Data Mining Aug 12-16
ACM SIGIR Conference Aug 12-16
Conference on Advances in Social Networks Analysis and Mining Aug 26-29
Conference on Very Large Databases Aug 27-31
Conference on Social Computing Sept 3-6
Conference on Practice of Knowledge Discovery in Databases Sept 24-28
Internet Measurement Conference Nov 14-16
Conference on Neural Information Processing Systems Dec 3-8
IEEE Conference on Data Mining Dec 10-13
ACM Conference on Information and Knowledge Management TBD
International Conference on Recommender Systems TBD