domingo, 20 de maio de 2012

Graph Kernels for Chemical Informatics

This is the first paper I've read about graph kernels, or kernels in general. The authors were affiliated to UCI at the time. The problem studied is mapping graph structures into a linear space. After reviewing the literature on graph kernels, the paper introduces three new ones (Tanimoto, MinMax, and Hybrid). An extensive evaluation shows that the proposed kernel perform well on chemical classification problems.

The basic idea behind kernels is shown in the above figure. Kernels are transformations from a non-linear input space to a linear space. In this new space, a linear classifier may recognize a separating surface. Considering a classification task, a linear classifier can be defined as follows:

Where <u_i,u> is the inner-product and the prediction is represented by the sign of f(u). The so called kernel trick consists of applying a linear method to a transformation of the data:

The inner-product is replaced by a kernel function k(u,v) for which the following properties hold.

Using the kernel approach, the linear classifier can be formulated as:

The kernel method consists of two fundamental steps:

  1. Computing the kernel function k (focus of this work)
  2. Computing the optimal manifold in the feature space (for the classification problem, it is the separating surface).

A convolution kernel is the application of two different kernels to parts (substructures) of the the input instances and a spectral kernel is a special case of a convolution kernel where feature vectors are derived by counting substructures in a given structure. Graph kernels are examples of convolution kernels.

In this paper, the authors are interested in graph kernels for chemical data. These graphs are 2D representations of molecules. Vertices and edges are labeled and edges are undirected.

In particular, the paper focus on kernels based on paths in molecular graphs. A path is a sequence of vertices such that there exists and edge between each consecutive pair of vertices. The label of path is composed of the labels of vertices and edges involved in the path according to the sequence defined by such path. Edges cannot be visited twice but a path can contain cycles.

The kernels introduced in this work are based on the concept of molecular fingerprinting, which are feature vector representations of molecules. A fingerprint is a bit-vector of limited size where bits are set according to limited depth-first searches on the graph (i.e. substructures are labeled paths). A hash value is computed based on each path and this value is used to initialize a random number generator. A number b of numbers are generated and then reduced modulo the size of the the bit-vector. Finally, the corresponding positions of the bit-vector are set as 1. The length of the paths is usually small and the cost of extracting all the paths is O(nm) where n is the number of vertices and m is the number of edges. Another similar fingerprinting method can consider all possible paths in case the size of the bit-vector is the number of possible paths in the graph. Also, path counts can be used instead of binary variables.

The proposed kernes are defined as follows:

where phi is the counting feature map.

Therefore, the hybrid kernel considers both common and the missing paths. The authors show that the three proposed kernels are Mercer kernels.

Fingerprints are generated efficiently using suffix trees. The proposed kernels are evaluated using the Voted Perceptron algorithm, but any other linear classification algorithm could be used as well.

All datasets used in the evaluation of the proposed kernels are sets of of chemical compounds. I will not give much detail on the particular results. The objectives of the classification tasks are predicting mutagenicity, toxicity, and anti-cancer activity on three datasets. The evaluation metrics applied are accuracy and Area under the ROC curve (AUC). The evaluation procedure is the leave-one-out. The Tanimoto and the MinMax kernels achieved the best or similar performance when compared against a pattern discovery algorithm, marginalized kernels, and other techniques proposed in the literature.

I think this paper is way more instructive than the papers I'm used to read. The concepts and formulations are easy to follow and the text is full of relevant details. I've learned the main concepts related to graph kernels, which was my primary goal. One thing that is not very clear to me is how you can separate the effect of kernel function and the linear classification algorithm in the study. In other words, what would be the implications of the use of another classification method?


sexta-feira, 18 de maio de 2012

Influence and Passivity in Social Media

Identifying influential users is the problem studied in this work. The authors are from Cornell University, EPFL (Switzerland), and HP Social Computing Labs, including Bernardo Huberman. The concept of passivity, which materializes the intuitive idea that some users are harder to influence than others is applied as means to provide a more accurate measurement of user influence in social media applications.

The study is based on a dataset from Twitter containing URLs posted during a 300 hours time interval ('http' was used as a keyword in the crawling process). They crawled 15M URLs and also metadata (e.g., followers and followees) about users that posted them.

According to the authors, passivity is a barrier to propagation. They show, as an evidence of passivity on Twitter, the different levels of retweet activity of users (see figure above). In other words, while some users retweet a lot, others do not retweet very often. Moreover, it is relevant to evaluate the relative influence of users on the whole network, instead of only considering direct influence (e.g., retweets). More specifically, this work is based on the following assumptions:

1. A user's influence score depends on the number of people she influences as well as their passivity.
2. A user's influence score depends on how dedicated the people she influences are.
3. A user's passivity score depends on the influence of those who she's exposed to by not influenced by.
4. A user's passivity score depends on how much she rejects other user's influence compared to everyone else.

An algorithm for computing user's influence score called IP (influence-passivity) is introduced. IP receives a graph G = (N,E,W) as input, where N is a set of nodes, E is a set of arcs, and W is a set of arc weights. The weight of an edge e = (i,j) is defined as the following ratio:

w(i,j) = influence i exerts on j / influence i attempted to exert on j

This influence can be based on several evidences, such as retweets and other co-occurrences of content in general. An acceptance rate (u_i,j and a rejection rate (v_i,j) are defined for arcs as follows:

The output of IP are the influence (I) and passivity (P) functions:

The IP algorithm solves this recursive formulation through an iterative process.

Therefore, according to IP, an influential user is someone able to get content propagated by passive users and a passive user is someone who rejects content from influential users.

They considered as baselines for IP the PageRank over a version of G where all edges are inverted, the Hirch Index (H-index), the number of followers of users, and the average number of retweets of users. Using data from, they evaluated how the scores were correlated with the number of clicks a URL gets. Results show that IP is better at predicting users' attention (i.e., number of clicks) then the other metrics for both graphs based on retweets and co-mentions. It is also shown that influence is not signficantly correlated with the number of followers (popularity) a user has.

In a case, study, the authors present the most influential and passive users in the dataset. The most influential user is mashable (Social Media Blogger). In general, influential users are somehow semantically right, although this kind of result is always prone to subjectivity.  Passive users are mostly aggregators and spammers (i.e., robots).

This paper is well-written and its contributions are very clear. I think the idea of passivity is interesting, but requires a deeper study. In fact, an open question is: How better is it to influence a passive than a non-passive user? In other words, let's say that one user has influenced 100 non-passive users while other has influenced 1 very passive one. What is better in practice? Also, is the number of clicks a URL gets a good measure of influence? I think it works just as the number of retweets.


quarta-feira, 16 de maio de 2012

Video Suggestion and Discovery for Youtube: Taking Random Walks Through the View Graph

I just finished reading this paper about video recommendation written by some googlers back in 2008. It seems like the authors were somehow involved with Youtube at the time, because it is used as both application scenario and main motivation. The idea of the paper is recommending videos based on a random walk through user-video (i.e., bipartite) graph.

Video recommendation is a key problem for Youtube and other video distribution systems. However, the volume of data that such systems have to deal with together with a highly skewed popularity distribution and the dynamic nature (i.e., new users and content all the time) of the interactions involved produce specific requirements, some of which are addressed in this paper. The basic idea is recommending videos based on co-views (i.e., the way in which videos are viewed by the same users). As an example, because videos 1 and 2 were co-viewed by users A and B, we may recommend video 1 to users who have watched video 2, and vice-versa.

The family of recommendation algorithms introduced by the authors are based on a general adsorption algorithm. They argue that this algorithm does not have certain undesired properties that nearest neighbor algorithms may have due to their particular distance metric (e.g., shortest path, commute time). Some of these metrics may produce potentially bad recommendations because of high-degree nodes - several paths are expected to pass through these vertices -, may be too expensive to compute or not admit incremental updates. They describe three views of this general adsorption algorithm: the first is based on averaging, the second on random walks and the last one is a linear system formulation.

Adsorption via averaging: The input graph is G = (V,E,w), where V are nodes from the bipartite graph, edges represent views and w are weights. Shadow vertices V' are created for each video in V. Moreover, we associate a probability distribution over the set of videos for each vertex. Vertices from V have no initial distribution defined. Instead, an edge from v' in V' to its respective vertex in v is inserted in the graph. The algorithm works as follows:

Adsorption via random walks: If we reverse the edges from G, the above algorithm is equivalent to a random walk algorithm over the bipartite graph. Shadow vertices become absorbing states. Simple as that.

Adsorption via Linear Systems: This part is very interesting. The authors formulate a version of the averaging algorithm through a linear system. Pretty cool.

The authors also propose the use of injection and dummy probabilities as means to improve the proposed recommendation model. The injection probability is basically a way to reduce the probability of reaching a shadow node corresponding to a popular video. Other properties of the video may also be considered in order to set a proper injection probability. Dummy probabilities allow a random walk to be abandoned in case a node corresponding to a popular video or a very active users are reached.

The proposed algorithms are evaluated using a dataset with 1.3 million videos and 1.1 million users from Youtube. Training and test sets (50/50%) were defined according to view time. Evaluation metrics are based on average precision and recall along ranks of videos for each user. Three baselines are considered:

Global popularity (GP): Recommends videos based on popularity;
Local popularity (LP): Recommends videos based on popularity and co-views;
Local rare (LR): Similar to LP, but weights co-views using Jaccard coefficient of the shared users.

Two variations of the proposed algorithm, both implemented as the averaging version, are evaluated:

ADSORPTION-N: Without dummy probability
ADSORPTION-D: dummy probability set as 1/4 (don't follow, is it constant?).

Results show that LP is the best baseline algorithm in general. ADSORPTION-N and ADSORPTION-D seem to achieve similar recall and precision values. However, a deeper analysis show that, in fact, ADSORPTION-D is more effective than ADSORPTION-N for more active users (the more active, the better). For users with a single view, GP makes the best recommendations. For a small number of views, LP is the best algorithm. Because the distribution of the number of views per user is skewed, these characteristics were not noticed at first.

This is a good paper. Well-defined problem, intuitive solution, conclusive experiments. The authors sometimes focused on details that are not essential to understand the paper and this particular parts are kind of boring. Also, I'm not sure whether the evaluation procedures are the standard or not.


domingo, 13 de maio de 2012

TwitterRank: Finding Topic-sensitive Influential Twitters

This paper is not really new to me. But it became more important because of a project I've been working on during the last weeks, so I decided to read it carefully this time. It did not take me more than one hour to read it and get its fundamental contributions. Most of the authors of the paper are from Singapure Management University. There is also an author from Pennsylvania State University. I've never heard about any of them, but the particular paper is somehow famous because it was one of the first academic works on Twitter.

The problem studied in the paper is identifying influential users on Twitter. The authors apply a strategy very similar to the PageRank algorithm. However, the proposed algorithm considers also topics extracted from tweets. The idea is taking into account user's interests in the computation of influence. The use of topics is motivated by the topic homophily on Twitter (i.e., users that are connected through following or reciprocal interactions have a more similar distribution of topics than randomly selected users that do not interact in these ways).

The study is based on a small dataset (6k users, 1M tweets) crawled in 2009. All users crawled are from Singapore and the selection criteria is the number of followers. Even the authors agreed that their sample is biased, what probably have affected their conclusions. I understood that they call friendship a reciprocal following interaction on Twitter. They show that the distributions of number of tweets, number of followers and number of friends per user are heavy-tailed. Moreover, there is a very strong correlation between the number of friends and followers of users.

One of the main contribution of the paper is being the first paper to show that homophily occurs on Twitter. Basically, they compute each user's topic distribution based on tweet content using LDA. I may read a more detailed paper on LDA soon, so I will not describe it here. Given two distributions, they apply Jensen-Shannon divergence as a comparison metric, which is an extension of the Kullback-Leibler divergence. The divergence is based on the average of the distributions, and on weighted logarithm of their ratios. They evaluate homophily between followers and friends using statistical tests, showing that topics of connected users are correlated significantly.

Next, they propose TwitterRank, which is a ranking algorithm based on both network structure and topics. TwitterRank builds a transition matrix for each topic. Transitions between users for a given topic are based on topic similarity and also number of tweets. More precisely, TwitterRank gives more weight to transitions to users that generate a higher fraction of the tweets a given user has access to (i.e., tweets that appear in the timeline). They consider it to be a feature, what I disagree. Further, they say/confess that this "feature" makes them more sensitive to malicious behavior. Similar to PageRank, they also make use of a teleportation matrix and a damping factor. Topic-specific transitions can be aggregated based on topic probability (general influence) and user topic probability (perceived influence).

The evaluation section begins with an overview of the most influential users regarding the most popular topics. The discussion is highly subjective and does not really add much to the study. They also compare TwitterRank scores against indegree, PageRank, and Topic-sensitive PageRank (to my surprise it only biases the teleportation probability according to the topics). TwitterRank scores are different from other methods, what is shown using Kendall's rank correlation. Finally, they use TwitterRank in a followee recommendation task. The results are not very conclusive, since TwitterRank is sometimes outperformed by other simpler methods. Also, this section is not very clear due to the many different selection criteria for users employed.

I've never been a big fan of this paper. Apart from using Twitter, they are not very innovative. Moreover, I think the flow of ideas is not very clear in general. Finally, the dataset is not representative, as I've discussed before. One good point, is that the authors seem to be aware of most of the issues I've pointed out here.


terça-feira, 8 de maio de 2012

Proximity Tracking on Time-Evolving Bipartite Graphs

This paper studies the problem of computing proximity and centrality in bipartite graphs that evolve over time. The researchers involved in this work are from Carnegie Mellon University, including Christos Faloutsos, IBM T.J. Watson Lab, and University of Illinois at Chicago (Philip Yu). It won the best paper award in the 2009 edition of the SIAM Data Mining Conference.

It is easy to think about application scenarios where entities can be modeled through a bipartite graph and the proximity/centrality of these entities is somehow relevant. The example used through the paper is a graph that connects authors to conferences. In this case, it is useful to compute the proximity between authors (i.e., how related their work are in terms of the conferences the publish in), authors and conferences (i.e., how important is a conference to an author), conferences and authors (i.e., how important is an author to a conference) and conferences (i.e., how close are the research communities that publish in two conferences). Based on proximity, it is easy to leverage the centrality of users and conferences as the average proximity from all conferences and authors to a specific author or conference. The following Figure shows the centrality of top-centrality authors in NIPS and Philip Yu's top conferences in a 5-year window.

The authors define proximity in terms of random walk with restart. In a static setting, the proximity matrix Q is defined as follows:

Q = (1-c).(I_{(n+l)x(n+l)}-cD^-1W)^-1

where (1-c) is the restart probability, I is an identity matrix, D is a degree matrix (i.e., it contains the degree of the vertices in the graph), and W is an adjacency matrix (the graph is undirected and weighted).

In a dynamic setting, which is the focus of the paper, the matrices D and W are dynamic. The authors propose three updating schemes for W: a global (i.e., new edges are simply added), a sliding window (i.e., only edges inside the current time window are considered) and an exponential weighting (i.e., more recent edges have their weights amplified exponentially). Regarding the matrix D, the authors propose the use of a constant matrix that is a times the first matrix in order to guarantee the monotonicity of the proximity measure.

In a previous work, some of the authors of this paper have shown how the proximity can be computed for general graphs efficiently if these graphs present linear correlations and block-wise community structure. The basis of this idea is the Sherman-Morrison lemma, which gives a formulation for the inverse of a sum of matrices. Instead of inverting a (n+l)x(n+l) matrix, the formulation allows the pre-computation of the inversion of a lxl matrix A (l is supposed to be much smaller than m), defined as:

A = (I - c^2.Mc.Mr)^-1

where Mc = D^-1.M and Mr = D^-1.M^T.

Given the matrix A, proximity can be computed as follows:

if i <= n and j <= n (e.g. author-author):
q(i,j) = 1(i=j) + c^2.Mr(i,:).A.Mc(:,j)

if i <= n and j > n (e.g. author-conference):
q(i,j) =  c.Mr(i,:).A(:,j-n)

if i > n and j <= n (e.g. conference-author):

q(i,j) =  c.A(i-n,:).Mc(:,j)

if i > n and j > n (e.g. conference-conference):

q(i,j) =  A(i-n,j-n)

Computing the matrix A for every update of the bipartite graph in a dynamic setting, which is O(l^3+m.l), is prohibitive. So, the authors propose two solutions:

1) Fast-single update: The matrix A is updated based on the a single edge update.
2) Batch update: For a set of edge updates, instead of performing several single updates, the authors propose an alternative strategy. Since the updates are expected to involve a small set set of records (authors and conferences), the idea is to explore the low-rank property of the update matrices. I did not go into the details of method.

The proposed techniques are applied in two scenarios:
1) Track-centrality: Tracking the top-10 most important type 1 our type 2 nodes over time.
2) Track-proximity: Tracking the top-k most relevant type 1 or type 2 objects for a particular query object A over time.

They apply track-centrality and track-proximity to 5 datasets. NIPS is based on the proceedings of the NIPS conference. Type 1 nodes are authors and type 2 nodes are papers. DM, AC, and ACPost were extracted from DBLP. DM is an author-paper graph for a set of data mining conferences. AC is an author-conference graph from the entire DBLP. ACPost is a similar graph, but the timestamps have a finer granularity (date in which the paper was entered into the database). Netflix is a user-movie graph from the Netflix prize and has a single timestamp.

The applications described in this paper are much more interesting than those from the previous paper (Fast Random Walk with Restart and its Applications). They show the top-10 most influential authors for NIPS over time and the top-5 most related authors for Jian Pei (from SFU). Moreover, they present the proximity rank of KDD w.r.t. VLDB over time, which have been increasing steadily since 1996 (KDD was the 5th closest conference to VLDB in 2007, behind SIGMOD, ICDE, PODS, and EDBT). I will not discuss all case studies here.

In terms of performance, they show that Fast-single-update is up to 176X faster than straight update (i.e., re-computing A for every update). They also show that Fast-batch-update is up to 32X faster than straight update. I missed a comparison between Fast-single-update and Fast-batch-update.

This paper is very good. The authors were able to navigate from a well-defined problem, to a complex but efficient algorithm, and then to cool application scenarios. I have not seem many successful cases like this so far. I think that some examples and general intuitions could help me in understanding the math applied in the paper. Also, I had to read a previous paper to have a better idea of the contributions of this one, i.e. it is not self-contained.


segunda-feira, 7 de maio de 2012

Fast Random Walk with Restart and its Applications

After the longest hiatus in the short history of this blog, I'm back! April was not exactly what I would call a following-the-reading-plans month. First, I got stuck in the middle of a paper and , then, I had two deadlines in a row. One of the lessons I've learned so far is that, considering the kind of papers I've been trying to read, one paper a week may be too much for me. Reading one (non-trivial) paper a day is just insane. To bring my guilty back to recommended levels, I will try to read at least 10 papers this month. Let's see how it works.

This is not the first time I've read this paper. The first one was in the very beginning of my M.Sc studies, when I was trying to find a research problem. This time, I think I've got I much better picture of the the contributions of this work. The authors were affiliated to the Carnegie Mellon University at the time (2006). The problem addressed is performing a random walk in a graph efficiently, specially in terms of query time. Random walk with restart gives a very interesting measure of proximity between two vertices. Nevertheless, existing methods may not be applicable to some real-life constraints. With this in mind, the authors propose some matrix manipulations that are the basis for two algorithms for random walk in graphs. The solution is not exact, but the authors show that it gives a good compromise between accuracy and execution time in real application scenarios.

Random walk with restart (RWR) works as follows. Given a normalized matrix W that represents a weighted graph and a probability c that the vertex will return to its original node, the probability of a random walk to pass through each vertex of the graph is given by:

r_i = cWr_i + (1-c)e_i

where e_i is a nX1 vector with 1 in the i-nth position and 0 otherwise. The vector r_i (nX1) gives the relevance scores for the n vertices in the graph. We can re-arrange the above equation as the following linear system:

r_i = (1-c)(I-cW)^(-1)e_i

The straightforward way to solve this system is by inverting the nXn matrix I-cW. The inversion is only required once, and many proximity queries can be computed with small cost afterwards. However, this operation requires O(n^3) time, which may be prohibitive. The most common solution in this case is using an iterative method (power method) to get an approximate  r_i for each query (vertex i). The authors call the first and the second aforementioned methods as PreCompute and OnTheFly, respectively. A third alternative is approximating W using low-rank approximation, which means applying a simpler representation of W based on its linear correlations. This method is called Blk.

The proposed method is based on the idea of decomposing W into two matrices based on its partitions. The authors apply the METIS algorithm, but other techniques may be applied as well. The first component of W (W_1) is composed of edges inside partitions and the second component (W_2) contains edges between clusters. Therefore, the method is making use of the community structure that is expected to appear in real graphs. W_1 can be represented by many small matrices (1 for each partition)   with a lot of 0's representing edges that cross different partitions. W_2 is decomposed into three matrices (W_2 = USV) using low-rank approximation. Given W_1 and W_2, we can re-formulate the random walk expression as:

r_i = (1-c)(I - cW_1 - c USV)^(-1)e_i

According to the Serman-Morrison lemma, we have:

r_i = (1-c)( ( I - cW_1 )^(-1) e_i + c ( I - cW_1 )^(-1) U (S^(-1) -cV( I - cW_1 )^(-1)U)^(-1) V ( I - cW_1 )^(-1) e_i)

Simplifying, if Q_1^(-1) = ( I - cW_1 )^(-1) and A = (S^(-1) -cVQ_1^(-1)U)^(-1), then:

r_i = (1-c)( Q_1^(-1)  e_i + c Q_1^(-1)  A V Q_1^(-1)  e_i)

In other (non-mathematical) words, now r_i can be approximated based on the inversion of A (a tXt matrix, where t is the rank of the low-rank approximation) and some other simple operations.

Based on this formulation, two algorithms are proposed: B_LIN and NB_LIN. NB_LIN is a simplified version of B_LIN where each vertex in W is considered as a single partition (i.e., low-rank approximation is directly applied to W).

Jumping straight to the results, the proposed algorithms are applied in two image retrieval tasks and one neighborhood formulation task. I found the results section boring, so I did not really tried to understand what the tasks really mean. Basically, the accuracy of the methods are at least 90% of the optimal ones. Moreover, they achieved up to 150x speed up in response time compared to OnTheFly, and up to 479x speedup in pre-computation time compared to PreCompute.

My general view on this paper is that it is theoretically strong for a CS but maybe not for a Mathematics paper (I would like to hear an expert's opinion on this). Also, the paper lacks elegance and didactic, though I know these are very subjective matters. The results section is difficult to understand and I missed a comparison against other fast random walk computation alternatives. Anyway, the paper was worth reading to give me a historical perspective on my understanding of more theoretical data mining papers. I've spent some hours following the mathematical formulations throughout the paper, which I consider to be its biggest contributions.