terça-feira, 27 de março de 2012

Content-based Modeling and Prediction of Information Dissemination

This paper was written by two researchers from University of California, Santa Barbara, and is about the problem of predicting direct links that express communication using content information. It can be seen as a variation of the traditional link-prediction problem where extra information w.r.t. communication flows (aka threads) is available. The idea is to apply Latent Dirichlet Allocation (LDA) in order to learn topics based on content features and then use topic distribution to predict the graph structure.

Link prediction is a very interesting problem and the scenario presented in the paper is novel. To illustrate it better, lets consider a network extracted from email communication as example. In this network, each node is an address and links are messages sent. Therefore, the content of the message may be useful in the discovery of its list of recipients. In order to make use of such information, they propose the GC-Model (don't know what GC means). The content (words) of messages are represented as a "word document" and the list of possible subgraphs, with limited size, of the graph that describes the messages in thread as a "graph document". Therefore, each thread produces two documents. This nomenclature matches the concepts involved in LDA and other topic models.

The GC-model works as follows. Topic distributions are learned from content using LDA (I won't try to explain LDA here, but it is basically a generative model for the identification of topic, which are sets of words, based on documents). This is the content part. The probability of a graph word appearing in a document and the topic mixture of this document are related as follows:

P(word) = Sum_{i=1 to t} P(word | T_i).Q(T_i)

Where Q(T_i) is the topic probability in the given document and P(word | T_i) is defined using Bayes' theorem:

P(word | T_i) = P(T_i|word).P(word) / P(T_i) 

Where T_i is a topic. Each component of the formula is computed as follows:

Topic distribution: P(T_i) = Sum_{j=1 to n} P(T | Thread_j).P(Thread_j)

Thread probability: P(Thread_j) is computed based on the time between threads and their percentage of link overlap, which are binned and fitted to an exponential. Roughly, the probability of a thread structure occurring again decreases exponentially with the time since its most recent occurrence.

Word probability: P(word) = Sum_{j=1 to n} P(word|Thread_j).P(Thread_j)

Topic probability given the word: Estimated using sampling.

Based on this formulation, the probability of a graph word can be computed based on the thread topic distribution. Therefore, given training data (i.e., a set of training thread), this model can be used to predict communication structures in new threads based on their content.

The authors apply the proposed method to three real datasets: The Enron email network (threads are based on email subjects, 5k threads), and two Twitter datasets (threads defined based on replies and retweets, 60k and 400k threads). The length of graph words selected (empirically) is 2 and the evaluation metric used is F1 score. Baselines are standard metrics applied in link prediction (e.g., common neighbors), a naive technique that always predicts the most frequent link and other based on text distance, and the PropFlow (unsupervised and somewhat similar to PageRank) algorithm. The GC-model outperforms all the baseline methods.

The general idea of this paper is good, but I think it should be developed a little further. Regarding the presentation, I think it would be greatly improved by removing some pseudo-codes and adding more text to it. The experimental section is not very strong, specially because it seems like unfair comparing PropFlow and the other baselines against the GC-model, which is the only one that is supervised. Moreover, there is no much evidence that supports why GC-model is good.

Link: http://www.cs.ucsb.edu/~kpm/ASONAM2011_Macropol_Singh_final.pdf

quinta-feira, 22 de março de 2012

2012 ICWSM Conference: Accepted Papers

Here is a list of potentially interesting papers to be published in the proceedings of the ICWSM conference this year:

Learning the Nature of Information in Social Networks
Event Diffusion Patterns in Social Media
Virality and Susceptibility in Information Diffusions
Temporal Motifs Reveal the Dynamics of Editor Interactions in Wikipedia
Modeling Spread of Disease from Social Interactions
Modeling Destructive Group Dynamics in On-line Gaming Communities
What Were the Tweets About? Topical Associations between Public Events and Twitter Feeds
Cross-Community Influence in Discussion Fora
Have You Heard?: How Gossip Flows through Workplace Email
The Emergence of Conventions in Online Social Networks
Modeling Diffusion in Social Networks Using Network Properties
The YouTube Social Network
On the Study of Social Interactions in Twitter
Modeling Polarizing Topics: When Do Different Political Communities Respond Differently to the Same News?
SearchBuddies: Bringing Search Engines into the Conversation

Twitter is still hot (but for how long?). Many papers on social influence/propagation/diffusion. The fact that most of the papers are about online social networks is something kind of scaring, seems like an academic bubble.

Link: http://icwsm.org/2012/program/program/

terça-feira, 20 de março de 2012

Output Space Sampling for Graph Patterns

No doubt that VLDB has published some great papers recently. This one is certainly part of my top list. It was written by some colleagues from RPI. The problem studied is subgraph pattern mining. The authors propose an output sampling algorithm based on Metropolis-Hastings that is able to produce graph patterns according to a general purpose interestingness metric (e.g., support) and threshold (e.g., minimum  support). Moreover, you can also bias the sampling method towards a given distribution. The benefits are: (1) Performance improvement over traditional graph mining algorithms, and (2) reduction of the output volume. Graph mining, specially frequent graph mining, is a problem that is not well-solved in the literature (the input graphs considered are too small) and sampling seems to be an interesting solution for it. The work is theoretically sound and has potential to affect data mining significantly in the long term.

The (generic) subgraph mining problem consists of, given a graph database (multiset of graphs), and an interestingness function F->R, return all subgraphs from the database that satisfy the given threshold. The proposed method samples the output space of a subgraph mining algorithm without generating the whole search-space. This approach is completely different from what has been done in the literature so far.

I will skip the formal definition of what is a graph, a subgraph, isomorphism, etc. and go straight to the point. The idea of the output sampling is performing a random-walk on the search-space of subgraphs, which can be seen as a Partial Order Graph (POG). However, this random walk must be smart. This is when the Metropolis-Hastings (MH) algorithm arrives. MH enables the generation of samples following a given target distribution. MH is a Markov chain Monte Carlo (MCMC) method developed in 1970. One important characteristic of this method is that it depends only on the current state (Markov property).

MH applies an arbitrary jumping function Q, which, given a current value x_t, returns a new value x_{t+1}. Q must be symmetric, i.e. Q(x_{t+1} | x_t) = Q(x_t | x_{t+1}).  The procedure starts with an initial value x_0 and then enters into a loop where new values are generated based on the current value x_t and Q.  Given a = P(x_{t+1})/P(x_t), if a >= 1, then x_{t+1} is accepted, otherwise, x_{t+1} is accepted with probability a. P is the target distribution. The method works better if Q is similar to P. There are some conditions under which MH works, the authors consider the condition that P must be stationary (i.e., P will converge after some iterations in the Markov process).

The state space of the MH algorithm for sampling the output space of subgraph patterns is the Partial Order Graph (POG). This graph is build only locally and the authors prove that a random walk on POG converges to a stationary distribution (it is finite, irreducible, and aperiodic).

First, it is shown how to obtain a uniform sample of frequent patterns, what means that any frequent pattern has the same probability to be returned. In order to do so, it is necessary to create proper transition probabilities in the POG. In fact, they show that if the transition matrix is symmetric, then the stationary distribution is uniform. P(u,v), where u and v are states (or vertices in the POG) are defined as follows:

P(u,v)        =                          1/2(*max(d_u,d_v))                                       if u != v and v in adj(u)
                      3/2- Sum_{x in adj(u)} P(u,x)                        if u = v
                        0                                                                    otherwise

The MH algorithm is adapted using P(u,v). The algorithm receives a pattern (in practice an empty pattern, which is part of the POG), the minimum support threshold, and the number of iterations to be performed. The acceptance of a transition is based on the degree of the current vertex and the degree of the new vertex. The number of transitions depend on the number of iterations and the final pattern found is returned as output. In this process, only neighbors of the current pattern are generated (i.e., sub- and super-patterns) and the support counting for these patterns is performed using standard isomorphism checking in a pass over the database.

The authors show also how to perform a support proportional sampling (i.e., a sampling biased towards the most frequent subgraphs), which requires a modification of the transition function P(u,v). However, since they are still working with the POG, they must relate this interestingness function with the degree of its vertices. Basically, they propose a transition function where choosing an up movement (towards the empty pattern) is more likely than choosing a down movement in some extent. Changing the algorithm to consider a discriminatory metric is easy, since it requires only a small change in how new vertices are accepted.

The results are divided into two parts. First an analysis of the patterns using a small dataset (AIDS antiviral screening dataset) is presented. It is shown that the method generates a uniform distribution of patterns if desired and it converges after 20k iterations. The authors also show that the distribution of patterns is still a bit skewed towards high degree vertices in the POG. Using an transaction dataset, for which the POG is denser, they show that their method achieves a more uniform distribution. In terms of support-biased sampling, the method also works well, but it is necessary to select a good value for up/down transition probabilities. Discriminatory subgraph sampling is evaluated using a dataset of chemical graphs. The results in this case are not very conclusive. Although many high delta score patterns are found, the correlation between delta score and visit count does not seem as high as for the support-biased case. In the evaluation using large graphs, a protein-interaction and a cell-graph database are applied. Most of the existing algorithms take more than 2 days to process such datasets and the output space sampling algorithm is shown to be a good alternative.

The technique proposed in this paper is very innovative and relevant. In the light of what has been done so far in graph mining, this is a big step towards the analysis of real graph databases. One thing that I did not understand very well is why the approach proposed for support-biased sampling is different from the one that considered the significance of patterns. In fact, after finding out that the significance-biased technique does not work very well I start to wonder how general the technique is in practice. I would like to know the line of thought that brought the authors from the existing work to this sampling idea, it is really different from everything I've seen in this area.

Link: www.vldb.org/pvldb/2/vldb09-713.pdf

quinta-feira, 15 de março de 2012

PageRank: Standing on the shoulders of giants

This paper presents the PageRank algorithm from a historical perspective. PageRank is an algorithm for ranking web pages in terms of relevance based on the structure of the web graph. It was proposed by Sergey Brin and Larry Page in 1998 and has been applied by the Google search engine. This paper is authored by Massimo Fraceschet, from University of Udine, Italy, and was published by the Communications of the ACM magazine.

The HTML protocol, which is the basis for the Web as we know, was invented in 1989 by Berners-Lee while he was working at CERN. PageRank was one of the first efforts on the use of the web topology (that emerges from links between pages) to measure the importance of a page. The basic idea of PageRank is that a web page is important if it is pointed to by other important links. The formulation of the score is as follows:

PI = alpha.PI.S + (1-alpha).u

Where alpha is the damping factor, S is the adjacency matrix of the graph after dangling pages (i.e., without output links) have their output probabilities set as uniform, and u is a uniform probability vector.

The PageRank equation is known to have a unique solution. The existence of at least one solution comes from the fact that the transformed adjacency matrix is stochastic (i.e., all rows sum up to 1). Moreover, according to the Perron-Frobenius theorem,  the equation has a single solution.

Perron-Frobenius theorem: If A is irreducible (i.e., its associated directed graph is strongly connected) nonnegative square matrix, then there exists a unique vector x such that xA = rx, x > 0, and Sum_{i} x_i = 1, where r is the maximum eigenvalue of A in absolute value.

The solution to the PageRank equation can be obtained using the power method, which computes the eigenvector and eigenvalue of a matrix through an iterative process. An initial vector PI^{0} is set to u (uniform vector) and PI^{k+1} = PI^{k}. G, where G is the google matrix. The result converges after a limited number of iterations that depends on the damping factor alpha.

After presenting PageRank, the author discusses a list of very similar methods in Computer Science, Bibliometrics, Sociology and Economics. One of these methods is HITS (Hypertext Induced Topic Search), developed by Jon Kleinberg. HITS is based on the concept of authorities and hubs. Good authorities are pages that are pointed to by good hubs and good hubs are pages that point to good authorities. Given an adjacency matrix L, the authority (x) and hub (y) vectors are defined as follows:

x^k = L^T.y^{k-1}
y^k = L.x^k

These equations can be solved using the power method, like was discussed for PageRank.

I wont discuss the other methods in detail. The point is that ideas very close to PageRank and HITS were developed decades before in other areas.

Bibliometrics (1976): A journal is influential if it is cited by other influential journals.

Sociometry (1949): A person is prestigious if he is endorsed by prestigious people.

Econometrics (1941): Highly remunerated industries are those that receive substantial inputs from highly remunerated industries.

This is a great paper. It is funny how recurring ideas appear somehow independently in some different areas. In the case of PageRank and HITS, the research was developed almost in parallel. Something that I ask to myself sometimes is whether I should know Math a lot or just be able to model the a given problem and then search for the mathematical solution. I've heard somewhere that the second approach is better.

 Link: http://arxiv.org/abs/1002.2858

quarta-feira, 14 de março de 2012

Trust Transitivity in Social Networks

Here goes another paper written by Physicists. This paper studies trust in a social network where trust relationships are transitive. Understanding trust is a very relevant problem in society. In particular, several web applications, such as e-markets and P2P file systems, rely on trustful users in order to provide quality service. Trust transitive plays a special role in non-centralized scenarios, i.e. when there is not a central source of trust available.

Two trust metrics based on transitivity are proposed. The first one, called best trust, is defined as follows:

s_{u,v} = max{PROD_{e_i} c_{e_{i}}}, for all {e_i} in P{u=>v}

According to this metric, the trust of u towards v is the maximum trust product of the direct trusts over the path P{u=>v}is the set of all paths between u and v, {e_i} is the set of edges in a given path, and c_e is a value of direct trust associated to an edge. This metric can be computed using a shortest path algorithm, but the authors argue that it is biased towards high values of trust, so they propose another metric, called pervasive trust:

t_{u,v}=(Sum_w A_{w,v}.(s_{u,v}^{G\{v}})^2.c_{w,v})/(Sum_w A_{w,v}.s_{u,w}^{G\{v}})

Where A is the adjacency matrix, and s_{u,v}^{G\{v}} is the best trust from u to a neighbor of v (v is excluded). In this case, several paths are considered, instead of a single one. Moreover, this metric does not tend to 0 when N approaches to infinite.

The authors give a short discussion on the Eigentrust and the trustWebRank, which are other trust metrics in the literature. These metrics are based on the concept of feedback centrality and are calculated by solving some linear system. Eigentrust requires a stochastic matrix (i.e., the sum of the trust values of the out-edges of all vertices must sum to 1) and the inferred trust values are given by the steady state distribution of the corresponding Markov chain (i.e., the left eigenvector of the stochastic matrix must have an unity eigenvalue). Therefore, the trust values are not personalized and the absolute values of trust are lost. TrustWebRank is based on the PageRank algorithm and also requires a stochastic matrix but is personalized. It also depends on a damping factor that eliminates the contribution of longer paths.

Properties of the two trust metrics presented are studied theoretically. The authors consider a random direct network with arbitrary degree distribution and direct trust values between c and c+dc given by an arbitrary distribution. In this scenario <t> = <s><c> in the limit of large network size. The discussion seems to be heavily based on the paper "Random graphs with arbitrary degree distributions and their applications" that I've not read yet. Therefore, I will not cover the math here. Basically, the authors give a formulation for <s> and compute it through numerical simulations and an analytical solution. In- and out-degree distributions following a Poisson and a Zipf are studied. The authors found a first-order transition from vanishing to positive trust values that correspond to the formation of a giant component in the network when the degree distributions are Poisson. For a Zipf distribution, the transition is of second order and the average trust is smaller than for the Poisson distribution.

Trust transitivity is also studied using data from the PGP (Pretty Good Privacy) network, which is a mechanism for users to attach a signature to the public key of another user for cryptography purposes.  The largest connected component of this network has 40k vertices. It is shown that this network does not grow in a purely random fashion because the power-law distribution of the waiting time between new keys and signatures. Degree distributions seem to follow a power-law. Moreover, the in- and out-degree of  vertices are strongly correlated and the reciprocity is very high. The degree correlation is assortative for intermediary degree values and dissortative for very high and very low degrees. Two paradigms of trust organization are studied: Authority (trust relies on a few universally trusted vertices) vs. Community-based (densely connected components provide trust to each other). The authority-based trust is more efficient but also more prone to forgery. The community-based trust has opposite characteristics. These two mechanisms seem to work together in the PGP network. In order to identify the communities, the authors a modularity maximization algorithm is applied. Two types of communities were found: one that represents the vicinity of an authority vertex and another that represents a "real" community. As means to understand trust transitivity in the PGP network, the authors assign a direct trust c=1/2 to all edges except for a fraction gamma of edges to which c=1. Three scenarios are considered: 1) Edges are selected at random (Random); 2) Edges with the largest betweenness are selected (Authority-centered); and 3) Edges with largest edge clustering are chosen (Community-centered). The authority-centered trust leads to higher values of average trust (<s> and <t>). On the other hand, the community-centered trust leads to lower values of average trust.  Moreover, for random trust distributions, vertices with large degree achieve higher <s> but have uniformly distributed <t>, as desired. For authority-based trust, both small and large degree vertices achieve high trust values. For community-based trust, vertices with intermediary degree achieve higher trust than the small and large degree ones.

This paper is very interesting and relevant. It is similar to another Physics paper I've read some weeks ago. This pattern of studying complex features of simple models seem to be a standard for these kind of paper. In the case of this particular paper, a real dataset (the PGP network) is also analyzed. I found some minor typos that were probably corrected in the "official version" of the paper (this one is from ArXiv). I think that the formulation for <s> is not clear. I will read the paper "Random graphs with arbitrary degree distributions and their applications" in the near future in order to understand the formulation in detail.

Link: http://arxiv.org/abs/1012.1358

terça-feira, 6 de março de 2012

The Role of Social Networks in Information Diffusion

This paper was accepted in the next WWW conference, to take place in Lyon, France. Its authors are from Facebook and University of Michigan (Lada Adamic). The idea of the paper is covering a gap in the study of social influence and information propagation in social networks, which is the fact that most of the existing studies do not distinguish influence/propagation from other sources of correlation (e.g., homophily). Although this problem has been known for years, previous studies had no access to a large dataset that could enable this kind of analysis. Thanks to Facebook, these guys performed an experiment in order to quantify the effect of social relationships in the diffusion of information.

Given a pair of users (A,B) that are friends, the researchers considered two scenarios in which A posts a URL on Facebook: (1) the URL does not appear in B's news feed, and (2) the URL appears in B's news feed. For a large set of pairs subject-URL selected at time of display, the researchers were able to randomly select whom were going to be exposed to the URL and whom were not going not. The number of subjects, URLs and subject-URL pairs considered were 253M, 75M, and 1B, respectively. The collection period ranged from August 14th to October 4th.

Social exposure x diffusion: In order to account for the possible effect of bias towards frequent URLs or active users, the authors considered the bootstrap average by URL to compute the likelihood of sharing for the feed and no feed group. Bootstrapping is a statistical technique for assessing the variation in estimate based on sampling. They found that The likelihood of sharing is 0.191% for the feed and 0.025% for the no feed group. Individuals in the feed condition are 7.37 times more likely to share.

Temporal clustering: Subjects in the feed group share the information sooner than those from the no feed one. The median sharing latency is 6 hours for the feed and 20 hours for the no feed group.

Multiple sharing friends: One important aspect in the literature on social influence and information diffusion is the effect of the number of active/affected/sharing friends.  The authors found that sharing probability increases with the number of sharing friends in both the feed and no feed groups. However, the ratio between the shared probability  in the feed and no feed group decreases with the number of sharing friends.

Tie strength: The frequency of private communication, frequency of public online interaction, number of appearances of users in the same photo, and the number of times users comment on the same posts were considered as tie strength measures. Data from three month prior to the experiment was applied in this evaluation. The goal is to bring back the concepts of strong and weak ties, from a seminal work developed in 1973 called The Strength of Weak Ties, in which was found that weak ties play an important role in social networks. The results presented in this paper also showed that the strength of ties is positively correlated with the probability of sharing for the feed and no feed conditions, but this effect diminishes with tie strength.

Collective impact of ties: On the same topic of the previous section, here the authors compute the effect of exposure to information shared by friends with strength k, which is computed by the following equation:

ATET(k) = p(k,feed) - p(k,n feed)

Where p is the probability of sharing. This value is multiplied by f(k), which is the fraction of links displayed in all users' feeds posted by friends with tie strength k. Ties for which k = 0 were classified as weak and those with k > 0 were classified as strong. The results show that most of the diffusion occurs through weak ties.

This paper is not bad. Handling a dataset like the one used in this paper is an once in a lifetime opportunity for an ordinary researcher. I'm just a little bit disappointed because when I read the tittle and the list of authors, I pictured a better paper in my head. I was expecting a more rich statistical analysis, a bunch of cool visualizations, and also some kind of influence/propagation model based on the data. I did not buy the sections about tie strength, which were probably intended to be the high point of the paper.

Link: http://arxiv.org/abs/1201.4145

segunda-feira, 5 de março de 2012

How (not) to predict elections

This paper is a very simple, but still interesting paper, that brings skepticism on the current hype about making predictions, specially wrt elections, using social media data. I would say that it is somehow similar to the paper 'Predicting Consumer Behavior with Web Search', summarized here 2 months ago.  The authors are from Wellesley College (US) and Universidad de Oviedo (Spain), and I confess I have never heard about any of them.

Several recent papers have reported positive results of the use of social media in the real-time prediction of many outcomes. Similar results have also been reported w.r.t. the use of search engine data. This studies have promptly attracted the interest of the general public. In this paper, the authors revisit some of these studies and show that the results achieved are actually disappointing or misleading. The authors emphasize that fair baselines, such as the incumbency (re-election) rate are not considered and also that the proposed methods perform much worse than professional pollsters or even chance.

In the experimental section, two datasets from Twitter are applied, one related to the Senate elections in Massachusetts (about 234K tweets) and other from an uniform sample of tweets related to 6 races for the US Senate (about 13K tweets). Two published strategies for predicting election results are evaluated, one based on the number of mentions of the candidate and the other based on sentiment associated to terms contained in tweets.  Results show that the number of mentions predicted the victory of the wrong candidate in Massachusetts but sentiment analysis worked. In the other dataset, the number of mentions predicted 3 out of 6 results correctly, same performance achieved by the sentiment analysis. In a further analysis, the authors show that the automatic strategy for sentiment analysis applied achieves very low accuracy (37%). A champaign against a given candidate had most its tweets classified as positive or neutral towards her. Finally, the aggregate opinion of all tweets from users is weakly correlated with the average ADA (Americans for Democratic Action) score of the candidates they follow.

Based on these results, the authors propose set of guidelines for the study of techniques for election prediction: (1) Adjustments should be determined before hand (they always seem obvious afterwards), (2) Social media data should not be considered as from a natural phenomena, since it can be manipulated, (3) We need to understand why a given method works, (4) Proper baselines (e.g., incumbency) should be considered, and (5) Existing polling strategies, based on statistically significant samples, should be considered.

This kind of paper have an interesting contribution to any research area, specially when it comes to criticizing research lines that attract too much publicity without the proper scientific rigor. I would be really concerned if I was cited by one paper like this. Publishing misleading results on purpose is unethical. Although seeing a researcher on NYT or CNN is always inspiring, there is no space for flawed results in  serious research.

Link: http://bit.ly/qUHJyv

domingo, 4 de março de 2012

Education of a Model Student

This is another theoretical paper. Its authors are from Cornell University and include Jon Kleinberg and Steven Strogatz. The subject of the paper is the study of mathematical models for scheduling of educational material. Many asymptotic analysis of the relationship between learning rate and spacing constraints are given for different families of constraints and educational goals.

The authors consider a simplified educational model that is based on a set of educational units u_1, u_2, ... u_n which have spacing constraints defined by sequences {a_k} and {b_k}. A pair (a_i,b_i) tells us that the (i+1)-th ideal time to the student see a given unit is between a_k and b_k time steps after seeing it for the i-th time. In general, the spacing constraints are the same for all units and can be defined by functions. Moreover, the paper studies two educational goals: the infinite perfect learning and the finite sequence (cramming). Infinite perfect learning consists of never forgetting what have been learned so far. Cramming is a method of learning that has a specific temporal target (e.g., a test) and does not assure that the learned material will be remembered after this target.

From now, I will summarize some of the most important results presented in the paper. Proofs will be discussed without much detail.

The Recap Method: Based on the idea of infinite perfect learning and a quick learning rate (i.e., rate at which new content is introduced). The learning rate t_n (i.e., the time stamp when the content n is introduced) grows as theta(n log n) and a_k = 2^k and b_k=2^(k-1)(k+1). An algorithm for generating this schedule can be defined in terms of a depth-first post-order traversal of a binary tree. An example for 4 units is: u_1,u_2,u_1,u_2,u_3,u_4,u_3,u_4,u_1,u_2,u_3,u_4,...

The Superlinearity of the Introduction Time Function: For schedules that exhibit infinite perfect learning, the introduction time function t_n must be superlinear. This is my favorite result of the paper and it makes me think a lot about my own lousy educational strategy. They prove it by contradiction. A linear learning rate means that there is a c such that t_n <= c.n for all n. Let n_0 be an integer such that n_0 > Sum_{j=1}^k b_j. At least c.n_0 units has been introduced at n_0. Moreover, at least n_0  units will have occurred c+1 times by time step c.n_0+Sum_{j=1}^k b_j. So, c.n_0+Sum_{j=1}^k b_j >= (c_1).n_0 and Sum_{j=1}^k b_j >= n_0, which is a contradiction.

The Slow Flashcard Schedule: This schedule is based on a deck of flashcards, where the kth card corresponds to the unit u_k. A card is taken from the top and inserted into the k+1 position, where k is the number of times the card has already been taken. This generates a schedule such as: u_1,u_2,u_1,u_2,u_3,u_1,u_3,u_2,u_4,u_3... It is shown that this schedule exhibits infinite perfect learning with spacing constraints (a_k,b_k) = (k,k^2). Simulation results suggest that this schedule may even exhibit infinite perfect learning with spacing constraints (k,2k).

Flexible Students: For flexible students a_k = 1 for all k. In this scenario, the authors propose a sequencing for which b_1>=2 , b_k => inf and in which the constraints are met and new material is quickly introduced. The idea is increasing the level of the sequences as follows:

Level 1: u_1,u_2,u_1,u_2,...
Level 2: u_3,u_1,u_3,u_2,u_3...
Level 3: u_4,u_1,u_4,u_2,u_4,u_3...

This schedule exhibits infinite perfect learning and b_k can grow arbitrarily slowly, depending on the value of b_k.

The Finicky Slow Student: For this family of constraints, where a_k = b_k = k it is not possible to achieve infinite perfect learning. It is proved by contradiction by showing for two or more units, it is not possible to meet the constraints without conflicts. Unit u_1 will occur at times t_1,t_2, ... where t_i = Sum_{k=1}^i a_k. Lets state that u_2 starts at s_0 and also occurs at s_1,s_2, ... where s_i=s_0+Sum_{k=1}^i a_k. We know that s_i - t_i = s_0 and s_{i+1}-s_i = t_{i+1}-t_i=a_{i+1}. Choose k large enough so that t_{k+1}-t_k>s_0, then t_{k+1}>t_{k}+s_0=s_k and t_{k+1}>s_k. Let m be the smallest number such that t_m+1>s_m. The authors show that t_{m}=s_{m-1}.

Cramming: The authors show that for every positive integer n and every set of spacing constraints with b_k approaching to infinite, there exists a sequence that achieves bounded learning of order n. In other words, if you have a given content to study for a test, it is possible to organize a schedule for this test for any valid spacing constraints with enough time. They prove it by induction, using the idea of inserting a new unit into a sequence S_n. They also present limits on the amount of content that can be crammed in a limited amount of time. I confess that I did not understand this part very well, I believe an example would help.

This paper is very interesting. Based on a real complex problem, the authors propose a very simple model and apply asymptotic analysis to study it from several perspectives. During the reading, I discovered that asymptotic analysis was borrowed/stolen by computer scientists, having many (maybe more) applications outside it. I should study this subject more in the future, specially from a broader perspective (e.g., a book about asymptotic analysis in general).

Link: http://www.pnas.org/content/early/2012/01/13/1109863109.full.pdf+html

quinta-feira, 1 de março de 2012

The Power of Sampling in Knowledge Discovery

This is the first completely theoretical paper summarized here. And as completely theoretical I mean that it has no empirical result at all. It took me roughly a week to read this paper carefully, and I still do not think I  understood it fully. The idea is to keep trying papers like these and get used to more theoretical studies with time.

The title of the paper misled me. I was expecting a more general theory to explain why sampling is a good strategy in data mining. However, this paper proves that universal and existential sentences of tuple relational calculus can be verified approximately using a sample of a relation (table). Lower bounds on sample size based on the expected accuracy of this verification are the main contribution of this work. Sampling is expected to enable the analysis of large databases efficiently.

Tuple relational calculus is a query language for the definition of logical conditions and has its origins rooted with the relational model, defined by Edgard Codd in 1969. In this language, tuples are variables and conditions are defined as atoms. The result of a formula can the set of tuples that satisfy it or simply TRUE or FALSE. An example of a formula is:

for all t (R(t) => ((t[A] < t[B]) ^ (t[B] < t[C])))"

This paper focuses on two classes of formulas: universal, like the one above, and existential, which verifies the existence of a tuple with a given property. Moreover, they consider approximate verifications, which associate an error value to a formula. This error is close to 0 if the sentence almost holds  or to 1 if it does not hold. An error threshold epsilon is defined in order to verify sentences in this scenario. In case sampling is applied, a confidence parameter delta gives the probability of a sentence with error greater than epsilon in the relation to remain undetected in a sample. The first error expression is defined as follows:

g_1(sentence theta,relation M) = | {a_1...a_k} in M, such that theta does not hold in M | / |M|^k

This error measure gives the probability that theta does not hold for a random k-tuple. The second error expression is defined as:

g_3(sentence theta,relation M) = |M| - |max subset N of M for which theta holds| / |M| 

In other words, g_3 is the fraction of tuples from M that must be removed in order to make theta hold in the remaining of M.

Relation between g_1 and g_3(Prop. 3.1): For a sentence theta = for all t_1 ... for all t_k phi(t_1...t_k), consider g_3(theta,M)=epsilon, then

epsilon/|M|^(k-1) <= g_1(theta,M) <= 1 - (1-epsilon)^k = k.epsilon + O(epsilon^3)

If N is the maximal subset of M such that theta holds and L = M - N, then epsilon = |L| / |M| and g_3(theta,M) = epsilon. If V = {(a_1...a_k) in M^k such that theta does not hold}, then g_1(theta,M) = |V| / |M|^k. Because |V| >= |L|, g_1(theta,M) >= epsilon/|M|^(k-1). Moreover, |V| <= |M|^k - |N|^k = |M|^k(1-(1-epsilon)^k), therefore, g_1(theta,M) <= 1 - (1-epsilon)^k.

Upper bound on sample size, one sentence, g_1: Let theta = for all t_1 ... for all t_k phi(t_1...t_k), if g_1(theta,M) >= epsilon, then with probability at least 1-delta theta does not satisfies a sample S of M with size m >= k.ceil((1/epsilon).ln(1/delta)).

The sample of size m can be divided into l = ceil((1/epsilon).ln(1/delta)) k-tuples. The sentence theta holds for a k-tuple with probability 1-epsilon, and holds for all these k-tuples with probability (1-epsilon)^k <= e^(-l.epsilon) <= delta.

Upper bound on sample size, K sentences, g_1: For a set of K sentences, m >= k.ceil((1/epsilon).ln(K/delta)).

The probability that one of the K formula holds is delta/K.

Upper bound on sample size, one sentence, g_3: Let theta = for all t_1 ... for all t_k phi(t_1...t_k), k>= 2, there is a constant C_k such that if |M| >= C_k and g_3(theta,M) >= epsilon, then with probability at least 1-delta theta does not satisfies a sample S of M with size:

 m = max{(8/epsilon).ln(2/delta) , (2/epsilon).ceil(7.k^k.ln(2/delta)).ceil(|M|^(1-1/k))}.

 m = O((|M|^(1-1/k)/epsilon).log(1/delta)).

The proof for this theorem is not straightforward. It is based on the idea of a set P which is the maximal subset a set T composed of the tuples that does not satisfy theta such that the elements of P are pairwise disjoint. Given the union of the tuples in P (UP), we know that theta holds for N = M - (UP) and |UP| >= epsilon.|M|. For a sample of m elements from M, the expected number of elements from UP is at least m.epsilon and the probability of having at least m.epsilon/2 elements from UP is at most e^(-me/8) according to the Chernoff bound. Replacing one of the upper bounds on the value of the m in this expression gives that the probability of having at least m.epsilon/2 elements from UP is at most delta/2. Assuming that the sample has at least m.epsilon/2 elements from UP, we can extract from the sample h independent subsamples of size l, where h = ceil(7.k^k.ln(2/delta)) and l = ceil(|M|^(1-1/k)). The authors show that with this values of h and l, the probability of a subsample having all the elements of a set in P is at least 1/(7.k^k), this part was omitted. Therefore, the probability that all subsamples fail to contain all the elements from a set from P is at most (1-1/(7.k^k))^k <= delta/2. The value C_k comes from lower bounds on the size of samples in order to guarantee the given probability of a subsample having all the elements of a set in P is at least 1/(7.k^k). The formula theta holds in the sample in case it fails to contain at least m.epsilon/2 elements from UP or it does not contain all the elements from a set S in P. Therefore, theta does not hold with probability 1-(delta/2 + delta/2) = 1-delta.

Upper bound on sample size, K sentences, g_3: For a set of K sentences:

 m = max{(8/epsilon).ln(2K/delta) , (2/epsilon).ceil(7.k^k.ln(2K/delta)).ceil(|M|^(1-1/k))}.

 m = O((|M|^(1-1/k)/epsilon).log(K/delta)).

It is also shown that these upper bounds can not be improved because their are dependent on epsilon (g_1) and M (g_2). Moreover, the authors give examples of scenarios where each of the error measures provide better bounds and also propose a similar formulation for existential formulas.

I felt disappointed with myself many times while reading this paper. As a computer scientist I was supposed to understand, with a little effort, any paper on computer science. I should try to read papers with a more theoretical focus, like this one. On the other hand, this paper does not seem well-written to me. The authors give no intuition about how they defined some bounds. These intuitions are more important than just proving such bounds. I'm aware that research papers are not supposed to teach, but, at the same time, the worth of a paper is a function of the number of people who can actually read it.

Link: www.cs.helsinki.fi/u/jkivinen/publications/km-pskd-94.ps.gz