quinta-feira, 1 de março de 2012

The Power of Sampling in Knowledge Discovery

This is the first completely theoretical paper summarized here. And as completely theoretical I mean that it has no empirical result at all. It took me roughly a week to read this paper carefully, and I still do not think I  understood it fully. The idea is to keep trying papers like these and get used to more theoretical studies with time.

The title of the paper misled me. I was expecting a more general theory to explain why sampling is a good strategy in data mining. However, this paper proves that universal and existential sentences of tuple relational calculus can be verified approximately using a sample of a relation (table). Lower bounds on sample size based on the expected accuracy of this verification are the main contribution of this work. Sampling is expected to enable the analysis of large databases efficiently.

Tuple relational calculus is a query language for the definition of logical conditions and has its origins rooted with the relational model, defined by Edgard Codd in 1969. In this language, tuples are variables and conditions are defined as atoms. The result of a formula can the set of tuples that satisfy it or simply TRUE or FALSE. An example of a formula is:

for all t (R(t) => ((t[A] < t[B]) ^ (t[B] < t[C])))"

This paper focuses on two classes of formulas: universal, like the one above, and existential, which verifies the existence of a tuple with a given property. Moreover, they consider approximate verifications, which associate an error value to a formula. This error is close to 0 if the sentence almost holds  or to 1 if it does not hold. An error threshold epsilon is defined in order to verify sentences in this scenario. In case sampling is applied, a confidence parameter delta gives the probability of a sentence with error greater than epsilon in the relation to remain undetected in a sample. The first error expression is defined as follows:

g_1(sentence theta,relation M) = | {a_1...a_k} in M, such that theta does not hold in M | / |M|^k

This error measure gives the probability that theta does not hold for a random k-tuple. The second error expression is defined as:

g_3(sentence theta,relation M) = |M| - |max subset N of M for which theta holds| / |M| 

In other words, g_3 is the fraction of tuples from M that must be removed in order to make theta hold in the remaining of M.

Relation between g_1 and g_3(Prop. 3.1): For a sentence theta = for all t_1 ... for all t_k phi(t_1...t_k), consider g_3(theta,M)=epsilon, then

epsilon/|M|^(k-1) <= g_1(theta,M) <= 1 - (1-epsilon)^k = k.epsilon + O(epsilon^3)

If N is the maximal subset of M such that theta holds and L = M - N, then epsilon = |L| / |M| and g_3(theta,M) = epsilon. If V = {(a_1...a_k) in M^k such that theta does not hold}, then g_1(theta,M) = |V| / |M|^k. Because |V| >= |L|, g_1(theta,M) >= epsilon/|M|^(k-1). Moreover, |V| <= |M|^k - |N|^k = |M|^k(1-(1-epsilon)^k), therefore, g_1(theta,M) <= 1 - (1-epsilon)^k.

Upper bound on sample size, one sentence, g_1: Let theta = for all t_1 ... for all t_k phi(t_1...t_k), if g_1(theta,M) >= epsilon, then with probability at least 1-delta theta does not satisfies a sample S of M with size m >= k.ceil((1/epsilon).ln(1/delta)).

The sample of size m can be divided into l = ceil((1/epsilon).ln(1/delta)) k-tuples. The sentence theta holds for a k-tuple with probability 1-epsilon, and holds for all these k-tuples with probability (1-epsilon)^k <= e^(-l.epsilon) <= delta.

Upper bound on sample size, K sentences, g_1: For a set of K sentences, m >= k.ceil((1/epsilon).ln(K/delta)).

The probability that one of the K formula holds is delta/K.

Upper bound on sample size, one sentence, g_3: Let theta = for all t_1 ... for all t_k phi(t_1...t_k), k>= 2, there is a constant C_k such that if |M| >= C_k and g_3(theta,M) >= epsilon, then with probability at least 1-delta theta does not satisfies a sample S of M with size:


 m = max{(8/epsilon).ln(2/delta) , (2/epsilon).ceil(7.k^k.ln(2/delta)).ceil(|M|^(1-1/k))}.


 m = O((|M|^(1-1/k)/epsilon).log(1/delta)).

The proof for this theorem is not straightforward. It is based on the idea of a set P which is the maximal subset a set T composed of the tuples that does not satisfy theta such that the elements of P are pairwise disjoint. Given the union of the tuples in P (UP), we know that theta holds for N = M - (UP) and |UP| >= epsilon.|M|. For a sample of m elements from M, the expected number of elements from UP is at least m.epsilon and the probability of having at least m.epsilon/2 elements from UP is at most e^(-me/8) according to the Chernoff bound. Replacing one of the upper bounds on the value of the m in this expression gives that the probability of having at least m.epsilon/2 elements from UP is at most delta/2. Assuming that the sample has at least m.epsilon/2 elements from UP, we can extract from the sample h independent subsamples of size l, where h = ceil(7.k^k.ln(2/delta)) and l = ceil(|M|^(1-1/k)). The authors show that with this values of h and l, the probability of a subsample having all the elements of a set in P is at least 1/(7.k^k), this part was omitted. Therefore, the probability that all subsamples fail to contain all the elements from a set from P is at most (1-1/(7.k^k))^k <= delta/2. The value C_k comes from lower bounds on the size of samples in order to guarantee the given probability of a subsample having all the elements of a set in P is at least 1/(7.k^k). The formula theta holds in the sample in case it fails to contain at least m.epsilon/2 elements from UP or it does not contain all the elements from a set S in P. Therefore, theta does not hold with probability 1-(delta/2 + delta/2) = 1-delta.


Upper bound on sample size, K sentences, g_3: For a set of K sentences:



 m = max{(8/epsilon).ln(2K/delta) , (2/epsilon).ceil(7.k^k.ln(2K/delta)).ceil(|M|^(1-1/k))}.


 m = O((|M|^(1-1/k)/epsilon).log(K/delta)).

It is also shown that these upper bounds can not be improved because their are dependent on epsilon (g_1) and M (g_2). Moreover, the authors give examples of scenarios where each of the error measures provide better bounds and also propose a similar formulation for existential formulas.


I felt disappointed with myself many times while reading this paper. As a computer scientist I was supposed to understand, with a little effort, any paper on computer science. I should try to read papers with a more theoretical focus, like this one. On the other hand, this paper does not seem well-written to me. The authors give no intuition about how they defined some bounds. These intuitions are more important than just proving such bounds. I'm aware that research papers are not supposed to teach, but, at the same time, the worth of a paper is a function of the number of people who can actually read it.

Link: www.cs.helsinki.fi/u/jkivinen/publications/km-pskd-94.ps.gz

Nenhum comentário:

Postar um comentário