quinta-feira, 28 de junho de 2012

Expectation Maximization

I'm really late in my reading plans this month. I've been working hard on a recent project and also spending a lot of time on bureaucratic stuff related to my move in September. Some weeks ago I started reading a paper about content recommendation on Twitter (it is going to be summarized here soon) and I discovered that I had no idea about how Expectation Maximization (EM) works. So, I decided to spend some time learning EM. It is interesting that this is my first post covering a specific topic (not a paper) here. Also, I've applied the learning method described here to learn EM. Let's see how this concurrent experiment works.

Expectation Maximization (EM) is a statistical technique for estimating parameters for statistical models when there are latent (or hidden) variables. The parameters found are maximum likelihood estimates (MLE) based on both the data and the latent variables.

But what is a statistical model? Very simple. A statistical model is a equation that describes relationships between variables through probability distributions. This distributions make the relationships stochastic instead of deterministic.

What about hidden variables? Hidden (or latent) variables are those variables that are not directly observed in the data. Therefore, such variables must be inferred based on the observed variables. The "magic" of  EM is that it is able to deal with both hidden variables and parameters to be estimated together. Sometimes making some variables hidden seems to be useful in practice.

Maximum likelihood estimate: The likelihood of a set of parameters is, basically, the probability of such parameters to generate the given outcomes. MLE is a method for inferring the parameters with maximum likelihood. EM is a generalization of MLE for scenarios where there are hidden variables. We usually maximize the log-likelihood because the logarithm function is an increasing function and log is a monotone transformation (i.e., the parameters that maximize the original function also maximize the likelihood also maximize the log-likelihood).

An example of a problem that can be solved using EM is: Given two coins, each of them with unknown biases, suppose that a random coin is selected uniformly and thrown N times. This experiment is performed M times (for each time, one of the coins is selected at random). What are the biases of the coins?

Parameters of the model: The biases of the coins.

Unknown variables: The selection of coins.

What if the choice of coins was known? Then we have a traditional MLE scenario. The parameters (biases) can be estimated as:

estimated bias coin A = # tails coin A / # times coin A was thrown
estimated bias coin B = # tails coin B / # times coin B was thrown

EM is an iterative procedure that alternates between two main steps: Expectation (E) and Maximization (M). The parameters are initially set to some initial values. The Expectation step estimates the hidden variables based on the current parameter values. The Maximization step calculates the MLE of the parameters given the estimated hidden variables. The method is expected to converge after a finite number of iterations.

How would EM work for the aforementioned example? 

  1. Set initial values to the biases of the coins.
  2. A probability distribution over the unknown coin choices is drawn based on the current parameters (biases).
  3. New parameters are estimated using MLE based on the probability distribution found in the last step.
  4. Go back to step 2 (until convergence). 

I took this example from the article What is the expectation maximization algorithm? Here is the complete running example, which I found very clever.



If you are familiar with the K-means algorithm (I am), you will find this explanation of K-means as EM very instructive: The cluster of each point are the latent variables and the centroid of each cluster are the parameters to be estimated. Therefore, the cluster of each point is updated in the expectation step and the centroids are computed in the maximization step. However, in K-means the expectation step actually assigns points to clusters instead of computing a probability distribution for assignments.

Another example of the application of EM is the follows: Suppose that you have two Normal distributions, A and B, and that you are given instances of these distributions but you do not know which distribution the instances were sampled from. How can you compute the means mu_A and mu_B of the distributions if both distributions have the same known variance? Surprisingly, the answer is EM.



What if there is a single distributions instead of two? Then estimating the mean is straightforward.

What if the selected distribution is known? In this case, mu_A and mu_B are the mean value of the instances for each distribution. Still pretty easy.

How we solve this problem using EM? The means are the parameters and the distribution from which each instance was sampled are the latent variables.


  1. Define initial values for mu_A and mu_B.
  2. Expectation step: Estimate the probability that each instance was generated from A B.
  3. Maximization step:  Compute the weighted means of instances for A and B.
  4. Go back to step 2 (until the estimates converge).

In the expectation step, the probabilities are computed as follows:

Probability that the instance x_i is from A:
 P(x = x_i | mu = mu_A) / (P(x = x_i | mu = mu_A) + P(x = x_i | mu = mu_A)

Probability that the instance x_i is from B:
 P(x = x_i | mu = mu_B) / (P(x = x_i | mu = mu_B) + P(x = x_i | mu = mu_B)


Since we know that the distributions are Gaussian and the variation is also known, we can compute these probabilities using the probability density function of the Gaussian distribution.

In the maximization step, the means are updated as follows:

mu_A = Sum_{x_i} P(x_i is from A).x_i / Sum_{x_i}P(x_i is from A)


mu_B = Sum_{x_i} P(x_i is from B).x_i / Sum_{x_i}P(x_i is from B)

The EM method is known to converge to local optimum parameters that may not be the global optimum ones. At each iteration, the parameters are improved until a local optimum is reached. In practice, several initial parameters are applied in order to avoid poor local maximum likelihood values.


Alternative approaches to EM include the gradient descent and Newton-Raphson methods.


References:

Chuong B Do and Serafim Batzoglou. What is the expectation maximization algorithm? Nature  Biotechnology.

McLachlan, G. J., & Krishnan, T. (1997). The EM Algorithm and Extensions. Wiley, New York.

http://www.cs.berkeley.edu/~russell/classes/cs194/f11/lectures/CS194%20Fall%202011%20Lecture%2020-EM-proof.pdf

segunda-feira, 11 de junho de 2012

Recommending Twitter Users to Follow Using Content and Collaborative Filtering Approaches

This paper was published in RecSys 2010 and is authored by researchers from University College Dublin, in Ireland. The problem studied in the paper is recommending Twitter followees (information sources). The authors propose several profiling strategies - based on tweets, followers, and followees - in order to represent users' interests, enabling accurate recommendations.

The authors of this paper were working on a real system, called Twittomender, for user recommendation based user profiles and query terms. Therefore, they give a lot of detail on the architecture of such system, which is not actually my focus as a reader. So, I will skip most of these details here.

The profiling strategies proposed are the following:

1) User's tweets;
2) Tweets of the user's followees;
3) Tweets of the user's followers;
4) User's followees (ids);
5) User's followers (ids).

Profiles were indexed using the Lucene platform, which is a traditional platform for indexing and querying text documents. Each user is mapped into a document and the features (words or ids) are represented as terms. Recommendations are made using Lucene querying system. A user or a set of keywords is given to Lucene as a query. Terms are weighted based on TF-IDF.

The profiling strategies are evaluated in two experiments, an offline and an online one. In the offline experiment, a set of 20,000 users are divided into a training (19,000 users) and a test (1,000 users) set. The profiling strategies are evaluated in 9 different forms:

S1: Users are represented by their tweets;
S2: Users are represented by the tweets of their followees;
S3: Users are represented by the tweets of their followers;
S4: Users are represented as a combination of their own tweets, the tweets of their followees and the tweets of their followers;
S5: Users are represented by the IDs of their followees;
S6: Users are represented by the IDs of their followers;
S7: Users are represented by a combination of the IDs of their followees and followers;
S8: Ensemble that combines strategy S1 and S6;
S9: Ensemble that combines strategies S1-S7. The position of a user in the resulting list is a function of its rank in the results generated by the set of strategies.

The following figure show the recommendation precision of each of these strategies:



The evaluation is based on the overlap between the recommended lists and the actual followee lists of users. The authors do not explicitly state that they do not use test data to train their methods but I assume it. The best strategies in terms of precision are S6, S7, and S9. One interesting result is that the profiling strategy based on the followee list does not achieve as good results as the strategy based on the followers list.

Another quality measure considered in the offline evaluation is the average relevant document position for different recommendation list sizes (see the upcoming figure).



This evaluation give opposite evidences of the quality of the strategies, compared to the previous one.  However, I would expect that a good method brings more relevant users to the top-20 ranking (evaluations are based on the top-20 users returned). As a consequence, good methods would have higher average relevant user position, except if only the first relevant user position is considered. However, it seems like the authors were expecting good strategies to have lower average values, which is counter-intuitive.

The online experiment was based on evaluations provided by 34 volunteer Twittomender users. Only the strategy S6 was applied. The quality metrics considered were the number of relevant recommendations along the rankings. Users were invited to test both the profile and the keyword-based interfaces of the system. In general, results show that the relevant users were positioned on the top positions of the ranking for the two interfaces. Moreover, the query interface achieved worse results compared to those achieved using the interface based on user-profiles.

I see this paper as an application paper, not a research paper. It is well written in general, but the profiling and recommendation strategies are very simplistic and the results are not very conclusive. Both the offline and the online experiment were based on a small amount of users. As an application paper, it should focus more on the architectural and performance aspects of Twittomender rather than in user profiling and recommendation accuracy.

Link: http://irserver.ucd.ie/dspace/bitstream/10197/2524/1/john-hannon-recsys-v4-April-25.pdf