Me and my research: Expectation Maximization

I'm really late in my reading plans this month. I've been working hard on a recent project and also spending a lot of time on bureaucratic stuff related to my move in September. Some weeks ago I started reading a paper about content recommendation on Twitter (it is going to be summarized here soon) and I discovered that I had no idea about how Expectation Maximization (EM) works. So, I decided to spend some time learning EM. It is interesting that this is my first post covering a specific topic (not a paper) here. Also, I've applied the learning method described here to learn EM. Let's see how this concurrent experiment works.

Expectation Maximization (EM) is a statistical technique for estimating parameters for statistical models when there are latent (or hidden) variables. The parameters found are maximum likelihood estimates (MLE) based on both the data and the latent variables.

But what is a statistical model? Very simple. A statistical model is a equation that describes relationships between variables through probability distributions. This distributions make the relationships stochastic instead of deterministic.

What about hidden variables? Hidden (or latent) variables are those variables that are not directly observed in the data. Therefore, such variables must be inferred based on the observed variables. The "magic" of EM is that it is able to deal with both hidden variables and parameters to be estimated together. Sometimes making some variables hidden seems to be useful in practice.

Maximum likelihood estimate: The likelihood of a set of parameters is, basically, the probability of such parameters to generate the given outcomes. MLE is a method for inferring the parameters with maximum likelihood. EM is a generalization of MLE for scenarios where there are hidden variables. We usually maximize the log-likelihood because the logarithm function is an increasing function and log is a monotone transformation (i.e., the parameters that maximize the original function also maximize the likelihood also maximize the log-likelihood).

An example of a problem that can be solved using EM is: Given two coins, each of them with unknown biases, suppose that a random coin is selected uniformly and thrown N times. This experiment is performed M times (for each time, one of the coins is selected at random). What are the biases of the coins?

Parameters of the model: The biases of the coins.

Unknown variables: The selection of coins.

What if the choice of coins was known? Then we have a traditional MLE scenario. The parameters (biases) can be estimated as:

estimated bias coin A = # tails coin A / # times coin A was thrown
estimated bias coin B = # tails coin B / # times coin B was thrown

EM is an iterative procedure that alternates between two main steps: Expectation (E) and Maximization (M). The parameters are initially set to some initial values. The Expectation step estimates the hidden variables based on the current parameter values. The Maximization step calculates the MLE of the parameters given the estimated hidden variables. The method is expected to converge after a finite number of iterations.

How would EM work for the aforementioned example?

Set initial values to the biases of the coins.
A probability distribution over the unknown coin choices is drawn based on the current parameters (biases).
New parameters are estimated using MLE based on the probability distribution found in the last step.
Go back to step 2 (until convergence).

I took this example from the article What is the expectation maximization algorithm? Here is the complete running example, which I found very clever.

If you are familiar with the K-means algorithm (I am), you will find this explanation of K-means as EM very instructive: The cluster of each point are the latent variables and the centroid of each cluster are the parameters to be estimated. Therefore, the cluster of each point is updated in the expectation step and the centroids are computed in the maximization step. However, in K-means the expectation step actually assigns points to clusters instead of computing a probability distribution for assignments.

Another example of the application of EM is the follows: Suppose that you have two Normal distributions, A and B, and that you are given instances of these distributions but you do not know which distribution the instances were sampled from. How can you compute the means mu_A and mu_B of the distributions if both distributions have the same known variance? Surprisingly, the answer is EM.

What if there is a single distributions instead of two? Then estimating the mean is straightforward.

What if the selected distribution is known? In this case, mu_A and mu_B are the mean value of the instances for each distribution. Still pretty easy.

How we solve this problem using EM? The means are the parameters and the distribution from which each instance was sampled are the latent variables.

Define initial values for mu_A and mu_B.
Expectation step: Estimate the probability that each instance was generated from A B.
Maximization step: Compute the weighted means of instances for A and B.
Go back to step 2 (until the estimates converge).

In the expectation step, the probabilities are computed as follows:

Probability that the instance x_i is from A:

P(x = x_i | mu = mu_A) / (P(x = x_i | mu = mu_A) + P(x = x_i | mu = mu_A)

Probability that the instance x_i is from B:

P(x = x_i | mu = mu_B) / (P(x = x_i | mu = mu_B) + P(x = x_i | mu = mu_B)

Since we know that the distributions are Gaussian and the variation is also known, we can compute these probabilities using the probability density function of the Gaussian distribution.

In the maximization step, the means are updated as follows:

mu_A = Sum_{x_i} P(x_i is from A).x_i / Sum_{x_i}P(x_i is from A)

mu_B = Sum_{x_i} P(x_i is from B).x_i / Sum_{x_i}P(x_i is from B)

The EM method is known to converge to local optimum parameters that may not be the global optimum ones. At each iteration, the parameters are improved until a local optimum is reached. In practice, several initial parameters are applied in order to avoid poor local maximum likelihood values.

Alternative approaches to EM include the gradient descent and Newton-Raphson methods.

References:

Chuong B Do and Serafim Batzoglou. What is the expectation maximization algorithm? Nature Biotechnology.

McLachlan, G. J., & Krishnan, T. (1997). The EM Algorithm and Extensions. Wiley, New York.

http://www.cs.berkeley.edu/~russell/classes/cs194/f11/lectures/CS194%20Fall%202011%20Lecture%2020-EM-proof.pdf

Me and my research

quinta-feira, 28 de junho de 2012

Expectation Maximization

Um comentário: