Expectation Maximization (EM) is a statistical technique for estimating parameters for statistical models when there are latent (or hidden) variables. The parameters found are maximum likelihood estimates (MLE) based on both the data and the latent variables.
But what is a statistical model? Very simple. A statistical model is a equation that describes relationships between variables through probability distributions. This distributions make the relationships stochastic instead of deterministic.
What about hidden variables? Hidden (or latent) variables are those variables that are not directly observed in the data. Therefore, such variables must be inferred based on the observed variables. The "magic" of EM is that it is able to deal with both hidden variables and parameters to be estimated together. Sometimes making some variables hidden seems to be useful in practice.
Maximum likelihood estimate: The likelihood of a set of parameters is, basically, the probability of such parameters to generate the given outcomes. MLE is a method for inferring the parameters with maximum likelihood. EM is a generalization of MLE for scenarios where there are hidden variables. We usually maximize the log-likelihood because the logarithm function is an increasing function and log is a monotone transformation (i.e., the parameters that maximize the original function also maximize the likelihood also maximize the log-likelihood).
An example of a problem that can be solved using EM is: Given two coins, each of them with unknown biases, suppose that a random coin is selected uniformly and thrown N times. This experiment is performed M times (for each time, one of the coins is selected at random). What are the biases of the coins?
Parameters of the model: The biases of the coins.
Unknown variables: The selection of coins.
What if the choice of coins was known? Then we have a traditional MLE scenario. The parameters (biases) can be estimated as:
estimated bias coin A = # tails coin A / # times coin A was thrown
estimated bias coin B = # tails coin B / # times coin B was thrown
EM is an iterative procedure that alternates between two main steps: Expectation (E) and Maximization (M). The parameters are initially set to some initial values. The Expectation step estimates the hidden variables based on the current parameter values. The Maximization step calculates the MLE of the parameters given the estimated hidden variables. The method is expected to converge after a finite number of iterations.
How would EM work for the aforementioned example?
- Set initial values to the biases of the coins.
- A probability distribution over the unknown coin choices is drawn based on the current parameters (biases).
- New parameters are estimated using MLE based on the probability distribution found in the last step.
- Go back to step 2 (until convergence).
I took this example from the article What is the expectation maximization algorithm? Here is the complete running example, which I found very clever.
If you are familiar with the K-means algorithm (I am), you will find this explanation of K-means as EM very instructive: The cluster of each point are the latent variables and the centroid of each cluster are the parameters to be estimated. Therefore, the cluster of each point is updated in the expectation step and the centroids are computed in the maximization step. However, in K-means the expectation step actually assigns points to clusters instead of computing a probability distribution for assignments.
Another example of the application of EM is the follows: Suppose that you have two Normal distributions, A and B, and that you are given instances of these distributions but you do not know which distribution the instances were sampled from. How can you compute the means mu_A and mu_B of the distributions if both distributions have the same known variance? Surprisingly, the answer is EM.
What if there is a single distributions instead of two? Then estimating the mean is straightforward.
What if the selected distribution is known? In this case, mu_A and mu_B are the mean value of the instances for each distribution. Still pretty easy.
How we solve this problem using EM? The means are the parameters and the distribution from which each instance was sampled are the latent variables.
- Define initial values for mu_A and mu_B.
- Expectation step: Estimate the probability that each instance was generated from A B.
- Maximization step: Compute the weighted means of instances for A and B.
- Go back to step 2 (until the estimates converge).
In the expectation step, the probabilities are computed as follows:
Probability that the instance x_i is from A:
P(x = x_i | mu = mu_A) / (P(x = x_i | mu = mu_A) + P(x = x_i | mu = mu_A)
Probability that the instance x_i is from B:
P(x = x_i | mu = mu_B) / (P(x = x_i | mu = mu_B) + P(x = x_i | mu = mu_B)
In the maximization step, the means are updated as follows:
mu_A = Sum_{x_i} P(x_i is from A).x_i / Sum_{x_i}P(x_i is from A)
mu_B = Sum_{x_i} P(x_i is from B).x_i / Sum_{x_i}P(x_i is from B)
Alternative approaches to EM include the gradient descent and Newton-Raphson methods.
References:
Chuong B Do and Serafim Batzoglou. What is the expectation maximization algorithm? Nature Biotechnology.
McLachlan, G. J., & Krishnan, T. (1997). The EM Algorithm and Extensions. Wiley, New York.
http://www.cs.berkeley.edu/~russell/classes/cs194/f11/lectures/CS194%20Fall%202011%20Lecture%2020-EM-proof.pdf