Unsupervised Learning - Expectation Maximization Derivation

Expectation Maximization is a popular generative algorithm, which elegantly handles Multi-Variate Distributions and tries to find close approximations to underlying data-generating distribution.

In following I will consider a particular family of exponential distributions,
the Normal Distribution. (=
where, : is the mean and is the covariance matrix.

To fully understand the underlying assumptions behind an algorithm, it usually helps to see how it was derived. Following calculations are a bit tedious with some Aha! moments along the way (and I promise to highlight interesting points!)

Let us consider that we are trying to explain some data distribution using a mix of K Gaussians.
So that is an important assumption right there, that the data is explained by such a mixture of Gaussians.

This derivation is done using Bayesian framework, so first comes the priors.
Say, the prior probability of gaussian or the probability that the data is generated by gaussian is
… (1)

Now, in such a world of K Gaussians, what is the probability of seeing some data point x.

where,
= probability of seeing data x, in a distribution defined by gaussian
= defined above.

So, … (2)

It will also be useful to have reverse conditional handy (basically comes from Bayes Rule)
… (3)

Now lets define what we are actually trying to accomplish here
We are basically trying to maximize the likelihood of all data, , in a world characterized by some mixture of Gaussians.
That is,
It is noteworthy, that there is another assumption, that the data points are independent, which is generally not a bad assumption, but something to keep in mind.

maximizing is same as maximizing (as logarithm(x) is a monotonic function of x)
That is, the goal is
… (4)

We have three set of variables,

We need to maximize

w.r.t.
Note , which can be incorporated as a constraint using Lagrangian as follows
where we have defined , which can be interpreted as Expected number of data points belonging to Gaussian k.


w.r.t.

which basically is the weighted mean of Gaussian k.


w.r.t.
Now is the hardest part, taking derivative w.r.t. . Since there are some non-trivial Jacobains involved, I will make use of couple of shortcuts using The Matrix Cook Book

Written on August 4, 2017