# 12.4 – Gaussian Mixture Models

Let’s now look at a more principled method of learning from input data. In Gaussian Mixture Models (GMMs), we assume that the data is a mixture of Gaussians, each with its own mean and variance. What does that mean precisely? It means that we assume that there are $k$ Gaussians that each data point could be generated by. We assume that each data point was generated by a two-step process: first, you select one of the $k$ Gaussians according to a prior distribution, $p(z^{(i)}; G)$, where $z^{(i)} \sim \text{ Multinomial}(\phi)$. We read this as “probability of $z^{(i)}$, parameterized by G, where $z^{(i)}$ has a Multinomial distribution”. Essentially, $G$ is the set of the parameters for our model. After selecting one of the Gaussians, we generate a data point. We assume that this process was repeated for every data point. This method should remind you of the Gaussian Discriminant Analysis model we discussed–however, here, we don’t know the class labels.

We can now write the probability distribution of our inputs. Note that this is simply the law of total probability: $p(x^{(i)}; G) = \sum\limits_{l=1}^k p(x^{(i)} | z^{(i)}=l; G) p(z^{(i)}=l; G)$

The parameters of our model are $G = (\boldsymbol \mu, \boldsymbol \sigma, \boldsymbol \phi)$.

To solve for these parameters, we use an iterative algorithm called the Expectation-Maximization (EM) algorithm. We’ll talk more about this algorithm later, but it provides a general method to find maximum likelihood (ML) or maximum a posteriori (MAP) estimates of parameters, where the model depends on some unobservable variables. In our case, the unobservable (or latent) variables are the priors, since we can only see the $x^{(i)}$s, not the $z^{(i)}$s. In fact, if we had known the $z^{(i)}$s, we could write down the likelihood, and get the same MLEs that we got for GDA. The EM algorithm has two steps:

• Expectation (E) step: Here, we compute the posterior distribution of $z^{(i)}$ for each $x^{(i)}$. For each $i, j$, we set $p(z^{(i)} = j | x^{(i)}; G) = \frac{p(x^{(i)} | z^{(i)}=j; \boldsymbol\mu, \boldsymbol\sigma) p(z^{(i)} = j; \phi)}{p(x^{(i)})} = \frac{\mathcal{N}(\mu_j, \sigma_j) p(z^{(i)}=j; \phi)}{\sum\limits_{l=1}^k \mathcal{N}(\mu_l, \sigma_l) p(z^{(i)}=l; \phi)}$

• Maximization (M) step: In this step, we use the results of the E step to update the parameters of our model. \begin{aligned} \mu_j &= \frac{\sum\limits_{i=1}^m p(z^{(i)} = j | x^{(i)}; G) x^{(i)}}{\sum\limits_{i=1}^m p(z^{(i)} = j | x^{(i)}; G)} \\ \sigma_j^2 &= \frac{\sum\limits_{i=1}^m p(z^{(i)} = j | x^{(i)}; G) \left\vert x^{(i)} - \mu_j \right\vert^2}{\sum\limits_{i=1}^m p(z^{(i)} = j | x^{(i)}; G)} \\ \phi_j &= \frac{1}{m} \sum\limits_{i=1}^m p(z^{(i)} = j | x^{(i)}; G) \end{aligned}

Note that in the M step, the update formulas are similar to the GDA ones, except that the indicator functions have been replaced with the posterior distribution values. Also, note that the EM algorithm expects that you have all the data at once, making it a batch learning algorithm, while k-means is an online learning algorithm, which means you can give the data in parts, and it will work correctly. The EM algorithm, like k-means, is also susceptible to local optima.