Gaussian Mixture Model Notes

#machine-learning #notes

19/11/2024 6 Min.

Written by: Yuk

Table of Content

Revision 1

1, Model Representation and Objective Function

The Gaussian Mixture Model (GMM) represents data as a mixture of \(K\) multivariate Gaussian distributions. Given a dataset \(X = \{x_1, x_2, \ldots , x_N\}\) and \(\{x_i \in \mathbb{R}^n\}_{i=1}^N\), the likelihood of a single data point \(x_i\) under the model is:

\[\begin{gather*} p(x_i) = \sum_{k=1}^K \pi_k \space g(x_i \mid \mu_k, \Sigma_k) \quad \end{gather*}\]

Where:

\(K\) is the number of clusters (components).
\(\pi_k \geq 0\) are the mixing coefficients, satisfying \(\sum_{k=1}^K \pi_k = 1\).

And \(g(x_i \mid \mu_k, \Sigma_k)\) is the \(k\)-th multivariate Gaussian density defined as:

\[\begin{gather*} g(x_i \mid \mu_k, \Sigma_k) = \frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}} \exp(-\frac{1}{2}(x_i - \mu_k)^T \Sigma^{-1} (x_i - \mu_k)) \quad \end{gather*}\]

The model’s parameters are \(\theta = \{\mu_k, \Sigma_k\}^K_{k=1}\) and \(\Theta = \{\pi_k, \theta_k\}^K_{k=1}\). The goal is to estimate \(\Theta\) with Maximum Likelihood Estimation (MLE) of the observed dataset \(X\), defined as:

\[\begin{gather*} L(\Theta; X) = \prod^N_{i=1} p(x_i) = \prod^N_{i=1} \sum^K_{k=1} \pi_k \space g(x_i \mid \theta_k) \quad \end{gather*}\]

To simplify computation, the Log-Likelihood is used:

\[\begin{gather*} \log L(\Theta; X) = \sum^N_{i=1} \log \left( \sum^K_{k=1} \pi_k \space g(x_i \mid \theta_k) \right) \quad \end{gather*}\]

So the Objective Function becomes:

\[\begin{gather*} \hat{\Theta} = \text{arg}\max\limits_{\Theta} \sum^N_{i=1} \log \left( \sum^K_{k=1} \pi_k \space g(x_i \mid \theta_k) \right) \quad \end{gather*}\]

2, Estimation-Maximization (EM) Algorithm

The EM algorithm is used to iteratively optimize the parameters \(\Theta\) by alternating between two steps: the Expectation (E) step and the Maximization (M) step. EM leverages latent variables \(Z\), representing the cluster membership of each data points, to simplify optimization.

2.1, Latent Variable Framework

The latent variables \(Z = \{Z_1, \ldots, Z_k \}\), \(Z_k = \{z_{1k}, z_{2k}, \ldots, z_{Nk}\}\) and \(\{z_{ik} \in \{0, 1\}\}^N_{i=1}\) denotes the cluster membership, eg. if \(x_i\) is in cluster 2 in a total of 4 clusters then \(\{z_{i1}, z_{i2}, z_{i3}, z_{i4}\} = \{0, 1, 0, 0\}\). The complete-data log-likelihood (including \(Z\)) becomes:

\[\begin{gather*} L( \Theta ; \{X; Z\} ) = \sum^N_{i=1} \sum^K_{k=1} \gamma_{ik} \log \left( \pi_k \space g(x_i \mid \theta_k) \right) \quad \end{gather*}\]

Given the observed data \(X\), the posterior probability \(\gamma_{ik} = P(z_{ik} = k \mid x_i)\) is computed in the E-step and used to update parameters in the M-step. The formulas for the parameters are obtained by differentiating the expected “soft” log-likelihood with respect to each parameter and setting the derivatives to zero.

2.2, EM Algorithm Steps

Step 1 - Initialization: Initialize \(\Theta = \{\pi_k , \theta_k\}^K_{k=1}\) with random or heuristic values.
Step 2 - E-Step (Expectation): Calculate the posterior probabilities \(\gamma_{ik}\) using Bayes’ Theorem:

\[\begin{gather*} \gamma_{ik} = \frac{\pi_k \space g(x_i \mid \theta_k)}{\sum_{j=1}^K \pi_j \space g(x_i \mid \theta_j)} \end{gather*}\]

Step 3 - M-Step (Maximization): Update the parameters using the posterior probabilities (responsibilities):

\[\begin{gather*} \pi_k = \frac{\sum_{i=1}^N \gamma_{ik}}{N} \quad \\[15pt] \mu_k = \frac{\sum_{i=1}^N \gamma_{ik} x_i}{\sum_{i=1}^N \gamma_{ik}} \quad \\[15pt] \Sigma_k = \frac{\sum_{i=1}^N \gamma_{ik} (x_i - \mu_k)(x_i - \mu_k)^T}{\sum_{i=1}^N \gamma_{ik}} \quad \end{gather*}\]

Step 4 - Convergence Check: Repeat E-Step and M-Step until the log-likelihood converges with \(\epsilon\) being a predefined tolerence:

\[\begin{gather*} \Delta L = L( \Theta^{\text{new}} ; \{X; Z\} ) - L( \Theta ; \{X; Z\} ) < \epsilon \quad \end{gather*}\]

3, Guaranteed Convergence

The EM algorithm guarantees non-decreasing log-likelihood by leveraging the Jensen’s inequality:

\[\begin{gather*} \log \left( \sum^K_{k=1} \pi_k \space g(x_i \mid \theta_k) \right) \geq \sum^K_{k=1} \gamma_{ik} \log \left( \frac{\pi_k \space g(x_i \mid \theta_k)}{\gamma_{ik}} \right) \quad \\[20pt] \implies \log L(\Theta^{\text{new}}; X) \geq \log L(\Theta; X) \quad \end{gather*}\]

Jensen’s inequality ensures that the log-likelihood is lower-bounded by the expected value of the logarithm, thereby guaranteeing a non-decreasing sequence of log-likelihood values during the EM iterations.

Draft 1

1, Model Representation and Objective Function

Given dataset \(X = \{x_1, x_2, \ldots , x_N\}\) and \(\{x_i \in \mathbb{R}^n\}_{i=1}^N\), the GMM model fitted on this dataset is represented as:

\[\begin{gather*} p(\mathbb{x}) = \sum_{k=1}^K \pi_k \space g(\mathbb{x} \mid \mu_k, \Sigma_k) \quad \end{gather*}\]

Where:

\(K\) is the number of clusters / number of Gaussian distributions.
\(\pi_k\) represents a mixing coefficient for the \(k\)-th cluster.
\(\int p(x_i) \space dx_i = 1\).
\(\sum_{k=1}^K \pi_k = 1\).

And \(g(x_i \mid \mu_k, \Sigma_k)\) represents a multivariate Gaussian distribution:

\[\begin{gather*} g(\mathbb{x} \mid \mu_k, \Sigma_k) = \frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}} \exp(-\frac{1}{2}(\mathbb{x} - \mu_k)^T \Sigma^{-1} (\mathbb{x} - \mu_k)) \quad \end{gather*}\]

We establish the parameters:

\(\theta = \{\mu_k, \Sigma_k\}^K_{k=1}\).
\(\Theta = \{\pi_k, \theta_k\}^K_{k=1}\).

Define the Maximum Likelihood Estimation (MLE) equation:

\[\begin{gather*} L(\Theta; X) = \prod^N_{i=1} p(x_i) = \prod^N_{i=1} \sum^K_{k=1} \pi_k \space g(x_i \mid \theta_k) \quad \end{gather*}\]

Define the Log-Likelihood of the given MLE equation:

\[\begin{gather*} \log L(\Theta; X) = \sum^N_{i=1} \log \left( \sum^K_{k=1} \pi_k \space g(x_i \mid \theta_k) \right) \quad \end{gather*}\]

Therefore we have our objective function:

\[\begin{gather*} \hat{\Theta} := \text{arg}\max\limits_{\Theta} \prod^N_{i=1} \sum^K_{k=1} \pi_k \space g(x_i \mid \theta_k) \quad \\[5pt] \sim \hat{\Theta} := \text{arg}\max\limits_{\Theta} \sum^N_{i=1} \log \left( \sum^K_{k=1} \pi_k \space g(x_i \mid \theta_k) \right) \quad \end{gather*}\]

2, Estimation-Maximization (EM)

Consider our dataset, \(X = \{x_1, x_2, \ldots , x_N\}\) and \(\{x_i \in \mathbb{R}^n\}_{i=1}^N\).
Introduce the latent variables \(Z = \{Z_1, \ldots, Z_k \}\), \(Z_k = \{z_{1k}, z_{2k}, \ldots, z_{Nk}\}\) and \(\{z_{ik} \in \{0, 1\}\}^N_{i=1}\)
Which denotes the cluster membership, eg. if \(x_i\) is cluster 2 in a total of 4 clusters then \(\{z_{i1}, z_{i2}, z_{i3}, z_{i4}\} = \{0, 1, 0, 0\}\).
So we formulate the observed dataset \(\{X; Z\}\) and the unobserved dataset \(X\).

Foundation and related theorems:

The full Log-Likelihood of the MLE equation on the observed dataset is:

\[\begin{gather*} L( \Theta^t ; {X; Z} ) = \sum^N_{i=1} \sum^K_{k=1} z_{ik} \log \left( \pi_k \space g(x_i \mid \theta_k) \right) \quad \end{gather*}\]

Define the Q-Function:

\[\begin{gather*} Q(\Theta \mid \Theta^{\text{new}}) = \mathbb{E}_{Z \mid X, \Theta^{\text{new}}} \left[ \log L( \Theta^t ; {X; Z} ) \right] \quad \end{gather*}\]

By Jensen’s inequality:

\[\begin{gather*} \log \left( \sum^K_{k=1} \pi_k \space g(x_i \mid \theta_k) \right) \geq \sum^K_{k=1} \gamma_{ik} \log \left( \frac{\pi_k \space g(x_i \mid \theta_k)}{\gamma_{ik}} \right) \quad \\[20pt] \implies \log L(\Theta^{\text{new}}; X) \geq \log L(\Theta; X) \quad \end{gather*}\]

Which provides proof for EM’s convergence to a local maximum of the MLE equation.
The steps of EM are derived from the derivatives of the likelihood equations and the Q-function.

E-Step: Calculate Posteriori Probability

\[\begin{gather*} \text{By the Bayes Theorem:} \quad \\[15pt] P(z_{ik} = k \mid x_i) = \frac{P(x_i \mid z_{ik} = k) P(z_ik = k)}{P(x_i)} \quad \\[10pt] \implies \gamma_{ik} = P(z_{ik} = k \mid x_i) = \frac{\pi_k \space g(x_i \mid \theta_k)}{\sum_{j=1}^K \pi_j \space g(x_i \mid \theta_j)} \end{gather*}\]

Where:

\(\sum_{k=1}^K \gamma_k = 1\).

M-Step: Update Parameters

\[\begin{gather*} \pi_k = \frac{\sum_{i=1}^N \gamma_{ik}}{N} \quad \\[15pt] \mu_k = \frac{\sum_{i=1}^N \gamma_{ik} x_i}{\sum_{i=1}^N \gamma_{ik}} \quad \\[15pt] \Sigma_k = \frac{\sum_{i=1}^N \gamma_{ik} (x_i - \mu_k)(x_i - \mu_k)^T}{\sum_{i=1}^N \gamma_{ik}} \quad \end{gather*}\]

Repeat E-Step and M-Step until convergence:

\[\begin{gather*} \Delta L = L( \Theta^{\text{new}} ; {X; Z} ) - L( \Theta ; {X; Z} ) < \epsilon \quad \end{gather*}\]