Table of Content

Block

Revision 1

1, Model Representation and Objective Function

The Gaussian Mixture Model (GMM) represents data as a mixture of KK multivariate Gaussian distributions. Given a dataset X={x1,x2,,xN}X = \{x_1, x_2, \ldots , x_N\} and {xiRn}i=1N\{x_i \in \mathbb{R}^n\}_{i=1}^N, the likelihood of a single data point xix_i under the model is:

p(xi)=k=1Kπk g(xiμk,Σk)\begin{gather*} p(x_i) = \sum_{k=1}^K \pi_k \space g(x_i \mid \mu_k, \Sigma_k) \quad \end{gather*}
Block

Where:

  • KK is the number of clusters (components).
  • πk0\pi_k \geq 0 are the mixing coefficients, satisfying k=1Kπk=1\sum_{k=1}^K \pi_k = 1.

And g(xiμk,Σk)g(x_i \mid \mu_k, \Sigma_k) is the kk-th multivariate Gaussian density defined as:

g(xiμk,Σk)=1(2π)n/2Σ1/2exp(12(xiμk)TΣ1(xiμk))\begin{gather*} g(x_i \mid \mu_k, \Sigma_k) = \frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}} \exp(-\frac{1}{2}(x_i - \mu_k)^T \Sigma^{-1} (x_i - \mu_k)) \quad \end{gather*}
Block

The model’s parameters are θ={μk,Σk}k=1K\theta = \{\mu_k, \Sigma_k\}^K_{k=1} and Θ={πk,θk}k=1K\Theta = \{\pi_k, \theta_k\}^K_{k=1}. The goal is to estimate Θ\Theta with Maximum Likelihood Estimation (MLE) of the observed dataset XX, defined as:

L(Θ;X)=i=1Np(xi)=i=1Nk=1Kπk g(xiθk)\begin{gather*} L(\Theta; X) = \prod^N_{i=1} p(x_i) = \prod^N_{i=1} \sum^K_{k=1} \pi_k \space g(x_i \mid \theta_k) \quad \end{gather*}
Block

To simplify computation, the Log-Likelihood is used:

logL(Θ;X)=i=1Nlog(k=1Kπk g(xiθk))\begin{gather*} \log L(\Theta; X) = \sum^N_{i=1} \log \left( \sum^K_{k=1} \pi_k \space g(x_i \mid \theta_k) \right) \quad \end{gather*}
Block

So the Objective Function becomes:

Θ^=argmaxΘi=1Nlog(k=1Kπk g(xiθk))\begin{gather*} \hat{\Theta} = \text{arg}\max\limits_{\Theta} \sum^N_{i=1} \log \left( \sum^K_{k=1} \pi_k \space g(x_i \mid \theta_k) \right) \quad \end{gather*}
Block

2, Estimation-Maximization (EM) Algorithm

The EM algorithm is used to iteratively optimize the parameters Θ\Theta by alternating between two steps: the Expectation (E) step and the Maximization (M) step. EM leverages latent variables ZZ, representing the cluster membership of each data points, to simplify optimization.

2.1, Latent Variable Framework

The latent variables Z={Z1,,Zk}Z = \{Z_1, \ldots, Z_k \}, Zk={z1k,z2k,,zNk}Z_k = \{z_{1k}, z_{2k}, \ldots, z_{Nk}\} and {zik{0,1}}i=1N\{z_{ik} \in \{0, 1\}\}^N_{i=1} denotes the cluster membership, eg. if xix_i is in cluster 2 in a total of 4 clusters then {zi1,zi2,zi3,zi4}={0,1,0,0}\{z_{i1}, z_{i2}, z_{i3}, z_{i4}\} = \{0, 1, 0, 0\}. The complete-data log-likelihood (including ZZ) becomes:

L(Θ;{X;Z})=i=1Nk=1Kγiklog(πk g(xiθk))\begin{gather*} L( \Theta ; \{X; Z\} ) = \sum^N_{i=1} \sum^K_{k=1} \gamma_{ik} \log \left( \pi_k \space g(x_i \mid \theta_k) \right) \quad \end{gather*}
Block

Given the observed data XX, the posterior probability γik=P(zik=kxi)\gamma_{ik} = P(z_{ik} = k \mid x_i) is computed in the E-step and used to update parameters in the M-step. The formulas for the parameters are obtained by differentiating the expected “soft” log-likelihood with respect to each parameter and setting the derivatives to zero.

2.2, EM Algorithm Steps
  • Step 1 - Initialization: Initialize Θ={πk,θk}k=1K\Theta = \{\pi_k , \theta_k\}^K_{k=1} with random or heuristic values.
  • Step 2 - E-Step (Expectation): Calculate the posterior probabilities γik\gamma_{ik} using Bayes’ Theorem:
γik=πk g(xiθk)j=1Kπj g(xiθj)\begin{gather*} \gamma_{ik} = \frac{\pi_k \space g(x_i \mid \theta_k)}{\sum_{j=1}^K \pi_j \space g(x_i \mid \theta_j)} \end{gather*}
Block
  • Step 3 - M-Step (Maximization): Update the parameters using the posterior probabilities (responsibilities):
πk=i=1NγikNμk=i=1Nγikxii=1NγikΣk=i=1Nγik(xiμk)(xiμk)Ti=1Nγik\begin{gather*} \pi_k = \frac{\sum_{i=1}^N \gamma_{ik}}{N} \quad \\[15pt] \mu_k = \frac{\sum_{i=1}^N \gamma_{ik} x_i}{\sum_{i=1}^N \gamma_{ik}} \quad \\[15pt] \Sigma_k = \frac{\sum_{i=1}^N \gamma_{ik} (x_i - \mu_k)(x_i - \mu_k)^T}{\sum_{i=1}^N \gamma_{ik}} \quad \end{gather*}
Block
  • Step 4 - Convergence Check: Repeat E-Step and M-Step until the log-likelihood converges with ϵ\epsilon being a predefined tolerence:
ΔL=L(Θnew;{X;Z})L(Θ;{X;Z})<ϵ\begin{gather*} \Delta L = L( \Theta^{\text{new}} ; \{X; Z\} ) - L( \Theta ; \{X; Z\} ) < \epsilon \quad \end{gather*}
Block

3, Guaranteed Convergence

The EM algorithm guarantees non-decreasing log-likelihood by leveraging the Jensen’s inequality:

log(k=1Kπk g(xiθk))k=1Kγiklog(πk g(xiθk)γik)    logL(Θnew;X)logL(Θ;X)\begin{gather*} \log \left( \sum^K_{k=1} \pi_k \space g(x_i \mid \theta_k) \right) \geq \sum^K_{k=1} \gamma_{ik} \log \left( \frac{\pi_k \space g(x_i \mid \theta_k)}{\gamma_{ik}} \right) \quad \\[20pt] \implies \log L(\Theta^{\text{new}}; X) \geq \log L(\Theta; X) \quad \end{gather*}
Block

Jensen’s inequality ensures that the log-likelihood is lower-bounded by the expected value of the logarithm, thereby guaranteeing a non-decreasing sequence of log-likelihood values during the EM iterations.

Draft 1

1, Model Representation and Objective Function

Given dataset X={x1,x2,,xN}X = \{x_1, x_2, \ldots , x_N\} and {xiRn}i=1N\{x_i \in \mathbb{R}^n\}_{i=1}^N, the GMM model fitted on this dataset is represented as:

p(x)=k=1Kπk g(xμk,Σk)\begin{gather*} p(\mathbb{x}) = \sum_{k=1}^K \pi_k \space g(\mathbb{x} \mid \mu_k, \Sigma_k) \quad \end{gather*}
Block

Where:

  • KK is the number of clusters / number of Gaussian distributions.
  • πk\pi_k represents a mixing coefficient for the kk-th cluster.
  • p(xi) dxi=1\int p(x_i) \space dx_i = 1.
  • k=1Kπk=1\sum_{k=1}^K \pi_k = 1.

And g(xiμk,Σk)g(x_i \mid \mu_k, \Sigma_k) represents a multivariate Gaussian distribution:

g(xμk,Σk)=1(2π)n/2Σ1/2exp(12(xμk)TΣ1(xμk))\begin{gather*} g(\mathbb{x} \mid \mu_k, \Sigma_k) = \frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}} \exp(-\frac{1}{2}(\mathbb{x} - \mu_k)^T \Sigma^{-1} (\mathbb{x} - \mu_k)) \quad \end{gather*}
Block

We establish the parameters:

  • θ={μk,Σk}k=1K\theta = \{\mu_k, \Sigma_k\}^K_{k=1}.
  • Θ={πk,θk}k=1K\Theta = \{\pi_k, \theta_k\}^K_{k=1}.

Define the Maximum Likelihood Estimation (MLE) equation:

L(Θ;X)=i=1Np(xi)=i=1Nk=1Kπk g(xiθk)\begin{gather*} L(\Theta; X) = \prod^N_{i=1} p(x_i) = \prod^N_{i=1} \sum^K_{k=1} \pi_k \space g(x_i \mid \theta_k) \quad \end{gather*}
Block

Define the Log-Likelihood of the given MLE equation:

logL(Θ;X)=i=1Nlog(k=1Kπk g(xiθk))\begin{gather*} \log L(\Theta; X) = \sum^N_{i=1} \log \left( \sum^K_{k=1} \pi_k \space g(x_i \mid \theta_k) \right) \quad \end{gather*}
Block

Therefore we have our objective function:

Θ^:=argmaxΘi=1Nk=1Kπk g(xiθk)Θ^:=argmaxΘi=1Nlog(k=1Kπk g(xiθk))\begin{gather*} \hat{\Theta} := \text{arg}\max\limits_{\Theta} \prod^N_{i=1} \sum^K_{k=1} \pi_k \space g(x_i \mid \theta_k) \quad \\[5pt] \sim \hat{\Theta} := \text{arg}\max\limits_{\Theta} \sum^N_{i=1} \log \left( \sum^K_{k=1} \pi_k \space g(x_i \mid \theta_k) \right) \quad \end{gather*}
Block

2, Estimation-Maximization (EM)

  • Consider our dataset, X={x1,x2,,xN}X = \{x_1, x_2, \ldots , x_N\} and {xiRn}i=1N\{x_i \in \mathbb{R}^n\}_{i=1}^N.
  • Introduce the latent variables Z={Z1,,Zk}Z = \{Z_1, \ldots, Z_k \}, Zk={z1k,z2k,,zNk}Z_k = \{z_{1k}, z_{2k}, \ldots, z_{Nk}\} and {zik{0,1}}i=1N\{z_{ik} \in \{0, 1\}\}^N_{i=1}
  • Which denotes the cluster membership, eg. if xix_i is cluster 2 in a total of 4 clusters then {zi1,zi2,zi3,zi4}={0,1,0,0}\{z_{i1}, z_{i2}, z_{i3}, z_{i4}\} = \{0, 1, 0, 0\}.
  • So we formulate the observed dataset {X;Z}\{X; Z\} and the unobserved dataset XX.

Foundation and related theorems:

The full Log-Likelihood of the MLE equation on the observed dataset is:

L(Θt;X;Z)=i=1Nk=1Kziklog(πk g(xiθk))\begin{gather*} L( \Theta^t ; {X; Z} ) = \sum^N_{i=1} \sum^K_{k=1} z_{ik} \log \left( \pi_k \space g(x_i \mid \theta_k) \right) \quad \end{gather*}
Block

Define the Q-Function:

Q(ΘΘnew)=EZX,Θnew[logL(Θt;X;Z)]\begin{gather*} Q(\Theta \mid \Theta^{\text{new}}) = \mathbb{E}_{Z \mid X, \Theta^{\text{new}}} \left[ \log L( \Theta^t ; {X; Z} ) \right] \quad \end{gather*}
Block

By Jensen’s inequality:

log(k=1Kπk g(xiθk))k=1Kγiklog(πk g(xiθk)γik)    logL(Θnew;X)logL(Θ;X)\begin{gather*} \log \left( \sum^K_{k=1} \pi_k \space g(x_i \mid \theta_k) \right) \geq \sum^K_{k=1} \gamma_{ik} \log \left( \frac{\pi_k \space g(x_i \mid \theta_k)}{\gamma_{ik}} \right) \quad \\[20pt] \implies \log L(\Theta^{\text{new}}; X) \geq \log L(\Theta; X) \quad \end{gather*}
Block
  • Which provides proof for EM’s convergence to a local maximum of the MLE equation.
  • The steps of EM are derived from the derivatives of the likelihood equations and the Q-function.

E-Step: Calculate Posteriori Probability

By the Bayes Theorem:P(zik=kxi)=P(xizik=k)P(zik=k)P(xi)    γik=P(zik=kxi)=πk g(xiθk)j=1Kπj g(xiθj)\begin{gather*} \text{By the Bayes Theorem:} \quad \\[15pt] P(z_{ik} = k \mid x_i) = \frac{P(x_i \mid z_{ik} = k) P(z_ik = k)}{P(x_i)} \quad \\[10pt] \implies \gamma_{ik} = P(z_{ik} = k \mid x_i) = \frac{\pi_k \space g(x_i \mid \theta_k)}{\sum_{j=1}^K \pi_j \space g(x_i \mid \theta_j)} \end{gather*}
Block

Where:

  • k=1Kγk=1\sum_{k=1}^K \gamma_k = 1.

M-Step: Update Parameters

πk=i=1NγikNμk=i=1Nγikxii=1NγikΣk=i=1Nγik(xiμk)(xiμk)Ti=1Nγik\begin{gather*} \pi_k = \frac{\sum_{i=1}^N \gamma_{ik}}{N} \quad \\[15pt] \mu_k = \frac{\sum_{i=1}^N \gamma_{ik} x_i}{\sum_{i=1}^N \gamma_{ik}} \quad \\[15pt] \Sigma_k = \frac{\sum_{i=1}^N \gamma_{ik} (x_i - \mu_k)(x_i - \mu_k)^T}{\sum_{i=1}^N \gamma_{ik}} \quad \end{gather*}
Block

Repeat E-Step and M-Step until convergence:

ΔL=L(Θnew;X;Z)L(Θ;X;Z)<ϵ\begin{gather*} \Delta L = L( \Theta^{\text{new}} ; {X; Z} ) - L( \Theta ; {X; Z} ) < \epsilon \quad \end{gather*}
Block