Table of Content
Block
Revision 1
1, Model Representation and Objective Function
The Gaussian Mixture Model (GMM) represents data as a mixture of K K K multivariate Gaussian distributions. Given a dataset X = { x 1 , x 2 , … , x N } X = \{x_1, x_2, \ldots , x_N\} X = { x 1 , x 2 , … , x N } and { x i ∈ R n } i = 1 N \{x_i \in \mathbb{R}^n\}_{i=1}^N { x i ∈ R n } i = 1 N , the likelihood of a single data point x i x_i x i under the model is:
p ( x i ) = ∑ k = 1 K π k g ( x i ∣ μ k , Σ k ) \begin{gather*}
p(x_i) = \sum_{k=1}^K \pi_k \space g(x_i \mid \mu_k, \Sigma_k) \quad
\end{gather*} p ( x i ) = k = 1 ∑ K π k g ( x i ∣ μ k , Σ k )
Block
Where:
K K K is the number of clusters (components).
π k ≥ 0 \pi_k \geq 0 π k ≥ 0 are the mixing coefficients, satisfying ∑ k = 1 K π k = 1 \sum_{k=1}^K \pi_k = 1 ∑ k = 1 K π k = 1 .
And g ( x i ∣ μ k , Σ k ) g(x_i \mid \mu_k, \Sigma_k) g ( x i ∣ μ k , Σ k ) is the k k k -th multivariate Gaussian density defined as:
g ( x i ∣ μ k , Σ k ) = 1 ( 2 π ) n / 2 ∣ Σ ∣ 1 / 2 exp ( − 1 2 ( x i − μ k ) T Σ − 1 ( x i − μ k ) ) \begin{gather*}
g(x_i \mid \mu_k, \Sigma_k) = \frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}} \exp(-\frac{1}{2}(x_i - \mu_k)^T \Sigma^{-1} (x_i - \mu_k)) \quad
\end{gather*} g ( x i ∣ μ k , Σ k ) = ( 2 π ) n /2 ∣Σ ∣ 1/2 1 exp ( − 2 1 ( x i − μ k ) T Σ − 1 ( x i − μ k ))
Block
The model’s parameters are θ = { μ k , Σ k } k = 1 K \theta = \{\mu_k, \Sigma_k\}^K_{k=1} θ = { μ k , Σ k } k = 1 K and Θ = { π k , θ k } k = 1 K \Theta = \{\pi_k, \theta_k\}^K_{k=1} Θ = { π k , θ k } k = 1 K . The goal is to estimate Θ \Theta Θ with Maximum Likelihood Estimation (MLE) of the observed dataset X X X , defined as:
L ( Θ ; X ) = ∏ i = 1 N p ( x i ) = ∏ i = 1 N ∑ k = 1 K π k g ( x i ∣ θ k ) \begin{gather*}
L(\Theta; X) = \prod^N_{i=1} p(x_i) = \prod^N_{i=1} \sum^K_{k=1} \pi_k \space g(x_i \mid \theta_k) \quad
\end{gather*} L ( Θ ; X ) = i = 1 ∏ N p ( x i ) = i = 1 ∏ N k = 1 ∑ K π k g ( x i ∣ θ k )
Block
To simplify computation, the Log-Likelihood is used:
log L ( Θ ; X ) = ∑ i = 1 N log ( ∑ k = 1 K π k g ( x i ∣ θ k ) ) \begin{gather*}
\log L(\Theta; X) = \sum^N_{i=1} \log \left( \sum^K_{k=1} \pi_k \space g(x_i \mid \theta_k) \right) \quad
\end{gather*} log L ( Θ ; X ) = i = 1 ∑ N log ( k = 1 ∑ K π k g ( x i ∣ θ k ) )
Block
So the Objective Function becomes:
Θ ^ = arg max Θ ∑ i = 1 N log ( ∑ k = 1 K π k g ( x i ∣ θ k ) ) \begin{gather*}
\hat{\Theta} = \text{arg}\max\limits_{\Theta} \sum^N_{i=1} \log \left( \sum^K_{k=1} \pi_k \space g(x_i \mid \theta_k) \right) \quad
\end{gather*} Θ ^ = arg Θ max i = 1 ∑ N log ( k = 1 ∑ K π k g ( x i ∣ θ k ) )
Block
2, Estimation-Maximization (EM) Algorithm
The EM algorithm is used to iteratively optimize the parameters Θ \Theta Θ by alternating between two steps: the Expectation (E) step and the Maximization (M) step. EM leverages latent variables Z Z Z , representing the cluster membership of each data points, to simplify optimization.
2.1, Latent Variable Framework
The latent variables Z = { Z 1 , … , Z k } Z = \{Z_1, \ldots, Z_k \} Z = { Z 1 , … , Z k } , Z k = { z 1 k , z 2 k , … , z N k } Z_k = \{z_{1k}, z_{2k}, \ldots, z_{Nk}\} Z k = { z 1 k , z 2 k , … , z N k } and { z i k ∈ { 0 , 1 } } i = 1 N \{z_{ik} \in \{0, 1\}\}^N_{i=1} { z ik ∈ { 0 , 1 } } i = 1 N denotes the cluster membership, eg. if x i x_i x i is in cluster 2 in a total of 4 clusters then { z i 1 , z i 2 , z i 3 , z i 4 } = { 0 , 1 , 0 , 0 } \{z_{i1}, z_{i2}, z_{i3}, z_{i4}\} = \{0, 1, 0, 0\} { z i 1 , z i 2 , z i 3 , z i 4 } = { 0 , 1 , 0 , 0 } . The complete-data log-likelihood (including Z Z Z ) becomes:
L ( Θ ; { X ; Z } ) = ∑ i = 1 N ∑ k = 1 K γ i k log ( π k g ( x i ∣ θ k ) ) \begin{gather*}
L( \Theta ; \{X; Z\} ) = \sum^N_{i=1} \sum^K_{k=1} \gamma_{ik} \log \left( \pi_k \space g(x_i \mid \theta_k) \right) \quad
\end{gather*} L ( Θ ; { X ; Z }) = i = 1 ∑ N k = 1 ∑ K γ ik log ( π k g ( x i ∣ θ k ) )
Block
Given the observed data X X X , the posterior probability γ i k = P ( z i k = k ∣ x i ) \gamma_{ik} = P(z_{ik} = k \mid x_i) γ ik = P ( z ik = k ∣ x i ) is computed in the E-step and used to update parameters in the M-step. The formulas for the parameters are obtained by differentiating the expected “soft” log-likelihood with respect to each parameter and setting the derivatives to zero.
2.2, EM Algorithm Steps
Step 1 - Initialization: Initialize Θ = { π k , θ k } k = 1 K \Theta = \{\pi_k , \theta_k\}^K_{k=1} Θ = { π k , θ k } k = 1 K with random or heuristic values.
Step 2 - E-Step (Expectation): Calculate the posterior probabilities γ i k \gamma_{ik} γ ik using Bayes’ Theorem:
γ i k = π k g ( x i ∣ θ k ) ∑ j = 1 K π j g ( x i ∣ θ j ) \begin{gather*}
\gamma_{ik} = \frac{\pi_k \space g(x_i \mid \theta_k)}{\sum_{j=1}^K \pi_j \space g(x_i \mid \theta_j)}
\end{gather*} γ ik = ∑ j = 1 K π j g ( x i ∣ θ j ) π k g ( x i ∣ θ k )
Block
Step 3 - M-Step (Maximization): Update the parameters using the posterior probabilities (responsibilities):
π k = ∑ i = 1 N γ i k N μ k = ∑ i = 1 N γ i k x i ∑ i = 1 N γ i k Σ k = ∑ i = 1 N γ i k ( x i − μ k ) ( x i − μ k ) T ∑ i = 1 N γ i k \begin{gather*}
\pi_k = \frac{\sum_{i=1}^N \gamma_{ik}}{N} \quad \\[15pt]
\mu_k = \frac{\sum_{i=1}^N \gamma_{ik} x_i}{\sum_{i=1}^N \gamma_{ik}} \quad \\[15pt]
\Sigma_k = \frac{\sum_{i=1}^N \gamma_{ik} (x_i - \mu_k)(x_i - \mu_k)^T}{\sum_{i=1}^N \gamma_{ik}} \quad
\end{gather*} π k = N ∑ i = 1 N γ ik μ k = ∑ i = 1 N γ ik ∑ i = 1 N γ ik x i Σ k = ∑ i = 1 N γ ik ∑ i = 1 N γ ik ( x i − μ k ) ( x i − μ k ) T
Block
Step 4 - Convergence Check: Repeat E-Step and M-Step until the log-likelihood converges with ϵ \epsilon ϵ being a predefined tolerence:
Δ L = L ( Θ new ; { X ; Z } ) − L ( Θ ; { X ; Z } ) < ϵ \begin{gather*}
\Delta L = L( \Theta^{\text{new}} ; \{X; Z\} ) - L( \Theta ; \{X; Z\} ) < \epsilon \quad
\end{gather*} Δ L = L ( Θ new ; { X ; Z }) − L ( Θ ; { X ; Z }) < ϵ
Block
3, Guaranteed Convergence
The EM algorithm guarantees non-decreasing log-likelihood by leveraging the Jensen’s inequality :
log ( ∑ k = 1 K π k g ( x i ∣ θ k ) ) ≥ ∑ k = 1 K γ i k log ( π k g ( x i ∣ θ k ) γ i k ) ⟹ log L ( Θ new ; X ) ≥ log L ( Θ ; X ) \begin{gather*}
\log \left( \sum^K_{k=1} \pi_k \space g(x_i \mid \theta_k) \right) \geq \sum^K_{k=1} \gamma_{ik} \log \left( \frac{\pi_k \space g(x_i \mid \theta_k)}{\gamma_{ik}} \right) \quad \\[20pt]
\implies \log L(\Theta^{\text{new}}; X) \geq \log L(\Theta; X) \quad
\end{gather*} log ( k = 1 ∑ K π k g ( x i ∣ θ k ) ) ≥ k = 1 ∑ K γ ik log ( γ ik π k g ( x i ∣ θ k ) ) ⟹ log L ( Θ new ; X ) ≥ log L ( Θ ; X )
Block
Jensen’s inequality ensures that the log-likelihood is lower-bounded by the expected value of the logarithm, thereby guaranteeing a non-decreasing sequence of log-likelihood values during the EM iterations.
Draft 1
1, Model Representation and Objective Function
Given dataset X = { x 1 , x 2 , … , x N } X = \{x_1, x_2, \ldots , x_N\} X = { x 1 , x 2 , … , x N } and { x i ∈ R n } i = 1 N \{x_i \in \mathbb{R}^n\}_{i=1}^N { x i ∈ R n } i = 1 N , the GMM model fitted on this dataset is represented as:
p ( x ) = ∑ k = 1 K π k g ( x ∣ μ k , Σ k ) \begin{gather*}
p(\mathbb{x}) = \sum_{k=1}^K \pi_k \space g(\mathbb{x} \mid \mu_k, \Sigma_k) \quad
\end{gather*} p ( x ) = k = 1 ∑ K π k g ( x ∣ μ k , Σ k )
Block
Where:
K K K is the number of clusters / number of Gaussian distributions.
π k \pi_k π k represents a mixing coefficient for the k k k -th cluster.
∫ p ( x i ) d x i = 1 \int p(x_i) \space dx_i = 1 ∫ p ( x i ) d x i = 1 .
∑ k = 1 K π k = 1 \sum_{k=1}^K \pi_k = 1 ∑ k = 1 K π k = 1 .
And g ( x i ∣ μ k , Σ k ) g(x_i \mid \mu_k, \Sigma_k) g ( x i ∣ μ k , Σ k ) represents a multivariate Gaussian distribution:
g ( x ∣ μ k , Σ k ) = 1 ( 2 π ) n / 2 ∣ Σ ∣ 1 / 2 exp ( − 1 2 ( x − μ k ) T Σ − 1 ( x − μ k ) ) \begin{gather*}
g(\mathbb{x} \mid \mu_k, \Sigma_k) = \frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}} \exp(-\frac{1}{2}(\mathbb{x} - \mu_k)^T \Sigma^{-1} (\mathbb{x} - \mu_k)) \quad
\end{gather*} g ( x ∣ μ k , Σ k ) = ( 2 π ) n /2 ∣Σ ∣ 1/2 1 exp ( − 2 1 ( x − μ k ) T Σ − 1 ( x − μ k ))
Block
We establish the parameters:
θ = { μ k , Σ k } k = 1 K \theta = \{\mu_k, \Sigma_k\}^K_{k=1} θ = { μ k , Σ k } k = 1 K .
Θ = { π k , θ k } k = 1 K \Theta = \{\pi_k, \theta_k\}^K_{k=1} Θ = { π k , θ k } k = 1 K .
Define the Maximum Likelihood Estimation (MLE) equation:
L ( Θ ; X ) = ∏ i = 1 N p ( x i ) = ∏ i = 1 N ∑ k = 1 K π k g ( x i ∣ θ k ) \begin{gather*}
L(\Theta; X) = \prod^N_{i=1} p(x_i) = \prod^N_{i=1} \sum^K_{k=1} \pi_k \space g(x_i \mid \theta_k) \quad
\end{gather*} L ( Θ ; X ) = i = 1 ∏ N p ( x i ) = i = 1 ∏ N k = 1 ∑ K π k g ( x i ∣ θ k )
Block
Define the Log-Likelihood of the given MLE equation:
log L ( Θ ; X ) = ∑ i = 1 N log ( ∑ k = 1 K π k g ( x i ∣ θ k ) ) \begin{gather*}
\log L(\Theta; X) = \sum^N_{i=1} \log \left( \sum^K_{k=1} \pi_k \space g(x_i \mid \theta_k) \right) \quad
\end{gather*} log L ( Θ ; X ) = i = 1 ∑ N log ( k = 1 ∑ K π k g ( x i ∣ θ k ) )
Block
Therefore we have our objective function :
Θ ^ : = arg max Θ ∏ i = 1 N ∑ k = 1 K π k g ( x i ∣ θ k ) ∼ Θ ^ : = arg max Θ ∑ i = 1 N log ( ∑ k = 1 K π k g ( x i ∣ θ k ) ) \begin{gather*}
\hat{\Theta} := \text{arg}\max\limits_{\Theta} \prod^N_{i=1} \sum^K_{k=1} \pi_k \space g(x_i \mid \theta_k) \quad \\[5pt]
\sim \hat{\Theta} := \text{arg}\max\limits_{\Theta} \sum^N_{i=1} \log \left( \sum^K_{k=1} \pi_k \space g(x_i \mid \theta_k) \right) \quad
\end{gather*} Θ ^ := arg Θ max i = 1 ∏ N k = 1 ∑ K π k g ( x i ∣ θ k ) ∼ Θ ^ := arg Θ max i = 1 ∑ N log ( k = 1 ∑ K π k g ( x i ∣ θ k ) )
Block
2, Estimation-Maximization (EM)
Consider our dataset, X = { x 1 , x 2 , … , x N } X = \{x_1, x_2, \ldots , x_N\} X = { x 1 , x 2 , … , x N } and { x i ∈ R n } i = 1 N \{x_i \in \mathbb{R}^n\}_{i=1}^N { x i ∈ R n } i = 1 N .
Introduce the latent variables Z = { Z 1 , … , Z k } Z = \{Z_1, \ldots, Z_k \} Z = { Z 1 , … , Z k } , Z k = { z 1 k , z 2 k , … , z N k } Z_k = \{z_{1k}, z_{2k}, \ldots, z_{Nk}\} Z k = { z 1 k , z 2 k , … , z N k } and { z i k ∈ { 0 , 1 } } i = 1 N \{z_{ik} \in \{0, 1\}\}^N_{i=1} { z ik ∈ { 0 , 1 } } i = 1 N
Which denotes the cluster membership, eg. if x i x_i x i is cluster 2 in a total of 4 clusters then { z i 1 , z i 2 , z i 3 , z i 4 } = { 0 , 1 , 0 , 0 } \{z_{i1}, z_{i2}, z_{i3}, z_{i4}\} = \{0, 1, 0, 0\} { z i 1 , z i 2 , z i 3 , z i 4 } = { 0 , 1 , 0 , 0 } .
So we formulate the observed dataset { X ; Z } \{X; Z\} { X ; Z } and the unobserved dataset X X X .
Foundation and related theorems:
The full Log-Likelihood of the MLE equation on the observed dataset is:
L ( Θ t ; X ; Z ) = ∑ i = 1 N ∑ k = 1 K z i k log ( π k g ( x i ∣ θ k ) ) \begin{gather*}
L( \Theta^t ; {X; Z} ) = \sum^N_{i=1} \sum^K_{k=1} z_{ik} \log \left( \pi_k \space g(x_i \mid \theta_k) \right) \quad
\end{gather*} L ( Θ t ; X ; Z ) = i = 1 ∑ N k = 1 ∑ K z ik log ( π k g ( x i ∣ θ k ) )
Block
Define the Q-Function :
Q ( Θ ∣ Θ new ) = E Z ∣ X , Θ new [ log L ( Θ t ; X ; Z ) ] \begin{gather*}
Q(\Theta \mid \Theta^{\text{new}}) = \mathbb{E}_{Z \mid X, \Theta^{\text{new}}} \left[ \log L( \Theta^t ; {X; Z} ) \right] \quad
\end{gather*} Q ( Θ ∣ Θ new ) = E Z ∣ X , Θ new [ log L ( Θ t ; X ; Z ) ]
Block
By Jensen’s inequality:
log ( ∑ k = 1 K π k g ( x i ∣ θ k ) ) ≥ ∑ k = 1 K γ i k log ( π k g ( x i ∣ θ k ) γ i k ) ⟹ log L ( Θ new ; X ) ≥ log L ( Θ ; X ) \begin{gather*}
\log \left( \sum^K_{k=1} \pi_k \space g(x_i \mid \theta_k) \right) \geq \sum^K_{k=1} \gamma_{ik} \log \left( \frac{\pi_k \space g(x_i \mid \theta_k)}{\gamma_{ik}} \right) \quad \\[20pt]
\implies \log L(\Theta^{\text{new}}; X) \geq \log L(\Theta; X) \quad
\end{gather*} log ( k = 1 ∑ K π k g ( x i ∣ θ k ) ) ≥ k = 1 ∑ K γ ik log ( γ ik π k g ( x i ∣ θ k ) ) ⟹ log L ( Θ new ; X ) ≥ log L ( Θ ; X )
Block
Which provides proof for EM’s convergence to a local maximum of the MLE equation.
The steps of EM are derived from the derivatives of the likelihood equations and the Q-function.
E-Step: Calculate Posteriori Probability
By the Bayes Theorem: P ( z i k = k ∣ x i ) = P ( x i ∣ z i k = k ) P ( z i k = k ) P ( x i ) ⟹ γ i k = P ( z i k = k ∣ x i ) = π k g ( x i ∣ θ k ) ∑ j = 1 K π j g ( x i ∣ θ j ) \begin{gather*}
\text{By the Bayes Theorem:} \quad \\[15pt]
P(z_{ik} = k \mid x_i) = \frac{P(x_i \mid z_{ik} = k) P(z_ik = k)}{P(x_i)} \quad \\[10pt]
\implies \gamma_{ik} = P(z_{ik} = k \mid x_i) = \frac{\pi_k \space g(x_i \mid \theta_k)}{\sum_{j=1}^K \pi_j \space g(x_i \mid \theta_j)}
\end{gather*} By the Bayes Theorem: P ( z ik = k ∣ x i ) = P ( x i ) P ( x i ∣ z ik = k ) P ( z i k = k ) ⟹ γ ik = P ( z ik = k ∣ x i ) = ∑ j = 1 K π j g ( x i ∣ θ j ) π k g ( x i ∣ θ k )
Block
Where:
∑ k = 1 K γ k = 1 \sum_{k=1}^K \gamma_k = 1 ∑ k = 1 K γ k = 1 .
M-Step: Update Parameters
π k = ∑ i = 1 N γ i k N μ k = ∑ i = 1 N γ i k x i ∑ i = 1 N γ i k Σ k = ∑ i = 1 N γ i k ( x i − μ k ) ( x i − μ k ) T ∑ i = 1 N γ i k \begin{gather*}
\pi_k = \frac{\sum_{i=1}^N \gamma_{ik}}{N} \quad \\[15pt]
\mu_k = \frac{\sum_{i=1}^N \gamma_{ik} x_i}{\sum_{i=1}^N \gamma_{ik}} \quad \\[15pt]
\Sigma_k = \frac{\sum_{i=1}^N \gamma_{ik} (x_i - \mu_k)(x_i - \mu_k)^T}{\sum_{i=1}^N \gamma_{ik}} \quad
\end{gather*} π k = N ∑ i = 1 N γ ik μ k = ∑ i = 1 N γ ik ∑ i = 1 N γ ik x i Σ k = ∑ i = 1 N γ ik ∑ i = 1 N γ ik ( x i − μ k ) ( x i − μ k ) T
Block
Repeat E-Step and M-Step until convergence:
Δ L = L ( Θ new ; X ; Z ) − L ( Θ ; X ; Z ) < ϵ \begin{gather*}
\Delta L = L( \Theta^{\text{new}} ; {X; Z} ) - L( \Theta ; {X; Z} ) < \epsilon \quad
\end{gather*} Δ L = L ( Θ new ; X ; Z ) − L ( Θ ; X ; Z ) < ϵ
Block