K-Means Notes
Table of Content
1, Objective Function
Given dataset \(X = \{ x_1, x_2, \ldots, x_N\}\) and \(x_i \in \mathbb{R}^n\), the objective function is:
\[\begin{gather*} J = \sum^K_{i=1} \sum_{x \in C_i} \| x - \mu_i \|^2 = \text{arg}\min\limits_{C_i} \sum^K_{i=1} | C_i | \text{Var}_{C_i} \quad \end{gather*}\]
Where:
- \(K\) is the number of clusters.
- \(x \in X\) is a data point in \(X\).
- \(C_i\) is the set of data points in the \(i\)-th cluster.
- \(\mu_i\) is the centroid of the \(i\)-th cluster.
Note that the objective function is non-increasing, thus guarantees to converge to a local maximum.
2, Algorithm Steps
- Step 1 - Initialization: Randomly initializes \(k\) centroids: \(\{ \mu_1, \mu_2 , \ldots ,\mu_k \}\).
- Step 2 - Assignment: Assign data points to the closest centroid (by Euclidean distance):
\[\begin{gather*} C_i = \{ x_j : \|x_j - \mu_i \|^2 \leq \|x_j - \mu_l \|^2, \forall l = 1, 2, \ldots , k \} \quad \end{gather*}\]
- Step 3 - Update Centroids: Shift the centroids to accommodate for the assigned points:
\[\begin{gather*} \mu_i = \frac{1}{|C_i|} \sum_{x \in C_i} x \end{gather*}\]
- Step 4 - Convergence Check: Repeat step 3 and 4 until convergence with predefined tolerance \(\epsilon\):
\[\begin{gather*} \Delta J = J_{\text{new}} - J < \epsilon \end{gather*}\]
Extra: Relation to Cosine Similarity
Given Cosine Similarity between two vectors \(u\) and \(v\):
\[\begin{gather*} S_c (u, v) = \cos(\theta) = \frac{\sum u_i v_i}{\sqrt{\sum u_i^2 \sum v_i^2}} \end{gather*}\]
Note that for normalized vectors (ie. \(\sum u_i^2 = \sum v_i^2 = 1\)), Cosine Similarity is simply the dot product \(u v\).
The relationship between the Cosine Similarity and Euclidean Distance (for normalized vectors) is as follows:
\[\begin{align} & \| u - v \|^2 = \sum (u_i - v_i)^2 = \sum (u_i^2 + v_i^2 - 2u_i v_i) \quad \\[5pt] & \qquad \qquad = \sum u_i^2 + \sum v_i^2 - 2 \sum u_i v_i \quad \\[5pt] & \qquad \qquad = 1 + 1 - 2 \cos(\theta) = 2(1 - \cos(\theta)) \quad \end{align}\]
Substituting back to the objective function, we get:
\[\begin{gather*} J_{\text{cos}} = \sum^K_{i=1} \sum_{x \in C_i} (1-\cos(x, \mu_i)) \quad \end{gather*}\]
So minimizing the Euclidean distance in \(J\) is the same as maximizing the Cosine Similarity in \(J_{\text{cos}}\). Thus our objective function holds (for normalized vectors).
The assignment step becomes:
\[\begin{gather*} C_i = \{ x_j : \cos(x_j, u_i) \geq \cos(x_j, u_l), \forall l = 1, 2, \ldots , k \} \quad \end{gather*}\]
And the centroid update step becomes:
\[\begin{gather*} \mu_i = \frac{\sum_{x \in C_i} x}{\| \sum_{x \in C_i} x \|} \end{gather*}\]