Table of Content

1, Objective Function

Given dataset \(X = \{ x_1, x_2, \ldots, x_N\}\) and \(x_i \in \mathbb{R}^n\), the objective function is:

\[\begin{gather*} J = \sum^K_{i=1} \sum_{x \in C_i} \| x - \mu_i \|^2 = \text{arg}\min\limits_{C_i} \sum^K_{i=1} | C_i | \text{Var}_{C_i} \quad \end{gather*}\]

Where:

  • \(K\) is the number of clusters.
  • \(x \in X\) is a data point in \(X\).
  • \(C_i\) is the set of data points in the \(i\)-th cluster.
  • \(\mu_i\) is the centroid of the \(i\)-th cluster.

Note that the objective function is non-increasing, thus guarantees to converge to a local maximum.

2, Algorithm Steps

  • Step 1 - Initialization: Randomly initializes \(k\) centroids: \(\{ \mu_1, \mu_2 , \ldots ,\mu_k \}\).
  • Step 2 - Assignment: Assign data points to the closest centroid (by Euclidean distance):
\[\begin{gather*} C_i = \{ x_j : \|x_j - \mu_i \|^2 \leq \|x_j - \mu_l \|^2, \forall l = 1, 2, \ldots , k \} \quad \end{gather*}\]
  • Step 3 - Update Centroids: Shift the centroids to accommodate for the assigned points:
\[\begin{gather*} \mu_i = \frac{1}{|C_i|} \sum_{x \in C_i} x \end{gather*}\]
  • Step 4 - Convergence Check: Repeat step 3 and 4 until convergence with predefined tolerance \(\epsilon\):
\[\begin{gather*} \Delta J = J_{\text{new}} - J < \epsilon \end{gather*}\]

Extra: Relation to Cosine Similarity

Given Cosine Similarity between two vectors \(u\) and \(v\):

\[\begin{gather*} S_c (u, v) = \cos(\theta) = \frac{\sum u_i v_i}{\sqrt{\sum u_i^2 \sum v_i^2}} \end{gather*}\]

Note that for normalized vectors (ie. \(\sum u_i^2 = \sum v_i^2 = 1\)), Cosine Similarity is simply the dot product \(u v\).

The relationship between the Cosine Similarity and Euclidean Distance (for normalized vectors) is as follows:

\[\begin{align} & \| u - v \|^2 = \sum (u_i - v_i)^2 = \sum (u_i^2 + v_i^2 - 2u_i v_i) \quad \\[5pt] & \qquad \qquad = \sum u_i^2 + \sum v_i^2 - 2 \sum u_i v_i \quad \\[5pt] & \qquad \qquad = 1 + 1 - 2 \cos(\theta) = 2(1 - \cos(\theta)) \quad \end{align}\]

Substituting back to the objective function, we get:

\[\begin{gather*} J_{\text{cos}} = \sum^K_{i=1} \sum_{x \in C_i} (1-\cos(x, \mu_i)) \quad \end{gather*}\]

So minimizing the Euclidean distance in \(J\) is the same as maximizing the Cosine Similarity in \(J_{\text{cos}}\). Thus our objective function holds (for normalized vectors).

The assignment step becomes:

\[\begin{gather*} C_i = \{ x_j : \cos(x_j, u_i) \geq \cos(x_j, u_l), \forall l = 1, 2, \ldots , k \} \quad \end{gather*}\]

And the centroid update step becomes:

\[\begin{gather*} \mu_i = \frac{\sum_{x \in C_i} x}{\| \sum_{x \in C_i} x \|} \end{gather*}\]