Table of Content

Block

1, Objective Function

Given dataset X={x1,x2,,xN}X = \{ x_1, x_2, \ldots, x_N\} and xiRnx_i \in \mathbb{R}^n, the objective function is:

J=i=1KxCixμi2=argminCii=1KCiVarCi\begin{gather*} J = \sum^K_{i=1} \sum_{x \in C_i} \| x - \mu_i \|^2 = \text{arg}\min\limits_{C_i} \sum^K_{i=1} | C_i | \text{Var}_{C_i} \quad \end{gather*}
Block

Where:

  • KK is the number of clusters.
  • xXx \in X is a data point in XX.
  • CiC_i is the set of data points in the ii-th cluster.
  • μi\mu_i is the centroid of the ii-th cluster.

Note that the objective function is non-increasing, thus guarantees to converge to a local maximum.

2, Algorithm Steps

  • Step 1 - Initialization: Randomly initializes kk centroids: {μ1,μ2,,μk}\{ \mu_1, \mu_2 , \ldots ,\mu_k \}.
  • Step 2 - Assignment: Assign data points to the closest centroid (by Euclidean distance):
Ci={xj:xjμi2xjμl2,l=1,2,,k}\begin{gather*} C_i = \{ x_j : \|x_j - \mu_i \|^2 \leq \|x_j - \mu_l \|^2, \forall l = 1, 2, \ldots , k \} \quad \end{gather*}
Block
  • Step 3 - Update Centroids: Shift the centroids to accommodate for the assigned points:
μi=1CixCix\begin{gather*} \mu_i = \frac{1}{|C_i|} \sum_{x \in C_i} x \end{gather*}
Block
  • Step 4 - Convergence Check: Repeat step 3 and 4 until convergence with predefined tolerance ϵ\epsilon:
ΔJ=JnewJ<ϵ\begin{gather*} \Delta J = J_{\text{new}} - J < \epsilon \end{gather*}
Block

Extra: Relation to Cosine Similarity

Given Cosine Similarity between two vectors uu and vv:

Sc(u,v)=cos(θ)=uiviui2vi2\begin{gather*} S_c (u, v) = \cos(\theta) = \frac{\sum u_i v_i}{\sqrt{\sum u_i^2 \sum v_i^2}} \end{gather*}
Block

Note that for normalized vectors (ie. ui2=vi2=1\sum u_i^2 = \sum v_i^2 = 1), Cosine Similarity is simply the dot product uvu v.

The relationship between the Cosine Similarity and Euclidean Distance (for normalized vectors) is as follows:

uv2=(uivi)2=(ui2+vi22uivi)=ui2+vi22uivi=1+12cos(θ)=2(1cos(θ))\begin{align} & \| u - v \|^2 = \sum (u_i - v_i)^2 = \sum (u_i^2 + v_i^2 - 2u_i v_i) \quad \\[5pt] & \qquad \qquad = \sum u_i^2 + \sum v_i^2 - 2 \sum u_i v_i \quad \\[5pt] & \qquad \qquad = 1 + 1 - 2 \cos(\theta) = 2(1 - \cos(\theta)) \quad \end{align}
Block

Substituting back to the objective function, we get:

Jcos=i=1KxCi(1cos(x,μi))\begin{gather*} J_{\text{cos}} = \sum^K_{i=1} \sum_{x \in C_i} (1-\cos(x, \mu_i)) \quad \end{gather*}
Block

So minimizing the Euclidean distance in JJ is the same as maximizing the Cosine Similarity in JcosJ_{\text{cos}}. Thus our objective function holds (for normalized vectors).

The assignment step becomes:

Ci={xj:cos(xj,ui)cos(xj,ul),l=1,2,,k}\begin{gather*} C_i = \{ x_j : \cos(x_j, u_i) \geq \cos(x_j, u_l), \forall l = 1, 2, \ldots , k \} \quad \end{gather*}
Block

And the centroid update step becomes:

μi=xCixxCix\begin{gather*} \mu_i = \frac{\sum_{x \in C_i} x}{\| \sum_{x \in C_i} x \|} \end{gather*}
Block