Table of Content
Block
1, Objective Function
Given dataset X = { x 1 , x 2 , … , x N } X = \{ x_1, x_2, \ldots, x_N\} X = { x 1 , x 2 , … , x N } and x i ∈ R n x_i \in \mathbb{R}^n x i ∈ R n , the objective function is:
J = ∑ i = 1 K ∑ x ∈ C i ∥ x − μ i ∥ 2 = arg min C i ∑ i = 1 K ∣ C i ∣ Var C i \begin{gather*}
J = \sum^K_{i=1} \sum_{x \in C_i} \| x - \mu_i \|^2 = \text{arg}\min\limits_{C_i} \sum^K_{i=1} | C_i | \text{Var}_{C_i} \quad
\end{gather*} J = i = 1 ∑ K x ∈ C i ∑ ∥ x − μ i ∥ 2 = arg C i min i = 1 ∑ K ∣ C i ∣ Var C i
Block
Where:
K K K is the number of clusters.
x ∈ X x \in X x ∈ X is a data point in X X X .
C i C_i C i is the set of data points in the i i i -th cluster.
μ i \mu_i μ i is the centroid of the i i i -th cluster.
Note that the objective function is non-increasing, thus guarantees to converge to a local maximum.
2, Algorithm Steps
Step 1 - Initialization: Randomly initializes k k k centroids: { μ 1 , μ 2 , … , μ k } \{ \mu_1, \mu_2 , \ldots ,\mu_k \} { μ 1 , μ 2 , … , μ k } .
Step 2 - Assignment: Assign data points to the closest centroid (by Euclidean distance):
C i = { x j : ∥ x j − μ i ∥ 2 ≤ ∥ x j − μ l ∥ 2 , ∀ l = 1 , 2 , … , k } \begin{gather*}
C_i = \{ x_j : \|x_j - \mu_i \|^2 \leq \|x_j - \mu_l \|^2, \forall l = 1, 2, \ldots , k \} \quad
\end{gather*} C i = { x j : ∥ x j − μ i ∥ 2 ≤ ∥ x j − μ l ∥ 2 , ∀ l = 1 , 2 , … , k }
Block
Step 3 - Update Centroids: Shift the centroids to accommodate for the assigned points:
μ i = 1 ∣ C i ∣ ∑ x ∈ C i x \begin{gather*}
\mu_i = \frac{1}{|C_i|} \sum_{x \in C_i} x
\end{gather*} μ i = ∣ C i ∣ 1 x ∈ C i ∑ x
Block
Step 4 - Convergence Check: Repeat step 3 and 4 until convergence with predefined tolerance ϵ \epsilon ϵ :
Δ J = J new − J < ϵ \begin{gather*}
\Delta J = J_{\text{new}} - J < \epsilon
\end{gather*} Δ J = J new − J < ϵ
Block
Given Cosine Similarity between two vectors u u u and v v v :
S c ( u , v ) = cos ( θ ) = ∑ u i v i ∑ u i 2 ∑ v i 2 \begin{gather*}
S_c (u, v) = \cos(\theta) = \frac{\sum u_i v_i}{\sqrt{\sum u_i^2 \sum v_i^2}}
\end{gather*} S c ( u , v ) = cos ( θ ) = ∑ u i 2 ∑ v i 2 ∑ u i v i
Block
Note that for normalized vectors (ie. ∑ u i 2 = ∑ v i 2 = 1 \sum u_i^2 = \sum v_i^2 = 1 ∑ u i 2 = ∑ v i 2 = 1 ), Cosine Similarity is simply the dot product u v u v uv .
The relationship between the Cosine Similarity and Euclidean Distance (for normalized vectors) is as follows:
∥ u − v ∥ 2 = ∑ ( u i − v i ) 2 = ∑ ( u i 2 + v i 2 − 2 u i v i ) = ∑ u i 2 + ∑ v i 2 − 2 ∑ u i v i = 1 + 1 − 2 cos ( θ ) = 2 ( 1 − cos ( θ ) ) \begin{align}
& \| u - v \|^2 = \sum (u_i - v_i)^2 = \sum (u_i^2 + v_i^2 - 2u_i v_i) \quad \\[5pt]
& \qquad \qquad = \sum u_i^2 + \sum v_i^2 - 2 \sum u_i v_i \quad \\[5pt]
& \qquad \qquad = 1 + 1 - 2 \cos(\theta) = 2(1 - \cos(\theta)) \quad
\end{align} ∥ u − v ∥ 2 = ∑ ( u i − v i ) 2 = ∑ ( u i 2 + v i 2 − 2 u i v i ) = ∑ u i 2 + ∑ v i 2 − 2 ∑ u i v i = 1 + 1 − 2 cos ( θ ) = 2 ( 1 − cos ( θ ))
Block
Substituting back to the objective function, we get:
J cos = ∑ i = 1 K ∑ x ∈ C i ( 1 − cos ( x , μ i ) ) \begin{gather*}
J_{\text{cos}} = \sum^K_{i=1} \sum_{x \in C_i} (1-\cos(x, \mu_i)) \quad
\end{gather*} J cos = i = 1 ∑ K x ∈ C i ∑ ( 1 − cos ( x , μ i ))
Block
So minimizing the Euclidean distance in J J J is the same as maximizing the Cosine Similarity in J cos J_{\text{cos}} J cos . Thus our objective function holds (for normalized vectors).
The assignment step becomes:
C i = { x j : cos ( x j , u i ) ≥ cos ( x j , u l ) , ∀ l = 1 , 2 , … , k } \begin{gather*}
C_i = \{ x_j : \cos(x_j, u_i) \geq \cos(x_j, u_l), \forall l = 1, 2, \ldots , k \} \quad
\end{gather*} C i = { x j : cos ( x j , u i ) ≥ cos ( x j , u l ) , ∀ l = 1 , 2 , … , k }
Block
And the centroid update step becomes:
μ i = ∑ x ∈ C i x ∥ ∑ x ∈ C i x ∥ \begin{gather*}
\mu_i = \frac{\sum_{x \in C_i} x}{\| \sum_{x \in C_i} x \|}
\end{gather*} μ i = ∥ ∑ x ∈ C i x ∥ ∑ x ∈ C i x
Block