K-Means is one of the most widely used algorithms for [[Clustering]] tasks. It minimizes the *within-cluster sum of squares* ("WCSS"), a measure of variance within each cluster, to produce cohesive groups. It can be shown that, minimizing WCSS is equivalent to maximizing the *between-cluster sum of squares*, ensuring well-separated clusters.
## Algorithm
Randomly select cluster representatives $z_1, \dots, z_k$. Then, iterate until no change in costs:
1. Given $z_1, \dots, z_k$, we find for each $x^{(i)}$ the closest $z_j$ and assign the observation to the corresponding $\mathbb C_j$.
$ x^{(i)} \mapsto \mathbb C_j: \arg \min_{j=1,...,k}\big(|| x^{(i)}-z_j||^2\big) $
2. Costs are minimized by updating cluster assignment, given fixed cluster representatives.
$ \text{Cost}(z_1, \dots ,z_k)=\sum_{i=1}^n \min_{j=1, ...,k} ||x^{(i)}-z_j ||^2 $
3. Given the new cluster assignments $\mathbb C_1,\dots,\mathbb C_k$, we find for each $z_j$ a value, that minimizes the sum of squared distances to all its assigned points $x^{(i)} \in \mathbb C_j$.
$ z_j= \arg \min_z \big(\sum_{i \in \mathbb C_j}||x^{(i)}-z||^2 \big) $
4. Costs are minimized by updating each $z_j$ given the new cluster assignments $\mathbb C_1, \dots, \mathbb C_k$.
$ \text{Cost}(\mathbb C_1, \dots \mathbb C_k)=\min_{z_i, \dots , z_k} \sum_{j=1}^k \sum_{i \in \mathbb C_j} || x^{(i)}-z_j ||^2 $
## Centroids
The only step where we need to find the [[Gradient Descent#Gradient Vector|Gradient Vector]], is for updating the cluster representatives. It turns out that the update of these $z_j$ returns the cluster centroids (given we use Euclidean distance).
$
\begin{align}
\nabla _{z_j}\left(\sum_{i \in \mathbb {C}_ j} \| x^{(i)} - z_ j\| ^2\right) &= 0 \tag{1}\\[10pt]
\sum_{i \in \mathbb C_j} -2(x^{(i)}-z_j) &= 0 \tag{2} \\[8pt]
\sum_{i \in \mathbb C_j} -2x^{(i)}+\sum_{i \in \mathbb C_j} 2z_j&=0 \tag{3}\\[8pt]
\sum_{i \in \mathbb C_j}z_j&=\sum_{i \in \mathbb C_j} x^{(i)} \tag{4}\\[8pt]
|\mathbb C_j|*z_j&=\sum_{i \in \mathbb C_j} x^{(i)} \tag{5}\\[8pt]
z_j&=\frac{\sum_{i \in \mathbb C_j} x^{(i)}}{|\mathbb C_j|} \tag{6}
\end{align}
$
(3) The summation of the left side is just $n$ copies of $z_j$, where $n$ denotes the number of points in the cluster, i.e. $\vert \mathbb C_j \vert$.
## Convergence
K-Means guarantees to convergence, since we know that the cost function is non-negative and with each iteration step, the WCSS either decreases or stays the same.
However, we can get stuck in local minima due to unlucky initialization of $z_j$. Therefore it is recommended to run the algorithm multiple times with different random initializations $z_1, \dots, z_k$.