K-Means is one of the most widely used algorithms for [[Clustering]] tasks. It minimizes the *within-cluster sum of squares* ("WCSS"), a measure of variance within each cluster, to produce cohesive groups. It can be shown that, minimizing WCSS is equivalent to maximizing the *between-cluster sum of squares*, ensuring well-separated clusters. ## Algorithm Randomly select cluster representatives $z_1, \dots, z_k$. Then, iterate until no change in costs: 1. Given $z_1, \dots, z_k$, we find for each $x^{(i)}$ the closest $z_j$ and assign the observation to the corresponding $\mathbb C_j$. $ x^{(i)} \mapsto \mathbb C_j: \arg \min_{j=1,...,k}\big(|| x^{(i)}-z_j||^2\big) $ 2. Costs are minimized by updating cluster assignment, given fixed cluster representatives. $ \text{Cost}(z_1, \dots ,z_k)=\sum_{i=1}^n \min_{j=1, ...,k} ||x^{(i)}-z_j ||^2 $ 3. Given the new cluster assignments $\mathbb C_1,\dots,\mathbb C_k$, we find for each $z_j$ a value, that minimizes the sum of squared distances to all its assigned points $x^{(i)} \in \mathbb C_j$. $ z_j= \arg \min_z \big(\sum_{i \in \mathbb C_j}||x^{(i)}-z||^2 \big) $ 4. Costs are minimized by updating each $z_j$ given the new cluster assignments $\mathbb C_1, \dots, \mathbb C_k$. $ \text{Cost}(\mathbb C_1, \dots \mathbb C_k)=\min_{z_i, \dots , z_k} \sum_{j=1}^k \sum_{i \in \mathbb C_j} || x^{(i)}-z_j ||^2 $ ## Centroids The only step where we need to find the [[Gradient Descent#Gradient Vector|Gradient Vector]], is for updating the cluster representatives. It turns out that the update of these $z_j$ returns the cluster centroids (given we use Euclidean distance). $ \begin{align} \nabla _{z_j}\left(\sum_{i \in \mathbb {C}_ j} \| x^{(i)} - z_ j\| ^2\right) &= 0 \tag{1}\\[10pt] \sum_{i \in \mathbb C_j} -2(x^{(i)}-z_j) &= 0 \tag{2} \\[8pt] \sum_{i \in \mathbb C_j} -2x^{(i)}+\sum_{i \in \mathbb C_j} 2z_j&=0 \tag{3}\\[8pt] \sum_{i \in \mathbb C_j}z_j&=\sum_{i \in \mathbb C_j} x^{(i)} \tag{4}\\[8pt] |\mathbb C_j|*z_j&=\sum_{i \in \mathbb C_j} x^{(i)} \tag{5}\\[8pt] z_j&=\frac{\sum_{i \in \mathbb C_j} x^{(i)}}{|\mathbb C_j|} \tag{6} \end{align} $ (3) The summation of the left side is just $n$ copies of $z_j$, where $n$ denotes the number of points in the cluster, i.e. $\vert \mathbb C_j \vert$. ## Convergence K-Means guarantees to convergence, since we know that the cost function is non-negative and with each iteration step, the WCSS either decreases or stays the same. However, we can get stuck in local minima due to unlucky initialization of $z_j$. Therefore it is recommended to run the algorithm multiple times with different random initializations $z_1, \dots, z_k$.