K-Medoids - Bernhard Pfann, CFA

K-Medoids is a [[Clustering]] algorithm designed to address some of the limitations of [[K-Means]]: - *Centroid placement:* The cluster representations $z_j$ are (most certainly) no actual points from the dataset. It is not even guaranteed that a $z_j$ lies in a populated area of points (e.g. $z_j$ can lie between two bubbles of points). - *Distance metric:* K-means restricts us to use Euclidean distance, which can be impacted by large outliers. ## Algorithm Randomly initialize $\{z_1, \dots, z_k\} \subseteq \{x^{(i)}, \dots, x^{(n)}\}$. Then iterate until no change in costs: 1. Given $z_1, \dots, z_k$, we find for each $x^{(i)}$ the closest $z_j$ and assign the observation to the corresponding $\mathbb C_j$. Unlike K-means, here we are free to choose any distance measure $\text{dist}()$ we like. $ x^{(i)} \mapsto \mathbb C_j: \arg \min_{j=1,...,k}\text{dist}\big(x^{(i)}, z_j\big) $ 2. Costs are minimized by updating cluster assignment, given fixed cluster representatives. $ \text{Cost}(z_1, \dots ,z_k)=\sum_{i=1}^n \min_{j=1, ...,k} \text{dist}\big(x^{(i)}, z_j\big) $ 3. Given the new cluster assignments $\mathbb C_1,\dots,\mathbb C_k$, we look for the $z_j \in \{x^{(i)}, \dots ,x^{(n)}\}$, that minimizes the chosen distance measure to all its assigned points $x^{(i)} \in \mathbb C_j$. $ z_j= \arg \min_{z \in \{x^{(i)},...,x^{(n)}\}}\space \sum_{i \in C_j}\text{dist}(x^{(i)}, z) $ 4. Costs are minimized by updating each $z_j$ given the new cluster assignments $\mathbb C_1, \dots, \mathbb C_k$. $ \text{Cost}(\mathbb C_1, \dots \mathbb C_k)=\min_{z_i, \dots , z_k} \sum_{j=1}^k \sum_{i \in \mathbb C_j} \text{dist}(x^{(i)}, z_j) $