Clustering is a type of *unsupervised learning*, where the goal is to group $n$ observations into $K$ clusters without any labels, based on their similarity.
$ S_n = \{x^{(i)} \vert i=1 \dots n\} $
## Partitioning
A partition is a solution to group all elements of a set into clusters $\mathbb C_1, \mathbb C_2, \dots , \mathbb C_K$. Each cluster contains a set of indices (i.e. position of elements from $S_n$ that belong to that cluster).
**Criteria for Valid Partition:**
- *Complete coverage:* The union of all clusters contains all elements of $S_n$.
$ \mathbb C_1 \cup \mathbb C_2 \cup \dots \cup \mathbb C_K = \{1,2, \dots, n\} $
- *No overlap:* The intersection of any two different clusters is the empty set. This means that each observation is only found in exactly $1$ cluster.
$ \mathbb C_i \cap \mathbb C_j = \emptyset \quad \text{for any }i \not = j \text{ in } \{1, \dots , k\} $
>[!note:]
>A trivial but valid partition would be having only a single cluster containing all observations.
## Costs
Instead of a traditional loss function, clustering uses a *cost function* to evaluate the quality of a partition. The total cost is the sum of the costs for individual clusters:
$ \text{Cost}(\mathbb C_1, \dots, \mathbb C_k)= \sum_{j=1}^k \text{Cost}(\mathbb C_j) $
The cost of each cluster can be defined in various ways:
- Diameter of cluster (i.e. distance between the two farthest points)
- Average distance between all possible points combinations
- Average distance between each point and the class representation $z_j$
The latter one can be written as follows, where $x^{(i)}$ is the observed data point, and $z_j$ is the class representation (e.g. centroid) of $\mathbb C_j$.
$ \text{Cost}(\mathbb C_j, z_j)=\sum_{i \in \mathbb C_j} \text{distance}(x^{(i)}, z_j) $
**Total Clustering Costs:**
Using squared Euclidean distance as the measure of dissimilarity, the total cost can be expressed as follows.
$ \text{Cost}(\mathbb C_1, \dots, \mathbb C_k, z_1, \dots , z_k)=\sum_{j=1}^k \sum_{i \in \mathbb C_j} ||x^{(i)}-z_j ||^2 $