Multi-Dimensional Scaling - Bernhard Pfann, CFA

**Distance matrix:** A matrix $D \in \mathbb R^{n \times n}$ qualifies as a distance matrix, only if the following is true for all $(i,j,k)$. Examples are Euclidean distance, Manhattan distance.. |#|Criteria|Description| |---|---|---| |1|$D_{ii}=0$|The distance between point $i$ and itself is $0$.| |2|$D_{ij} \ge0$|The distance between any point is $\ge0$.| |3|$D_{ij} = D_{ji}$|The distance between any point $i$ and $j$ is symmetric.| |4|$D_{ij} \le D_{ik}+D_{kj}$|The distance between $i,j$ cannot be larger than the sum of distances from $i,k$ plus $k,j$.| **Dissimilarity matrix:** When only the first 3 criteria are met, we speak of a dissimilarity matrix. An Example is rankings. ## Approaches For a given dataset $\mathbb X \in \mathbb R^{n \times p}$ the distance matrix is $\mathbf D \in \mathbb R^{n \times n}$, where each element $D_{ij}= ||x_i - x_j || ^2$ is the [[Vector Length]] of delta between two points. Instead of $\mathbb X \in \mathbb R^{n \times p}$ we try to find a lower-dimensional embedding $\mathbf Y \in \mathbb R^{n \times q}$, where $q < p$. There are different variations to MDS: - *Classical MDS:* We try to find a set of $y$ that minimizes the squared difference in distances. This tries to preserve the distance information in the lower dimension $y$ as close as possible to the one in $D$. Classical MDS assumes a Euclidean distance matrix. $ \min_{y}\sum_{i=1}^n \sum_{i=j}^n \Big(D_{ij}- ||y_i-y_j||^2\Big)^2 $ - *Weighted MDS:* The classical MDS is extended by weight parameters, that should also be optimized together with $y$. This makes the optimization more complex, but allows also for other distance measures than Euclidean distance when constructing $D$. $ \min_{(y, w)}\sum_{i=1}^n \sum_{i=j}^n w_{ij} \Big(D_{ij}- ||y_i-y_j||^2\Big)^2 $ - *Non-metric MDS:* Instead of a weights vector, we add a transformation function $\theta$, where $\theta$ needs to be monotonically increasing. This also allows $D$ to be a dissimilarity matrix. $ \min_{y}\sum_{i=1}^n \sum_{i=j}^n \Big(\theta(D_{ij})- ||y_i-y_j||^2\Big)^2 $ ## Centered Configuration Classical MDS does not have a unique solution. While a certain vector $\mathbf Y \in \mathbb R^q$ can minimize the equation, the same vector with and added constant $\mathbf Y+cE$ would also minimize the equation (where $E \in \mathbb R^{n \times q}$ is a vector of ones). Therefore we will center the underlying data $\mathbb X \in \mathbb R^{n \times p}$ along each column (i.e. feature). $ \sum_{i=1}^n x_k^{(i)}=0 \quad \forall \, k $ >[!note:] >While MDS preserves the distances of datapoints, [[Principal Component Analysis|PCA]] preserves the variance of the data (it tries to capture as much variance within the lower dimensions). $ \begin{align} D_{ij}^2&=||x_i-x_j||^2 \\[8pt] D_{ij}^2&=||x_i||^2 + ||x_j||^2 - 2x_i^T x_j \\[8pt] D_{ij}^2 &= (XX^T)_{ii} + (XX^T)_{jj}-2(XX^T)_{ij} \\[8pt] (XX^T)_{ij} &=-\frac{1}{2}\Big(D_{ij}^2- \underbrace{\frac{1}{n}\sum_{k=1}^n D_{i k}^2}_{\text{row i average}} -\underbrace{\frac{1}{n}\sum_{k=1}^n D_{k j}^2}_{\text{col j average}} +\underbrace{ \frac{1}{n}\sum_{k=1}^n \sum_{l=1}^n D_{kl}^2}_{\text{matrix average}}\Big) \end{align} $ **Note:** $XX^T$ is the inner-product for every $i,j$ combination. The inner product $(XX^T)_{ij}$ captures the angle (cosine) and the length of the two vectors. $ \mathbf a \cdot \mathbf b= \lVert\mathbf a\rVert*\lVert\mathbf b\rVert* \cos(\theta) $ It turns out that minimizing the difference in distances is the same as minimizing these angles and length between all points. $ \begin{align} &\min_{y} \Big( \sum_{i=1}^n \sum_{i=j}^n \Big(D_{ij}- \lVert y_i-y_j \rVert^2\Big)^2 \Big) \\[6pt] \equiv & \min_y \Big(\text{trace}(XX^T - YY^T)^2\Big) \end{align} $ where: - (1) Here we do element-wise differences, square each difference, and sum up all those squared distances. - (2) We take the differences of the matrices as a whole. Then squaring it and only considering the [[Matrix Trace]] is equivalent to the element-wise approach in (1). The minimum is reached when $YY^T$ is constructed from the $q$ largest [[Eigenvectors|Eigenvalues]]. Given a fixed $q$, this is the representation that most closely mimics the variance of the original datapoints, as [[Eigenvectors]] capture the maximum variance in a dataset.