Stochastic Neighbor Embedding - Bernhard Pfann, CFA

The goal of Stochastic Neighbor Embedding (”SNE”) is to preserve the identity of close-by neighbors in the data. Therefore we place a [[Gaussian Distribution|Gaussian]] onto each datapoint $i$. The probability $p_{ij}$ then represents the probability that point $i$ would pick point $j$ as its neighbor, if neighbors were picked in proportion to their closeness in distance under a Gaussian centered at $i$. $ p_{ij}=\frac{\exp (-D_{ij}^2)}{\sum_{k\not = l}\exp (-D_{kl}^2)} $ - *Numerator:* Resembles the $\exp(-\frac{(\theta - \mu)^2}{2\sigma^2})$ we know from the Gaussian. Here the distance $D_{ij}^2$ is this squared distance to the mean (when we calculate $p_{ij}$ the point $i$ is our mean). - *Denominator:* The normalization term that sums up the distances from $i$ to any other point. Thus $p_{ij}$ needs to be between 0 and 1. Now, we do the same calculations in a lower dimensional space. $ q_{ij}=\frac{\exp (-||y_i-y_j||^2)}{\sum_{k\not = l}\exp (-||y_k - y_l||^2)} $ >[!note:] >Here we simplify by taking the same variance for all datapoints. In actual SNE implementations, the [[Variance]] is defined per each datapoint, and governed by a hyperparameter (”perplexity”). Finally we want to minimize the difference between these two sets of probabilities $p_{ij}$ and $q_{ij}$. A valid metric for that is the [[Kullback-Leibler Divergence|KL Divergence]]. $ \text{KL}(P ||Q)=\sum_{i\not =j}p_{ij} \log\Big(\frac{p_{ij}}{q_{ij}}\Big) $ To minimize this we actually use stochastic [[Gradient Descent]]. The output can depend on the path the algorithm is taking as there might be many local optima. The Gaussian as a distribution is well-suited for this purpose due to: - *Symmetry:* The probability density is highest at the center and decreases symmetrically as we move away, reflecting the intuitive notion that points closer to the center are more likely to be considered "neighbors." - *Bell Curve:* The shape of the distribution ensures that the probability falls off smoothly and rapidly as the distance from the center increases. This means that points very close to $i$ are much more likely to be picked as neighbors than those farther away. - *Variance:* The spread of the Gaussian is controlled by its variance ($\sigma^2$). In SNE, the variance is adaptively chosen for each point to account for the local density of data, allowing for flexibility in defining neighborhoods in regions of differing data density. ## tSNE This is a variation of SNE, where we use a [[Student T-Distribution]] with $1$ degree of freedom for modeling the distances in the lower dimensional space for $q_{ij}$. The distances in $p_{ij}$ are still calculated as before. The tSNE has the benefit, that it models more reliably moderate distances. This is because the t-distribution has fatter fails, while in a Gaussian, moderate distances are already so small that the optimization cannot really distinguish between these distances. $ q_{ij}=\frac{(1+\lVert y_i-y_j \rVert^2)^{-1}}{\sum_{k \not = j}(1+\lVert y_i-yj \rVert^2)^{-1}} $