There are 3 equivalent approaches on how to obtain [[Eigenvectors|eigenvectors and eigenvalues]].
1. Find vectors, where the data is [[Vector Projection|projected]] on and the [[Variance]] of the projected points is maximized.
2. Find vectors, where the data is projected on and the [[Vector Length]] of the projection (”residual” or “orthogonal distance”) is minimized.
3. Perform [[Spectral Theorem|Spectral Decomposition]].
![[pca.jpeg|center|600]]
The length of the adjacent vector is $w^Tx$. Multiplied by $w$ itself, we obtain the adjacent vector $w^Txw$ itself. Finally the residual vector is the subtraction of $x$ and that adjacent vector. As we are interested in its length, we take the squared norm.
$ \text{resid}= ||x-w^Txw ||^2 $
Given we have $n$ data points, we can find the eigenvectors by minimizing these residuals.
$
\begin{align}
\min_{w \in \mathbb R^p}& \left( \sum_{i=1}^n ||x_i-w^Tx_iw ||^2 \right) \tag{1}\\[10pt]
&=(x_i-w^Tx_iw)^T \, (x_i-w^Tx_iw) \tag{2}\\[10pt]
&=x_i^Tx_i-x_i(w^Tx_iw)-(w^Tx_iw)^Tx_i+(w^Tx_iw)^T(w^Tx_iw) \tag{3}\\[10pt]
&=x_i^Tx_i-\underbrace{x_i^T(w^Tx_iw)}_{A}-\underbrace{(w^Tx_iw^T)x_i}_{B}+(\underbrace{w^Tx_i}_{\text{scalar}}w^T)(\underbrace{w^Tx_i}_{\text{scalar}}w) \tag{4}\\[12pt]
&=x_i^Tx_i -2(x_i^T(\underbrace{w^Tx_i}_{\text{scalar}}w))+(w^Tx_i)^2\underbrace{(w^Tw)}_{=1} \tag{5}\\[12pt]
&=x_i^Tx_i -2(w^Tx_i)(x_i^Tw)+(w^Tx_i)^2 \tag{6}\\[10pt]
\min_{w \in \mathbb R^p}&\sum_{i=1}^n \Big(x_i^Tx_i -(w^Tx_i)^2\Big) \tag{7}
\end{align}
$
where:
- (2) The Euclidean norm of a vector $v$ is the same as the inner product of $v^Tv$
- (3) As $w^Tx_i$ is a scalar the following two are equivalent $(w^Tx_iw)^T=w^Tx_iw^T$
- (4) $A$ and $B$ are equivalent as both return a scalar.
- (5) Pulling out the $w^Tx_i$ from the multiplication of the last term. As $w$ is a unit vector we know that $w^Tw=1$
When we minimize for $w$, then the first term $x_i^Tx_i$ is just a constant. Therefore minimizing the residuals turns out to be equivalent to maximizing the adjacent vectors length.
$ \max_{w \in \mathbb R^p} \sum_{i=1}^n(w^Tx_i)^2 $
Finally we can also prove the equivalence to the approach using decomposition of symmetric matrices.
$
\begin{align}
\max_{w \in \mathbb R^p}& \sum_{i=1}^n(w^Tx_i)^2 \\
& =\sum_{i=1}^n(w^Tx_i)(w^Tx_i)^T \\ & =\sum_{i=1}^n(w^T x_i)(wx_i^T) \\
&= w^T \Big(\sum_{i=1}^nx_i^Tx_i\Big)w \\[10pt]
\max_{w \in \mathbb R^p}& \Bigg(w^T \Big(\underbrace{\frac{1}{n}\sum_{i=1}^nx_i^Tx_i}_{\text{Cov}(X)}\Big)w \Bigg)\\[10pt]
\end{align}
$
>[!note:]
>When we have features of different units, then we should either standardize the data first and then use the covariance matrix, or directly use the correlation matrix instead of the [[Covariance Matrix]]. Both approaches are equivalent.
>[!note:]
>When we do it based on correlation matrix, then the eigenvalues are on average 1. Thus we can say that we only keep dimensions with “above-average-variance”, meaning dimensions where the eigenvalue is > 1. this is one rule to choose number of principal components.