Principal Component Analysis - Bernhard Pfann, CFA

There are 3 equivalent approaches on how to obtain [[Eigenvectors|eigenvectors and eigenvalues]]. 1. Find vectors, where the data is [[Vector Projection|projected]] on and the [[Variance]] of the projected points is maximized. 2. Find vectors, where the data is projected on and the [[Vector Length]] of the projection (”residual” or “orthogonal distance”) is minimized. 3. Perform [[Spectral Theorem|Spectral Decomposition]]. ![[pca.jpeg|center|600]] The length of the adjacent vector is $w^Tx$. Multiplied by $w$ itself, we obtain the adjacent vector $w^Txw$ itself. Finally the residual vector is the subtraction of $x$ and that adjacent vector. As we are interested in its length, we take the squared norm. $ \text{resid}= ||x-w^Txw ||^2 $ Given we have $n$ data points, we can find the eigenvectors by minimizing these residuals. $ \begin{align} \min_{w \in \mathbb R^p}& \left( \sum_{i=1}^n ||x_i-w^Tx_iw ||^2 \right) \tag{1}\\[10pt] &=(x_i-w^Tx_iw)^T \, (x_i-w^Tx_iw) \tag{2}\\[10pt] &=x_i^Tx_i-x_i(w^Tx_iw)-(w^Tx_iw)^Tx_i+(w^Tx_iw)^T(w^Tx_iw) \tag{3}\\[10pt] &=x_i^Tx_i-\underbrace{x_i^T(w^Tx_iw)}_{A}-\underbrace{(w^Tx_iw^T)x_i}_{B}+(\underbrace{w^Tx_i}_{\text{scalar}}w^T)(\underbrace{w^Tx_i}_{\text{scalar}}w) \tag{4}\\[12pt] &=x_i^Tx_i -2(x_i^T(\underbrace{w^Tx_i}_{\text{scalar}}w))+(w^Tx_i)^2\underbrace{(w^Tw)}_{=1} \tag{5}\\[12pt] &=x_i^Tx_i -2(w^Tx_i)(x_i^Tw)+(w^Tx_i)^2 \tag{6}\\[10pt] \min_{w \in \mathbb R^p}&\sum_{i=1}^n \Big(x_i^Tx_i -(w^Tx_i)^2\Big) \tag{7} \end{align} $ where: - (2) The Euclidean norm of a vector $v$ is the same as the inner product of $v^Tv$ - (3) As $w^Tx_i$ is a scalar the following two are equivalent $(w^Tx_iw)^T=w^Tx_iw^T$ - (4) $A$ and $B$ are equivalent as both return a scalar. - (5) Pulling out the $w^Tx_i$ from the multiplication of the last term. As $w$ is a unit vector we know that $w^Tw=1$ When we minimize for $w$, then the first term $x_i^Tx_i$ is just a constant. Therefore minimizing the residuals turns out to be equivalent to maximizing the adjacent vectors length. $ \max_{w \in \mathbb R^p} \sum_{i=1}^n(w^Tx_i)^2 $ Finally we can also prove the equivalence to the approach using decomposition of symmetric matrices. $ \begin{align} \max_{w \in \mathbb R^p}& \sum_{i=1}^n(w^Tx_i)^2 \\ & =\sum_{i=1}^n(w^Tx_i)(w^Tx_i)^T \\ & =\sum_{i=1}^n(w^T x_i)(wx_i^T) \\ &= w^T \Big(\sum_{i=1}^nx_i^Tx_i\Big)w \\[10pt] \max_{w \in \mathbb R^p}& \Bigg(w^T \Big(\underbrace{\frac{1}{n}\sum_{i=1}^nx_i^Tx_i}_{\text{Cov}(X)}\Big)w \Bigg)\\[10pt] \end{align} $ >[!note:] >When we have features of different units, then we should either standardize the data first and then use the covariance matrix, or directly use the correlation matrix instead of the [[Covariance Matrix]]. Both approaches are equivalent. >[!note:] >When we do it based on correlation matrix, then the eigenvalues are on average 1. Thus we can say that we only keep dimensions with “above-average-variance”, meaning dimensions where the eigenvalue is > 1. this is one rule to choose number of principal components.