**Covariance for Two Random Variables:** The [[Covariance]] between two [[Random Variable|r.v's.]] is simply the [[Expectation of a Product]] of $X,Y$ when both r.v’s. are being centered around their respective mean. $ \mathrm{Cov}(X,Y)=\mathbb E\Big[\big(X-\mathbb E[X]\big)*\big( Y-\mathbb E[Y]\big)\Big] $ **Covariance for Multiple Random Variables:** However, now instead we have a [[Vector Operations|Vector]] of attributes $X= \begin{bmatrix} X^{(1)} & \cdots & X^{(d)}\end{bmatrix}^T$ and we are interested in the covariance terms coming from all the pairwise combinations of $X$. Therefore we need to do matrix multiplication, i.e. an outer product. This is written as $XX^T$, where the multiplication of $(d \times 1)*(1 \times d)$ leads to a $(d \times d)$ covariance matrix $\Sigma$. $ \begin{align} \mathbf{Cov}(X) = \Sigma &= \mathbb E\Big[(X-\mathbb E[X])*(X-\mathbb E[X])^T\Big] \\[4pt] \Sigma_{ij} &= \mathbb E\Big[(X^{(i)}-\mathbb E[X^{(i)}])*(X^{(j)}-\mathbb E[X]^{(j)})^T\Big] \end{align} $ Each diagonal element $\Sigma_{ii}$ measures the covariance of $X_i$ with itself. Thus we see the variance of each $X_i$ along the diagonal. ## Matrix Representation For a dataset (design matrix) $\mathbf X$, we can compute the empirical covariance matrix. $\mathbf X$ should always be constructed, so that each row represents an observation, and each column shows all observations for a single feature. Thus $\mathbf X$ has shape $(n \times d)$ with $n$ observations and $d$ features. $ \mathbf X = \begin{pmatrix} \leftarrow&\mathbf X_1^T & \rightarrow \\ \vdots & \vdots & \vdots \\[3pt] \leftarrow&\mathbf X_n^T & \rightarrow \end{pmatrix} $ The covariance matrix can also be written as follows, where: - $\mathbf I_n$ is an $(n \times n)$ identity matrix - $\mathbf 1_n$ is an $(n \times 1)$ column vector where all entries are $1$. $ \begin{align} \Sigma &= \frac{1}{n}\mathbf X^T \big(\mathbf I_n-\frac{1}{n} \mathbf 1 \mathbf 1^T\big) \mathbf X \\[8pt] \Sigma &= \frac{1}{n}\mathbf X^T \Big(\underbrace{\big(\mathbf I_n \mathbf X\big)- \underbrace{\big(\frac{1}{n} \mathbf 1 \mathbf 1^T \mathbf X\big)}_{\bar X}}_{X-\bar X}\Big) \end{align} $ This resembles very much the first notation, where we multiply $(\mathbf X- \mathbb E[\mathbf X])$ with its transpose and then again take the expectation. Here we take averages, as we compute the empirical covariance matrix. ## Affine Transformation Recall that when $X$ was a univariate r.v., we could apply transformations to $X$ and still make direct statements about the [[Variance after Linear Transformation]]. $ \mathrm{Var}(aX+b) = a^2 \mathrm{Var}(X) $ We can also do such linear transformations with the covariance matrix $\Sigma$. Assume we transform our vector $X$ by matrices $\mathbf A$ and $\mathbf B$ and want to know how $\Sigma$ looks afterwards. $ \mathrm{Cov}(\mathbf AX+ \mathbf B) = \mathrm{Cov}(\mathbf AX)= \mathbf A\,\mathrm{Cov}(X)\mathbf A^T = \mathbf A \Sigma \mathbf A^T $ >[!note:] >We arrange the $\mathbf A$ and $\mathbf A^T$ in a way that makes matrix multiplication possible with $(d \times d)$ covariance matrix $\Sigma$ in the middle.