**Covariance for Two Random Variables:**
The [[Covariance]] between two [[Random Variable|r.v's.]] is simply the [[Expectation of a Product]] of $X,Y$ when both r.v’s. are being centered around their respective mean.
$ \mathrm{Cov}(X,Y)=\mathbb E\Big[\big(X-\mathbb E[X]\big)*\big( Y-\mathbb E[Y]\big)\Big] $
**Covariance for Multiple Random Variables:**
However, now instead we have a [[Vector Operations|Vector]] of attributes $X= \begin{bmatrix} X^{(1)} & \cdots & X^{(d)}\end{bmatrix}^T$ and we are interested in the covariance terms coming from all the pairwise combinations of $X$. Therefore we need to do matrix multiplication, i.e. an outer product.
This is written as $XX^T$, where the multiplication of $(d \times 1)*(1 \times d)$ leads to a $(d \times d)$ covariance matrix $\Sigma$.
$ \begin{align}
\mathbf{Cov}(X) = \Sigma &= \mathbb E\Big[(X-\mathbb E[X])*(X-\mathbb E[X])^T\Big] \\[4pt]
\Sigma_{ij} &= \mathbb E\Big[(X^{(i)}-\mathbb E[X^{(i)}])*(X^{(j)}-\mathbb E[X]^{(j)})^T\Big]
\end{align}
$
Each diagonal element $\Sigma_{ii}$ measures the covariance of $X_i$ with itself. Thus we see the variance of each $X_i$ along the diagonal.
## Matrix Representation
For a dataset (design matrix) $\mathbf X$, we can compute the empirical covariance matrix. $\mathbf X$ should always be constructed, so that each row represents an observation, and each column shows all observations for a single feature. Thus $\mathbf X$ has shape $(n \times d)$ with $n$ observations and $d$ features.
$
\mathbf X = \begin{pmatrix}
\leftarrow&\mathbf X_1^T & \rightarrow \\ \vdots & \vdots & \vdots \\[3pt] \leftarrow&\mathbf X_n^T & \rightarrow
\end{pmatrix}
$
The covariance matrix can also be written as follows, where:
- $\mathbf I_n$ is an $(n \times n)$ identity matrix
- $\mathbf 1_n$ is an $(n \times 1)$ column vector where all entries are $1$.
$
\begin{align}
\Sigma &= \frac{1}{n}\mathbf X^T \big(\mathbf I_n-\frac{1}{n} \mathbf 1 \mathbf 1^T\big) \mathbf X \\[8pt]
\Sigma &= \frac{1}{n}\mathbf X^T \Big(\underbrace{\big(\mathbf I_n \mathbf X\big)- \underbrace{\big(\frac{1}{n} \mathbf 1 \mathbf 1^T \mathbf X\big)}_{\bar X}}_{X-\bar X}\Big)
\end{align}
$
This resembles very much the first notation, where we multiply $(\mathbf X- \mathbb E[\mathbf X])$ with its transpose and then again take the expectation. Here we take averages, as we compute the empirical covariance matrix.
## Affine Transformation
Recall that when $X$ was a univariate r.v., we could apply transformations to $X$ and still make direct statements about the [[Variance after Linear Transformation]].
$ \mathrm{Var}(aX+b) = a^2 \mathrm{Var}(X) $
We can also do such linear transformations with the covariance matrix $\Sigma$. Assume we transform our vector $X$ by matrices $\mathbf A$ and $\mathbf B$ and want to know how $\Sigma$ looks afterwards.
$
\mathrm{Cov}(\mathbf AX+ \mathbf B) = \mathrm{Cov}(\mathbf AX)= \mathbf A\,\mathrm{Cov}(X)\mathbf A^T = \mathbf A \Sigma \mathbf A^T
$
>[!note:]
>We arrange the $\mathbf A$ and $\mathbf A^T$ in a way that makes matrix multiplication possible with $(d \times d)$ covariance matrix $\Sigma$ in the middle.