Linear Regression with LSE - Bernhard Pfann, CFA

## Distribution of $\beta^\star$ We start the derivation from the [[Multivariate Linear Regression#Closed-Form Solution|Closed-Form Solution]] of [[Multivariate Linear Regression]] on $\beta^\star$ obtained during least-squares-estimation (or equivalently [[Maximum Likelihood Estimation|MLE]]). $ \begin{align} \hat \beta &= (\mathbb X^T \mathbb X)^{-1} \mathbb X^T \mathbf Y \tag{1}\\[6pt] &= (\mathbb X^T \mathbb X)^{-1} \mathbb X^T (\mathbb X \beta^\star + \epsilon) \tag{2}\\[6pt] &= (\mathbb X^T \mathbb X)^{-1} \mathbb X^T \mathbb X \beta^\star + (\mathbb X^T \mathbb X)^{-1} \mathbb X^T \epsilon \tag{3}\\[6pt] &=\beta^\star+ \mathcal N_p\Big(0, (\mathbb X^T \mathbb X)^{-1} \mathbb X^T *(\sigma^2 \mathbf I_n) * \mathbb X(\mathbb X^T \mathbb X)^{-1} \Big) \tag{4}\\[2pt] &=\beta^\star+ \mathcal N_p\Big(0,\sigma^2\big((\mathbb X^T \mathbb X)^{-1} \mathbb X^T \mathbb X(\mathbb X^T \mathbb X)^{-1} \big)\Big) \tag{5}\\[2pt] &=\beta^\star+ \mathcal N_p\Big(0,\sigma^2(\mathbb X^T \mathbb X)^{-1}\Big) \tag{6}\\[2pt] &=\mathcal N_p\Big(\beta^\star,\sigma^2(\mathbb X^T \mathbb X)^{-1}\Big) \tag{7} \end{align} $ ^540ccb where: - (3) The product of a matrix $(\mathbb X^T \mathbb X)$ times its [[Inverse Matrix]] equals 1. - (4) We can model the second term as a [[Gaussian Distribution|Gaussian]] with following behavior: - Since $\epsilon$ is Gaussian, the product of a deterministic matrix $(\mathbb X^T \mathbb X)^{-1} \mathbb X^T$with $\epsilon$ is also Gaussian. - Since $\mathbb E[\epsilon]=0$, the [[Expectation]] of this product is also 0. - The $\mathrm{Var}\big((\mathbb X^T \mathbb X)^{-1} \mathbb X^T \epsilon\big)$ can be obtained by using the following fact: $ \mathrm{Var}(Ax) \mapsto A \mathrm{Var}(x)A^T $ - (7) Since $\mathcal N_p$ is Gaussian with 0 mean, and $\beta^\star$ is a deterministic vector, we can say that $\hat \beta$ has mean $\beta^\star$, which is a desired outcome. **Conclusion:** $ \hat \beta \sim \mathcal N_p\Big(\beta^\star,\sigma^2(\mathbb X^T \mathbb X)^{-1}\Big) $ To understand the [[Variance]] term, we can evaluate the matrix product $(\mathbb X^T \mathbb X)^{-1}$ on a single covariate $p$. - In this case, $\mathbb X^T$ is just a row-vector of all observations for this covariate. This makes $\mathbb X^T \mathbb X$ equal to the squared sum of all observed values for this particular covariate $p$. - Since we then need to take its inverse, this means, that the larger the squared sum of all observations for a given covariate, the smaller is the variance of $\hat \beta$. This makes sense, as more spread out values of $\mathbb X$ give a more stable regression line (see example below). ![[linear-regression-line.png|center|500]] ## Other Properties **Degrees of freedom:** In linear regression the degrees of freedom are always $(n-p)$, which denotes the number of observations minus the number of parameters (coefficients) that we want to estimate. Since each $\beta$ parameter is using information from the $n$ observations, they reduce the degrees freedom. - *Example:* When we know the sample average $\bar X_n$ and we know all $X_1, \dots X_{n-1}$ then $X_n$ is not free to vary anymore. **Unbiased estimator of $\sigma^2$:** We need to divide by $(n-p)$ as these are the used degrees of freedom. The more parameters we estimate, the bigger the uncertainty (larger variance of residuals). $ \hat \sigma^2 = \frac{\| \mathbf Y-\mathbb X \hat \beta \| _2^2}{n-p} \space = \space \frac{1}{n-p} \sum_{i=1}^n \hat \epsilon_i^2 $ **Prediction error:** It is also known as the residual sum of squares (RSS). It takes the expectation of squared [[Vector Length#Norm Notation|Norm]] of the residuals vector. $ \mathbb E \big[ \| \mathbf Y- \mathbb X \beta \|_2^2 \big]=\mathbb E[\hat \sigma^2]*(n-p)=\sigma^2*(n-p) $