There are several key assumptions that need to be respected in [[Univariate Linear Regression]] and [[Multivariate Linear Regression]].
## Linearity
The relationship between the independent variables $\mathbb X$ and the dependent variable $\mathbf Y$ is linear. This can be assessed with a residual plot along the different values of $\mathbf Y$.
$ \mathbb E[Y \vert X] = X \beta^\star$
## Assumptions on Design Matrix
**Deterministic Observations:**
We view observations in $\mathbb X$ as deterministic, and not as [[Random Variable|Random Variables]]. This avoids complications in the matrix algebra for the [[Univariate Linear Regression#Least Squares Estimator|Least Squares Estimator]].
**Full Rank:**
The design matrix has to have [[Matrix Rank|rank]] $p$. This means [[Linearly Independent Vectors|linear independence]] across all covariates (”features”). Hence none of them can be replicated by a linear combination of the other covariates.
$ \text{Rank}(\mathbb X)=p $
- *Example:* When we include the same feature twice, then there are infinite possibilities how to set their coefficients e.g. $(\beta_1, \beta_2)$, and to retrieve the exact same squared loss.
$
\begin{align}
&\beta_1 =0;&& \beta_2=1 \\
&\beta_1 =1;&& \beta_2=0 \\
&\beta_1 =0.3;&& \beta_2=0.7
\end{align}
$
**No Underdetermined System:**
The number of observations $n$ need to be $\ge$ to the number of parameters $p$. In the case where $n<p$ the problem has infinite possible solutions where squared loss $=0$.
- *Example:* When $n=1$ and $p=2$ and the single observation is $\{x=3, y=5\}$, there are infinite solutions to that equation.
$ \beta_0+3\beta_1=5 $
>[!note:]
>Full rank and sufficient observations are required **only for the closed-form solution** of the least squares estimator. When using iterative methods like [[Gradient Descent]], these conditions can be relaxed.
>[!note:]
>When we deal with an underdetermined system or linear dependence, we can introduce [[Ridge Regression]] (L2) that does not rely on the invertibility of $\mathbb X^T\mathbb X$. It still finds a unique solution due to the introduced penalty on the $\beta$ coefficients.
## Assumptions on Residuals
**Independence:**
The residuals are all independent of each other. In the context of time series data, this means that there is no autocorrelation in residuals.
**Mean Zero:**
The residuals must have an expectation of zero. This ensures that the regression model is unbiased.
$ \mathbb E[\epsilon]=0$
**Homoscedasticity:**
The variance of the residuals is constant across all levels of the independent variables. Thus, as the value of any covariate $\mathbb X^{(i)}$ in/decreases, the residuals should remain the same, and not show any trend. This can be assess with residual plots along each covariate.
$ \mathrm{Var}(\epsilon_i \vert X)=\sigma^2 \quad \forall i $
**Normality:**
The residuals follow a [[Gaussian Distribution]].
$ \epsilon \stackrel{iid}{\sim} \mathcal N_n(0, \sigma^2 \,\mathbf I_n) $
- Due to homoscedasticity, all residuals have the same variance $\sigma^2$.
- Due to independence, the residuals form a [[Multivariate Gaussian]], whose [[Covariance]] is just $\sigma^2$ times the identity matrix $\mathbf I_n$ of size $(n \times n)$.
- Due to construction of the least-squares estimator, the expectation of the residuals is zero.
We can assess normality of residuals visually via [[QQ-Plots]] or analytically via the [[Shapiro-Wilk Test]].
>[!note:]
>In this notation we look at all epsilon terms (each as a separate r.v.) at once and assembled in a single vector. This highlights the (non existing) interdependence between them, stated by the $\sigma^2 \mathbf I_n$ term.
**Uncorrelation with Covariates (Exogeneity of Error Terms):**
The residuals must be uncorrelated with the independent variables. If they are, it suggests that some causal relationship on $\mathbf Y$ is not reflected by the current selection of covariates in the model.
$\mathrm{Cov}(\mathbb X, \epsilon)=0$
>[!note:]
>While the least-squares estimator ensures $\mathrm{Cov}(\mathbb X, \hat \epsilon)=0$, the “no endogeneity” assumption requires $\mathrm{Cov}(\mathbb X, \epsilon)=0$ regarding the true/unknown $\epsilon$.
## Dependent Variable
Since we have assumed that $\mathbb X$ is deterministic, we can state that $\mathbf Y$ has the same distribution shape as $\epsilon$ but with a shifted expectation by $\mathbb X\beta^\star$.
$ \mathbf Y \stackrel{iid}{\sim} \mathcal N_n(\mathbb X \beta^\star, \sigma^2 \, \mathbf I_n) $
For each individual observation:
$ \mathbf Y_i\sim\mathcal N(X_i^T\beta^\star, \sigma^2) $