There are several key assumptions that need to be respected in [[Univariate Linear Regression]] and [[Multivariate Linear Regression]]. ## Linearity The relationship between the independent variables $\mathbb X$ and the dependent variable $\mathbf Y$ is linear. This can be assessed with a residual plot along the different values of $\mathbf Y$. $ \mathbb E[Y \vert X] = X \beta^\star$ ## Assumptions on Design Matrix **Deterministic Observations:** We view observations in $\mathbb X$ as deterministic, and not as [[Random Variable|Random Variables]]. This avoids complications in the matrix algebra for the [[Univariate Linear Regression#Least Squares Estimator|Least Squares Estimator]]. **Full Rank:** The design matrix has to have [[Matrix Rank|rank]] $p$. This means [[Linearly Independent Vectors|linear independence]] across all covariates (”features”). Hence none of them can be replicated by a linear combination of the other covariates. $ \text{Rank}(\mathbb X)=p $ - *Example:* When we include the same feature twice, then there are infinite possibilities how to set their coefficients e.g. $(\beta_1, \beta_2)$, and to retrieve the exact same squared loss. $ \begin{align} &\beta_1 =0;&& \beta_2=1 \\ &\beta_1 =1;&& \beta_2=0 \\ &\beta_1 =0.3;&& \beta_2=0.7 \end{align} $ **No Underdetermined System:** The number of observations $n$ need to be $\ge$ to the number of parameters $p$. In the case where $n<p$ the problem has infinite possible solutions where squared loss $=0$. - *Example:* When $n=1$ and $p=2$ and the single observation is $\{x=3, y=5\}$, there are infinite solutions to that equation. $ \beta_0+3\beta_1=5 $ >[!note:] >Full rank and sufficient observations are required **only for the closed-form solution** of the least squares estimator. When using iterative methods like [[Gradient Descent]], these conditions can be relaxed. >[!note:] >When we deal with an underdetermined system or linear dependence, we can introduce [[Ridge Regression]] (L2) that does not rely on the invertibility of $\mathbb X^T\mathbb X$. It still finds a unique solution due to the introduced penalty on the $\beta$ coefficients. ## Assumptions on Residuals **Independence:** The residuals are all independent of each other. In the context of time series data, this means that there is no autocorrelation in residuals. **Mean Zero:** The residuals must have an expectation of zero. This ensures that the regression model is unbiased. $ \mathbb E[\epsilon]=0$ **Homoscedasticity:** The variance of the residuals is constant across all levels of the independent variables. Thus, as the value of any covariate $\mathbb X^{(i)}$ in/decreases, the residuals should remain the same, and not show any trend. This can be assess with residual plots along each covariate. $ \mathrm{Var}(\epsilon_i \vert X)=\sigma^2 \quad \forall i $ **Normality:** The residuals follow a [[Gaussian Distribution]]. $ \epsilon \stackrel{iid}{\sim} \mathcal N_n(0, \sigma^2 \,\mathbf I_n) $ - Due to homoscedasticity, all residuals have the same variance $\sigma^2$. - Due to independence, the residuals form a [[Multivariate Gaussian]], whose [[Covariance]] is just $\sigma^2$ times the identity matrix $\mathbf I_n$ of size $(n \times n)$. - Due to construction of the least-squares estimator, the expectation of the residuals is zero. We can assess normality of residuals visually via [[QQ-Plots]] or analytically via the [[Shapiro-Wilk Test]]. >[!note:] >In this notation we look at all epsilon terms (each as a separate r.v.) at once and assembled in a single vector. This highlights the (non existing) interdependence between them, stated by the $\sigma^2 \mathbf I_n$ term. **Uncorrelation with Covariates (Exogeneity of Error Terms):** The residuals must be uncorrelated with the independent variables. If they are, it suggests that some causal relationship on $\mathbf Y$ is not reflected by the current selection of covariates in the model. $\mathrm{Cov}(\mathbb X, \epsilon)=0$ >[!note:] >While the least-squares estimator ensures $\mathrm{Cov}(\mathbb X, \hat \epsilon)=0$, the “no endogeneity” assumption requires $\mathrm{Cov}(\mathbb X, \epsilon)=0$ regarding the true/unknown $\epsilon$. ## Dependent Variable Since we have assumed that $\mathbb X$ is deterministic, we can state that $\mathbf Y$ has the same distribution shape as $\epsilon$ but with a shifted expectation by $\mathbb X\beta^\star$. $ \mathbf Y \stackrel{iid}{\sim} \mathcal N_n(\mathbb X \beta^\star, \sigma^2 \, \mathbf I_n) $ For each individual observation: $ \mathbf Y_i\sim\mathcal N(X_i^T\beta^\star, \sigma^2) $