Compared to [[Univariate Linear Regression]], we have multiple independent variables (covariates) in the multivariate setup. For shorter notation we make use of [[Vector Operations|Vectors]] (bold) and matrices (blackboard letters). $ \mathbf Y = \mathbb X \beta^\star+ \epsilon $ | Terms | Size | Name | Description | | ------------- | ------------ | ------------------------------------- | --------------------------------------------------------------------------------------------------------------------- | | $\mathbf Y$ | $n \times 1$ | Dependent variable | $\mathbf Y_i$ is the outcome for the $i$-th observation. | | $\mathbb X$ | $n \times p$ | Independent variables (design matrix) | $\mathbf X_i$ is the $i$-th row in the matrix. It represents the collection of covariates for the $i$-th observation. | | $\beta^\star$ | $p \times 1$ | Parameters | $\beta^\star_j$ is the slope coefficient for the $j$-th covariate. | | $\epsilon$ | $n \times 1$ | Error term | $\epsilon_i$ is the unobserved noise for the $i$-th observation. | ![[multivariate-linear-regression-1.png|center|500]] **Intercept:** To include an intercept, we add an *additional column vector* $(1 \times n)$ filled with ones to the design matrix $\mathbb X$ (usually as the first column). Also we extend $\beta^\star$ by an additional element (also at the first position) corresponding to the intercept. ![[multivariate-linear-regression-2.png|center|500]] ## Closed-Form Solution We find the optimal $\beta^\star$ where the squared difference between the prediction and the actual outcome is minimal. To find the $\hat \beta$ at that minimum, we take the derivative of the squared loss and set it to zero. This is a closed-form solution, with a unique optimum, when conditions are met. $ \hat \beta = \arg \min_{\beta \in \mathbf R^p} \, \lVert \mathbf Y-\mathbb X \beta \rVert_2^2 $ To take the derivative of the $\beta$ vector $(1 \times p)$, we switch from matrix-form to the explicit notation. $ \begin{aligned} f(\beta)&= \lVert \mathbf Y- \mathbb X\beta \rVert_2^2\\[6pt] f(\beta)&=\sum_{i=1}^n(y_i-x_0^{(i)}\beta_0- \dots -x_p^{(i)}\beta_p)^2 \end{aligned} $ ^535c58 From there it gets obvious how to compute the [[Partial Derivative]] of each $\beta$ component separately. We collect all partial derivatives in a [[Gradient Descent#Gradient Vector|Gradient Vector]] $\nabla f(\beta)$ and set it to zero. $ \nabla f(\beta) = \begin{pmatrix} \frac{\partial f}{\partial \beta_0} \\ \vdots \\[3pt] \frac{\partial f}{\partial \beta_0}\end{pmatrix} \stackrel{!}{=}0 $ Finally we can merge all partial derivatives together, to get back into matrix form. $ \begin{rcases} \frac{\partial f}{\partial \beta_0}&=\sum_{i=1}^n2*(\cdots)_-x_0^{(i)} \\ \,\,\, \vdots &= \,\,\,\vdots\\[3pt] \frac{\partial f}{\partial \beta_p}&=\sum_{i=1}^n2_(\cdots)*-x_p^{(i)} \end{rcases} =-2(\mathbf Y - \mathbb X\beta)\mathbb X^T $ Setting the matrix form to zero, corresponds to a zero vector (vector filled with zeros) with $p$ elements. $ \begin{align} -2 \mathbb X^T(\mathbf Y- \mathbb X\hat \beta)&\stackrel{!}{=}0 \tag{1}\\[6pt] -2\mathbb X^T\mathbf Y&= -2\mathbb X^T\mathbb X\hat \beta \tag{2}\\[6pt] (\mathbb X^T\mathbb X)^{-1}\,\, \mathbb X^T\mathbf Y&= {(\mathbb X^T\mathbb X)^{-1} \,\, \mathbb X^T\mathbb X}\hat \beta \tag{3}\\[6pt] (\underbrace{\mathbb X^T\mathbb X}_{p \times p})^{-1} \mathbb X^T \mathbf Y &= \hat \beta \tag{4} \end{align} $ (3) We multiply both sides with $(\mathbb X^T\mathbb X)^{-1}$, because this cancels out the $\mathbb X^T\mathbb X$ on the right side. For that step we require $\mathbb X^T\mathbb X$ to be invertible. >[!note:] >For requirements specific to the closed-form approach see [[Linear Regression Assumptions#Assumptions on Design Matrix|Assumptions on Design Matrix]]. >[!note:] >Further details on the [[Linear Regression with LSE]] and [[Linear Regression with MLE]]. ## Gradient Descent When $\mathbb X$ is too large, it can get unfeasible to find the optimal $\beta$ with the closed-form solution, as matrix inversion of $\mathbb X^T \mathbb X$ is computationally expensive. Alternatively we can also apply [[Gradient Descent]], with the following update-rule: $ \beta^{t+1} \leftarrow \beta^t - \eta_t \nabla f(\beta^t) $ ^574a4e The derivative of the sum of squared losses, is equal to the sum of derivatives of squared losses: $ \nabla_\beta \left( \sum_{i=1}^n(y_i-x_i*\beta)^2\right) = \sum_{i=1}^n \nabla_\beta ( y_i-x_i*\beta)^2 = -2\sum_{i=1}^n(y_i-x_i*\beta)*x_i^T $ Plugging the gradient in the update-rule: $ \beta^{t+1} \leftarrow \beta^t+2*\eta_t \sum_{i=1}^n(y_i-\underbrace{x_i*\beta}_{\hat y})*x_i^T $ The $\beta_t$ is effectively updated by a weighted sum of $\mathbb X$, where the weights are the positive or negative residuals $(y - \hat y)$. **Example:** 1. When we are underpredicting $(y_i>\hat y_i)$, 2. The residuals will be positive $(y_i -x_i*\beta)$. 3. A positively scaled version of $x_i^T$ will be added to the $\beta$ vector. 4. This will make $\beta^{t+1}$ more like $x_i^T$ itself. As both vectors are now more similar, their [[Dot Product]] $(\hat y)$ will increase.