Ridge regression, also known as *L2 regularization*, extends [[Multivariate Linear Regression|Linear Regression]] by adding a penalty to the loss function. This penalty discourages large values in the parameter vector $\beta$, improving stability and mitigating overfitting. ## Objective Function The ridge regression objective function combines the squared loss from linear regression with a regularization term: $ J_{n, \lambda}(\beta)=\frac{\lambda}{2} \lvert \lvert \beta \rvert \rvert^2+ \frac{1}{n} \sum_{i=1}^n \frac{1}{2}(y_i - x_i \cdot \beta)^2$ where: - $\lambda$: Regularization parameter - $x_i$: Vector of all features of the $i$-th observation ($i$-th row in design matrix $\mathbb X$) - $y_i$: Response variable of the $i$-th observation ## Closed Form Like [[Multivariate Linear Regression#Closed-Form Solution|linear regression]], we can solve for the optimal $\beta$ parameters in closed-form. Therefore we differentiate $J_{n, \lambda}$ w.r.t. $\beta$ and set the gradient to zero. This results in solving: $ \hat \beta = (\mathbb X^T \mathbb X+ \lambda \mathbf I)^{-1} \mathbb X^T \mathbf Y$ where $\mathbf I$ is the $p \times p$ identity matrix. The additional term $\lambda \mathbf I$ ensures that the matrix $\mathbb X^T \mathbb X + \lambda \mathbf I$ is invertible, even if $\mathbb X^T \mathbb X$ is not. ## Gradient Descent For large datasets, where computing the closed-form solution is computationally expensive, [[Gradient Descent]] provides an iterative alternative. The gradient of the ridge regression loss w.r.t. $\beta$ is: $ \nabla_\beta = \lambda \beta + \frac{1}{n}\sum_{i=1}^n-(y^{(i)}-x^{(i)}\cdot \beta)*x^{(i)}$ For a single observation $(x_i, y_i)$ the gradient is: $ \nabla_\beta^{(i)} = \lambda\beta -\big(y^{(i)}-x^{(i)}\cdot \beta \big)*x^{(i)} $ We initialize $\beta=0$ and start picking random observations, and update $\beta$ by the negative gradient, scaled by the learning rate $\eta$. ![[Multivariate Linear Regression#^574a4e]] Substituting the gradient expression: $ \begin{align} &=\beta- \eta\big(\lambda\beta-(y^{(i)}-x^{(i)}\cdot \beta)*x^{(i)}\big) \\[6pt] &=(1-\eta\lambda)\beta- \eta\big((y^{(i)}-x^{(i)}\cdot \beta)*x^{(i)}\big) \end{align} $