Ridge regression, also known as *L2 regularization*, extends [[Multivariate Linear Regression|Linear Regression]] by adding a penalty to the loss function. This penalty discourages large values in the parameter vector $\beta$, improving stability and mitigating overfitting.
## Objective Function
The ridge regression objective function combines the squared loss from linear regression with a regularization term:
$ J_{n, \lambda}(\beta)=\frac{\lambda}{2} \lvert \lvert \beta \rvert \rvert^2+ \frac{1}{n} \sum_{i=1}^n \frac{1}{2}(y_i - x_i \cdot \beta)^2$
where:
- $\lambda$: Regularization parameter
- $x_i$: Vector of all features of the $i$-th observation ($i$-th row in design matrix $\mathbb X$)
- $y_i$: Response variable of the $i$-th observation
## Closed Form
Like [[Multivariate Linear Regression#Closed-Form Solution|linear regression]], we can solve for the optimal $\beta$ parameters in closed-form. Therefore we differentiate $J_{n, \lambda}$ w.r.t. $\beta$ and set the gradient to zero. This results in solving:
$ \hat \beta = (\mathbb X^T \mathbb X+ \lambda \mathbf I)^{-1} \mathbb X^T \mathbf Y$
where $\mathbf I$ is the $p \times p$ identity matrix. The additional term $\lambda \mathbf I$ ensures that the matrix $\mathbb X^T \mathbb X + \lambda \mathbf I$ is invertible, even if $\mathbb X^T \mathbb X$ is not.
## Gradient Descent
For large datasets, where computing the closed-form solution is computationally expensive, [[Gradient Descent]] provides an iterative alternative. The gradient of the ridge regression loss w.r.t. $\beta$ is:
$ \nabla_\beta = \lambda \beta + \frac{1}{n}\sum_{i=1}^n-(y^{(i)}-x^{(i)}\cdot \beta)*x^{(i)}$
For a single observation $(x_i, y_i)$ the gradient is:
$ \nabla_\beta^{(i)} = \lambda\beta -\big(y^{(i)}-x^{(i)}\cdot \beta \big)*x^{(i)} $
We initialize $\beta=0$ and start picking random observations, and update $\beta$ by the negative gradient, scaled by the learning rate $\eta$.
![[Multivariate Linear Regression#^574a4e]]
Substituting the gradient expression:
$
\begin{align}
&=\beta- \eta\big(\lambda\beta-(y^{(i)}-x^{(i)}\cdot \beta)*x^{(i)}\big) \\[6pt]
&=(1-\eta\lambda)\beta- \eta\big((y^{(i)}-x^{(i)}\cdot \beta)*x^{(i)}\big)
\end{align}
$