Gradient Descent - Bernhard Pfann, CFA

The [[Newton-Raphson Method]] uses the slope of a univariate function to iteratively find the position of the root of a function $f$. We extend this iterative approach of linear approximations to: - Multivariate functions instead of univariate functions - Finding minima/maxima instead of roots ## Gradient Vector For a multivariate function $f(x,y)$ which takes multiple input variables and returns a scalar, we calculate its gradient vector $\nabla f$. The gradient aggregates all partial derivatives of $f$, representing the rate of change of $f$ along each input variable: $ \underbrace{\nabla f}_{\text{Gradient}}= \begin{bmatrix} \frac{df}{dx}\\[6pt] \frac{df}{dy} \end{bmatrix} $ ![[gradient-descent.png|center|400]] The gradient can be seen as a special case of the [[Jacobian]], when the function is scalar-valued (i.e. has a scalar as output). ## Steepest Slope The gradient $\nabla f$ at any point points in the direction of the steepest ascent (i.e. maximum increase in $f$). This property arises because the gradient combines the slopes (partial derivatives) along each axis, providing the overall direction of steepest increase. **Example:** At point $[x_0, y_0]$ for a function $f(x,y)$ assume that $\frac{\partial f}{\partial x}=5$ and $\frac{\partial f}{\partial y}=0$. The gradient becomes $\nabla f=[5,0]$, pointing entirely into the $x$-direction. ## Gradient Descent Method We start at some random point $s_n$ and iteratively move into the negative direction of the gradient vector (evaluated at $s_n$) multiplied by a step size parameter $\gamma$. $ s_{n+1}=s_n-\gamma *\nabla f(s_n) $ In the machine learning set-up, most of the times $f$ represents the loss function. Since we want to minimize this and the gradient points towards the steepest upward slope, we actually have to take the negative gradient.