The [[Newton-Raphson Method]] uses the slope of a univariate function to iteratively find the position of the root of a function $f$. We extend this iterative approach of linear approximations to:
- Multivariate functions instead of univariate functions
- Finding minima/maxima instead of roots
## Gradient Vector
For a multivariate function $f(x,y)$ which takes multiple input variables and returns a scalar, we calculate its gradient vector $\nabla f$. The gradient aggregates all partial derivatives of $f$, representing the rate of change of $f$ along each input variable:
$ \underbrace{\nabla f}_{\text{Gradient}}= \begin{bmatrix} \frac{df}{dx}\\[6pt] \frac{df}{dy} \end{bmatrix} $
![[gradient-descent.png|center|400]]
The gradient can be seen as a special case of the [[Jacobian]], when the function is scalar-valued (i.e. has a scalar as output).
## Steepest Slope
The gradient $\nabla f$ at any point points in the direction of the steepest ascent (i.e. maximum increase in $f$). This property arises because the gradient combines the slopes (partial derivatives) along each axis, providing the overall direction of steepest increase.
**Example:** At point $[x_0, y_0]$ for a function $f(x,y)$ assume that $\frac{\partial f}{\partial x}=5$ and $\frac{\partial f}{\partial y}=0$. The gradient becomes $\nabla f=[5,0]$, pointing entirely into the $x$-direction.
## Gradient Descent Method
We start at some random point $s_n$ and iteratively move into the negative direction of the gradient vector (evaluated at $s_n$) multiplied by a step size parameter $\gamma$.
$ s_{n+1}=s_n-\gamma *\nabla f(s_n) $
In the machine learning set-up, most of the times $f$ represents the loss function. Since we want to minimize this and the gradient points towards the steepest upward slope, we actually have to take the negative gradient.