Newtons Method for Optimization - Bernhard Pfann, CFA

While the [[Newton-Raphson Method]] is a root-finding algorithm, there is also the related Newtons Method for finding minima/maxima in a function $f$. ## Truncated Taylor Series To locate a minimum of $f$, we approximate it locally using a truncated [[Taylor Series]] around a reference point $w_0$. The second-order approximation $g(u)$ of $f(u)$, truncating terms beyond the second derivative, is given as: $ g(u) \approx f(w_0) +f'(w_0) (u - w_0)+ \frac{1}{2} f^{\prime \prime}(w_0) (u - w_0)^2 $ ## Minimization To find the minimum $u$ of the approximation $g(u)$, we take the derivative $\frac{\partial g}{\partial u}$ and set it to $0$. The derivatives of each term separate are: $ \begin{aligned} \frac{\partial g}{\partial u}& f(w_o)=0, \\[6pt] \frac{\partial g}{\partial u}& f^\prime(w_0)(u-w_0)=f^\prime(w_0) \\[6pt] \frac{\partial g}{\partial u}& \frac{1}{2} f^{\prime \prime}(w_0) (u - w_0)^2 = \frac{1}{2} * 2* f^{\prime \prime}(w_0)(u-w_0)* 1 \end{aligned} $ Resulting in an overall derivative: $ \frac{\partial g}{\partial u} g(u) = f'(w_0) + f^{\prime \prime }(w_0) (u - w_0) \stackrel{!}{=}0 $ Rearranging to solve for $u$, we get: $ u=w_0 - \frac{f'(w_0)}{f^{\prime \prime }(w_0)} $ We can also write this in an iterative update formula, replacing $w_0 \mapsto w_t$ and $u \mapsto w_{t+1}$. $ w_{t+1} = w_t - \frac{f^\prime(w_t)}{f^{\prime \prime }(w_t)} $ ## Multi-Dimensional Newtons Method For a multi-dimensional function , i.e. $f \in \mathbb R^d$ where $d>1$, we generalize the update formula using gradient and [[Hessian]] notation. $ \mathbf w_{t+1} = \mathbf w_t - \Big[\underbrace{(\nabla \nabla f)}_{\mathbf H}(\mathbf w_t) \Big]^{-1} (\nabla f)(\mathbf w_t) $ The second derivative $\nabla \nabla f$ is the Hessian matrix $\mathbf H$. While this term was originally in the denominator, there is no division in matrix form. Instead we need to take the inverse to achieve the desired transformation. ## Gradient Descent When computing the second derivative $\nabla \nabla f$ is computationally expensive, we can approximate Newtons method with [[Gradient Descent]]. Instead of using the Hessian, we just choose a step size $1 \over L$ to approximate the curvature. $ \mathbf w_{t+1} = \mathbf w_t - \frac{1}{L}*(\nabla f)(\mathbf w_t) $ | | Newtons Method | Gradient Descent | | ----------- | ------------------------------------------------------------------------- | -------------------------------------------------------------------------------------- | | Derivatives | First and second derivative | Only first derivative | | Computation | More expensive to compute $\mathbf H$ and its inverse | Less expensive | | Step Size | Implicitly governed by the curvature of $\mathbf H$ | Learning rate parameter $\eta$ needs to be set manually | | Convergence | Due to more adaptive step size usually quicker for well behaved functions | Takes more steps to converge, but might be desirable for complex high-dimensional data |